Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.


Stream: theory: applied category theory

Topic: Unnormalised probabilities


view this post on Zulip John Baez (Apr 29 2020 at 16:45):

Nathaniel Virgo said:

It's hard to conceptualise the disjoint union of two probability distributions. An element of the sample space is either an event in A or an event in B. The question is what should the probability of such an event be, and how can we interpret it probablistically?

You could say "flip a coin, then sample from A if it's heads and from B if it's tails." Then for an event in A we would have pA+B(a)=12pA(a)p_{A+B}(a) = \frac{1}{2}p_A(a), and for an event in B we'd have pA+B(b)=12pB(b)p_{A+B}(b) = \frac{1}{2}p_B(b). But I think you perceived that this is a little arbitrary...

This is one reason I like to relax the condition that probabilities sum to one, and work with unnormalized distributions p ⁣:A[0,)p \colon A \to [0,\infty) on finite sets, or more generally measures on measures spaces. You can always normalize your distribution at the very end, when you define expected values of observables. Given f ⁣:ARf \colon A \to \mathbb{R} you define its expected value to be

f=iAf(i)p(i)iAp(i) \displaystyle{ \langle f \rangle = \frac{\sum_{i \in A} f(i) p(i)}{\sum_{i \in A} p(i)} }

But until you do that, work with unnormalized distributions!

Then there's no problem figuring out how to take a distribution on AA and one on BB and get one on A+BA + B.

Moreover in statistical mechanics this is how it actually works: when you get probability distributions from Boltzmann's prescription you first get unnormalized distributions

p(i)=exp(H(i)/β) p(i) = \exp(-H(i) / \beta )

where H(i)H(i) is the energy of the $$i$$th state. You normalize them at the end, when you define expected values of observables. This second argument is the one that's really convincing to me: this is, apparently, how nature actually does things!

view this post on Zulip Nathaniel Virgo (Apr 29 2020 at 17:16):

I quite like unnormalised probabilities too, but it seems to me they model something different from normalised ones. This is something I would like to understand better. If you have any insights it would be great!

For example, in the case of the disjoint union, if both distributions initially sum to 1, then forming pA+Bp_{A+B} in the obvious way and then normalising at the end amounts to doing pA+B(a)=12p(a)p_{A+B}(a) = \frac{1}{2}p(a) and pA+B(b)=12p(b)p_{A+B}(b) = \frac{1}{2}p(b), but if they are initially unnormalised then it will be weighted according to the relative difference in their initial sums. So an unnormalised distribution seems to model a probability distribution, together with a weight that says how strongly it should count when combined with other distributions.

Another example of this is Markov processes. The unnormalised analog of a discrete time Markov chain is iteratively multiplying by a not-necessarily-stochastic matrix with nonnegative entries. In such a process the future can affect the past, in that multiplying the whole chain by an unnormalised matrix can affect the relative marginal weights of the initial state. This is weird if we're modelling a classical stochastic process, but it makes a lot of sense if we're using it to model a growing biological population, because you're effectively weighting each member of the initial population by its number of descendants.

Then there is the strangeness of the Kullback-Leibler divergence for unnormalised distributions. Amari argues that to make information geometry work, the KL divergence should be extended to

ipilogpiqiipi+iqi\sum_i p_i \log \frac{p_i}{q_i} - \sum_i p_i + \sum_i q_i

(which can be further generalised to an α\alpha-divergence). I have some vague intuitions about what this is and where it comes from, but it's strange because it doesn't seem to behave in any particularly nice way if you multiply all the probabilities in one of the distributions by a constant amount, and I don't have a good way to interpret the extra terms in terms of information.

view this post on Zulip John Baez (Apr 29 2020 at 18:11):

I found that the KL divergence for unnormalized distributions shows up quite naturally in chemistry...

view this post on Zulip John Baez (Apr 29 2020 at 18:12):

It's connected to "free energy" and under the right conditions chemical reactions tend to decrease it:

view this post on Zulip John Baez (Apr 29 2020 at 18:13):

See starting on page 16, where equation (21) defines the KL divergence in this context.

view this post on Zulip John Baez (Apr 29 2020 at 18:14):

So I think it's a very natural concept, though it's still not understood well enough!

view this post on Zulip Nathaniel Virgo (Apr 29 2020 at 18:38):

Oh, nice derivation, that's cool. I agree it's natural, I just wish I had a good intuition for what the terms mean.

view this post on Zulip John Baez (Apr 29 2020 at 18:39):

Yes, me too. It's used in the literature on chemical reaction networks, but I've never seen anyone talk much about what it "means".

view this post on Zulip John Baez (Apr 29 2020 at 18:40):

Maybe we could say: "entropy" was hard enough for people to understand in the first place, that its generalization to non-normalized probability distributions should be expected to require further thought! But so far people using it have not been like Boltzmann or Shannon or Jaynes - they haven't wanted to talk a lot about what things mean.

view this post on Zulip Nathaniel Virgo (Apr 29 2020 at 19:25):

Here's one way to get to it that maybe provides a little bit of intuition. This is about as far as I got in trying to work out its meaning.

Let's suppose we have a set of rates for Poisson processes, η1,,ηn\eta_1, \dots, \eta_n, and we consider what will happen in the next small time interval. As δt0\delta t \to 0 the probability that the ithi^\text{th} Poisson process will fire tends to qi=ηiδtq_i = \eta_i \delta t, and we also have the much larger probability that nothing happens, q0=1δtiηi.q_0 = 1 - \delta t\sum_i \eta_i.

But then in Bayesian fashion we get some new information that causes us to update our estimates of the probability rates to new values μ1,,μn.\mu_1, \dots, \mu_n. Then we have a new probability distribution given by pi=μiδtp_i = \mu_i \delta t and p0=1δtiμi.p_0 = 1-\delta t \sum_i \mu_i.

We can use the regular KL divergence to calculate how much information we received about the next time interval. It's given by

i=0npilogpiqi=δti=1nμilogμiηi+(1δti=1nμi)log1δtiμi1δtiηi.\sum_{i=0}^n p_i\log \frac{p_i}{q_i} \quad=\quad \delta t \sum_{i=1}^n \mu_i \log \frac{\mu_i}{\eta_i} \,\,+\,\, \left(1 - \delta t\sum_{i=1}^n\mu_i\right) \log\frac{1 - \delta t\sum_i\mu_i }{1 - \delta t\sum_i\eta_i }.

Then we can use log(1xδt)xδt\log(1-x\delta t) \approx - x\delta t and neglect higher-order terms in δt\delta t to get

δt(i=1nμilogμiηii=1nμi+i=1nηi).\delta t \left( \sum_{i=1}^n \mu_i \log \frac{\mu_i}{\eta_i} - \sum_{i=1}^n \mu_i + \sum_{i=1}^n \eta_i \right).

So the unnormalised KL arises as a "KL divergence rate" between the two Poisson processes, and the extra terms arise from the p0p_0 term in the regular KL divergence corresponding to nothing happening in the next small time interval.

This gives me some intuition for the extra terms in cases like this where the probabilities are unnormalised because they represent sub-outcomes of an unlikely event. But I don't feel quite satisfied by it - it seems like there should be a more first-principles way to understand it. (I agree that sort of thing tends to come after the formalism has been worked out.)

view this post on Zulip Sam Tenka (Apr 30 2020 at 18:02):

@Nathaniel Virgo Un-normalized distributions, often interpreted Boltzmannly @John Baez wrote, pervade Probabilistic Programming! They often go under the name "score functions". As with your evolution example, they are sometimes also called "weights", e.g. in Importance Sampling.

view this post on Zulip Nathaniel Virgo (Apr 30 2020 at 19:58):

Do people use the unnormalised KL in those contexts? It would be neat to see it used somewhere if they do.

[I wrote a lot of other stuff here about exponential families and their relationship to unnormalised distributions, but then I realised there's maybe a better way to think about it, so I want to think through it a bit more before posting, probably tomorrow.]

view this post on Zulip Sam Tenka (May 02 2020 at 16:02):

Not that I know of :/ ... I thought AIDE (Auxiliary Inference Divergence Estimator) or something similar would, but after skimming the papers, I think I was wrong.