You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.
Nathaniel Virgo said:
It's hard to conceptualise the disjoint union of two probability distributions. An element of the sample space is either an event in A or an event in B. The question is what should the probability of such an event be, and how can we interpret it probablistically?
You could say "flip a coin, then sample from A if it's heads and from B if it's tails." Then for an event in A we would have , and for an event in B we'd have . But I think you perceived that this is a little arbitrary...
This is one reason I like to relax the condition that probabilities sum to one, and work with unnormalized distributions on finite sets, or more generally measures on measures spaces. You can always normalize your distribution at the very end, when you define expected values of observables. Given you define its expected value to be
But until you do that, work with unnormalized distributions!
Then there's no problem figuring out how to take a distribution on and one on and get one on .
Moreover in statistical mechanics this is how it actually works: when you get probability distributions from Boltzmann's prescription you first get unnormalized distributions
where is the energy of the $$i$$th state. You normalize them at the end, when you define expected values of observables. This second argument is the one that's really convincing to me: this is, apparently, how nature actually does things!
I quite like unnormalised probabilities too, but it seems to me they model something different from normalised ones. This is something I would like to understand better. If you have any insights it would be great!
For example, in the case of the disjoint union, if both distributions initially sum to 1, then forming in the obvious way and then normalising at the end amounts to doing and , but if they are initially unnormalised then it will be weighted according to the relative difference in their initial sums. So an unnormalised distribution seems to model a probability distribution, together with a weight that says how strongly it should count when combined with other distributions.
Another example of this is Markov processes. The unnormalised analog of a discrete time Markov chain is iteratively multiplying by a not-necessarily-stochastic matrix with nonnegative entries. In such a process the future can affect the past, in that multiplying the whole chain by an unnormalised matrix can affect the relative marginal weights of the initial state. This is weird if we're modelling a classical stochastic process, but it makes a lot of sense if we're using it to model a growing biological population, because you're effectively weighting each member of the initial population by its number of descendants.
Then there is the strangeness of the Kullback-Leibler divergence for unnormalised distributions. Amari argues that to make information geometry work, the KL divergence should be extended to
(which can be further generalised to an -divergence). I have some vague intuitions about what this is and where it comes from, but it's strange because it doesn't seem to behave in any particularly nice way if you multiply all the probabilities in one of the distributions by a constant amount, and I don't have a good way to interpret the extra terms in terms of information.
I found that the KL divergence for unnormalized distributions shows up quite naturally in chemistry...
It's connected to "free energy" and under the right conditions chemical reactions tend to decrease it:
See starting on page 16, where equation (21) defines the KL divergence in this context.
So I think it's a very natural concept, though it's still not understood well enough!
Oh, nice derivation, that's cool. I agree it's natural, I just wish I had a good intuition for what the terms mean.
Yes, me too. It's used in the literature on chemical reaction networks, but I've never seen anyone talk much about what it "means".
Maybe we could say: "entropy" was hard enough for people to understand in the first place, that its generalization to non-normalized probability distributions should be expected to require further thought! But so far people using it have not been like Boltzmann or Shannon or Jaynes - they haven't wanted to talk a lot about what things mean.
Here's one way to get to it that maybe provides a little bit of intuition. This is about as far as I got in trying to work out its meaning.
Let's suppose we have a set of rates for Poisson processes, , and we consider what will happen in the next small time interval. As the probability that the Poisson process will fire tends to , and we also have the much larger probability that nothing happens,
But then in Bayesian fashion we get some new information that causes us to update our estimates of the probability rates to new values Then we have a new probability distribution given by and
We can use the regular KL divergence to calculate how much information we received about the next time interval. It's given by
Then we can use and neglect higher-order terms in to get
So the unnormalised KL arises as a "KL divergence rate" between the two Poisson processes, and the extra terms arise from the term in the regular KL divergence corresponding to nothing happening in the next small time interval.
This gives me some intuition for the extra terms in cases like this where the probabilities are unnormalised because they represent sub-outcomes of an unlikely event. But I don't feel quite satisfied by it - it seems like there should be a more first-principles way to understand it. (I agree that sort of thing tends to come after the formalism has been worked out.)
@Nathaniel Virgo Un-normalized distributions, often interpreted Boltzmannly @John Baez wrote, pervade Probabilistic Programming! They often go under the name "score functions". As with your evolution example, they are sometimes also called "weights", e.g. in Importance Sampling.
Do people use the unnormalised KL in those contexts? It would be neat to see it used somewhere if they do.
[I wrote a lot of other stuff here about exponential families and their relationship to unnormalised distributions, but then I realised there's maybe a better way to think about it, so I want to think through it a bit more before posting, probably tomorrow.]
Not that I know of :/ ... I thought AIDE (Auxiliary Inference Divergence Estimator) or something similar would, but after skimming the papers, I think I was wrong.