Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.


Stream: theory: categorical probability

Topic: Improper priors


view this post on Zulip Nathaniel Virgo (Sep 20 2020 at 15:19):

I've been working through a lot of the stuff I know about probability and inference in a Markov category framework, and seem to be making good progress and generating nice insights.

However, sometimes I need to do things that involve improper priors, meaning measures that can't be normalised, but can become normalisable after applying Bayes' theorem. I'm wondering if there is any work on making improper priors meaningful in a Markov category or probability monad framework.

I realise I could just work in a category of unnormalised measures. But that seems like introducing a lot of heavy machinery that might be unnecessary - intuitively, it seems like it should be possible to define a Markov category (in which delete is still natural), where morphisms can represent improper distributions as well as proper ones. Has anything like this been done? (@Tobias Fritz, do you know?)

view this post on Zulip Tobias Fritz (Sep 20 2020 at 19:03):

That's a great question! I haven't haven't worked with improper priors before, but @Sam Staton has, so hopefully he can tell us more :smile: One thing that I imagine he will say is that working with unnormalized measures doesn't require a lot of additional heavy machinery. In the synthetic approach of Markov categories, you can just drop the naturalness of delete, and most things can still be done with the proper additional care; but I'm sure that you know this already.

view this post on Zulip Sam Staton (Sep 21 2020 at 10:17):

Hi! The category of s-finite kernels works fine for unnormalized/unnormalizable distributions. On the axiomatic side, I think Tobias is right about dropping the naturalness of delete (CD categories?) although there was that conjecture about disintegration/Radon-Nikodym on the CPS stream that we should finish!

view this post on Zulip Sam Staton (Sep 21 2020 at 10:18):

Actually I came to this more worried about improper posteriors and unnormalizable sub-expressions than priors. In some sense you can end up with these even with proper priors. I wrote a bit more about this from page 13 of my chapter here.

view this post on Zulip Sam Staton (Sep 21 2020 at 10:20):

Also, I have been wondering whether this synthetic or categorical perspective can give a better understanding of what improper priors really mean. For example, in the Rado topos towards the end of my CPS talk, there is a “measure” that is a bit like the uniform distribution on the nodes of the Rado graph, and yet it still forms a Markov category.

view this post on Zulip Nathaniel Virgo (Sep 22 2020 at 00:56):

Thank you both. My worry about introducing unnormalised kernels was just that, having got used to working in a Markov category context where delete is natural, I'm unsure how much I'd have to change to make everything still work if it's not. It could be that it's not very much after all, I haven't really thought about it that much. It seems like you'd need an explicit normalising operation though, and ideally a way to reason with it in string diagrams.

It still seems like it might not be necessary though. For example, I'm wondering if we can define a Markov category where instead of demanding the kernels be normalised, we make the morphisms equivalence classes of kernels (let's say s-finite ones), where two kernels are considered the same if they're the same up to normalisation, sort of like how states in quantum mechanics are defined as rays in Hilbert space instead of normalised vectors. I'm imagining doing that in such a way that "proper" kernels would behave the way they normally do, while improper ones would also exist and behave reasonably. I'm not sure how the tensor product should be defined in this case though.

Do your intuitions say something like that should be possible or impossible? If it works it might be a neat way to get towards Sam's "what are they really" question.

view this post on Zulip Sam Staton (Sep 22 2020 at 08:33):

Hi, that's a curious idea, but I suspect it wouldn't work well with composition. Here's an example based on importance sampling. Let p and q be probability measures on R with densities f and g respectively, and suppose p has full support. For a function h : R->R, let weight(h):R->R be the measure kernel given by weight(h)(r)(U) = h(r) if r in U, 0 otherwise. So it's like dirac, but unnormalized. Then the composite of kernels
1 --p--> R --weight(1/f)--> R --weight(g)--> R
is the same as q, hence normalized.
But the composite of the first two morphisms is not normalizable –– it is the Lebesgue measure.

view this post on Zulip Nathaniel Virgo (Sep 22 2020 at 08:46):

Is that a problem? It seems like this example is the kind of thing we would like to allow. Or I would like to, anyway, if it can be done without causing contradictions. It's actually quite nice if you can have a kernel from 11 to R\mathbb{R} that represents the Lebesgue measure - the question is just whether we can have that while still being in a Markov category. (i.e. while still having delete be natural)

view this post on Zulip Nathaniel Virgo (Sep 22 2020 at 08:53):

I guess by "the same up to normalisation" what I mean is something along the lines of "we consider measures ff and gg the same if there exists some λ(0,)\lambda\in (0,\infty) such that, for all events UU in the sample space, f(U)=λg(U)f(U)=\lambda g(U)." So two measures can be the same up to normalisation without being normalisable. For kernels we probably need λ\lambda to be able to depend on the parameters.

view this post on Zulip Sam Staton (Sep 22 2020 at 08:57):

Thanks. My concern is that in my example, weight(1/f) and weight(g) would be equated with dirac, but you wouldn't want to equate p and q.

view this post on Zulip Nathaniel Virgo (Sep 22 2020 at 08:58):

Ah, I got it. Thanks, that's a good point. I guess composition would have to be defined so they behave like identities, not like weightings. Let me think about that.

view this post on Zulip Tobias Fritz (Sep 22 2020 at 10:00):

Here are some (slightly off-topic) comments on the naturality of detete. I'm very much on board with @Nathaniel Virgo's idea here, assuming that that is possible at all. The naturality of delete is so convenient in practice that I wouldn't want to give up on it unless it's strictly necessary. On the other hand, dropping it does seem to be necessary to formalize some aspects of measure-theoretic probability which are often considered central, such as the treatment of densities and the Radon-Nikodym theorem (see the discussion Sam had linked to). More precisely, I also have some ideas on how to treat densities even with delete being natural by defining a density to be a deterministic morphism to a suitable "real number object", but so far it's only a vague idea, and I doubt that such a treatment would be as elegant as the one of taking a density to be simply a morphism to 11, which is what happens without naturality of delete.

So perhaps the situation is a bit like with symmetric vs braided monoidal categories: most things which can be done with SMC's can be done more generally for BMC's, and in some contexts (such as knot theory) one really needs the greater generality of BMC's. But many people still work with SMC's in practice because of the greater convenience. So we end up having two separate but closely related formalisms. My current impression is that something like this is happening with Markov categories and the naturality of delete as well.

I wanted to mention this also in order to try and get some input from you on what a good naming scheme would be. The term "Markov category" seems to have caught on, which I'm very happy about, and (at least currently) the naturality of delete is included in the axioms. With this axiom dropped, I've recently seen the term "CD category" being used (going back to Cho and Jacobs). But probably it would be better to uniformize the terminology a bit, no? So what would be a good pair of terms which makes clear that there are two slightly different but closely related concepts? And which one of these two should the plain "Markov category" stand for, if any? One possibility could be to say "Markov category" for the non-natural-deletion case, and "affine Markov category" or "normalized Markov category" for the natural-deletion case. Or something the other way around. Any thoughts on this? (We can move this to a separate thread if it's too distracting here.)

view this post on Zulip Nathaniel Virgo (Sep 22 2020 at 10:24):

I'd suggest not changing the meaning of "Markov category", because the meaning of delete being natural seems fairly established now, and changing it would be confusing.

But the distinction between normalised and unnormalised kernels reminds me of the distinction between directed graphical models (i.e. Bayesian networks) and undirected graphical models (i.e. Markov random fields), and the physicist in me tends to think of an unnormalised measure as being something like a factor in a Gibbs measure. This makes me want to very tentatively suggest "Gibbs category" as the name for a copy-delete category in the context of probability applications.

view this post on Zulip Nathaniel Virgo (Sep 22 2020 at 10:41):

I also think "Markov category" and "unnormalised Markov category" work quite well as terms.

view this post on Zulip Sam Staton (Sep 22 2020 at 10:45):

I'm fine with unnormalized Markov category too.
As you probably know, in the unnormalized setting, there is a related notion which has been called "synthetic measure theory". This is a ccc with countable sums and products and with a commutative monad M such that M takes sums to products.

view this post on Zulip Richard Samuelson (Feb 13 2023 at 08:24):

@Nathaniel Virgo

In the hypergraph category Cond(GaussEx) , which has uninformative priors, morphisms

d:mnd: m \to n

are energy functions d:Rm+nRd: \mathbb{R}^{m + n} \to \mathbb{R} in a Gibbs measure. An uninformative prior

d:0md: 0 \to m

is the constant function

d(,x)=0.d(*, x) = 0.

I don't know physics, so this connection baffles me. Where are you getting your intuition?

view this post on Zulip Nathaniel Virgo (Feb 13 2023 at 08:26):

I don't know what Cond(GaussEx) is - can you give a definition or reference to where it's defined?

view this post on Zulip Richard Samuelson (Feb 13 2023 at 08:30):

It's defined in these papers

via an equivalence class.

I noticed that you can also think of morphisms as convex energy functions.

view this post on Zulip Richard Samuelson (Feb 13 2023 at 08:40):

It is a little, late, but I can go into more detail tomorrow.

view this post on Zulip Richard Samuelson (Feb 13 2023 at 18:26):

The idea is this:

@Tobias Fritz defines in this paper a category Gauss\mathrm{Gauss} of general linear models / Gaussian conditionals. We want to perform Bayesian inference in this category, so we embed it into a hypergraph category Cond(GaussEx)\mathrm{Cond}(\mathrm{GaussEx}) and employ the caps to observe values.

A morphism d:mnd: m \to n in Cond(GaussEx)\mathrm{Cond}(\mathrm{GaussEx}) can be thought of as an equivalence class of convex functions

E:Rm+n[,],E: \mathbb{R}^{m + n} \to [-\infty, \infty],

where

E1E2    E1=E2+α for some αR.E_1 \sim E_2 \iff E_1 = E_2 + \alpha \text{ for some } \alpha \in \mathbb{R}.

We think of any function in this equivalence class as being the energy function of a Gibbs measure

p(x,y)=1ZeE(x,y),p(x, y) = \frac{1}{Z}e^{-E(x, y)},

the support of which is the set

{(x,y)E(x,y)<}.\{(x, y) \mid E(x, y) < \infty \}.

In particular, if ψ:0n\psi: 0 \to n is a morphism in Gauss\mathrm{Gauss} (i.e. a Gaussian distribution), then we embed it into Cond(GaussEx)\mathrm{Cond}(\mathrm{GaussEx}) by mapping it to its associated energy function.

An uninformative prior d:0nd: 0 \to n is given by the constant energy function

E(,x)=0.E(*, x) = 0.

A failure state :0n\bot: 0 \to n is given by the constant energy function

E(,x)=.E(*, x) = -\infty.

Composition occurs via multiplication of convex bifunctions:

(EF)(x,z)=inf{E(x,y)+F(y,z)y}.(EF)(x, z) = \inf \{ E(x, y) + F(y, z) \mid y \}.

view this post on Zulip Richard Samuelson (Feb 13 2023 at 18:38):

Incidentally, I have been working on a Julia implementation of this category: https://github.com/samuelsonric/AlgebraicInference.jl