Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.


Stream: theory: applied category theory

Topic: Category of probability distributions


view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 08:15):

I've been working with generative testing in Haskell a bit, and there's some strange patterns that keep cropping up that I think might be accounted for using category theory. It seems to me that there is such a thing as a category of "sets equipped with probability distributions". An object of this category consists of a set AA and a probability function:

ρ:ARaAρ(a)=1\rho : A \to \R \mid \displaystyle \sum_{a \in A} \rho(a) = 1

A morphism in this category is then a function between the underlying sets f:ABf : A \to B such that:

bB.ρ(b)=ρ(f1)\forall b \in B. \rho(b) = \sum \rho(f^{-1})

There is an obvious identity function, and associativity feels like it should work, although I haven't yet proved it works (if you can see how it already falls apart here, please let me know!).

To take a simple example, we might consider uniform distributions on the integers and booleans, and a morphism between them is the function which determines whether an integer is even.

Building further on the house of cards, it seems to me that this category is monoidal in a couple of interesting ways.

Given two sets with probability distributions, we have a probability distribution on the cartesian product of the two sets, such that the probability of any pair is the product of the probabilities assigned to its elements (i.e. the joint probability of independent events). The singleton set has an obvious probability distribution that will have no effect when combined in this manner with the probability distribution of any other set. Given an allocation of privileged probability distributions to sets, we end up with a lax monoidal functor from a subcategory of sets to this category of distributions, cohering the product structure in Set\mathbf{Set} with this monoidal structure of joint probabilities and cartesian products.

The disjoint union of two sets can also be assigned a probability distribution, in which the probability of any element is its original probability scaled by the size of its containing set as a fraction of the size of the disjoint union. Here I run into some difficulty because the unit would presumably be the empty set, but there's no way to construct a sensible probability function on it (i.e. there is no way to make it integrate to 1). I tried tweaking the concept to assign probability to subsets instead of the elements of the set, which gives me a way to get a probability function for the empty set, but that creates a big muddle out of everything else. Do you think there's a way to salvage the idea?

That's as far as I got before I thought I'd ask here. Is this idea something that is already well studied, either abstractly or in the specific context of generative testing?

view this post on Zulip Exequiel Rivas (Apr 29 2020 at 08:33):

One thing that people do is to work with the "subdistribution monad", i.e. sum bounded by 1, instead of equal to 1.

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 08:50):

I initially assumed you were talking about probability-preserving functions, but I realize I do not understand your notation. You wrote a condition that says "f:ABf:A\to B such that: ...'' but in the ... you just wrote a set without any condition. What is your condition?
In my response that follows, I will assume you meant to say something like ρ(b)=af1(b)ρ(a)\rho(b)=\sum_{a\in f^{-1}(b)}\rho(a) (that's what it means to be measure-preserving for finite sets). Some of these ideas are described in a very readable paper by Baez, Fritz, and Leinster https://arxiv.org/abs/1106.1791 (that's where I first learned about this stuff). In particular, they describe a "disjoint union" structure similar to yours in terms of what they call "convex combinations." They call this category FinProb. I don't think they talk about the tensor product, but this is discussed in other papers, such as Fritz https://arxiv.org/abs/1908.07021.
Some points:

  1. If you're not working with finite sets, then you have to be careful about what you mean by uniform distribution. There is no uniform probability measure on the integers. And if you're working with infinite sets, you're probably using the language of measurable spaces (and then, yes, you are assigning probabilities to subsets and this leads one to the study of measure spaces).
  2. The empty set is a perfectly fine set and has a natural measure on it (zero), but no probability measure, so there is no unit for it in your category. @Exequiel Rivas provided a reasonable solution: use sub-probability measures.
  3. I'm going to assume you're working with finite sets for simplicity. Is your functor well-defined? If you have two sets X and Y and a function from X to Y, how do you guarantee that if you assign probability measure p to X and q to Y that f pushes p to q as well? This won't work even if you choose uniform distributions. For example, take X={1,2,3} and Y={1,2}. Although there are many functions from X to Y, none of them get sent to anything in your category. Maybe I'm misunderstanding your morphisms?

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 09:05):

sorry about that. the condition I meant to describe in plain english is that the probability assigned to any element of the codomain is the sum of the probabilities assigned to the elements of its preimage

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 09:06):

Perfect, then that's exactly the same as the condition I wrote, so I believe what I said applies.

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 09:07):

Thanks. I have to go to bed for now, but I'll go through what you said tomorrow.

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 09:14):

Cool. Feel free to ask questions or if something I said needs clarification. Btw, I like the idea of rescaling the probability measures by the size of the set. Even if it doesn't define a monoidal product on the category, it might still define an associative product... (I need to think about it).

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 09:41):

Based on some rough calculations, it seems that your disjoint union assignment FinProb×FinProbFinProb\mathbf{FinProb}\times\mathbf{FinProb}\to\mathbf{FinProb} is not a functor. It is associative on objects, but look at what happens if you take two probability-preserving functions f:(X,p)(X,p)f:(X,p)\to(X',p') and g:(Y,q)(Y,q)g:(Y,q)\to(Y',q'). You need the function f⨿g:X⨿YX⨿Yf\amalg g:X\amalg Y\to X'\amalg Y' (⨿\amalg is my notation for disjoint union) to send the probability measure XX+Yp+YX+Yq\frac{|X|}{|X|+|Y|}p+\frac{|Y|}{|X|+|Y|}q on X⨿YX\amalg Y to XX+Yp+YX+Yq\frac{|X'|}{|X'|+|Y'|}p'+\frac{|Y'|}{|X'|+|Y'|}q' on X⨿YX'\amalg Y'. I don't think this will work in general even if ff and gg were probability-preserving, unless XY=XY|X'||Y|=|X||Y'|, i.e. XX=YY\frac{|X'|}{|X|}=\frac{|Y'|}{|Y|} (which, now that I think about it, makes intuitive sense). To see this, use the probability-preserving condition to see what the probability of xXx'\in X' and yYy'\in Y' are.

view this post on Zulip Jules Hedges (Apr 29 2020 at 09:50):

If you start with the category of conditional distributions (ie. the kleisli category of the finite support probability monad on Set) and then take its co-slice category over the 1-element set. Objects are sets together with a probability distribution (since kleisli morphisms 1X1 \to X are the same as distributions on XX), and morphisms are conditional distributions XYX \to Y satisfying the condition that P[y]=xP[yx]P[x]\mathbb P [y] = \sum_x \mathbb P [y | x] \mathbb P [x]. I believe you have the sub-category of this where you require morphisms to be deterministic, ie. f:XD(Y)f : X \to D (Y) factors through the dirac unit δ:YD(Y)\delta : Y \to D (Y)

view this post on Zulip Jules Hedges (Apr 29 2020 at 09:53):

If that's right then you probably get a symmetric monoidal structure by general abstract nonsense, since conditional distributions has a symmetric monoidal structure (among other things) and it most likely passes through those 2 constructions

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 10:25):

@Jules Hedges , I think @Asad Saeeduddin is trying to work with two monoidal structures. One is the one I think you're describing (which I think @Asad Saeeduddin already understands---this is the product of probability distributions). But there's another one he was trying to formulate, which involves appropriately rescaled disjoint unions. The category of conditionals does have a disjoint union, but this is not what I believe @Asad Saeeduddin was attempting to construct. The general abstract nonsense argument doesn't apply in this case because he is explicitly using the probability measures in his definition of the "rescaled disjoint union" (it does not come from a monoidal product in the category of conditionals).

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 10:59):

@Jules Hedges Out of curiosity, what happens to the disjoint union in the coslice category? Maybe you can elaborate on the abstract nonsense, because I think one needs some additional assumptions to apply it. Let's call the monoidal product on the category of conditionals \boxtimes (it can be either cartesian product or disjoint union). Given objects (X,p)(X,p) and (Y,q)(Y,q) in FinProb\mathbf{FinProb} (objects in your coslice cat), how is (X,p)(Y,q)(X,p)\boxtimes(Y,q) defined? One requires a natural map {}{}{}\{\bullet\}\to\{\bullet\}\boxtimes\{\bullet\} satisfying certain conditions in order to define the composite {}{}{}XY\{\bullet\}\to\{\bullet\}\boxtimes\{\bullet\}\to X\boxtimes Y to get an object. In the case of cartesian product, you have such a map and it's unique. In the case of disjoint union, such a map corresponds to a probability measure on a 2-element set. I believe there is no such probability measure that will turn this into a monoidal product in the coslice category. There are two choices to get an associative product (the two Dirac distributions), but it will not be symmetric with either of those choices. There is only one choice that will give a symmetric product (uniform distribution), but it will not be associative (hmm... reminds me of some operations in algebraic topology!). One of the interesting things is that @Asad Saeeduddin suggested an idea that's at least associative and symmetric on objects. However, it is only a partially defined operation.

view this post on Zulip Nathaniel Virgo (Apr 29 2020 at 11:00):

It's hard to conceptualise the disjoint union of two probability distributions. An element of the sample space is either an event in A or an event in B. The question is what should the probability of such an event be, and how can we interpret it probabalistically?

You could say "flip a coin, then sample from A if it's heads and from B if it's tails." Then for an event in A we would have pA+B(a)=12pA(a)p_{A+B}(a) = \frac{1}{2}p_A(a), and for an event in B we'd have pA+B(b)=12pB(b)p_{A+B}(b) = \frac{1}{2}p_B(b). But I think you perceived that this is a little arbitrary - why choose an unbiased coin instead of a biased one? A general unbiased coin would give pA+B(a)=(1λ)pA(a)p_{A+B}(a) = (1-\lambda)p_A(a) and pA+B(a)=λpA(a)p_{A+B}(a) = \lambda p_A(a), for some λ[0,1]\lambda\in [0,1], and the question is how to choose λ\lambda.

From what I understood above, you want to use a biased coin where the bias is proportional to the sample size, so λ=B/(A+B)\lambda = |B|/(|A| + |B|). This may make sense in some cases, but it's also somewhat arbitrary - why should we be 3/2 times more likely to roll a D6 than a D4, and how should we handle infinite supports?

However, if you instead choose a category where the objects are spaces of probability distributions, then this arbitrariness can be avoided: you just have one distribution in your space for every possible value of λ\lambda.

This might be a good time to mention this great talk that @Sam Tenka (naive student) gave, which I found online and have watched several times. He defines a category where the objects are spaces of probability distributions, in such a way that what I just described is the coproduct. The category also has a Cartesian product which is the space of all distributions of the form pA×B(a,b)=pA(a)pB(b)p_{A\times B}(a,b) = p_A(a)p_B(b) (i.e. where they're independent), and a tensor product, which I believe is the space of all possible joint distributions p(a,b)p(a,b). (My description here is kind of rough and possibly not quite right - it's better to watch the talk for the details.) It comes from thinking about probability distributions not as normalised measures but as points in convex sets, which live in a general category of convex spaces. It's an amazingly cool idea.

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 11:29):

@Nathaniel Virgo Although these resources don't describe everything Sam does in his talk, you might find Chapter 1 in @Paolo Perrone's PhD thesis (http://www.paoloperrone.org/phdthesis.pdf) a nice read. Particularly, the relationship between affine maps of convex spaces and stochastic maps is discussed. There's also some overlap in @Tobias Fritz' convex spaces paper (https://arxiv.org/abs/0903.5522). Though I do agree, I definitely learned things in Sam's talk that I did not learn elsewhere.

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 14:56):

@Arthur Parzygnat Thanks for investigating this further. I don't quite understand your argument about the functor. In particular, I don't understand how the sum of probability functions arises. An element in the codomain of f⨿g:X⨿YX⨿Yf \amalg g : X \amalg Y \to X' \amalg Y' is either an element of XX' or an element of YY'. If it is xXx' \in X', we define its probability to be XX+Yp(x)\frac {|X'|} {|X'| + |Y'|} p'(x'). If on the other hand it is yYy' \in Y', we define its probability to be YX+Yq(y)\frac {|Y'|} {|X'| + |Y'|} q'(y'). The requirement on f⨿gf \amalg g if I'm thinking correctly is to independently ensure that:

xX.XX+Yp(x)={p(x)xf1(x)}\forall x' \in X'. \frac {|X'|} {|X'| + |Y'|} p'(x') = \sum \{ p(x) | x \in f^{-1}(x') \}

and that:

yY.YX+Yq(y)={q(y)yf1(y)}\forall y' \in Y'. \frac {|Y'|} {|X'| + |Y'|} q'(y') = \sum \{ q(y) | y \in f^{-1}(y') \}

Am I misunderstanding?

view this post on Zulip Reid Barton (Apr 29 2020 at 15:02):

Take a simple example. Consider the following two functions:

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 15:03):

I think the requirement for probability-preservation is (let me just do it for f)
XX+Yp(x)=xf1(x)XX+Yp(x)\frac{|X'|}{|X'|+|Y'|}p'(x')=\sum_{x\in f^{-1}(x')}\frac{|X|}{|X|+|Y|}p(x)
but as you can see, the left-hand-side is
XX+Yxf1(x)p(x)\frac{|X'|}{|X'|+|Y'|}\sum_{x\in f^{-1}(x')}p(x)
which only equals the right-hand-sde if the relationship I provided earlier holds.

view this post on Zulip Reid Barton (Apr 29 2020 at 15:03):

Then there should be a function f⨿gf \amalg g from {a,b}\{a, b\} to {a,b,c}\{a', b', c'\}, but the probability distributions you defined on these two sets are such that there are no probability-preserving functions between them at all.

view this post on Zulip Reid Barton (Apr 29 2020 at 15:03):

Because on the left you have probabilities like 1/21/2, and on the right probabilities like 1/31/3 and 2/32/3 and 00.

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 15:16):

Right, and as you can also check, @Reid Barton's example does not satisfy the ratio equality (the size ratio between source and targets for both functions must be equal).

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 15:19):

ok, i'll set aside the disjoint union operation for now.

If you're not working with finite sets, then you have to be careful about what you mean by uniform distribution. There is no uniform probability measure on the integers. And if you're working with infinite sets, you're probably using the language of measurable spaces (and then, yes, you are assigning probabilities to subsets and this leads one to the study of measure spaces).

What I'm working with is inductive data (i.e. what I'm loosely interpreting as sets are actually types). So they need not be finite, but the way in which they are infinite is somewhat restricted. I'm not sure whether you find this relevant to anything though

view this post on Zulip Oliver Shetler (Apr 29 2020 at 15:21):

Can you provide a formal definition of inductive data and types in this context?

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 15:22):

The empty set is a perfectly fine set and has a natural measure on it (zero), but no probability measure, so there is no unit for it in your category. @Exequiel Rivas provided a reasonable solution: use sub-probability measures.

I haven't managed to understand much about measures yet, but based on my cursory reading: wouldn't p({})1p(\{\}) \mapsto 1 be a valid probability measure for {}\{\} ({{}}\{\{\}\} being the σ\sigma-algebra on the sample space {}\{\})?

view this post on Zulip Reid Barton (Apr 29 2020 at 15:24):

No, because P()=0P(\emptyset) = 0 is an axiom of a measure.

view this post on Zulip Oliver Shetler (Apr 29 2020 at 15:26):

Could the unit just be a set with one element?

view this post on Zulip Arthur Parzygnat (Apr 29 2020 at 15:26):

Btw, the reason it is an axiom is because the union of the empty set with itself is still the empty set. If p()=1p(\varnothing)=1 then 1=p()=p()=p()+p()=1+1=21=p(\varnothing)=p(\varnothing\cup\varnothing)=p(\varnothing)+p(\varnothing)=1+1=2 which would only allow p()=p(\varnothing)=\infty and then every set has infinite measure, and you generally don't want to do that.

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 15:29):

@Oliver Shetler There is some explanation here: https://ncatlab.org/nlab/show/inductive+type if you're interested. I'm not an expert in type theory either, so I wouldn't want to try and explain it

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 15:30):

Could the unit just be a set with one element?

Which operation are you considering the unit of?

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 15:33):

@Arthur Parzygnat

Is your functor well-defined?

Good point, it is not. I think I had the functor backwards: there is at least a functor that merely forgets the extra probabilistic equipment, no?

view this post on Zulip Oliver Shetler (Apr 29 2020 at 15:36):

Which operation are you considering the unit of?

Oops, I was thinking of products but I see you were talking about disjoint unions.

view this post on Zulip Oliver Shetler (Apr 29 2020 at 15:38):

The sub-probability measures fix seems semantically sound though. If the empty set canonically represents an impossible outcome, then the union of two impossible outcomes also ought to be impossible. That's how Coecke and Kissinger's Quantum string diagrams also handle it.

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 15:42):

Hmm. I hadn't thought of this before, but in the context of generative testing it is actually the case that sampling may fail entirely, and a generator that always fails acts as the unit of the append operation I am trying to reconcile with a lax monoidal functor structure. This is perhaps another pointer that I really should be working with sub-probability measures.

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 15:54):

Assuming both a category of sets with full probability measures and a category of sets with sub-probability measures exist, is there some relationship between the two?

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 16:02):

One of the problems I'm trying to solve in the applied domain is that generators that can fail (which it seems correspond well to sub-probability measures) are inefficient, since the probability of getting a successful result tends to diminish as generators are compounded. So we end up with generators that we have to run several times in order to get a successful sample.

If there is some systematic way of assigning probability distributions and restricting functions between them such that a proper probability distribution can be assigned to the disjoint union of any two underlying sets (and a probability preserving function to the product of any two probability preserving functions), this would be very useful in the context of a generative testing library.

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 16:25):

@Nathaniel Virgo thank you. the talk is very interesting, but I'm having some trouble imagining a continuous "mixing" function that applies to my use case

view this post on Zulip Oliver Shetler (Apr 29 2020 at 16:36):

How would this solve the generator problem you mentioned above? By reducing the need for generator composition?

view this post on Zulip Asad Saeeduddin (Apr 29 2020 at 16:37):

Well if the probability distributions are total, it does not matter how much we compose the generators. Sampling it cannot fail.

view this post on Zulip Oliver Shetler (Apr 29 2020 at 16:41):

I wish I knew more about the context here. It's a little hard for me to grasp the problem you are trying to solve with disjoint unions. Can you give a concrete example of a case where this problem comes up?