Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.

Stream: event: Categorical Probability and Statistics 2020 workshop

Topic: Jun 6: Eigil Rischel's talk

Paolo Perrone (Jun 04 2020 at 19:11):

Hey all,
This is the discussion thread of Eigil Rischel's talk, "Introduction to Markov categories".
The talk, besides being on Zoom, is livestreamed here: https://youtu.be/zn7k15sq_TE

Paolo Perrone (Jun 04 2020 at 19:24):

Date and time: Saturday, 6 Jun, 13h UTC.

Paolo Perrone (Jun 06 2020 at 13:00):

(Just started)

Arthur Parzygnat (Jun 06 2020 at 13:59):

@Oliver Shetler (I have the correct Oliver, right?) mentioned this during the talk: "If a statistic is not sufficient, then your estimator depends on the parameter that you don’t know, so you get a “catch 22” where you need the parameter for your estimator to work. Sufficiency is just the condition for avoiding this." I'd like to know what this means exactly, which is why I want to keep it here for a record. In particular, what is an estimator?

Paolo Perrone (Jun 06 2020 at 14:01):

Does anyone know if there's a nice category generalizing Gauss, where one allows for more general exponential families?

Peter Arndt (Jun 06 2020 at 14:02):

@Eigil Rischel Nice talk, thanks!
I have a question on your remark at the end, that it is hard to treat measures with density in a Markov category. You said that the problem is that the comultiplication produces Dirac measures, which have no density. Did you just mean that, because of this, you cannot define a Markov category of measurable spaces where _every_ measure has a density? Or are there more problems?

Tobias Fritz (Jun 06 2020 at 14:04):

Hi @Dario Stein! I'd be very interested in seeing the type theory or internal language of Markov categories that you mentioned, if you think that you could sketch it here? And how would you express the existence of conditionals? (Conditionals aren't part of the vanilla definition, but can be imposed as an additional axiom.) Can you have an inference rule for conditioning despite the issue of non-uniqueness of conditionals?

And how would you express additional axioms? I guess in terms of equality judgments? I have two other axioms called "positivity" and "causality" (although the latter may deserve a better name) which are logically in the form of Horn formulas. Would you translate axioms of this type into inference rules for equality judgments in the obvious way?

Oliver Shetler (Jun 06 2020 at 14:22):

Arthur Parzygnat said:

Oliver Shetler (I have the correct Oliver, right?) mentioned this during the talk: "If a statistic is not sufficient, then your estimator depends on the parameter that you don’t know, so you get a “catch 22” where you need the parameter for your estimator to work. Sufficiency is just the condition for avoiding this." I'd like to know what this means exactly, which is why I want to keep it here for a record. In particular, what is an estimator?

Hi Arthur. Thanks for asking. I was hoping to clarify the meaning of statistical sufficiency for the non-statisticians in the crowd. In this context, an estimator is a function from a sample to the real numbers. For example, if I take the average of my sample, that's an estimator. Another example would be if I take the average, and then compute the sum of the squared differences between each sample point and the average, and then divide by the degrees of freedom (in this case that's n-1 where n is the sample size). That would be an estimator of the variance of an assumed normal distribution.

The reason we call these functions "estimators" is because we are pretty much always trying to use them to estimate a parameter of a distribution. For example, when we take the average of a sample, we are trying to estimate the true mean of the distribution.

So if the goal of an estimator is to learn something about the parameter, it makes sense that we would want to make sure that the estimator does not depend on the parameter as input. This might seem trivial for the example of a mean, but it can happen more often than you'd think with complicated estimators. When that happens, the estimator is useless because you have to assume the parameter in order to use it. This is the "catech 22" I was referring to. You need the parameter to get the estimate, but you want the estimate so you can learn about the parameter.

Statistical sufficiency just says that the parameter is independent from the estimator, so you avoid this "catch 22."

Arthur Parzygnat (Jun 06 2020 at 14:25):

This was incredibly helpful, thank you!

Oliver Shetler (Jun 06 2020 at 14:45):

@Eigil Rischel or @Tobias Fritz , have you worked on demonstrating that estimators are unbiased using this categorical framework? Also, what about using the string diagrams in particular?

Tobias Fritz (Jun 06 2020 at 15:14):

No, we haven't worked on estimators and unbiasedness yet. But our next speaker @Evan Patterson may have more to say about this! In order for the formalism to be convincing as a complete framework for statistics, I agree that this will have to be done.

Dario Stein (Jun 06 2020 at 15:39):

Tobias Fritz said:

Hi Dario Stein! I'd be very interested in seeing the type theory or internal language of Markov categories that you mentioned, if you think that you could sketch it here? And how would you express the existence of conditionals? (Conditionals aren't part of the vanilla definition, but can be imposed as an additional axiom.) Can you have an inference rule for conditioning despite the issue of non-uniqueness of conditionals?

And how would you express additional axioms? I guess in terms of equality judgments? I have two other axioms called "positivity" and "causality" (although the latter may deserve a better name) which are logically in the form of Horn formulas. Would you translate axioms of this type into inference rules for equality judgments in the obvious way?

Hi @Tobias Fritz , I would imagine the internal language of a Markov category pretty much looks like the first-order fragment of a language like ML. So you can write

let x = normal(0,1) in (2*x + 3,x)
or
(2*normal(0,1), normal(0,1))

where normal, *, +, and the constants are just morphisms of Stoch. A slight subtlety is the the treatment of tuples, as they denote the tensor product and not cartesian product. However unlike in linear logic we can make use of the comonoids to copy freely and project out from tuples, so it wouldn't look too bad. Unlike in http://www.cs.ox.ac.uk/people/samuel.staton/papers/esop2017.pdf, there is no a-priori distinction between deterministic and probabilistic judgements, but it would also be interesting to annotate determinism explicitly, as you get better equations and could make use of e.g. positivity/causality. Properties of the Markov category would feature as admissible equations in some program logic. This is a useful practical view on probabilistic languages anyways, e.g. https://arxiv.org/pdf/1802.09598.pdf.

The question of a syntax for conditionals is a very good one. I'm currently trying to model it as a separate algebraic effect. Conditioning isn't a morphism in the Markov category, but maybe in a larger category. The question is then whether this larger category still has nice properties (it wouldn't be a Markov category as the conditioning effect is not discardable). Other difficulties are keeping track of supports to deal with the nonuniqueness of disintegrations, and how conditioning interacts with other program equations (in general not well: this is Borel's paradox).

Manuel Bärenz (Jun 06 2020 at 16:31):

Paolo Perrone said:

Does anyone know if there's a nice category generalizing Gauss, where one allows for more general exponential families?

That's a great question. Couldn't you take any family of conjugate priors (https://en.wikipedia.org/wiki/Conjugate_prior) and make a Markov category of that?

Tobias Fritz (Jun 06 2020 at 17:19):

I've been wondering that too! And in particular whether conjugate priors give us a Markov category with conditionals. It's not obvious, since the parameter space and the sample space of an exponential family seem to have different kind of structure, with in particular the parameter space being a vector space. But this is the sort of thing that @Evan Patterson 's lattices of supplies are made for, so perhaps he will be able to say more.

Evan Patterson (Jun 06 2020 at 18:36):

Tobias Fritz said:

No, we haven't worked on estimators and unbiasedness yet. But our next speaker Evan Patterson may have more to say about this! In order for the formalism to be convincing as a complete framework for statistics, I agree that this will have to be done.

A statistician might distinguish between three stages of fitting a model:

Specifying a statistical model
Choosing an estimator for the model
Finding an algorithm to compute the estimator

My impression is that most work on CT + stats, certainly including my own, has been about (1), with work on probabilistic programming also being about (3). I completely agree with Tobias that a fully convincing formalism would say something about (2).

One idea I've thought about superficially is introducing an expectation operator $\mathbb{E}$ into the formalism, so that if $M: \mathcal{X} \to \mathcal{Y}$ is a Markov kernel with $\mathcal{Y}$ a vector space, then $\mathbb{E} M: \mathcal{X} \to \mathcal{Y}$ is the deterministic kernel $(\mathbb{E} M)(x) := \mathbb{E}[M(x)]$ . This would allow one to express unbiasedness. In terms of statistical theories, one could express that a probability distribution is parameterized by mean, which would often be useful. I think it would be interesting to properly work out this idea.

Alex Simpson (Jun 06 2020 at 19:02):

Hi Eigil. I liked your talk. Regarding the formulation of the infinite tensor, that I asked about at the end, would an equivalent way of axiomatising this be to say that the category of deterministic maps has countable products and the inclusion functor into general maps preserves cofiltered limits of diagrams of deterministic maps consisting of projections between finite products?

Oliver Shetler (Jun 06 2020 at 19:10):

In this case, then, the goal would simply be to show that \mathbb{E}[M(X)] \approx M(\mathbb{E}[X])E[M(X)]≈M(E[X])?

Evan Patterson (Jun 06 2020 at 19:15):

In this case, then, the goal would simply be to show that $\mathbb{E}[M(X)] \approx M(\mathbb{E}[X])$ ?

I was thinking that given a statistical model $P: \Omega \to \mathcal{X}$ and an estimator $d: \mathcal{X} \to \Omega$ , unbiasedness is the equation $\mathbb{E}[P \cdot d] = 1_{\Omega}: \Omega \to \Omega$ .

Oliver Shetler (Jun 06 2020 at 19:18):

Evan Patterson said:

In this case, then, the goal would simply be to show that $\mathbb{E}[M(X)] \approx M(\mathbb{E}[X])$ ?

I was thinking that given a statistical model $P: \Omega \to \mathcal{X}$ and an estimator $d: \Omega \to \mathcal{X}$ , unbiasedness is the equation $\mathbb{E}[P \cdot d] = 1_{\Omega}: \Omega \to \Omega$ .

Yes. I was trying to suggest that there is a map to the sample and a deterministic map to the average of the sample. Then, if we take the composite of the two maps as a random variable, the expectation of such a random variable is equivalent to the expectation of the distribution random variable, so if we go out and back, we get an identity endomorphism.

Oliver Shetler (Jun 06 2020 at 19:24):

This could work for any estimator, too. Variance would rely on the same equation.

Another interesting problem is to ask, what does it mean to extend the concept of a parameter and an estimator to the context of sentences in grammars, with a semantics category such as the vector space of word associations or conceptual spaces (i.e. convex relations re: Coecke's implementation of Peter Gardenfors work)?

Oliver Shetler (Jun 06 2020 at 19:26):

For example, can we build a sigma algebra in the semantic space of sentences and define a meaningful notion of the "average" of a sample of sentences? The application I have in mind is on surveys, where a psychologist or social scientist might ask for a one sentence answer to a fairly closed ended question. You might not be able to give an "average sentence," but if you could define the "average meaning" then you could define a random variable with a much smaller variance than your sample and generate a few representative sentences that "capture the meaning" of the larger set, insofar as an average "captures the center" of a numerical data set. I'm still not mathematically equipped to really tackle this question, but it's been on my mind for a while.

Evan Patterson (Jun 06 2020 at 21:36):

Tobias Fritz said:

I've been wondering that too! And in particular whether conjugate priors give us a Markov category with conditionals. It's not obvious, since the parameter space and the sample space of an exponential family seem to have different kind of structure, with in particular the parameter space being a vector space. But this is the sort of thing that Evan Patterson 's lattices of supplies are made for, so perhaps he will be able to say more.

Great questions. Until now, I never thought about whether there is a Markov category of exponential families (or maybe some larger class, say including exponential dispersion families). A classic example is that the negative binomial family is a composite of the gamma and Poisson families. (I put this example in Sec 3.1 of my thesis, just for fun.) Still, it seems too good to be true that such a class would be closed under composition in general. Another subtlety is that exponential families have multiple reasonable parameterizations. The canonical parameter belongs to a vector space, but under regularity conditions, the family can be invertibly reparameterized by its mean, which in general only belongs to a convex set. (This provides the "canonical link function" in generalized linear models; see Sec 4.4 of my thesis for examples as statistical theories.) But exponential families play a fundamental role in statistics and machine learning, so it would be great to understand them from a compositional viewpoint, if possible.

Tomáš Gonda (Jun 06 2020 at 22:13):

Evan Patterson said:

One idea I've thought about superficially is introducing an expectation operator $\mathbb{E}$ into the formalism, so that if $M: \mathcal{X} \to \mathcal{Y}$ is a Markov kernel with $\mathcal{Y}$ a vector space, then $\mathbb{E} M: \mathcal{X} \to \mathcal{Y}$ is the deterministic kernel $(\mathbb{E} M)(x) := \mathbb{E}[M(x)]$ . This would allow one to express unbiasedness. In terms of statistical theories, one could express that a probability distribution is parameterized by mean, which would often be useful. I think it would be interesting to properly work out this idea.

I am not positive I understand exactly what you are aiming for, but how about something like this:

Rather than considering a vector space, provide $\mathcal{Y}$ with an algebra $e \colon P \mathcal{Y} \to \mathcal{Y}$ for a probability monad $P$ , which (deterministically) specifies how formal convex combinations of elements of $\mathcal{Y}$ are evaluated to a particular element of $\mathcal{Y}$ .
For a Markov kernel $M \colon \mathcal{X} \to \mathcal{Y}$ , define $\mathbb{E}[M] := e \circ PM \colon P \mathcal{X} \to \mathcal{Y}$ . This applies the random variable $M$ (or, indeed, any kernel) to elements of the convex mixtures and then evaluates the convex combination within $\mathcal{Y}$ .
The unbiasedness of $d \colon \mathcal{X} \to \Omega$ for $p \colon \Omega \to \mathcal{X}$ would then presumably look like $\mathbb{E}[d \circ p] = e_{\Omega}$ ?

Evan Patterson said:

I was thinking that given a statistical model $P: \Omega \to \mathcal{X}$ and an estimator $d: \Omega \to \mathcal{X}$ , unbiasedness is the equation $\mathbb{E}[P \cdot d] = 1_{\Omega}: \Omega \to \Omega$ .

By $P \cdot d$ , do you mean $d \circ P$ ?

Evan Patterson (Jun 06 2020 at 22:23):

Ok, nice, that seems like a more systematic way to think about it, which does not require introducing a vector space structure.

And yes, by $P \cdot d$ , I meant $d \circ P$ . Sorry, I get too used to that notation.

Juan Pablo Vigneaux (Jun 06 2020 at 22:49):

Peter Arndt said:

@Eigil Rischel Nice talk, thanks!
I have a question on your remark at the end, that it is hard to treat measures with density in a Markov category. You said that the problem is that the comultiplication produces Dirac measures, which have no density. Did you just mean that, because of this, you cannot define a Markov category of measurable spaces where _every_ measure has a density? Or are there more problems?

@Peter Arndt @Eigil Rischel I wonder if it is not possible to solve this problem keeping track of the support of the corresponding probability measures, and not only of the density. For instance, if X is a continuous r.v. on R with density f, then (X,X) is a r.v. on R^2 supported on the diagonal and with density w.r.t. the Hausdorff measure on it. A similar problem appears when disintegrating (conditioning): in any nontrivial case, the disintegrated measures do not have a density with respect to the initial reference measure. I took this approach in Chapter 9-11 of my dissertation https://webusers.imj-prg.fr/~juan-pablo.vigneaux/these , but this is not written in the language of Markov categories (and I don't know if it can be adapted)

Juan Pablo Vigneaux (Jun 06 2020 at 22:55):

@Tobias Fritz I also have a general question about this "synthetic" probability theory: many examples were mentioned in the talk, and even more in your paper. If the synthetic viewpoint is fruitful, I would expect proofs of the key theorems in any of these examples to make sense in all the others, once written in the general language. Hence the main theorems should be re-interpretable in terms of the other examples. Do you have this kind of result? Say, what is Kolmogorov's 0-1- Law for commutative rings? Etc.

Tobias Fritz (Jun 07 2020 at 00:13):

Hi Juan Pablo! Those are great questions. Let me start with a disclaimer: although Markov categories have become one approach for doing synthetic probability theory, there may well be other ones, and my understanding is that @Alex Simpson will be talking about another one on Monday. It's also unclear how far we can actually go in developing probability theory synthetically. For example, reproducing the central limit theorem still seems out of reach at the moment. There's also the caveat that some results which usually are theorems rather become definitions in a synthetic treatment.

That being said, I perfectly agree that all results of synthetic probability theory with Markov categories acquire a much greater generality than the usual one, and that this is worth exploring. For example, one can hope to instantiate synthetic theorems on diagram categories in a Markov category, since these again form Markov categories as sketched in Section 7 of my paper. In particular, this makes the synthetic theorems apply e.g. to stochastic processes.

Our synthetic version of Kolmogorov's 0/1-law indeed applies in any Markov category in which our Kolmogorov products exist. In the paper, we've applied it to a certain Markov category of topological spaces, and obtained the following as a special case:

Theorem. Let $(X_i)_{i \in J}$ be a family of topological spaces, $Y$ a Hausdorff space, and let $f : \prod_i X_i \to Y$ be a continuous function which is independent of any finite subset of its arguments. Then $f$ is constant.

This is also quite easy to see directly, but it may give a flavour of the kind of statement that our result instantiates to. I don't know what it instantiates to for commutative rings, and that sounds like an interesting question! Generally we're still very much at the beginning with all of this: only very few theorems have been developed synthetically, and those that have have been instantiated only in very few Markov categories. So there's a lot left to do.

BTW, I also wonder whether there's a Markov category for information theory. Something like this: the objects are finite sets (for simplicity), and a morphism $f : A \to B$ is a family of Markov kernels $f_n : A^{\times n} \to B^{\times n}$ which are suitably compatible across all $n \in \mathbb{N}$ , and quotiented with respect to a suitable notion of asymptotic equivalence of morphisms as $n \to \infty$ . So in this way, (aspects of) information theory may fall under the umbrella of synthetic probability theory. My hope is that this could make the parallels between probability theory and information theory precise; for example, conditional entropy, which has a well-known similarly to conditional probability, could perhaps be given exactly by the conditional in the Markov category. But so far this idea is just wild speculation. In case that it makes any sense, perhaps it could also be interesting to ask whether one can turn this upside down and find a cohomological interpretation of conditional probability :) But probably you've already thought about this?

Paolo Perrone (Jun 07 2020 at 01:03):

Hi all! Here's the video.
https://youtu.be/534f-9Qx2Lg

Bas Spitters (Jun 07 2020 at 16:20):

@Paolo Perrone @Eigil Rischel Are the slides available?

Juan Pablo Vigneaux (Jun 07 2020 at 22:50):

Tobias Fritz said:

That being said, I perfectly agree that all results of synthetic probability theory with Markov categories acquire a much greater generality than the usual one, and that this is worth exploring. For example, one can hope to instantiate synthetic theorems on diagram categories in a Markov category, since these again form Markov categories as sketched in Section 7 of my paper. In particular, this makes the synthetic theorems apply e.g. to stochastic processes.

Interesting! Have you tried to simplify the presentation of basic theorems on stochastic processes using these ideas? Are you aware of two different theorems in the literature that are actually instances of the same synthetic result?

I think that it'd be worth exploring other instances of the 0/1- law or any of the other known results.

Give me some time to think about the eventual information-theoretic version.

Tobias Fritz (Jun 07 2020 at 23:04):

No, as far as I know nobody has applied the theory to stochastic processes yet. And although theorems of synthetic probability theory (in terms of Markov cats) can be instantiated on stochastic processes, this only gives results which do not make explicit reference to time. So my suspicion is that those results will be somewhat uninteresting, and possibly merely pointwise in time. Nevertheless, one can probably develop a synthetic theory of stochastic processes if one equips Markov categories with some extra structure, such as a "time translation functor" given by an automorphism of the category, corresponding to shifting the diagrams of the original category in which the process lives. This would give explicit access to time.

Sure, if you have any thoughts on the information-theoretic version (if it makes any sense at all), it would be great to hear about them!

Paolo Perrone (Jul 14 2020 at 19:46):

If anyone is interested, this week Tobias Fritz will give a talk about Markov categories at the MIT Categories Seminar.
Thread here: https://categorytheory.zulipchat.com/#narrow/stream/229457-MIT-Categories.20Seminar/topic/July.2016.3A.20Tobias.20Fritz'.20talk

Paolo Perrone (Nov 17 2020 at 14:37):

Eigil Rischel is giving another talk on the topic here: https://categorytheory.zulipchat.com/#narrow/stream/229457-MIT-Categories.20Seminar/topic/November.2019.3A.20Eigil.20Rischel's.20talk