Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.

Stream: event: Categorical Probability and Statistics 2020 workshop

Topic: Tutorial: Probabilistic programming (Sam Staton)

Paolo Perrone (Jun 01 2020 at 19:32):

Hello! Here's Sam Staton's tutorial video, a categorical tutorial about probabilistic programming.
https://youtu.be/JimCpEG0nts
Any questions about the video go in this thread!

Tobias Fritz (Jun 01 2020 at 21:32):

That is a great talk, Sam! I have a few questions if you don't mind.

Could you elaborate a bit on why you prefer thinking of the/a category of probability kernels as a multicategory rather than a symmetric monoidal category?
What are the main differences between different probabilistic programming languages? How much do they differ with respect to which kernels can be defined?
With the question about whether s-finite kernels form a Kleisli category, are you considering all measurable spaces as objects? Or only a subclass like Polish spaces?

Sam Staton (Jun 02 2020 at 14:32):

Hi Tobias!

Tobias Fritz said:
Could you elaborate a bit on why you prefer thinking of the/a category of probability kernels as a multicategory rather than a symmetric monoidal category?

It's not a big deal in the big picture, but the fact that they do form a multicategory / monoidal category relies on Fubini's theorem, which is perhaps slightly clearer in the multicategory formulation. Also I think that in general programming language / type theory syntax is closer to multicategories than to monoidal categories.

What are the main differences between different probabilistic programming languages? How much do they differ with respect to which kernels can be defined?

I'll just list some:

Some of the languages are lisp-based and untyped, which means even understanding programs as probability kernels is very difficult.
Some languages are really good for certain inference algorithms, which leads to some restrictions. e.g. Stan is really good for Hamiltonian Monte Carlo simulation, which means you're supposed to only build differentiable densities, and you have to go round the houses to make a mixture model. Infer.Net is really good for variational message passing, but that means you have to use distribution families carefully.
Some languages (like Pyro) interface with GPU code, allowing us to work with variational autoencoders from a probabilistic programming viewpoint
Some languages (like Pyro and Gen) have facilities for tying a clean model to a more intricate inference set-up. For example, in variational inference you might tie bits of your model to different things in a "guide" for a posterior, and for Metropolis Hastings you might specify a particular proposal if you expect things to correlate in a certain way. I think this is really important ongoing work.
Some languages support non-parametric features to a greater or lesser extent. Personally I find this fascinating.

With the question about whether s-finite kernels form a Kleisli category, are you considering all measurable spaces as objects? Or only a subclass like Polish spaces?

Several of us have tried both questions -- all measurable spaces, and just the standard Borel spaces -- and we have no answers.
(I do know that they can be made into a Kleisli category over quasi-Borel spaces, but that's almost cheating.)

Tobias Fritz (Jun 02 2020 at 16:31):

Very interesting, thanks! Yes, I see the point about the de Finetti theorem having more of a multicategorical flavour.

Out of curiosity: to what extent are you a user and/or developer of probabilistic programming languages, in addition to studying them at the theoretical level?

Paolo Perrone (Jun 02 2020 at 17:12):

David Myers once made me notice that lax monoidal functors, as opposed to strong, are naturally the morphisms not quite of monoidal categories, but rather of the underlying multicategories. If the structural functors that appear on probability kernels tend to be lax monoidal instead of strong (the underlying functor to the probability monad certainly is in this form), this could be an additional witness that "really Fubini is about multicategories".

Oscar Cunningham (Jun 02 2020 at 17:30):

Why is it more natural for a functor between multicategories rather than monoidal categories to be lax?

Paolo Perrone (Jun 02 2020 at 18:30):

Oscar Cunningham said:

Why is it more natural for a functor between multicategories rather than monoidal categories to be lax?

The idea is that if T is lax monoidal, then it canonically maps a morphism f: A x B -> C to a morphism TA x TB -> TC.

Sam Staton (Jun 02 2020 at 18:45):

Tobias Fritz said:

to what extent are you a user and/or developer of probabilistic programming languages

I dabble a bit, mainly because I'm interested to know what could be useful. I'm involved in a project with some social scientists on analyzing hate events on twitter and I've been writing probabilistic programs for that.

Tomáš Gonda (Jun 03 2020 at 22:08):

Hi Sam, thanks for the tutorial! I was a bit confused by the first example (4 buses in an hour) of the weighted Monte Carlo, so let me rephrase it to check if I got it right:

There is no actual "simulation" going on. The algorithm just samples from a uniform distribution and then scores each sample with the likelihood that on the given sampled day, one would see 4 buses. In the end, one then counts the weighted proportion of samples corresponding to a given hypothesis to get a posterior.

I think what threw me off at first was the naive intuition that, in this example, somehow the most 'complicated' part of the calculation is computing the likelihood. Therefore, I was subconsciously expecting the simulation to be approximating that, but then the likelihoods just entered as an input to the algorithm.

Sam Staton (Jun 04 2020 at 05:33):

Hi! Good question. Tomáš Gonda said:

The [weighted Monte Carlo] algorithm just samples from a uniform distribution and then scores each sample with the likelihood that on the given sampled day, one would see 4 buses. In the end, one then counts the weighted proportion of samples corresponding to a given hypothesis to get a posterior.

That's exactly right. (The example is very simple, and in practice you would be sampling from a more interesting prior.)

somehow the most 'complicated' part of the calculation is computing the likelihood

Indeed, there are various approaches to automatically converting a generative model into a density / likelihood function. But here I assume that the likelihood function is given to us (the Poisson density), and that is often the approach taken in probabilistic programming in practice.

Arthur Parzygnat (Jun 05 2020 at 11:26):

Thanks for your talk! I have some very basic questions. Let $X:=A\times B$ be the space describing the values $a$ and $b$ in $y=a+bx$ , so it's the space parametrizing affine maps from $\mathbb{R}$ to itself. Given a fixed $(a,b)$ , it is not guaranteed that when an observation is made, the values will all lie along a straight line. If $O$ denotes the observation space, then this is described by a Markov kernel $X\rightsquigarrow O$ . If we assume that there is a definite value of $(a,b)$ then this corresponds to a Dirac delta measure $\{\bullet\}\rightarrow X$ . We can push forward this measure to get one on the observation space $O$ which is describing the probabilities of witnessing certain observations. But what is $O$ exactly in this example? If we have $n$ observed data points, is it just $\coprod_{i=1}^{n}\mathbb{R}$ ? (the disjoint union of $n$ copies of $\mathbb{R}$ )? If so, to obtain the posterior that you plotted visually as a collection of straight lines, we apply Bayesian inversion to produce the associated Markov kernel $O\rightsquigarrow A\times B$ having witnessed the specific observation of data?

Sam Staton (Jun 05 2020 at 13:15):

Hi Arthur Parzygnat. I think $O=(\mathbb{R}^2)^n$ , if there are $n$ observations in the plane. But maybe I misunderstood your notation?

Arthur Parzygnat (Jun 05 2020 at 15:01):

@Sam Staton Ah, I assumed that because the x values are only natural numbers then you get the disjoint union. But yes, if you allow arbitrary x positions, then yes. Okay, but it's good to know we agree.