Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.

Stream: theory: categorical probability

Topic: generalizing mutual information

Gurkenglas (Nov 19 2021 at 22:58):

I want to dissect a neural network into a hierarchy of modules to understand one module without having understood the others. My proxy for whether I can is to ask how well I can predict one module's reaction to an input from another's: Mutual information.

I model a neural network as a differentiable function between an input vector space and an activation vector space of all real numbers calculated anywhere in the network. I call each real-number slot a neuron. An input distribution induces an activation distribution, and for every module (set of neurons) we get a marginal distribution.

For tractability I replace the input distribution with a normal distribution of negligible variance around an input. Then the covariance matrix of the activation distribution is just the jacobian times its transpose, and the mutual information between two modules is pretty much the determinant of one block times the determinant of the other block divided by the determinant of their union.

Averaging the mutual information between two modules around each training input is almost the mutual information conditioned on the input, but not quite - since the input determines all other numbers, conditioning on the input makes everything trivial. What am I looking at?

The mutual information I get between modules takes values like 2 bits, 3.5 bits, 5 bits, infinity bits: Sometimes one degree of freedom, one number is shared between the modules. That sounds wrong and like I'm failing to account for some structure. How might one generalize mutual information to other categories than Set? One could just hackily add a noise vector onto every activation vector, which sounds like it might add some topological structure, but perhaps we should be respecting everything from the shape of a picture tensor to the causal dependency graph between neurons.

I suspect that counting the degrees of freedom shared between two modules, that is, the rank of the one block plus the rank of the other block minus the rank of their union, is what mutual information becomes in some other category.

Gurkenglas (Nov 20 2021 at 15:56):

Oh, the "mutual information between modules conditional on the input" is in fact mutual information between modules conditional on the input plus a negligible noise vector, neat. That does seem to make the hack less hacky. Still, can category theory help?

Morgan Rogers (he/him) (Nov 21 2021 at 09:36):

Gurkenglas said:

How might one generalize mutual information to other categories than Set?

What do you mean by this comment? Mutual information is defined over some category of random variables and measure spaces; these might be formally built over a Set-theoretic foundation, but it's unclear that this is a quantity associated particularly associated with Set.

Morgan Rogers (he/him) (Nov 21 2021 at 09:44):

Certainly you can formulate this categorically, or at least order-theoretically. You have a lattice (or category, but let's stick to lattice for now) of modules, and a quantity defined on pairs of elements in that lattice. You have ways of combining such lattices, both compositionally and substitutionally. This feels like an operad-y situation, so hopefully someone well-versed in those can give you some insight.
Your questions become "how do substitution and composition interact with my quantity (mutual information)", and these will be clarified by working out how the composition and substitution structures behave.

Gurkenglas (Dec 05 2021 at 15:37):

Has anyone yet done https://ncatlab.org/johnbaez/show/Entropy+as+a+functor for relative entropy so that it might generalize beyond the finite case, as laid out in 2. in https://golem.ph.utexas.edu/category/2011/05/categorytheoretic_characteriza.html#c037762 ?

Nathaniel Virgo (Dec 06 2021 at 04:46):

There is Baez and Fritz (2014) - A Bayesian Characterisation of Relative Entropy, which I think is a similar but different construction - it could be what you're looking for.

Tobias Fritz (Dec 06 2021 at 05:01):

And concerning "beyond the finite case", there's A categorical characterization of relative entropy on standard Borel spaces by Gagné and Panangaden, who extended our characterization to standard Borel spaces. I'm still amazed by them having managed to do that!

Gurkenglas (Dec 24 2021 at 23:01):

Nice! Have Gagné/Panangaden or others considered transporting the machinery from section 6.4 (Aggregating predictions from many forecasters) to inform aggregation of utilities? For that is also a setting where the usual averaging isn't really appropriate and it seems to be just across the gap between observation and control.

Tobias Fritz (Dec 25 2021 at 18:19):

Not that I know of. But others may know of things that I don't know of!

Gurkenglas (Jan 10 2022 at 16:54):

Tobias Fritz said:

A categorical characterization of relative entropy on standard Borel spaces

A coherent pair is just a split idempotent on a free Γ-algebra (except that this forgets the equipped measure, which could be reequipped with a morphism from the algebra Γ1->1).
Convex linearity means that the rectangle on the right commutes: http://sketchtoy.com/70390288
Relative entropy feels like it will be a 2-cell in the triangle with e and twice l.
Thoughts? Suggestions? Arcane prophecies?