Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.

Stream: learning: questions

Topic: A necessary condition for generalization in neural nets?

Isidor Seppälä Manning (May 07 2026 at 17:53):

Hi,

I’m an incoming undergrad from Sweden, and I’ve been immersed in category theory and it’s applications in deep learning for a few months. I am currently exploring ideas regarding emergence in neural networks using monads, $\mathbf{PROP}$ , and polynomial functors. I was initially inspired by Bruno Gavranović's Fundamental Components of Deep Learning.

I want to share my direction to hopefully learn a lot from everyone here.

I’m interested in finding a necessary condition for generalization that is independent of optimization dynamics and parameters. Following Geometric Deep Learning, I view the structure of tasks (e.g., equivariance for object detection) as certain "axioms" or pre-defined structures that any neural network must obey in order to "learn the task." I was later drawn towards category theory and Bruno's position paper, which is where I was introduced to using monads to represent the inherent rules of a task. I've looked at the monad defined over $\mathbf{FVect}_{\mathbb{R}}$ , and am then thinking of $M$ -algebras $(D, \alpha)$ as representing datasets $D$ obeying these priors via the evaluation map $\alpha$ .

As for what a "model" is in deep learning, I am viewing them as morphisms in $\mathbf{PROP}$ , generated by some abstract element $T$ . For now, I haven't thought about a lot about the distinguished object $T$ and what it could represent? I thought about $\mathbf{FVect}_{\mathbb{R}}$ but it doesn't seem to be necessary as I am focusing on isolating the wiring of the model. As for modeling neural networks categorically, I've also looked into $\mathbf{Para}(\mathbf{FVect_{\mathbb{R}}})$ , but since I want to focus on the structure of neural networks independently of parameter spaces, I have used $\mathbf{PROP}$ .

To study the interaction between the monad on $\mathbf{FVect}_{\mathbb{R}}$ (the task), and models $f$ in $\mathbf{PROP}$ (the architecture), I am using exploring how these interact in $\mathbf{Poly}(\mathbf{FVect}_{\mathbb{R}})$ . I hypothesize that if $f$ has some measure of "maximum multiplicative degree," then if that maximum degree is less than the required arity of the task (defined by the monad), the model cannot successfully factor through the priors, and can thus never generalize. In this view, "true generalization" or emergence becomes a categorical requirement between the degree of the model and the arity of the task. I've felt a bit stuck on how this interaction between the architecture and the task, as well as the "maximum multiplicative degree" could be formalized. Maybe it could be formulated as the highest polynomial degree in its "polynomial representation," which lead me to look into the idea of some functor $F:\mathbf{PROP}\to{}\mathbf{Poly}(\mathbf{FVect}_{\mathbb{R}})$ . This is not something I've looked more into but intuitively this functor would map the wiring of a model to a polynomial representation. I've also read that monads have a connection to monoids in $\mathbf{Poly}$ . I know this is informal, and a bit haphazardly thrown together. These are things I am currently learning.

To give an example of the idea, in modular addition the arity of the task is $2$ . MLPs would be "degree 1" models since they consists of sequential compositions of linear layers and activations. So in $\mathbf{PROP}$ it's a morphism $f_{\text{mlp}}:T\to{}T$ . A single-attention layer would be $f_{\text{attn}}:T^{3}\to{}T$ , and a bilinear model $T_{\text{bilin}}:T^{2}\to{}T$ . Then, applied to this case, this categorical approach could maybe prove mathematically that the bilinear and attention models are the only ones that could exhibit emergence under modular addition, as they have a "maximum multiplicative degree" greater or equal to $2$ .

I’m wondering if this approach could formally distinguish why certain architectures exhibit "grokking" or emergence on certain tasks while others simply memorize.

I realize this is currently informal and perhaps trivial from an empirical standpoint. Moreover, I've been a bit doubtful about the approach of applying category theory to explain emergence and to distinguish generalization from memorization in neural networks. But that's also why I am making this post. I wanted to reach out to people who understand research and mathematics more deeply, and more importantly, because I want to learn. My goal with this is to contribute to the development of categorical deep learning, and a more general mathematics that could be used to not only unify distinct parts of deep learning, but to ultimately make predictions, and understand neural networks better.

If you have any literature recommendations, direct critiques of my direction, or guidance regarding polynomial functors, I would appreciate hearing it and discussing it.

Thank you for your time!