Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.

Stream: theory: categorical probability

Topic: Is Prevalence a Weighted Colimit?

Ellis D. Cooper (Jul 13 2025 at 15:32):

$W$ denotes a non-void finite set of ``words." A sentence is an assignment of words to positions, $\tilde{w}:S\to W$ . If thinking of $\tilde{w}$ as a word-valued random variable defined at positions (like a "field" of words), then $\mathbb{P}[\tilde{w}=k]$ is the probability of word $k$ occurring at some position of the sentence.

The triple $(S,\tilde{w},W)$ with dom$$\tilde{w}=S$$ and cod$$\tilde{w}=W$$ is exactly an object of the slice category $\mathcal{S}et/W$ . Call it the category of sentences over $W$ .

A document is a ``field" of sentences over a discrete category of locations, as in $\tilde{S}:D\to \mathcal{S}et/W$ . The sentence $\tilde{S}q$ in the document is denoted by $\tilde{w}_q:\tilde{S}_q\to W$ . By the universal property of coproduct in $\mathcal{S}et$ , there is a random variable

$\bigcup\limits_{q\in D}\tilde{S}_q\xrightarrow{\tilde{W}}W$

such that if $p\in\tilde{S}q$ for $q\in D$ , then $\tilde{W}p=\tilde{w}_{q}p$ is the word at position $p$ in the sentence at location $q$ . And $\mathbb{P}[\tilde{W}]=k$ is the probability of word $k$ occurring at some position of some location in the document.

For natural language understanding it is of interest to determine, for a sample $A\subseteq D$ of sentences (such as the consecutive sentences in a paragraph), the ``density of highest-probability words" in $A$ . The density of word $k\in W$ in sample $A$ is

$\frac{\#\{\,p\in\tilde{S}_q\,|\,q\in A \wedge \tilde{W}p=k\,\}}{\#A}$

In applications a document is subdivided into consecutive intervals $A$ of fixed length so the denominator $\#A$ equal to $\#\tilde{W}_A^{-1}k$ .

Define the $\mathbf{prevalence}$ $\operatorname{Prev}_NA$ relative to a set $K\subset W$ of words of highest rank (according to some threshold $N\in\mathbb{N}$ ) by

$\operatorname{Prev}_N A:=\sum_{k \in K} \#\tilde{W}_A^{-1}k\cdot \mathbb{P}[\tilde{W}=k]\,.$

The question is whether prevalence is a weighted colimit. Evidence is a detailed formal correspondence:

$\frac{\operatorname{colim}^W F}{\operatorname{Prev}_{N}{A}}\quad \frac{\int^{c \in C}}{\sum\limits_{k\in K}}\quad \frac{W( \underline{\hspace{0.5em}} )}{\#\tilde{W}_A^{-1}\underline{\hspace{0.5em}}}\quad \frac{F\underline{\hspace{0.5em}}} {\mathbb{P}[\tilde{W}=\underline{\hspace{0.5em}}]}$

$\frac{ \operatorname{colim}^W F \cong \int^{c \in C} W c \cdot F c} { \operatorname{Prev}_N{A}:=\sum\limits_{k\in K} \#\tilde{W}_A^{-1}k\cdot\mathbb{P}[\tilde{W}=k] }$

Statement and proof of a theorem to substantiate this claim and evidence might involve ``enriched category theory." Of course, the best would be references to the literature.