Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.


Stream: theory: categorical probability

Topic: Is Prevalence a Weighted Colimit?


view this post on Zulip Ellis D. Cooper (Jul 13 2025 at 15:32):

WW denotes a non-void finite set of ``words." A sentence is an assignment of words to positions, w~:SW\tilde{w}:S\to W. If thinking of w~\tilde{w} as a word-valued random variable defined at positions (like a "field" of words), then P[w~=k]\mathbb{P}[\tilde{w}=k] is the probability of word kk occurring at some position of the sentence.

The triple (S,w~,W)(S,\tilde{w},W) with dom$$\tilde{w}=S$$ and cod$$\tilde{w}=W$$ is exactly an object of the slice category Set/W\mathcal{S}et/W. Call it the category of sentences over WW.

A document is a ``field" of sentences over a discrete category of locations, as in S~:DSet/W\tilde{S}:D\to \mathcal{S}et/W. The sentence S~q\tilde{S}q in the document is denoted by w~q:S~qW\tilde{w}_q:\tilde{S}_q\to W. By the universal property of coproduct in Set\mathcal{S}et, there is a random variable

qDS~qW~W\bigcup\limits_{q\in D}\tilde{S}_q\xrightarrow{\tilde{W}}W

such that if pS~qp\in\tilde{S}q for qDq\in D, then W~p=w~qp\tilde{W}p=\tilde{w}_{q}p is the word at position pp in the sentence at location qq. And P[W~]=k\mathbb{P}[\tilde{W}]=k is the probability of word kk occurring at some position of some location in the document.

For natural language understanding it is of interest to determine, for a sample ADA\subseteq D of sentences (such as the consecutive sentences in a paragraph), the ``density of highest-probability words" in AA. The density of word kWk\in W in sample AA is

#{pS~qqAW~p=k}#A\frac{\#\{\,p\in\tilde{S}_q\,|\,q\in A \wedge \tilde{W}p=k\,\}}{\#A}

In applications a document is subdivided into consecutive intervals AA of fixed length so the denominator #A\#A equal to #W~A1k\#\tilde{W}_A^{-1}k.

Define the prevalence\mathbf{prevalence} PrevNA\operatorname{Prev}_NA relative to a set KWK\subset W of words of highest rank (according to some threshold NNN\in\mathbb{N}) by

PrevNA:=kK#W~A1kP[W~=k].\operatorname{Prev}_N A:=\sum_{k \in K} \#\tilde{W}_A^{-1}k\cdot \mathbb{P}[\tilde{W}=k]\,.

The question is whether prevalence is a weighted colimit. Evidence is a detailed formal correspondence:

colimWFPrevNAcCkKW()#W~A1FP[W~=]\frac{\operatorname{colim}^W F}{\operatorname{Prev}_{N}{A}}\quad \frac{\int^{c \in C}}{\sum\limits_{k\in K}}\quad \frac{W( \underline{\hspace{0.5em}} )}{\#\tilde{W}_A^{-1}\underline{\hspace{0.5em}}}\quad \frac{F\underline{\hspace{0.5em}}} {\mathbb{P}[\tilde{W}=\underline{\hspace{0.5em}}]}

colimWFcCWcFcPrevNA:=kK#W~A1kP[W~=k]\frac{ \operatorname{colim}^W F \cong \int^{c \in C} W c \cdot F c} { \operatorname{Prev}_N{A}:=\sum\limits_{k\in K} \#\tilde{W}_A^{-1}k\cdot\mathbb{P}[\tilde{W}=k] }

Statement and proof of a theorem to substantiate this claim and evidence might involve ``enriched category theory." Of course, the best would be references to the literature.