Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.

Stream: theory: applied category theory

Topic: model and data provenance

Angeline Aguinaldo (Mar 25 2021 at 17:55):

I've been thinking about the model and data provenance problem and am trying to come up with a system that makes tracking changes in models and data more efficient. In this case, the term "models" refers to any data-driven model (i.e. machine-learning, deep learning) and "data" refers to the training dataset that was used to drive the parameters of that model. In the wild, what happens is that a team start with an initial training dataset, they train a model, and they want to improve the performance of that model so they tweak the training dataset a bit and retrain the model. This can go on for many iterations for several months and across multiple team members. The way I've seen people solve this problem is by naming their model files and directories according to some naming convention, but then that naming convention gets stale or is not adaptable. The reason why people even want to track it is because they may want to revert to an older version of the model and retrain from there, or they might point to an arbitrary model and ask "what training data was used to train this model?". Both these requests take a long time to address because of the current organizational system. I feel like this is a problem that can be modeled using category theory. I drew a picture based on the bit of category theory I know and my questions are:

what categorical constructions did I use?
what category did I make (if any)?
what do I need to figure out about this problem to help narrow on the relevant "differentiated" category?

model_and_data_provenance.png

The purple rectangular box is a Model and the purple square boxes inside are versions of that model. The three purple circles inside the square boxes refer to the model checkpoints. For example, these can be saved model files at epoch 50, 100, and 150. The orange rectangular box is the Training Dataset for that model and the orange square boxes inside are versions of that model. The nine orange circles inside the square boxes refer to elements in the training dataset. In this example, there can be nine rows in a table or nine image in a training set.

The lightest purple square, on the top-left, refers to the model that was trained on the initial training set. That model gets fine-tuned (1st "train" arrow) based on a subset of the initial training set ("filter" arrow). The resulting model gets fine-tuned again (2nd "train"arrow) on an updated version of the initial training set ("update" arrow). An update can be like adding another another column in a table or changing images from RGB to grayscale.

The way I see it, it seems that Model ( $M$ ) and Training Dataset ( $T$ ) are categories. The endofunctors for each say how each are being transformed. There is a functor $\alpha: M \rightarrow T$ that corresponds versions of the Model to versions of the Training Dataset.

Note: checkpoints of a machine-learning model are files that are saved at fixed epoch intervals during training.

Spencer Breiner (Mar 25 2021 at 18:02):

Hi Angeline! I'll come back and say more after a meeting, but for now it might be helpful if you walked through each of the elements in your diagram and try to explain what each piece in the diagram is, just in words or through an example.

Now I see that's coming.

Matteo Capucci (he/him) (Mar 25 2021 at 18:08):

(I hope I move the message to the right topic)

Angeline Aguinaldo (Mar 25 2021 at 18:25):

Hi Spencer! Thanks for taking a look at my post! I updated the initial post with a more detailed walk-through.

Jules Hedges (Mar 25 2021 at 18:30):

Am I right that the downwards dashed arrows are tracking that parts of the model have provenance in parts of the data?

Angeline Aguinaldo (Mar 25 2021 at 18:40):

If I understand your question correctly, the answer is yes. the downward dashed arrows say that model (tail) was trained on the data (head).

Jules Hedges (Mar 25 2021 at 18:40):

Is it "functional" - every individual bit of model points to exactly one thing in the data?

Angeline Aguinaldo (Mar 25 2021 at 19:11):

Hmmm, I think considering an example will help me decide if it’s functional.

To simplify, we can assume the model is a support vector machine and a purple square refers to a training session. Each purple circle refers to parameters of a hyperplane that attempt to separate the data (orange circles) of the corresponding orange square into two classes. This is done by passing all the orange circles through the support vector machine (purple square) multiple times and adjusting the hyperplane parameters. Every time we pass all the orange circles through, we save out the hyperplane parameters for that stage, creating one purple circle for that training session. If we pass all the orange circles through three times, we can get three purple circles within one purple square.

So regarding your question, if we say “one thing in the data” is one orange square in the orange rectangle, then yes— every purple square maps to a single orange square so it’s a function. And from my perspective, I don’t think we can say anything stricter. Like $\alpha: M \rightarrow T$ is not injective or surjective.

(Wow, lots of shapes)

Spencer Breiner (Mar 25 2021 at 19:19):

I'm going to start even simpler, with just a normal distribution, and we can work our way up from there.

Our model will be a random variable of the form $X=\mathcal{N}(\mu,\sigma^2)$ . In other words, there is a space of models $M\cong\mathbb{R}\times\mathbb{R}^+=\{(\mu,\sigma^2)\}.$

Spencer Breiner (Mar 25 2021 at 19:25):

A dataset in this simple context will just be a list of values $\{x_i|i\in I\}\in\mathbb{R}^I$ . That means the space of possible data sets is something like $D=\mathsf{List}(\mathbb{R})$ . A multi-set (no ordering) might be more appropriate here, but I'll use lists for simplicity.

Spencer Breiner (Mar 25 2021 at 19:28):

There is a "training" function from data to models that lets us extract a model from a dataset.
$t:D\to M$
$\{x_i\} \mapsto (\mu,\sigma^2)=\frac{1}{|I|}\left(\sum x_i,\sum(x_i-\mu)^2\right)$

Spencer Breiner (Mar 25 2021 at 19:29):

I think that (some of?) the dashed arrows in your diagram are indicating this relationship: some specific model came from some specific data set.

Angeline Aguinaldo (Mar 25 2021 at 19:34):

Yup! Some specific model came from some specific dataset.

Spencer Breiner (Mar 25 2021 at 19:34):

From here, we can start thinking about various kinds of updates.

For example, you can think of a filter $f$ (e.g., remove outliers) as a function $D\to D$ ; by precomposing that with $t$ , we get a new "training map".

Spencer Breiner (Mar 25 2021 at 20:28):

From here there are several areas that might be worth thinking about.

What other kinds of updates are relevant? Obviously adding new data, which should be representable in terms of list concatenation/multiset addition.
There are more interesting updates too. What happens when/how do we change the space models (e.g., gaussian random variable -> gaussian process)? Similarly, what happens when we change the data space (e.g., add new independent variables).
Work out the same story for more sophisticated kinds of models (e.g., support vector machines).

Spencer Breiner (Mar 26 2021 at 18:13):

Hi @Angeline Aguinaldo. A few more thoughts on this topic.

First, note that the relationship that you care about (which data produced this model) goes in the opposite direction to the training function. This is called a (partial) section, a map $s$ from models to data such that $t(s(m))=m$ , i.e., a data set that generates the model that you started from.

The "partial" above corresponds to the fact that we don't care about all models, just some subset $\{m_i\}\subseteq M$ that we have designated as important for some reason. If $I=\{i\}$ is the index set, then the subset corresponds to a function $m:I\to M$ , and the data that generated the models as a function $d:I\to D$ . The partial section condition is just a commutative triangle $m=t\circ d$ .

In the extreme case ( $|I|=1$ ), we just care about one model $m_0\in M$ generated from one data set $d_0\in D$ , with $t(d_0)=m_0$ . This is an arrow in the category of pointed sets, often denoted $\mathbf{Set}_*$ . I would nominate these as the objects of the category that you want to draw string diagrams in.

Spencer Breiner (Mar 26 2021 at 18:17):

You might call such a thing a "trained model". It consists of

A description of the space of input data $D$
A parameterization of the space of output models $M$
A training function from data to parameters $t:D\to M$
A given data set $d_0\in D$
A given model $m_0\in M$
An assertion that $t(d_0)=m_0$ .

Note that this list is over-specified, since you can get the last two items "for free" by defining $m_0:=t(d_0)$ .

Spencer Breiner (Mar 26 2021 at 18:24):

Now the challenge is to identify the relevant maps between these objects. I have the sense that you want to identify a some set of operations like

filtering
adding new data
coarsening the model
modifying the training function

Then you could use string diagrams to keep track of model operations.

Angeline Aguinaldo (Mar 29 2021 at 20:19):

@Spencer Breiner Thank so much for these ideas! It's really helpful to see how to formalize these concepts. I haven't gotten a chance to think about this topic in more depth the past few day so I'll get back to you with more detailed questions once I do.