You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.
I've been thinking about the model and data provenance problem and am trying to come up with a system that makes tracking changes in models and data more efficient. In this case, the term "models" refers to any data-driven model (i.e. machine-learning, deep learning) and "data" refers to the training dataset that was used to drive the parameters of that model. In the wild, what happens is that a team start with an initial training dataset, they train a model, and they want to improve the performance of that model so they tweak the training dataset a bit and retrain the model. This can go on for many iterations for several months and across multiple team members. The way I've seen people solve this problem is by naming their model files and directories according to some naming convention, but then that naming convention gets stale or is not adaptable. The reason why people even want to track it is because they may want to revert to an older version of the model and retrain from there, or they might point to an arbitrary model and ask "what training data was used to train this model?". Both these requests take a long time to address because of the current organizational system. I feel like this is a problem that can be modeled using category theory. I drew a picture based on the bit of category theory I know and my questions are:
The purple rectangular box is a Model and the purple square boxes inside are versions of that model. The three purple circles inside the square boxes refer to the model checkpoints. For example, these can be saved model files at epoch 50, 100, and 150. The orange rectangular box is the Training Dataset for that model and the orange square boxes inside are versions of that model. The nine orange circles inside the square boxes refer to elements in the training dataset. In this example, there can be nine rows in a table or nine image in a training set.
The lightest purple square, on the top-left, refers to the model that was trained on the initial training set. That model gets fine-tuned (1st "train" arrow) based on a subset of the initial training set ("filter" arrow). The resulting model gets fine-tuned again (2nd "train"arrow) on an updated version of the initial training set ("update" arrow). An update can be like adding another another column in a table or changing images from RGB to grayscale.
The way I see it, it seems that Model () and Training Dataset () are categories. The endofunctors for each say how each are being transformed. There is a functor that corresponds versions of the Model to versions of the Training Dataset.
Note: checkpoints of a machine-learning model are files that are saved at fixed epoch intervals during training.
Hi Angeline! I'll come back and say more after a meeting, but for now it might be helpful if you walked through each of the elements in your diagram and try to explain what each piece in the diagram is, just in words or through an example.
Now I see that's coming.
(I hope I move the message to the right topic)
Hi Spencer! Thanks for taking a look at my post! I updated the initial post with a more detailed walk-through.
Am I right that the downwards dashed arrows are tracking that parts of the model have provenance in parts of the data?
If I understand your question correctly, the answer is yes. the downward dashed arrows say that model (tail) was trained on the data (head).
Is it "functional" - every individual bit of model points to exactly one thing in the data?
Hmmm, I think considering an example will help me decide if it’s functional.
To simplify, we can assume the model is a support vector machine and a purple square refers to a training session. Each purple circle refers to parameters of a hyperplane that attempt to separate the data (orange circles) of the corresponding orange square into two classes. This is done by passing all the orange circles through the support vector machine (purple square) multiple times and adjusting the hyperplane parameters. Every time we pass all the orange circles through, we save out the hyperplane parameters for that stage, creating one purple circle for that training session. If we pass all the orange circles through three times, we can get three purple circles within one purple square.
So regarding your question, if we say “one thing in the data” is one orange square in the orange rectangle, then yes— every purple square maps to a single orange square so it’s a function. And from my perspective, I don’t think we can say anything stricter. Like is not injective or surjective.
(Wow, lots of shapes)
I'm going to start even simpler, with just a normal distribution, and we can work our way up from there.
Our model will be a random variable of the form . In other words, there is a space of models
A dataset in this simple context will just be a list of values . That means the space of possible data sets is something like . A multi-set (no ordering) might be more appropriate here, but I'll use lists for simplicity.
There is a "training" function from data to models that lets us extract a model from a dataset.
I think that (some of?) the dashed arrows in your diagram are indicating this relationship: some specific model came from some specific data set.
Yup! Some specific model came from some specific dataset.
From here, we can start thinking about various kinds of updates.
For example, you can think of a filter (e.g., remove outliers) as a function ; by precomposing that with , we get a new "training map".
From here there are several areas that might be worth thinking about.
Hi @Angeline Aguinaldo. A few more thoughts on this topic.
First, note that the relationship that you care about (which data produced this model) goes in the opposite direction to the training function. This is called a (partial) section, a map from models to data such that , i.e., a data set that generates the model that you started from.
The "partial" above corresponds to the fact that we don't care about all models, just some subset that we have designated as important for some reason. If is the index set, then the subset corresponds to a function , and the data that generated the models as a function . The partial section condition is just a commutative triangle .
In the extreme case (), we just care about one model generated from one data set , with . This is an arrow in the category of pointed sets, often denoted . I would nominate these as the objects of the category that you want to draw string diagrams in.
You might call such a thing a "trained model". It consists of
Note that this list is over-specified, since you can get the last two items "for free" by defining .
Now the challenge is to identify the relevant maps between these objects. I have the sense that you want to identify a some set of operations like
Then you could use string diagrams to keep track of model operations.
@Spencer Breiner Thank so much for these ideas! It's really helpful to see how to formalize these concepts. I haven't gotten a chance to think about this topic in more depth the past few day so I'll get back to you with more detailed questions once I do.