Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.


Stream: theory: applied category theory

Topic: Categorification of storages, backups and shardings?


view this post on Zulip Ignat Insarov (Oct 22 2024 at 09:02):

Say I run an e-commerce web site that sells gadgets. It is implemented as an HTTP server program X that can write stuff to a PostgreSQL database Y or to an ElasticSearch installation Z. Then:

Here, Y and Z are two backups of the same data. Here is another possible situation:

Whenever I see words and and or, I think there must be a category with sums and products in play. But in this case I cannot think of a way to formalize this intuition.

For one approach, we can say that objects are storages and arrows are connections, and we can ask that connections compose associatively. So, for example, X would have composite connection to Y₁ through Y. And maybe we can say that now there is a «virtual» storage (Y₁ + Y₂) × Z that X is connected to by virtue of being connected to Y and to Z.

One aspect this model does not account for is the difference between potentiality and actuality at a given time. Z can in principle hold any element of the set of gadgets, but Z is not the set of gadgets — at any given time it is some subset of that set. And X is either the empty set (when the HTTP server is idling) or a singleton set holding one gadget (when the HTTP server is working) at any given time, even though in potentiality it holds all possible gadgets.

Where can I go from here?

view this post on Zulip David Egolf (Oct 22 2024 at 17:23):

If I understand you correctly, you first note that we can form a category from "storages". We start as follows:

It's not totally clear to me what exactly a "connection" is. I have at least two questions about that: (1) Can we have a connection from AA to BB without a connection from BB to AA? (2) Can we have multiple distinct connections from AA to BB?

At any rate, we end up with a (directed multi-)graph of storages and connections, and from that we can freely generate a category.

view this post on Zulip David Egolf (Oct 22 2024 at 17:23):

However, you then note that this category only describes (to some level of detail) how information can flow around between our storages. It doesn't keep track of the particular state of any storage at some moment in time.

view this post on Zulip David Egolf (Oct 22 2024 at 17:32):

I don't know enough about this kind of thing to give specific suggestions. (Maybe someone else can chime in though?)

Here are some resources that might possibly be helpful to you, that talk about ways to model things that change over time using categories:

(I haven't read these yet - just bits and pieces of some of them. But I would like to read them properly someday!)

view this post on Zulip David Egolf (Oct 22 2024 at 17:38):

On a side note, I find it interesting that two kinds of information flow are present here: reading and writing. This reminds me of medical imaging, where a transducer can both transmit and receive. It seems like there ought to be a way to dream up a category where objects involve two-way interactions... but I'm not sure what morphisms should be! (What should a morphism be between two reading-writing systems? What should a morphism be between two transmitting-receiving systems?)

view this post on Zulip Ignat Insarov (Oct 22 2024 at 18:14):

Thank you David. These are great questions.

Say we have a logic. Then:

By analogy, we may say that:

Either way, connections are not generally symmetric. In some situations, like say a blockchain, we have multiple independent communicating agents that all have connection to one another. In other situations, like say an e-commerce installation, there is a hierarchy with few if any cycles.

One way we can think about this «duality» of reading and writing is by saying that there are two opposite categories — one where arrows are reading connections and another where arrows are writing connections. Maybe this should even be one single dagger category. But I am not ready to do any meaningful mathematical work in this direction yet.

None of this brings us closer to understanding transactions as they happen over time. The straightforward way would be to ask for a time-indexed sequence of categories defined as above — but we still need some secret sauce in order to be able to say that at time t₁ A knows about gadget g but B does not, and at time t₂ both A and B know about g. The things we may wish to prove here are like so:

Practically, in a non-trivial installation there will be many storages, some storages will store more than one type of information, and not all storages will be allowed to store information for time intervals longer than some value. For example:

I have no idea how to formalize any of this.

I read some of David Jaz Myers's book, and I read some of Brendan Fong and David Spivak's book. But their ideas do not live in me yet. I can draw composition of lenses if you wake me at 3 AM — I have been trying to apply this stuff for the last two years. But still it is not easy to explain how, for example, a non-trivial React web app is a composite lens. I have not seen John Baez's series of blog posts before — it looks interesting!

view this post on Zulip Julius Hamilton (Nov 14 2024 at 17:14):

Ok. I’m curious, why do Y1Y_1 and Y2Y_2 have identical schemas in this scenario - is it like for “data resilience”, in case one goes down or gets deleted?

In the type of categorical data science which I am learning, created by @Ryan Wisnesky and others, schemas are logical (“regular”) theories which have categories as their semantics (I believe), and an instance of a schema is a Set-valued functor on that category. (I am still learning, this may not be accurate).

Once you have a “database” (which is a schema and an instance), you can run queries to manipulate/transform/migrate that data. I am pretty sure that queries are actually natural transformations.

For one approach, we can say that objects are storages and arrows are connections, and we can ask that connections compose associatively. So, for example, X would have composite connection to Y₁ through Y. And maybe we can say that now there is a «virtual» storage (Y₁ + Y₂) × Z that X is connected to by virtue of being connected to Y and to Z.

Let’s say a “storage” is an instance. We can update and query these instances, with SQL commands like Insert and Select.

One thing your question reminds me of is a question of my own about “data synchronization”. I was also recently thinking about a scenario where I wanted updates to one instance to be automatically passed to other instances. In SQL, I think this is done with “triggers”. In Categorical Query Language, I don’t know if this is something that can be implemented or not.

In categorical query language, migrations are written using the transform keyword (I think). And I do believe transformations can be composed! Let’s hope Ryan gives us some input.

One aspect this model does not account for is the difference between potentiality and actuality at a given time. Z can in principle hold any element of the set of gadgets, but Z is not the set of gadgets — at any given time it is some subset of that set. 

I think categorical query language can handle this. First, you’ll define the schema of ZZ, which you said was “ElasticSearch”. In CQL, a schema is defined essentially using types, functions, and equations. I don’t know the structure of ElasticSearch, but we can work on it together if you like. CQL comes with many commonly used schemas already defined. For example, it already has the schema of Apache TinkerPop graph databases, so you can load an actual TinkerPop database into CQL and transform that data in various ways. If you define the schema of your e-commerce data, and of ElasticSearch, you can define “mappings” (CQL’s keyword for functors) which will transform data between the different schemas. Sometimes the mapping will be invertible, so you might be able to synchronize back and forth between ZZ and YY, I don’t know.

When you say “potentiality and actuality”, I think you are thinking about the difference between a schema, which would include a type like “Gadget”, and an instance, which would specify which gadgets some instance II has.

Where can I go from here?

Trying to model this in CQL could be a beneficial exercise. I’m happy to help.

view this post on Zulip Ryan Wisnesky (Nov 14 2024 at 19:04):

Remember that all models are wrong, but some models are useful - to determine how to formalize, one must know what one wants to do with the formalization. For example, this could be implemented, and proved correct, in Coq - even to the level of reasoning about HTTP protocols and file systems (see the paper 'certified web services in ynot' for example). Or you could formalize a series of updates using techniques from database theory, or CQL; there's several different ways "updates" in SQL can be formalized, even in the Spivak body of work. Sometimes if I don't know what I want to do with the formalization, I pick "run a simulation" as a default - could you implement a toy simulator of your system as a state machine, or Petri net, or python code, or ... ?

view this post on Zulip Ignat Insarov (Dec 06 2024 at 20:21):

Julius Hamilton said:

Ok. I’m curious, why do Y1Y_1 and Y2Y_2 have identical schemas in this scenario - is it like for “data resilience”, in case one goes down or gets deleted?

Cool question!

I never thought about it this way!

In practice, I figure people do the «simultaneous write» construction when there are two different systems that should handle the same data. For example, ElasticSearch may be used for, well, searching — it will handle complex search queries better than SQL, while not generally being as fast and reliable as PostgreSQL. And people do the «optional write» construction when they want to distribute load between several back ends. For example, we may want to have two ElasticSearch instances because it is kinda slow and memory inefficient. We can divide rows randomly between the two instances. Then, when we need to search for something, we shall ask both and merge the results. We do not do this currently in the project I use ElasticSearch in, but it is plausible.


So, I have been thinking about this topic a bit more and I have come up with some half-baked ideas which I am now going to share. Hopefully this will help me fully bake them without being much of a nuisance to everyone else.

  1. David Spivak's *Functorial Data Migration* seems to be a fitting formalization. It is definitely easier to handle than his earlier simplicial data bases thing, and surely better than the classic relational theory of data bases. So, I am going to use terminology and results from Functorial Data Migration freely from now on.

  2. What we are digging into here are not merely schema instances, but schema instances indexed by discrete time. So, in the example above, the HTTP server X is at any time an instance with a single row — it is a stateless «function». But this row will generally be different at different time indices. Meanwhile, Y will be growing over time: t₁ ≤ t₂ ⇒ Y (t₁) ⊆ Y (t₂). Generally, instances may either grow or shrink over time.

  3. Whenever some X has a connexion to some Y, it is motivated by the need for some information to be synchronized between them. So, we want to say that some rows are in agreement between X and Y at all times.

  4. Practically, X and Y may have different clocks and there may be some delay in their connexion, but if so we ask that there be a monotone and continuous function between their time lines, so that for every time interval in X's time line there were a time interval in Y's time line where they are in agreement.

The question now is what it means for two instances, say Y₁ and Y₂, to be in agreement. Practically, this means that we can run a select A₁ from B₁ where C₁ in Y₁ and select A₂ from B₂ where C₂ in Y₂ and get exactly the same instance Z — at all times.

In the terminology of Functorial Data Migration we can formalize this select … thing as «a single regressive update followed by a single progressive update» (proposition 3.5.6). For conceptual simplicity, I assume that the regressive update is a natural family of one to one functions, which means that we delete some rows, and the progressive update is a natural family of functions onto Z, which means that we drop some columns. These spans will be arrows in the category of instances and simple updates. Its construction is a bit technical but, in short, our arrows will be relations that reflect elements and distinctions. I checked and it turns out that these properties are preserved by composition of relations, so we do have here a subcategory of the category of sets and relations.

(I went about the proofs by translating statements about relations to first order logic and then asking an online solver to draw me some pretty pictures like this (reflexion of elements is preserved by composition) and this (reflexion of distinctions is preserved by composition).)

So, if select A₁ from B₁ where C₁ in Y₁ is the relation R₁ and select A₂ from B₂ where C₂ in Y₂ is the relation R₂, and they form an inward span Y₁ → Z ← Y₂, we can take the pullback Y₁ ← P → Y₂. This P tells us exactly which rows of Y₁ are synchronized with which rows of Y₂. Let us call such a pullback a «synchrony».

Now, a pullback in the category of sets and relations is itself a relation. So we can now pass from the category of instances and simple updates to the purple bubble foam of instances and synchronies. To recapitulate, if Y₁ and Y₂ are time indexed instances, a synchrony from Y₁ to Y₂ is given by:

This is a bit technical. And I have not done all my homework yet.

Does this make sense?

view this post on Zulip Ryan Wisnesky (Dec 06 2024 at 20:53):

Let me know if you want to try to formalize some of this in CQL (categoricaldata.net), which is based on functorial data migration - I can help.

view this post on Zulip Ignat Insarov (Dec 06 2024 at 21:40):

@Ryan Wisnesky   Haha thanks, I am honoured! I was staring at that page for an extended period of time today. I guess I should start with the tutorial.