Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.

Stream: theory: applied category theory

Topic: Categorification of storages, backups and shardings?

Ignat Insarov (Oct 22 2024 at 09:02):

Say I run an e-commerce web site that sells gadgets. It is implemented as an HTTP server program X that can write stuff to a PostgreSQL database Y or to an ElasticSearch installation Z. Then:

When a manager adds a new gadget, X should write to Y and to Z.
When a customer views a gadget, X may ask the details from either Y or Z.

Here, Y and Z are two backups of the same data. Here is another possible situation:

When a gadget is written to Y, Y may write to either Y₁ or Y₂, both of which are SQL databases with identical schemata.
When a gadget is asked of Y, X must ask the details from Y₁ and maybe also from Y₂ (or in the other order, or in parallel).

Whenever I see words and and or, I think there must be a category with sums and products in play. But in this case I cannot think of a way to formalize this intuition.

For one approach, we can say that objects are storages and arrows are connections, and we can ask that connections compose associatively. So, for example, X would have composite connection to Y₁ through Y. And maybe we can say that now there is a «virtual» storage (Y₁ + Y₂) × Z that X is connected to by virtue of being connected to Y and to Z.

One aspect this model does not account for is the difference between potentiality and actuality at a given time. Z can in principle hold any element of the set of gadgets, but Z is not the set of gadgets — at any given time it is some subset of that set. And X is either the empty set (when the HTTP server is idling) or a singleton set holding one gadget (when the HTTP server is working) at any given time, even though in potentiality it holds all possible gadgets.

Where can I go from here?

David Egolf (Oct 22 2024 at 17:23):

If I understand you correctly, you first note that we can form a category from "storages". We start as follows:

the objects are storages
we put an arrow from $A$ to $B$ if we have a "connection" from $A$ to $B$

It's not totally clear to me what exactly a "connection" is. I have at least two questions about that: (1) Can we have a connection from $A$ to $B$ without a connection from $B$ to $A$ ? (2) Can we have multiple distinct connections from $A$ to $B$ ?

At any rate, we end up with a (directed multi-)graph of storages and connections, and from that we can freely generate a category.

David Egolf (Oct 22 2024 at 17:23):

However, you then note that this category only describes (to some level of detail) how information can flow around between our storages. It doesn't keep track of the particular state of any storage at some moment in time.

David Egolf (Oct 22 2024 at 17:32):

I don't know enough about this kind of thing to give specific suggestions. (Maybe someone else can chime in though?)

Here are some resources that might possibly be helpful to you, that talk about ways to model things that change over time using categories:

a series of blog posts on "Agent-Based Models" by John Baez: here
"Categorical System Theory" by David Jaz Myers: here
"Seven Sketches in Compositionality" by Brendan Fong and David Spivak: here

(I haven't read these yet - just bits and pieces of some of them. But I would like to read them properly someday!)

David Egolf (Oct 22 2024 at 17:38):

On a side note, I find it interesting that two kinds of information flow are present here: reading and writing. This reminds me of medical imaging, where a transducer can both transmit and receive. It seems like there ought to be a way to dream up a category where objects involve two-way interactions... but I'm not sure what morphisms should be! (What should a morphism be between two reading-writing systems? What should a morphism be between two transmitting-receiving systems?)

Ignat Insarov (Oct 22 2024 at 18:14):

Thank you David. These are great questions.

Say we have a logic. Then:

There is a category of propositions and implications in this logic, which is a kind of a preorder category — either proposition B follows from proposition A or not. This is the way I think about mathematical problems where I do not usually try to do induction on proof terms or anything such.
There is a category of propositions and proofs in this logic, which allows for multiple arrows — there is generally any number of ways to prove B given A. This is the way I think about code in Agda or another proof relevant dependently typed programming language.

By analogy, we may say that:

There is a category of computers and ways they are allowed to connect. This is how I should think about virtual machines and other abstract resources when I design a cloud based deployment. Either connection from A to B is allowed and will be happening with some regularity, or such connection is not allowed and will never be happening.
There is a category of storages and «connection options», where an object is also a computer, and an arrow is a specific private SSH key — or an encryption key of another type, such as a PostgreSQL connection string. Now it is possible to have multiple arrows from one computer to another. This lets us talk concretely about how two specific computers will be connected.

Either way, connections are not generally symmetric. In some situations, like say a blockchain, we have multiple independent communicating agents that all have connection to one another. In other situations, like say an e-commerce installation, there is a hierarchy with few if any cycles.

One way we can think about this «duality» of reading and writing is by saying that there are two opposite categories — one where arrows are reading connections and another where arrows are writing connections. Maybe this should even be one single dagger category. But I am not ready to do any meaningful mathematical work in this direction yet.

None of this brings us closer to understanding transactions as they happen over time. The straightforward way would be to ask for a time-indexed sequence of categories defined as above — but we still need some secret sauce in order to be able to say that at time t₁ A knows about gadget g but B does not, and at time t₂ both A and B know about g. The things we may wish to prove here are like so:

A can always obtain knowledge about any g at any time t > t₁ given some network of connections.
There are some g such that removing any connection will result in A being unable to obtain knowledge about g at some time t > t₁.

Practically, in a non-trivial installation there will be many storages, some storages will store more than one type of information, and not all storages will be allowed to store information for time intervals longer than some value. For example:

An HTTP server is not guaranteed to store anything — it may cache stuff at its own discretion, but it may be forcibly rebooted any time, losing all knowledge.
An HTTP server may push its logs to a third party logging service that will retain these logs indefinitely, without possibility of erasure.
The same HTTP server may read and write stuff to the SQL database… but also delete stuff from it!
The information stored in ElasticSearch may be dropped and regenerated from the SQL database — we may think that there is a one way connection from the SQL database to the ElasticSearch service, even though it goes through some other program that connects to both, reads form SQL, performs the necessary conversions and writes to ElasticSearch. It is as if the ElasticSearch service was a sub-object of the SQL database…

I have no idea how to formalize any of this.

I read some of David Jaz Myers's book, and I read some of Brendan Fong and David Spivak's book. But their ideas do not live in me yet. I can draw composition of lenses if you wake me at 3 AM — I have been trying to apply this stuff for the last two years. But still it is not easy to explain how, for example, a non-trivial React web app is a composite lens. I have not seen John Baez's series of blog posts before — it looks interesting!

Julius Hamilton (Nov 14 2024 at 17:14):

When a gadget is written to Y, Y may write to either Y₁ or Y₂, both of which are SQL databases with identical schemata.

Ok. I’m curious, why do $Y_1$ and $Y_2$ have identical schemas in this scenario - is it like for “data resilience”, in case one goes down or gets deleted?

When a gadget is asked of Y, X must ask the details from Y₁ and maybe also from Y₂ (or in the other order, or in parallel). Whenever I see words and and or, I think there must be a category with sums and products in play. But in this case I cannot think of a way to formalize this intuition.

In the type of categorical data science which I am learning, created by @Ryan Wisnesky and others, schemas are logical (“regular”) theories which have categories as their semantics (I believe), and an instance of a schema is a Set-valued functor on that category. (I am still learning, this may not be accurate).

Once you have a “database” (which is a schema and an instance), you can run queries to manipulate/transform/migrate that data. I am pretty sure that queries are actually natural transformations.

For one approach, we can say that objects are storages and arrows are connections, and we can ask that connections compose associatively. So, for example, X would have composite connection to Y₁ through Y. And maybe we can say that now there is a «virtual» storage (Y₁ + Y₂) × Z that X is connected to by virtue of being connected to Y and to Z.

Let’s say a “storage” is an instance. We can update and query these instances, with SQL commands like Insert and Select.

One thing your question reminds me of is a question of my own about “data synchronization”. I was also recently thinking about a scenario where I wanted updates to one instance to be automatically passed to other instances. In SQL, I think this is done with “triggers”. In Categorical Query Language, I don’t know if this is something that can be implemented or not.

In categorical query language, migrations are written using the transform keyword (I think). And I do believe transformations can be composed! Let’s hope Ryan gives us some input.

One aspect this model does not account for is the difference between potentiality and actuality at a given time. Z can in principle hold any element of the set of gadgets, but Z is not the set of gadgets — at any given time it is some subset of that set.

I think categorical query language can handle this. First, you’ll define the schema of $Z$ , which you said was “ElasticSearch”. In CQL, a schema is defined essentially using types, functions, and equations. I don’t know the structure of ElasticSearch, but we can work on it together if you like. CQL comes with many commonly used schemas already defined. For example, it already has the schema of Apache TinkerPop graph databases, so you can load an actual TinkerPop database into CQL and transform that data in various ways. If you define the schema of your e-commerce data, and of ElasticSearch, you can define “mappings” (CQL’s keyword for functors) which will transform data between the different schemas. Sometimes the mapping will be invertible, so you might be able to synchronize back and forth between $Z$ and $Y$ , I don’t know.

When you say “potentiality and actuality”, I think you are thinking about the difference between a schema, which would include a type like “Gadget”, and an instance, which would specify which gadgets some instance $I$ has.

Where can I go from here?

Trying to model this in CQL could be a beneficial exercise. I’m happy to help.

Ryan Wisnesky (Nov 14 2024 at 19:04):

Remember that all models are wrong, but some models are useful - to determine how to formalize, one must know what one wants to do with the formalization. For example, this could be implemented, and proved correct, in Coq - even to the level of reasoning about HTTP protocols and file systems (see the paper 'certified web services in ynot' for example). Or you could formalize a series of updates using techniques from database theory, or CQL; there's several different ways "updates" in SQL can be formalized, even in the Spivak body of work. Sometimes if I don't know what I want to do with the formalization, I pick "run a simulation" as a default - could you implement a toy simulator of your system as a state machine, or Petri net, or python code, or ... ?

Ignat Insarov (Dec 06 2024 at 20:21):

Julius Hamilton said:

When a gadget is written to Y, Y may write to either Y₁ or Y₂, both of which are SQL databases with identical schemata.

Ok. I’m curious, why do $Y_1$ and $Y_2$ have identical schemas in this scenario - is it like for “data resilience”, in case one goes down or gets deleted?

Cool question!

If X writes everything to Y and Z simultaneously, then one of them can go down and X can still read whatever it wants to recall. Say Y and Z have 50% chance to be down each. Now we have 25% chance to be unable to read (improvement), but 75% chance to be unable to write (regression). So, this is a trade-off.
If X writes everything either to Y₁ or Y₂ at its choice, then both of them have to be online in order for X to read whatever it wants to recall. Say Y₁ and Y₂ have 50% chance to be down each. Now we have 50% chance to be unable to read and 25% chance to be unable to write. So, this is a clear-cut improvement.

I never thought about it this way!

In practice, I figure people do the «simultaneous write» construction when there are two different systems that should handle the same data. For example, ElasticSearch may be used for, well, searching — it will handle complex search queries better than SQL, while not generally being as fast and reliable as PostgreSQL. And people do the «optional write» construction when they want to distribute load between several back ends. For example, we may want to have two ElasticSearch instances because it is kinda slow and memory inefficient. We can divide rows randomly between the two instances. Then, when we need to search for something, we shall ask both and merge the results. We do not do this currently in the project I use ElasticSearch in, but it is plausible.

So, I have been thinking about this topic a bit more and I have come up with some half-baked ideas which I am now going to share. Hopefully this will help me fully bake them without being much of a nuisance to everyone else.

David Spivak's *Functorial Data Migration* seems to be a fitting formalization. It is definitely easier to handle than his earlier simplicial data bases thing, and surely better than the classic relational theory of data bases. So, I am going to use terminology and results from Functorial Data Migration freely from now on.
What we are digging into here are not merely schema instances, but schema instances indexed by discrete time. So, in the example above, the HTTP server X is at any time an instance with a single row — it is a stateless «function». But this row will generally be different at different time indices. Meanwhile, Y will be growing over time: t₁ ≤ t₂ ⇒ Y (t₁) ⊆ Y (t₂). Generally, instances may either grow or shrink over time.
Whenever some X has a connexion to some Y, it is motivated by the need for some information to be synchronized between them. So, we want to say that some rows are in agreement between X and Y at all times.
Practically, X and Y may have different clocks and there may be some delay in their connexion, but if so we ask that there be a monotone and continuous function between their time lines, so that for every time interval in X's time line there were a time interval in Y's time line where they are in agreement.

The question now is what it means for two instances, say Y₁ and Y₂, to be in agreement. Practically, this means that we can run a select A₁ from B₁ where C₁ in Y₁ and select A₂ from B₂ where C₂ in Y₂ and get exactly the same instance Z — at all times.

In the terminology of Functorial Data Migration we can formalize this select … thing as «a single regressive update followed by a single progressive update» (proposition 3.5.6). For conceptual simplicity, I assume that the regressive update is a natural family of one to one functions, which means that we delete some rows, and the progressive update is a natural family of functions onto Z, which means that we drop some columns. These spans will be arrows in the category of instances and simple updates. Its construction is a bit technical but, in short, our arrows will be relations that reflect elements and distinctions. I checked and it turns out that these properties are preserved by composition of relations, so we do have here a subcategory of the category of sets and relations.

(I went about the proofs by translating statements about relations to first order logic and then asking an online solver to draw me some pretty pictures like this (reflexion of elements is preserved by composition) and this (reflexion of distinctions is preserved by composition).)

So, if select A₁ from B₁ where C₁ in Y₁ is the relation R₁ and select A₂ from B₂ where C₂ in Y₂ is the relation R₂, and they form an inward span Y₁ → Z ← Y₂, we can take the pullback Y₁ ← P → Y₂. This P tells us exactly which rows of Y₁ are synchronized with which rows of Y₂. Let us call such a pullback a «synchrony».

Now, a pullback in the category of sets and relations is itself a relation. So we can now pass from the category of instances and simple updates to the purple bubble foam of instances and synchronies. To recapitulate, if Y₁ and Y₂ are time indexed instances, a synchrony from Y₁ to Y₂ is given by:

a monotone and continuous function f from time to time
a pair of updates that form an inward span Y₁ → Z ← Y₂
to every time index t, a pullback Y₁ (t) ← P (t) → Y₂ (f (t)) of that inward span
to every time index t, a proof that P (t) is exactly the pre-image of Z (t) along either side of the pullback square, and not some smaller instance

This is a bit technical. And I have not done all my homework yet.

I should like to show that the purple bubble foam of instances and their synchronies is really a category.
It would be interesting to explore whether the pullback Y₁ ← P → Y₂ that I defined as an outward span of relations can be understood as updates, and what does P look like.
I talk of all this as if my instances were simple sets, but they are really diagrams made of multiple sets, arrows and equations. Everything will hopefully work smoothly if I replace relations by natural families of relations as appropriate, but I have not really checked.
We can probably handle a more general situation where Y₁ and Y₂ are instances of two different schemate connected by a translation functor, using the contravariant Δ data migration functor, but this also needs to be looked into.

Does this make sense?

Ryan Wisnesky (Dec 06 2024 at 20:53):

Let me know if you want to try to formalize some of this in CQL (categoricaldata.net), which is based on functorial data migration - I can help.

Ignat Insarov (Dec 06 2024 at 21:40):

@Ryan Wisnesky Haha thanks, I am honoured! I was staring at that page for an extended period of time today. I guess I should start with the tutorial.