Some Guidance on Modeling a Software System · practice: software

The first. Consider the Stack Exchange API (as an example of any REST API in general).

REST APIs are probably much simpler systems than other types of programs, like complex applications, programming languages, etc. They are basically a collection of inputs and outputs, of a single data type, “string”. There is some conditional logic, because different API commands can take different parameters.

For a long time, I have had this idea where you could enumerate every valid sequence of keywords in a programming language, up to some number

n

of keywords.

The reason this interests me is because programming languages are written in lower level languages. Every defined keyword in the Python language translates directly to some sequence of C keywords.

The rough idea is that you could generate a library of functions from every single valid combination of keywords in the language. Why? I guess it just really interests me that instead of manually defining a few specific functions, you realize that the language itself inherently implies a specific set of possible sequences of those keywords. Why not name them, and absorb a rich vocabulary of new keywords, each of which corresponds to 2, 3, 4, or maybe even 5 keywords in the original language?

That isn’t quite what I want to do with the Stack Exchange API, but it is tangentially related. I want to create a model of the API which would allow me to understand the structure in its entirety. I don’t think REST APIs count as “typed”, but of course, they return error messages for invalid input.

I think the motivation is related to a lot of these “verified” software systems nowadays. I don’t want to manually browse the API to see which functions are or are not available. It’s more like I want to build a second-layer library of functions, out of the API, which is strongly typed and verified and so on.

I was thinking it could be cool to have something kind of like a “fuzzer”, which tests every possible sequence of API actions to discover which ones would be valid. I think of it as a “tracer” - you drop it in and it automatically builds a structural model of the API for you, because it attempts to find every single valid sequence of commands, while going through tons of invalid ones.

I need to develop my thoughts more on the above, but luckily, I have a very concrete goal motivating me, which is building a different UI for Stack Exchange, using the API.

My second similar project would be creating a “data model” for this app, Zulip. Zulip is open source. I want to know the exact structure of the application data - so that, perhaps, I could with ease query that data, and use it with a different user interface; or build new custom functions that plug in seamlessly to the data model.

My intention is to use category theory here, maybe something like string diagrams or ologs, and I’m wondering how someone might pull this off.

Ryan Wisnesky (Mar 31 2024 at 01:20):

If the API is stateless, it should be easy to axiomatize, at least partially, in equational logic, or first order logic, etc. Tooling-wise, you can reason about equational theories and/or first-order theories in tools such as Coq, or CQL, or CASL (https://en.wikipedia.org/wiki/Common_Algebraic_Specification_Language). Enumerating terms is a common operation, for example, to compute an initial model of a decidable equational theory, you can keep adding more terms to the empty model until all the new terms you try to add are provably equal to one already in the model you're adding to (this is how CQL works internally). Re: zulipchat, it runs on Postgres, https://zulip.readthedocs.io/en/stable/production/postgresql.html, so it has an SQL data model; as such, you can import it into CQL and many other tools (Toad, Erwin, data frames, etc) via JDBC, ODBC, etc.

Ryan Wisnesky (Mar 31 2024 at 01:30):

follow-up: zulip appears to use an object relational mapper (in python, called Django); eg, here's the ORM for messages: https://github.com/zulip/zulip/blob/main/zerver/models/messages.py . Once consequence is that I can't seem to find a clear statement of the SQL schema, because the ORM creates it. Another consequence is that even if you pull the SQL tables from a live zulip instance, their semantics is incomplete without the python code. I.e., to model zulip one must model the python objects, not the SQL backing them. (I think)

Julius Hamilton (Mar 31 2024 at 16:21):

Bummer, finger just accidentally pressed the delete button of my own message. Would be good if Zulip had an undo feature for that.

Julius Hamilton (Mar 31 2024 at 16:23):

I am imagining diagramming an API could begin with the character set - I will begin with ASCII before generalizing.

Julius Hamilton (Mar 31 2024 at 16:27):

The categorical notion of a product, to me, is about “containment”. For

A

B

, and

C

C

is the product of

A

and

B

, if for any

D

such that

A

and

B

both have a morphism to

D

, there is a path

A \rightarrow C \rightarrow D

, and

B \rightarrow C \rightarrow D

, which is equivalent.

Julius Hamilton (Mar 31 2024 at 16:30):

The collection of all possible strings from a set of

n

symbols is adequately modeled by the set product, since the terms of the product are ordered.

Julius Hamilton (Mar 31 2024 at 16:32):

To express a collection of triplets instead of pairs, I presume we can show diagrammatically that the order in which one takes products -

(A \times B) \times C

, vs.

(A \times (B \times C)

, is equivalent.

Julius Hamilton (Mar 31 2024 at 16:33):

I will think on this and draw up some diagrams with Mermaid.js, but my desire is to express categorically, “the union of n-term products on the set

S

, for any natural number n”, to represent the set of all possible ASCII strings.

Julius Hamilton (Mar 31 2024 at 16:35):

From there, I want to express different “layers” of acceptance, by the API. There may be some ASCII codes (like control codes) that would be immediately rejected by the computer before even being sent as an HTTP request. There would be a collection of strings where the API would return error 400, command not found. Then there would be a subset of strings which are recognized commands.

Julius Hamilton (Mar 31 2024 at 16:36):

Julius Hamilton (Mar 31 2024 at 18:07):

Categorical-products.pdf
These are my thoughts so far on the nature of products. I understand I am probably working up to something like monads, or the free monoid object in a monoidal category, or something like that. Thanks.

Ryan Wisnesky (Mar 31 2024 at 20:48):

In functional languages, there are ways to insert isomorphisms automatically during type inference, to mediate between types such as (AB)C and A(BC), although that's not always regarded as a feature (it may obscure additional errors). In any case, my guidance is to work formally - you can, for example, express the type of all finite ASCII strings in Coq, in CQL, in Haskell, in OCaml, in CASL, etc, and go from there.

Julius Hamilton (Mar 31 2024 at 21:06):

I’ll respond in more depth soon. I’m most used to Coq so far. Is the idea to express the type of ascii strings without providing a function to actually enumeratively construct them (which I believe in Coq would be a ‘fixpoint’)?

Julius Hamilton (Mar 31 2024 at 21:06):

In Coq I have also struggled with the same thing in my Olog thread. I am not sure the optimal way to define “a set of abstract elements” (as opposed to pure sets). Perhaps the Ensemble type?

Julius Hamilton (Mar 31 2024 at 21:07):

Ryan Wisnesky (Mar 31 2024 at 22:00):

There's (at least) two ways to embed one language (eg your API) into another (eg Coq). In a 'shallow embedding', you use the target language constructs directly, for example, "Inductive String := Nil | Cons Char String" in Coq. Very simple, but the downside is that you can't really talk about "terms". So in a "deep embedding", you use the target language to define abstract syntax trees, for example, "Inductive Term := Var String | Const String | App Term Term" in Coq. The downside is that it is a lot of work, but then you can indeed define a Coq function to e.g. compute the initial model by enumerating terms (although this would have to be phrased as a CoFixpoint or bounded by a number of steps, because in general, an algebraic theory may not have a finite initial model and Coq won't allow divergence).

Ryan Wisnesky (Mar 31 2024 at 22:05):

Here is an example of what I suspect you are looking for, in CQL. It defines an equational theory for the natural numbers mod 4, and then displays the initial model in Cayley graph form, where you can clearly observe the four elements being constructed from terms.
Screenshot-2024-03-31-at-3.03.25PM.png

Julius Hamilton (Mar 31 2024 at 22:09):

Awesome. I actually posted to the CQL email group, not sure if you saw it. I just happened to notice the CQL IDE GitHub repo’s last update was 3 years ago. Am I looking at the right repo?

Julius Hamilton (Mar 31 2024 at 22:10):

Ryan Wisnesky (Mar 31 2024 at 22:17):

Indeed, due to popular demand, I just posted updated binaries :-). Source code will follow. Here are the features, for anyone curious:

Julius Hamilton (Apr 01 2024 at 13:29):

K. I downloaded it but am feeling like working through Coq first is going to help me prepare for CQL. Need more foundations.

Julius Hamilton (Apr 01 2024 at 17:24):

Foundations

ZFC

The first-order language of ZFC includes 2 predicate symbols,

=

and

\in

, such that:

\forall

and

\exists

are first-order functions

\forall, \exists: X \rightarrow 2

, where

X

is the domain of discourse, and

2

is the ordinal set

\{0, 1\}

. One problem I have with this is it assumes the existence of a set

2

, whereas I was hoping to prove it existed before defining anything in terms of it.

\cap

is a zeroth-order function

\cap: X \rightarrow X

, defined as

\{ x : x \in y \wedge x \in z\}

. I also have a problem with this, since the axiom of restriction has not been stated yet.

I used to think these axioms could be classified as those which roughly “say what exists”, those which “say what is constructible from something else”, and those which “say what is not allowed”. However, I am no longer sure, as the previous was influenced by me assuming equality in the theory, instead of defining it as a first-order predicate.

ETCS

The first axiom, Finite Roots, states two symbols in Objects, 0 and 1. It states that for any two symbols in Object, there is a specific third symbol, and a specific symbol in Mappings, with a particular syntactic relationship:

(Lawvere says “satisfying the usual mapping properties” and I assume he refers to the universal property of products.)

SEAR

Relational Algebras

In order to define relations, one must defined ordered sequences. The Kuratowski definition is

(x, y) = \{\{x\}, \{x, y\}\}

. (…)

SQL

SQL is a query language for relational databases. A relational database can be defined as a multi-sorted relational algebra. SQL comes with 5 operations by convention:

Notes

As you can see, I am trying to build up my understanding of what a relational database is, in order to use CQL. I appreciate any feedback from anyone. Thanks.

Ryan Wisnesky (Apr 01 2024 at 18:55):

It's always good to brush up on the basics, but in your case relational theory and zfc may not be necessary - in the example provided of Nat mod 4, the only thing CQL is using is the notion of a multi-sorted equational theory and its initial (term) model.

Julius Hamilton (Apr 01 2024 at 19:04):

Ok. Could we discuss that example?
What is “typeside”?
Why is zero seemingly a function without a domain? Is it a terminal or initial object?
Why are the arrows labeled “succ_0^0”? Does this have to do with the categorical notion of an exponential object?

This is an inductive type. But what does it represent? Is it “the set of natural numbers mod4”? Or “the type of natural numbers mod 4”?

If I tried to do the same for “the collection of all finite ASCII strings”, I suppose I just need a loop, from the set of symbols S, to the product of that set with itself? (Iteratively, this approaches the infinite product).

Ryan Wisnesky (Apr 01 2024 at 19:10):

CQL uses multi-sorted equational theories to provide user-defined functions during data migration; such theories are called "type sides". The zero function is a constant, a point/morphism out of the terminal object. The _0^0 has to do with drawing Cayley graphs for theories with n-ary function symbols, not just unary ones; suffice it to say such 'Cayley hypergraphs' need additional edge labelling to help people understand what is happening. Re: the set or type of natural numbers, I'm not sure it matters what it is called, what it is is the initial model of the theory. Asking what you can do with NatMod4 in CQL is up to you: CQL provides certain operations to use functions such as zero and succ during functorial data migration, but many software systems allow expressing equational theories such as this one and reasoning about them, and as for what to do with your API, that's up to you.

Julius Hamilton (Apr 01 2024 at 19:13):

Gotcha. Ok, I’m gonna work through the manual/documentation/wiki on the site now. Thank you, really appreciate it.

Julius Hamilton (Apr 02 2024 at 13:17):

I might just look into making the installation a little easier before trying out some of the commands in the tutorial.

Ryan Wisnesky (Apr 02 2024 at 14:32):

All you should have to do is double click cql.jar after you've downloaded it and installed java 17. There's definitely no need to build the source code.

Julius Hamilton (Apr 02 2024 at 14:36):

Consider a to do application. Data has fixed metadata fields like “due date”, “sub tasks”, “person assigned”.

Instead imagine that in reality, task management is a complex, cognitive domain. Maybe we needn’t assume every “task” has identical metadata needs. Maybe as the human assess their project management goals, they spontaneously think of data manipulation functions they would like to have, or new relation types on the data.

Was just checking out the Tutorial when I had some minor issues on Windows but perhaps I was using too new of a version of Java.

Ryan Wisnesky (Apr 02 2024 at 15:28):

I use a Mac, so most testing is performed there. Certainly there rougher edges on java-windows than on java-mac (especially around high-DPI screens and key bindings), but if you run into anything too egregious, please let me know. The usual advice I give people is to first specify a software system using math, then start programming it; like, programming languages don't tend to give one the ability to express things they can't already define, they automate that which you already understand. I can't actually tell if CQL will help with your problem or not until you've defined it; at that point, we can prove or disprove where CQL's operations can solve it. Do you know how you would solve your problem in say C or python? Or math?

Ryan Wisnesky (Apr 02 2024 at 15:29):

Julius Hamilton (Apr 02 2024 at 17:26):

Thanks. I'm working on formulating some elementary notions in Python and/or math.

A relational database is fundamentally a collection of relations, plus certain operations that can be done on them.

An n-ary relation is a subset of the cartesian product of n (abstract) sets (of elements).

people = set('Tom', 'Sam', 'Gia')
toasters = set('Barnes and Noble 3000', 'Electric Toastmaker', 'Antique Handmade Toaster')
vacation_destination = set('Hawaii', 'Paris')

example binary relation:
people_has_toaster

\subset

people

\times

toasters := set([Tom, Barnes and Noble], [Gia, Electric Toastmaker], [Sam, null])

SQL originally had 8 fundamental operations. 4 are closed binary relations, on the relations of the database. I believe it is required that the two relations in the domain have the same type. For example, if I had 2 different relations "people_has_toasters", I could take their union, to create a single table of all known people-has-toasters relation. I could take the set difference, in case the first relation is related to a different context than the second. If people-has-toasters1 are the people who own a home in Hawaii, and people-has-toasters2 are the people who own a home in Paris, their difference is the people-has-toasters info for people who have own a house in hawaii, but not in paris.
and so on.

selection:
this is like restricted set comprehension, or a "filter".
projection: this just allows you to select one of the original sets used in a set product type.
I just need to understand inner/natural join and relational divison next.

(sorry, really helps me to discuss my thoughts with somebody. im working up to showing what my idea is, once i understand relational databases better.)

As for Windows, all i needed to do was uninstall java8 that i accidentally had, and it works with java17. currently studying how to use the scheme keyword, after having defined a typeside.

Ryan Wisnesky (Apr 02 2024 at 17:33):

Julius Hamilton (Apr 03 2024 at 02:43):

Yeah, they look really good. I just got laid off so now I have more time to study categorical data science. I’ll probably start with the “Algebraic Model Management: A Survey” paper.

Julius Hamilton (Apr 03 2024 at 19:33):

To supplement my learning, could you provide any example in CQL of something like the following, please?

I think I read somewhere that categorical data is sort of able to logically compute new data types, which are known to fit in algebraically perfectly to some existing data scheme. I can’t give good examples, but some of those papers I was checking out talk about how they can do things like “aggregate data”. My sense is that there is far less of a distinction between “data” and “metadata” in categorical/algebraic data.

I suppose I am interested in how a good “data” language shouldn’t have any major hurdles in querying vs. generating data. Querying is just a retrieval that doesn’t modify the state. Generating has huge potential. It amounts to a kind of logical or conceptual inference-making.

Imagine making a to do list, but you are generating the data content and the data schematic at the same time. So you initially think “I’ll need a type, “task””, and define it. You instantiate a single instance, “wash my car”. You define a function “add_task” which takes a string and instantiates an instance of type “task” containing that string. You add a few more tasks using that function. You start to think about something more that needs to be said or done about car-washing. So you consider declaring that instance “wash car” could be assigned attributes like “estimated-time” (ie, time taken to complete the action), and “time” (ie, do at this time). But you hesitate - you aren’t certain every element of type “task” actually needs those attributes. It could be cluttered to generate those attributes needlessly. So, maybe you decide that “wash-car” is a special sub-type of “task”, a sub-type that has an “estimated-time” and “time” fields. But it doesn’t stop there. You realize that the estimated time for washing your car should not be a strict temporal range, and you also aren’t sure (and don’t care) what number system it’s defined in (rationals, reals). You know there are multiple options for defining like, “a fuzzy span of time”. But you don’t want to worry about choosing one particular option, like, “I’ll assign a real number from 0-1 for every point on a real-valued time interval, representing the “plausibility” or “suggestedness” of washing my car during that time, or, it taking that long. You want to stay general - you just want some kind of category-theoretic/type-theoretic way of expressing whatever the most essential feature of a “time interval” is. That way, later, you can “plug in” different specific functions to represent a fuzzy time interval, all of which would be compatible since they accord with some fundamental requirement on the type of all time intervals.

Can you show me anything like that in CQL, or is my description not sufficiently clear?

Ryan Wisnesky (Apr 03 2024 at 19:44):

fwiw, functorial data migration usually rules aggregation out of scope; i.e., aggregation, naively understood, is not functorial. In any case, I would definitely encourage you to start working formally, doesn't have to be in CQL- I don't think I can really contribute anything besides a slogan, "if you can't write it in SQL or Coq, you can't write it better in CQL".

Julius Hamilton (Apr 04 2024 at 20:08):

Julius Hamilton (Apr 12 2024 at 02:42):

Ryan Wisnesky (Apr 12 2024 at 04:46):

the usual rule in programming is, 'first get an algorithm working on paper' - once you have pseduo-code, I'd be happy to help translate into SQL, or CQL, or excel, or java, or MATLAB, or ... . I read the stack exchange post but don't understand what to write in any language, not just CQL. Remember: CQL is basically datalog + existential quantifiers; it isn't designed for maximal expressiveness, but to express data integrations only. Even better than pseduo-code, can you say how you'd attempt it in any language?

Julius Hamilton (Apr 12 2024 at 05:08):

The example was meant to be simple - I hoped math would be the best formalization, before translating into any programming language.

In this case, I was just curious to see how to generate all possible n-tuples, with elements from a set. Mathematically, this is the Cartesian n-product of a set S, or the set exponentation

S^n

. My intuition is not strong enough yet to consider if you could write this in Python without the use of lambda. I believe this SQL achieves it:

WITH RECURSIVE set_power(elements, depth) AS (
    -- Base case: 1st power of the set
    SELECT ARRAY[element], 1
    FROM base_set

    UNION ALL

    -- Recursive case: (n+1)-th power of the set
    SELECT sp.elements || bs.element, sp.depth + 1
    FROM set_power sp, base_set bs
    WHERE sp.depth < n
)
SELECT elements
FROM set_power
WHERE depth = n;

I just wanted to see another cool diagram basically, like the one you showed me for Nat4, if possible. Thanks.

Ryan Wisnesky (Apr 12 2024 at 07:02):

cartesian product can be modeled a number of ways; in the CQL built-in example "pullback", both chasing with regular logic and evaluating with pro-functors are used to construct pullbacks, of which cartesian products are a degenerate case.
pullback.txt

Ryan Wisnesky (Apr 12 2024 at 07:04):

Ryan Wisnesky (Apr 12 2024 at 07:05):

The 'Cayley' built-in example has some neat Cayley graphs, you have to click the Cayley tab and enter a 'bound' (eg 4) to compute.
Screenshot-2024-04-12-at-12.04.43AM.png

Stream: practice: software

Topic: Some Guidance on Modeling a Software System

Julius Hamilton (Mar 30 2024 at 23:43):

Ryan Wisnesky (Mar 31 2024 at 01:20):

Ryan Wisnesky (Mar 31 2024 at 01:30):

Julius Hamilton (Mar 31 2024 at 16:21):

Julius Hamilton (Mar 31 2024 at 16:23):

Julius Hamilton (Mar 31 2024 at 16:27):

Julius Hamilton (Mar 31 2024 at 16:30):

Julius Hamilton (Mar 31 2024 at 16:32):

Julius Hamilton (Mar 31 2024 at 16:33):

Julius Hamilton (Mar 31 2024 at 16:35):

Julius Hamilton (Mar 31 2024 at 16:36):

Julius Hamilton (Mar 31 2024 at 18:07):

Ryan Wisnesky (Mar 31 2024 at 20:48):

Julius Hamilton (Mar 31 2024 at 21:06):

Julius Hamilton (Mar 31 2024 at 21:06):

Julius Hamilton (Mar 31 2024 at 21:07):

Ryan Wisnesky (Mar 31 2024 at 22:00):

Ryan Wisnesky (Mar 31 2024 at 22:05):

Julius Hamilton (Mar 31 2024 at 22:09):

Julius Hamilton (Mar 31 2024 at 22:10):

Ryan Wisnesky (Mar 31 2024 at 22:17):

Julius Hamilton (Apr 01 2024 at 13:29):

Julius Hamilton (Apr 01 2024 at 17:24):

Foundations

ZFC

ETCS

SEAR

Relational Algebras

SQL

Notes

Ryan Wisnesky (Apr 01 2024 at 18:55):

Julius Hamilton (Apr 01 2024 at 19:04):

Ryan Wisnesky (Apr 01 2024 at 19:10):

Julius Hamilton (Apr 01 2024 at 19:13):

Julius Hamilton (Apr 02 2024 at 13:17):

Ryan Wisnesky (Apr 02 2024 at 14:32):

Julius Hamilton (Apr 02 2024 at 14:36):

Ryan Wisnesky (Apr 02 2024 at 15:28):

Ryan Wisnesky (Apr 02 2024 at 15:29):

Julius Hamilton (Apr 02 2024 at 17:26):

Ryan Wisnesky (Apr 02 2024 at 17:33):

Julius Hamilton (Apr 03 2024 at 02:43):

Julius Hamilton (Apr 03 2024 at 19:33):

Ryan Wisnesky (Apr 03 2024 at 19:44):

Julius Hamilton (Apr 04 2024 at 20:08):

Julius Hamilton (Apr 12 2024 at 02:42):

Ryan Wisnesky (Apr 12 2024 at 04:46):

Julius Hamilton (Apr 12 2024 at 05:08):

Ryan Wisnesky (Apr 12 2024 at 07:02):

Ryan Wisnesky (Apr 12 2024 at 07:04):

Ryan Wisnesky (Apr 12 2024 at 07:05):

Julius Hamilton (Apr 12 2024 at 13:38):