You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.
There is a certain train of thought I have been going over for about a year. I have struggled a lot to even formulate the question I am trying to ask. I wonder if conversing about it with others can help me make progress in it.
I will try to mention just a few aspects of it. I am never able to send off what I write because I keep trying to develop what im saying, and can never finish what im writing.
Computer memory I believe could be thought of as one single binary sequence. It's ok if the actual details are different. Memory has addresses. The values are binary.
It's a sequence of elements (values, characters).
I have not had the time to develop my understanding of this, but in set theory at least, a sequence can be thought of as a set indexed by the natural numbers; but I believe the important part is that it is a totally ordered set.
Orders are defined as relations.
I have read that the category of finite ordered sets is the simplex category. This intrigues me. As I said, I have not had the time to study the simplex category deeply yet.
I am very interested in ontology. My background is in linguistics.
I also program a fair amount.
Many of the projects I try to work on revolve around my desire to process data (i.e., text, say, webpages) into something indicating semantic comprehension - like a knowledge graph.
A very common idea is that knowledge graphs are built from triples. Between two concepts, a third concept declares their relationship. So, (Donald Trump Jr, Father, Donald Trump Sr.), for example.
I have often gotten side-tracked considering variations to the triple. Maybe I won't go into it right now.
Here is where I start to struggle to say what I want to say. Forgive me for being vague. With time, I can try to get out what I am trying to say.
I am extremely interested in unsupervised learning. On a very intuitive, non-rigorous level, whatever "data" is, we can imagine and understand that it contains latent information, relationships, inside of it.
I believe in Claude Shannon type information theory, there is this interesting idea that the less random something is - the more it deviates from randomness - the more information it contains. Say, a signal being transmitted.
I think this is often used in statistics, as well. there is this kind of "null hypothesis", I think, that, given no other assumptions, we should just kind of assume the default, arbitrary state of something to be amorphous randomness. you can calculate the extent to which data deviates from randomness, which is telling. it tells us that, on the contrary, our data does not appear to be random. and if it is not random, it (again, this is very informal, intuitive), has some reason, rule, law, pattern, explanation, cause, for why it is other than random. how did it get that way? what preconditions influenced the data to emerge in this non-arbitrary form?
maybe this is jumping ahead, but at least from a mathematical perspective, we can kind of assert that any binary sequence can be described in terms of some rules, that specify or generate that sequence.
I can't remember the term, it might be Kolmogorov complexity or something similar. but there is this fascinating question in the field of induction, which is like, you might know that a binary sequence is certainly not random, but it may be intrinsically mysterious, difficult, unknowable to prove what set of rules / functions generated that sequence of data.
if there is some sort of recurrent pattern in the data, you are able to describe the information of that sequence... with less "information". you do not need to tell someone like, one hundred times, 1, then 0, then 1, then 0. assuming you have some sort of binary language to encode such an expression, you can express the idea, "repeat 10 100 times".
Let me see if I can skip ahead to the point, a bit, then add in details as needed.
Imagine in a genuinely practical way, you want to run some kind of unsupervised algorithm on some textual data in order to extract / discover the inherent patterns / relationships in it.
(as I wish to do.)
in a lot of conventional situation in computer science, data science, data bases, trying to extract knowledge or data sometimes comes across to me as sort of paltry, even crude. Like, SQL, labeled data... it requires you to know in advance exactly what data fields you want. it's very, very simple. like, name, age, gender, height, home address, etc.
in a lot of conventional data scenarios, you just choose in advance what "data model" you want, what categories are present. it may be a more elaborate process to harvest / mine textual data for information to then place into those "buckets". but, its like its only half of the equation, of "extracting information" from data... you may have used algorithms to process the data to discover the pre-defined data labels you were looking for - but your choice of data labels was arbitrary. it feels so much deeper to also generate the labels of your data, from the raw data itself: to derive them.
this brings us back to unsupervised learning, I guess.
what is so appealing about unsupervised learning is that its like turning something inside out... it is not preferential. you are not choosing highly particular algorithms that will only seek certain types of patterns, to the exclusion of others. ideally, to me, the ideal unsupervised algorithm could somehow comprehensively register every possible "relationship" in the data. but I have struggled conceptually to define what that means.
I think this might have some overlap with Stephen wolfram's recent ideas about the "ruliad", which really interested me, but because im so new to this, I do not yet know if his approach would actually apply to what I am (vaguely) trying to do.
So... we look at computer memory, and we see a string of binary. But, we know that this binary has deep, highly non-arbitrary information / patterns in it. Lets say its actually an encoding of Tolstoy's War and Peace.
I am a little bit stuck here. I have more to say but maybe i need a little time to gather my thoughts.
It seems like semantics is profoundly compositional.
there are some approaches where people build concepts from a hierarchy of concepts, ultimately built from a smaller collection of semantic primitives. based on my musings, I have become pretty comfortable with the idea that this must surely be how the actual human mind works.
there is this one constructed language called Information economy meta-language by Pierre levy. https://journals.sagepub.com/doi/10.1177/26339137231207634
I think I need to come back to this. This was a good start. But I haven't really gotten to the heart of what im trying to get at.
part of the idea is the hope that you can go from some sort of "brute force" "hypothesis-generating" on binary data and, if your generation of all possible "relations" is comprehensive, and totally ordered (you start from the simplest possible relations, and progress towards more and more "higher order" relations), you will eventually be able to generate a semantic knowledge graph from raw binary data. you do not need a "particular" from of "feature discovery" regarding the data's modality - i.e., it might be more conventional, if working with textual data, to apply natural language parsing before running some kind of unsupervised feature extraction algorithm; or, if working with visual data, to focus on things like edges. but this is meant to be the single most simple, general, all-encompassing "induction" algorithm possible, on the most abstract level possible.
I cannot justify why. but I have constantly considered that the most basic concept of a "relation" is co-occurrence.
so in a sequence (this is the technique of n-grams, in NLP), you just count how many times "a" was adjacent to "d". you do that for every pair of characters.
we can imagine that those co-occurrences with a higher tally are important, are meaningful. if you did this on the English language, you might find in the early, simple stage of "hypothesis generating" that the character "." very often goes next to the character " " (because a sentence ends with a period followed by a space).
that is the simplest level of hypothesis generating: single characters, direct adjacency. but we can progress through a sequence of types of hypotheses, and we will discover new significant relationships.
I think this is very similar to how in set theory you construct all mathematical concepts like functions from a few basic ingredients. functions are actually sets of pairs. the infinite set of natural numbers is actually just the set containing one element and a successor function. you define surprisingly sophisticated concepts from very simple elements.
we can imagine roughly the same thing with the English language: we would like to express that a space also very often occurs next to a capital letter. The way we generate hypotheses will need to construct something that corresponds to the concept of "capital letter", in order to be able to tally that relationship (co-occurrence), and discover that it is one with a high tally - non-arbitrary.
how? pretty obvious. the concept "capital letter" is represented by the set of all capital letters.
Hmmm a potential problem with your picture so far is that there's no way to justify that the relationships that are "most relevant" will be fully contained in the data. GPT has had success only by digesting incomprehensibly vast quantities of data. A large book on its own is not going to provide sufficient context for deducing all of the pathologies of a language's grammar, let alone correctly contextualizing all of the words and identifying all of the relationships between them: what hope does a machine have of extracting an assertion in the form of a metaphor referencing human experience that is not more deeply explained in the text?
It also seems to me that insignificant relationships could easily outweigh significant ones. Consider a text in which the entries all have a common form, such as a table with a standard set of headings. The repetition of those headings would be much easier to detect (even analysing a binary encoding of the text) than the consequent relationships between the actual contents of the tables.
That sounds a bit negative, I'm not trying to discourage you. I would like to understand in more detail the assumptions you want to work from.
(this was what I was writing before you responded, ill respond in a second)
basically, the generating of hypotheses (I think) is built on three things.
where it has constantly fallen apart in my mind is an attempt to sequentially pass through "all possibilities".
this algorithm could discover the concept of "noun" on its own. why? because it will iteratively try out every single combination of characters as a candidate hypothesis. many of them will be duds. it will check bizarre hypotheses, like, "how many times did a string like "a___(three characters between)zx6" occur within 7 characters of the string !e23? Answer: 0. Or perhaps in a massive data set, we get that kind of random "noise". we find occasional specimens, but in the scope of the data, they are negligible anomalies.
and yet, when this algorithm decides to check like, how often the class of nouns is followed by the class of verbs, it will discover an enormous tally - it will have discovered a contender for a rule of grammar.
the key thing is that it must build concepts hierarchically.
I believe this makes it a kind of "hyper graph", where there can be an arrow (a relation) between anything already present in the graph - a vertex, an edge, a collection of vertices an edges, a subgraph, etc.
this is how you are able to begin to discover relationships like "woman" and "girl" are similar, in some way.
I feel bad because I still haven't clearly explained what im asking. I will try to come back to this later to make it clearer and easier for people to understand what I am getting at.
I actually tried to draft an email to David spivak because im extremely interested in his work on ologs and categorical databases and stuff
but yet again I got stuck
the longterm goal is basically to build a queryable knowledge graph out of textual data
the key thing is that you do not have predefined labels / fields in advance
Julius said:
this algorithm could discover the concept of "noun" on its own. why? because it will iteratively try out every single combination of characters as a candidate hypothesis. many of them will be duds. it will check bizarre hypotheses, like, "how many times did a string like "a___(three characters between)zx6" occur within 7 characters of the string !e23? Answer: 0. Or perhaps in a massive data set, we get that kind of random "noise". we find occasional specimens, but in the scope of the data, they are negligible anomalies.
What you're describing does sound a lot like how large language models work already, by keeping track of the probabilities that certain strings of characters appear in proximity to one another. How do you imagine transitioning from that to actually extracting/labelling the concepts?
you can query your data like, "jobs" "most common" "2007", and it would return a list of the most common jobs in the year 2007
anyway I will come back to this later. thank you for your consideration.
I don't think you could avoid having to name the concepts produced by whatever procedure you're working towards. Even though a LLM can correctly identify when a noun is called for, if learning unsupervised from a text how would they know that a word with that function is called a noun? A potential work-around is to have an unsupervised meta-learning task to teach the LLM grammar, but that's going to completely bias such a machine towards human-determined categorization, which might not be desirable.
Morgan Rogers (he/him) said:
How do you imagine transitioning from that to actually extracting/labelling the concepts?
I haven't read all of this thread carefully, but this question caught my eye. I'm wondering if Julius (or for that matter Morgan) is familiar with "formal concept analysis", a mathematical approach to extracting "concepts" from data. Here's a nice introduction to it:
Of course you need to do real work to make formal concept analysis useful - e.g. I used the word "often" without saying exactly how I would define it, but that's part of the work you need to do.
For more, this is a great introduction:
As you probably know, Tai-Danae is one of the best expositors in applied category theory.
@John Baez Julius already mentioned knowledge graphs, which are another formalization of the kind of structured information one might want to extract. To present the problem that I think Julius is aiming at in terms of formal concept analysis, they would want to be able to extract not only "objects" and "attributes" but also the "relation" between these from raw, unlabelled, binary data. I actually think that knowledge graphs are a more reasonable thing to aim for here, since the only type of attribute that Julius has proposed is that of belonging to a particular group, but I don't expect the relations determining such groups (e.g. "occurs in similar contexts to ...") to be transitive.
In the language of knowledge graphs, I think that even if statistical measures of proximity and correlation might extract meaningful relationships between fragments of data in principle, I don't know how an unsupervised learner could learn to name or compare attributes. A mild version of this problem is that a system wouldn't be able to derive grammatical names like "synonym" or "noun" that we give to the kinds of relationship/category of word. That's fine if we can recognize these after the fact, but more problematic situations would be category errors on the part of a system: if "synonyms" correspond to pairs of words that are typically positionally interchangeable, then how would the system 'know' to draw the line to eliminate the pair "(beer,wine)" from that relation?
And all of that is assuming we're in a natural language context that the system has been able to reconstruct from raw binary!
@Morgan Rogers (he/him)
Thank you very much for engaging with me, this is exactly what I wanted, and it helps me so much.
A potential problem with your picture so far is that there's no way to justify that the relationships that are "most relevant" will be fully contained in the data.
You are correct, but that is an incidental problem.
I want to consider the ideal scenario in order to develop a theory, with the assumption that the practical necessities of implementing the theory can be factored out entirely.
What you're describing does sound a lot like how large language models work already.
My idea is heavily influenced by how large language models work. A key difference is that LLMs are "fuzzy" (statistical, based approximately on continuous functions). My idea is 100% "algebraic". (This is but one point of many which I can expand on greatly - I believe my problem with my idea is that there are so many interconnected parts, and none of them are developed enough yet, so I really, really need to break it apart into modular pieces and focus rigorously on each piece.)
How do you imagine transitioning from that to actually extracting/labelling the concepts?
A key part of the puzzle is this: https://openai.com/research/language-models-can-explain-neurons-in-language-models
Sometimes, it is hard to have the "meta-knowledge" to realize how what you know about something is very different from what other people know about something. You forget how different their picture is than yours.
When ChatGPT came out, I was utterly fascinated by how it worked, what its properties and behaviors were. I tinkered with it enough to develop a personal hypothesis about it. I came to feel I had seen through the illusion, in certain ways. My view had grown to differ from say, the news buzz that came out when a Google engineer proclaimed that Lambda (Google's earlier LLM) was "sentient", just because it could answer conversational questions fluently in the first-person. (Again: I can expand on this.)
Working on my own, at times I felt my learning progression was a kind of "discovery". Of course, in the vast world of AI users and researchers, many people were coming to similar conclusions at the same time, or had already known such things long before me. Regardless, when the above (to me, remarkable) paper came out, it confirmed my own train of thought. Large language models build ontologies.
In latent semantic analysis, you can represent a concept as a set of related words. What becomes deeply insightful and philologically fascinating is how you can combine somewhat arbitrary bundles of words, like "bag", "nuclear fission", and "emblematic", and use things like word vectors / embeddings to figure out what "vector" (token, word, phrase, or string) calculates as "most similar". You may not have realized what they have in common, but the algorithm can surprise you by drawing a connection which fuses / synthesizes them, or at least find a context which encompasses them.
There are ways in which these semantic techniques can provide conceptual knowledge to humans that is difficult and non-intuitive for them to figure out themselves, but which they can confirm seems accurate when provided with it.
This is also a similar, key paradigm:
"Neural Networks are Decision Trees"
https://arxiv.org/abs/2210.05189
In a way, it seems that neural networks use "continuous mathematics" only as a route to be able to approach, through gradual movement, what is a much more discrete model of the world. As is well known, the "neurons" in neural networks can, sometimes, be clearly interpreted, as to what factor, in a given context, they essentiallly "decide" on, before routing that information further to yet another (or, multiple) decisions conditional on the prior ones.
But that doesn't really answer the question you asked, and it is a very good one (for me to think about).
Again,
... by keeping track of the probabilities that certain strings of characters appear in proximity to one another. How do you imagine transitioning from that to actually extracting/labelling the concepts?
As far as I know, what you touched upon is a very, very famous and active point of research and debate amongst LLM people, including top researchers. Google DeepMind publishes research trying to settle hard questions about what's really going inside them. Andrew Ng tweeted an article about "OthelloGPT", which is one point scored for the party of the slogan, "LLMs build actual conceptual models, and therefore can actually 'think', in a limited way".
"Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see?"
"Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task"
https://arxiv.org/abs/2210.13382
I even saw a meme that summed it up nicely. It is ever so common to hear people say (exactly as I have), "LLMs are 'statistical parrots'. They relatively mindlessly regurgitate similar-sounding blather based on a salad of similar words in some discourse. If their response sounds accurate, you got lucky. It probably just coughed up a fusion of some terms it learned from a Wikipedia article and some research papers."
The meme goes,
Person A: "LLMs only predict the next token in a sequence of tokens."
Person B: "You keep using that word 'only'."
Similar to self-driving cars, people have wondered if an apparently reductive task actually requires deep enough cognitive abilities that in order to solve that problem, you actually have to solve AGI. With LLMs, the idea is, if a system can often predict the next token relatively well, maybe that would not be possible if it did not actually have some semblance of deep conceptual understanding.
I am going to leave off on this point right now, but it is another one that I should expand on and refine.
You ask how we can label the concepts learned by an LLM. The answer is, they are already roughly labeled. How? By the inputs. When you talk to an LLM, your input string is a label. The output the LLM generates is the data returned by that label. This is a succinct way of thinking about it, but requires so much more development, much more to say.
@John Baez Thank you for those, they look relevant. I feel like you can guide me in developing this. You can both point me towards highly formalized and developed concepts in category theory that are the structures I am looking for, and probably also with the more soft-side of just strategies for doing research and mathematical/intellectual theorization.
I meant to post something about formal concept analysis but I see I left it as a draft. Here it is:
One nice thing about formal concept analysis is that it uses a bit of category theory. The idea is we have a relation between a set (of words, or phrases, etc. - it doesn't matter) and a set (of pictures, etc. - again it doesn't matter). This relation could be "the word is often found next to the picture ". Or anything else, but let's just use that example.
This gives a map sending subsets of to subsets of . The idea is that given a set of words we say is the set of all pictures such that all the words in are often found next to those pictures.
This map is a map of posets from the power set to , where the "op" is thrown in because the bigger is, the smaller is.
Thus - to cut a long story short, which Simon Willerton explains better than I could here - the map gives rise to a Galois correspondence between and . And the "fixed points" of this Galois correspondence are called concepts, because they really do act a lot like concepts.
Nitpick: I think it's usually called a Galois connection. A Galois correspondence is a Galois connection where the two contravariant maps (one from to , the other one the other way) are inverse to each other.
Okay, thanks, I didn't know they had separate names.
Or maybe I did know, once....
Thanks.
I’m going to read those papers, but first I should reason a bit more on my own, as I need to make sure I know what I’m looking for.
I wrote a draft of an introduction to a paper about this.
There is an informal philosophical idea that I hope could made more precise mathematically, and there is a chance Wolfram’s “ruliology” does that.
The idea is that whatever we call “information”, it requires a pre-existing context in which it is expressed.
I think that Wolfram’s “multi-way systems” have a way of showing how information has a relationship to an “observer”.
I think Wolfram believes that a ruliad exists. I think Andrej Bauer believes it does not, and we have to embrace that we live in a logical “multiverse”.
In either case, it helps me try to define “all possible hypotheses”, for induction on arbitrary data. The idea may be misguided; there are different algorithms which lead to different constructions on the data. We need not seek a single unifying one.