106 points JPLeRouzic 3 days ago 60 comments
It generates text that seems to me at least on par with tiny LLMs, such as demonstrated by NanoGPT. Here is an example:
jplr@mypass:~/Documenti/2025/SimpleModels/v3_very_good$
./SLM10b_train UriAlon.txt 3
Training model with order 3...
Skip-gram detection: DISABLED (order < 5)
Pruning is disabled
Calculating model size for JSON export...
Will export 29832 model entries
Exporting vocabulary (1727 entries)...
Vocabulary export complete.
Exporting model entries...
Processed 12000 contexts, written 28765 entries (96.4%)...
JSON export complete: 29832 entries written to model.json
Model trained and saved to model.json
Vocabulary size: 1727
jplr@mypass:~/Documenti/2025/SimpleModels/v3_very_good$ ./SLM9_gen model.json
Aging cell model requires comprehensive incidence data. To obtain such a large medical database of the joints are risk factors. Therefore, the theory might be extended to describe the evolution of atherosclerosis and metabolic syndrome. For example, late‐stage type 2 diabetes is associated with collapse of beta‐cell function. This collapse has two parameters: the fraction of the senescent cells are predicted to affect disease threshold . For each individual, one simulates senescent‐cell abundance using the SR model has an approximately exponential incidence curve with a decline at old ages In this section, we simulated a wide range of age‐related incidence curves. The next sections provide examples of classes of diseases, which show improvement upon senolytic treatment tends to qualitatively support such a prediction. model different disease thresholds as values of the disease occurs when a physiological parameter ϕ increases due to the disease. Increasing susceptibility parameter s, which varies about 3‐fold between BMI below 25 (male) and 54 (female) are at least mildly age‐related and 25 (male) and 28 (female) are strongly age‐related, as defined above. Of these, we find that 66 are well described by the model as a wide range of feedback mechanisms that can provide homeostasis to a half‐life of days in young mice, but their removal rate slows down in old mice to a given type of cancer have strong risk factors should increase the removal rates of the joint that bears the most common biological process of aging that governs the onset of pathology in the records of at least 104 people, totaling 877 disease category codes (See SI section 9), increasing the range of 6–8% per year. The two‐parameter model describes well the strongly age‐related ICD9 codes: 90% of the codes show R 2 > 0.9) (Figure 4c). This agreement is similar to that of the previously proposed IMII model for cancer, major fibrotic diseases, and hundreds of other age‐related disease states obtained from 10−4 to lower cancer incidence. A better fit is achieved when allowing to exceed its threshold mechanism for classes of disease, providing putative etiologies for diseases with unknown origin, such as bone marrow and skin. Thus, the sudden collapse of the alveoli at the outer parts of the immune removal capacity of cancer. For example, NK cells remove senescent cells also to other forms of age‐related damage and decline contribute (De Bourcy et al., 2017). There may be described as a first‐passage‐time problem, asking when mutated, impair particle removal by the bronchi and increase damage to alveolar cells (Yang et al., 2019; Xu et al., 2018), and immune therapy that causes T cells to target senescent cells (Amor et al., 2020). Since these treatments are predicted to have an exponential incidence curve that slows at very old ages. Interestingly, the main effects are opposite to the case of cancer growth rate to removal rate We next consider the case of frontline tissues discussed above.MarkusQ 2 days ago | parent
Sohcahtoa82 3 hours ago | parent
But then, Markov Chains fall apart when the source material is very large. Try training a chain based on Wikipedia. You'll find that the resulting output becomes incoherent garbage. Increasing the context length may increase coherence, but at the cost of turning into just simple regurgitation.
In addition to the "attention" mechanism that another commenter mentioned, it's important to note that Markov Chains are discrete in their next token prediction while an LLM is more fuzzy. LLMs have latent space where the meaning of a word basically exists as a vector. LLMs will generate token sequences that didn't exist in the source material, whereas Markov Chains will ONLY generate sequences that existed in the source.
This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.
johnisgood 3 hours ago | parent
I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".
I am quite confused right now. Could you please help me with this?
Somewhat related: I like the work of David Hume, and he explains it quite well how we can imagine various creatures, say, a pig with a dragon head, even if we have not seen one ANYWHERE. It is because we can take multiple ideas and combine them together. We know how dragons typically look like, and we know how a pig looks like, and so, we can imagine (through our creativity and combination of these two ideas) how a pig with a dragon head would look like. I wonder how this applies to LLMs, if they even apply.
Edit: to clarify further as to what I want to know: people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?
jldugger 3 hours ago | parent
1. To the extent that creativity is randomness, LLM inference samples from the token distribution at each step. It's possible (but unlikely!) for an LLM to complete "pig with" with the token sequence "a dragon head" just by random chance. The temperature settings commonly exposed control how often the system takes the most likely candidate tokens.
2. A markov chain model will literally have a matrix entry for every possible combination of inputs. So a 2 degree chain will have n^2 weights, where N is the number of possible tokens. In that situation "pig with" can never be completed with a brand new sentence, because those have literal 0's in the probability. In contrast, transformers consider huge context windows, and start with random weights in huge neural network matrices. What people hope happens is that the NN begins to represent ideas, and connections between them. This gives them a shot at passing "out of distribution" tests, which is a cornerstone of modern AI evaluation.
thaumasiotes 3 hours ago | parent
> I am quite confused right now. Could you please help me with this?
This is pretty straightforward. Sohcahtoa82 doesn't know what he's saying.
Sohcahtoa82 3 hours ago | parent
thaumasiotes 3 hours ago | parent
More generally, since an LLM is a Markov chain, it doesn't make sense to try to answer the question "what's the difference between an LLM and a Markov chain?" Here, the question is "what's the difference between a tiny LLM and a Markov chain?", and assuming "tiny" refers to window size, and the Markov chain has a similarly tiny window size, they are the same thing.
johnisgood 3 hours ago | parent
shagie 2 hours ago | parent
Make up puzzles of your own and see if it is able to solve it or not.
The blanket claim of "cannot solve problems that are not in its training data" seems to be something that can be disproven by making up a puzzle from your own human creativity and seeing if it can solve it - or for that matter, how it attempts to solve it.
It appears that there is some ability for it to reason about new things. I believe that much of this "an LLM can't do X" or "an LLM is parroting tokens that it was trained on" comes from trying to claim that all the material that it creates was created before, by a human and any use of an LLM is stealing from some human and thus unethical to use.
( ... and maybe if my block world or wizards and warriors and witches puzzle was in the training data somewhere, I'm unconsciously copying something somewhere else and my own use of it is unethical. )
wadadadad 1 hour ago | parent
In your second example with the wizards- did you notice that it failed to follow the rules? Step 3, the witch was summoned by the wizard. I'm curious as to why you didn't comment either way on this.
On a related note, instead of puzzles, what about presenting riddles? I would argue that riddles are creative, pulling bits and pieces of meaning from words to create an answer. If AI can solve riddles not seen before, would that count as creative and not solving problems in their dataset?
Here's one I created and presented (the first incorrect answer I got was Escape Room; I gave it 10 attempts and it didn't get the answer I was thinking of):
---
Solve the riddle:
Chaos erupts around
The shape moot
The goal is key
purple_turtle 3 hours ago | parent
2) LLMs are not Markov Chains
thaumasiotes 1 hour ago | parent
Here's wikipedia:
> a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
A Markov chain is a finite state machine in which transitions between states may have probabilities other than 0 or 1. In this model, there is no input; the transitions occur according to their probability as time passes.
> 2) LLMs are not Markov Chains
As far as the concept of "Markov chains" has been used in the development of linguistics, they are seen as a tool of text generation. A Markov chain for this purpose is a hash table. The key is a sequence of tokens (in the state-based definition, this sequence is the current state), and the value is a probability distribution over a set of tokens.
To rephrase this slightly, a Markov chain is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then for the following token you should choose t_1 with probability p_1, t_2 with probability p_2, etc...".
Then, to tie this back into the state-based definition, we say that when we choose token t_k, we emit that token into the output, and we also dequeue the first token from our representation of the state and enqueue t_k at the back. This brings us into a new state where we can generate another token.
A large language model is seen slightly differently. It is a function. The independent variable is a sequence of tokens, and the dependent variable is a probability distribution over a set of tokens. Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".
Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".
You might notice that these two tables contain the same information organized in the same way. The transformation from an LLM to a Markov chain is the identity transformation. The only difference is in what you say you're going to do with it.
koliber 3 hours ago | parent
Imagine a source corpus that consists of:
Cows are big. Big animals are happy. Some other big animals include pigs, horses, and whales.
A Markov chain can only return verbatim combinations. So it might return "Cows are big animals" or "Are big animals happy".
An LLM can get a sense of meaning in these words and can return ideas expressed in the input corpus. So in this case it might say "Pigs and horses are happy". It's not limited to responding with verbatim sequences. It can be seen as a bit more creative.
However, LLMs will not be able to represent ideas that it has not encountered before. It won't be able to come up with truly novel concepts, or even ask questions about them. Humans (some at least) have that unbounded creativity that LLMs do not.
marcellus23 2 hours ago | parent
Just for my own edification, do you mean "Are big animals are happy"? "animals happy" never shows up in the source text so "happy" would not be a possible successor to "animals", correct?
vidarh 2 hours ago | parent
There's absolutely no evidence to support this claim. It'd require humans to exceed the Turing computable, and we have no evidence that is possible.
koliber 2 hours ago | parent
Humans created novel concepts like writing literally out of thin air. I like how the book "Guns, Steels, and Germs" describes that novel creative process and contrasts it via a disseminative derivation process.
fragmede 1 hour ago | parent
Sure they do. We call them hallucinations and complain that they're not true, however.
koliber 39 minutes ago | parent
In people there is a difference between unconscious hallucinations vs. intentional creativity. However, there might be situations where they're not distinguishable. In LLMs, it's hard to talk about intentionality.
I love where you took this.
Sohcahtoa82 3 hours ago | parent
LLMs can absolutely create things that are creative, at least for some definition of "creative".
For example, I can ask an LLM to create a speech about cross-site scripting the style of Donald Trump:
> Okay, folks, we're talking about Cross-Site Scripting, alright? I have to say, it's a bit confusing, but let's try to understand it. They call it XSS, which is a fancy term. I don't really know what it means, but I hear it's a big deal in the tech world. People are talking about it, a lot of people, very smart people. So, Cross-Site Scripting. It's got the word "scripting" in it, which sounds like it's about writing, maybe like a script for a movie or something. But it's on the internet, on these websites, okay? And apparently, it's not good. I don't know exactly why, but it's not good. Bad things happen, they tell me. Maybe it makes the website look different, I don't know. Maybe it makes things pop up where they shouldn't. Could be anything! But here's what I do know. We need to do something about it. We need to get the best people, the smartest people, to look into it. We'll figure it out, folks. We'll make our websites safe, and we'll do it better than anyone else. Trust me, it'll be tremendous. Thank you.
Certainly there's no text out there that contains a speech about XSS from Trump. There's some snippets here and there that likely sound like Trump, but a Markov Chain simply is incapable of producing anything like this.
johnisgood 3 hours ago | parent
0cf8612b2e1e 26 minutes ago | parent
If you similarly trained a Markov chain at the token level on a LLM sized corpus, it could make the same. Lacking an attention mechanism, the token probabilities would be terribly non constructive for the effort, but it is not impossible.
umanwizard 3 hours ago | parent
People who claim this usually don’t bother to precisely (mathematically) define what they actually mean by those terms, so I doubt you will get a straight answer.
pama 2 hours ago | parent
godelski 2 hours ago | parent
> I have seen the argument that LLMs can only give you what its been trained
There's confusing terminology here and without clarification people talk past one another."What its been trained on" is a distribution. It can produce things from that distribution and only things from that distribution. If you train on multiple distributions, you get the union of the distribution, making a distribution.
This is entirely different from saying it can only reproduce samples which it was trained on. It is not a memory machine that is surgically piecing together snippets of memorized samples. (That would be a mind bogglingly impressive machine!)
A distribution is more than its samples. It is the things between too. Does the LLM perfectly capture the distribution? Of course not. But it's a compression machine so it compresses the distribution. Again, different from compressing the samples, like one does with a zip file.
So distributionally, can it produce anything novel? No, of course not. How could it? It's not magic. But sample wise can it produce novel things? Absolutely!! It would be an incredibly unimpressive machine if it couldn't and it's pretty trivial to prove that it can do this. Hallucinations are good indications that this happens but it's impossible to do on anything but small LLMs since you can't prove any given output isn't in the samples it was trained on (they're just trained on too much data).
> people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?
Up until very recently most LLMs have struggled with the prompt Solve:
5.9 = x + 5.11
This is certainly in their training distribution and has been for years, so I wouldn't even conclude that they can solve problems "in their training data". But that's why I said it's not a perfect model of the distribution. > a pig with a dragon head
One needs to be quite careful with examples as you'll have to make the unverifiable assumption that such a sample does not exist in the training data. With the size of training data this is effectively unverifiable.But I would also argue that humans can do more than that. Yes, we can combine concepts, but this is a lower level of intelligence that is not unique to humans. A variation of this is applying a skill from one domain into another. You might see how that's pretty critical to most animals survival. But humans, we created things that are entirely outside nature require things outside a highly sophisticated cut and paste operation. Language, music, mathematics, and so much more are beyond that. We could be daft and claim music is simply cut and paste of songs which can all naturally be reproduced but that will never explain away the feelings or emotion that it produces. Or how we formulated the sounds in our heads long before giving them voice. There is rich depth to our experiences if you look. But doing that is odd and easily dismissed as our own familiarity deceives us into our lack of.
franciscator 2 hours ago | parent
andoando 1 hour ago | parent
hugkdlief 1 hour ago | parent
Funny choice of combination, pig and dragon, since Leonardo Da Vinci famously imagined dragons themselves by combining lizards and cats: https://i.pinimg.com/originals/03/59/ee/0359ee84595586206be6...
johnisgood 1 hour ago | parent
I should totally try to generate images using AI with some of these prompts!
thfuran 3 hours ago | parent
A markov chain of order N will only generate sequences of length N+1 that were in the training corpus, but it is likely to generate sequences of length N+2 that weren't (unless N was too large for the training corpus and it's degenerate).
Isamu 3 hours ago | parent
Sohcahtoa82 2 hours ago | parent
If you use a context window of 2, then yes, you might know that word C can follow words A and B, and D can follow words B and C, and therefore generate ABCD even if ABCD never existed.
But it could be that ABCD is incoherent.
For example, if A = whales, B = are, C = mammals, D = reptiles.
"Whales are mammals" is fine, "are mammals reptiles" is fine, but "Whales are mammals reptiles" is incoherent.
The longer you allow the chain to get, the more incoherent it becomes.
"Whales are mammals that are reptiles that are vegetables too".
Any 3-word fragment of that sentence is fine. But put it together, and it's an incoherent mess.
vjerancrnjak 2 hours ago | parent
Something like Markov Random Field is much better.
Not sure if anyone managed to create latent hierarchies from chars to words to concepts. Learning NNs is far more tinkery than brutality of probabilistic graphical models.
ssivark 2 hours ago | parent
psychoslave 2 hours ago | parent
I also do that kind of things with LLM. The other day, I don't remember the prompt (something casual really, not trying to trigger any issue) but le chat mistral started to regurgitate "the the the the the...".
And this morning I was trying a some local models, trying to see if they could output some Esperanto. Well, that was really a mess of random morphs thrown together. Not syntactically wrong, but so out of touch with any possible meaningful sentence.
lotyrin 2 hours ago | parent
papyrus9244 32 minutes ago | parent
Or, in other words, a Markov Chain won't hallucinate. Having a system that only repeats sentences from it's source material and doesn't create anything new on its own is quite useful on some scenarios.
AndrewKemendo 3 hours ago | parent
I’d offer an alternative interpretation: LLMs follow the Markov Decison modeling properties to encode the problem but use a very efficient policy for solver for the specific token based action space.
That is to say they are both within the concept of a “markovian problem” but have wildly different path solvers. MCMC is a solver for an MDP, as is an attention network
So same same, but different
aespinoza 3 hours ago | parent
spencerflem 3 hours ago | parent
benob 2 hours ago | parent
inciampati 3 hours ago | parent
zwaps 3 hours ago | parent
kittikitti 2 hours ago | parent
vjerancrnjak 1 hour ago | parent
Difference is obviously there but nothing prevents you from undirected conditioning of long range dependencies. There’s no need to chain anything.
The problem from a math standpoint is that it’s an intractable exercise. The moment you start relaxing the joint opt problem you’ll end up at a similar place.
kleiba 3 hours ago | parent
But then came deep-learning models - think transformers. Here, you don't represent your inputs and states discretely but you have a representation in a higher-dimensional space that aims at preserving some sort of "semantics": proximity in that space means proximity in meaning. This allows to capture nuances much more finely than it is possible with sequences of symbols from a set.
Take this example: you're given a sequence of n words and are to predict a good word to follow that sequence. That's the thing that LM's do. Now, if you're an n-gram model and have never seen that sequence in training, what are you going to predict? You have no data in your probabilty tables. So what you do is smoothing: you take away some of the probability mass that you have assigned during training to the samples you encountered and give it to samples you have not seen. How? That's the secret sauce, but there are multiple approaches.
With NN-based LLMs, you don't have that exact same issue: even if you have never seen that n-word sequence in training, it will get mapped into your high-dimensional space. And from there you'll get a distribution that tells you which words are good follow-ups. If you have seen sequences of similar meaning (even with different words) in training, these will probably be better predictions.
But for n-grams, just because you have seen sequences of similar meaning (but with different words) during training, that doesn't really help you all that much.
esafak 2 hours ago | parent
tlarkworthy 3 hours ago | parent
thatjoeoverthr 3 hours ago | parent
But also important is embeddings.
Tokens in a classic Markov chain are discrete surrogate keys. “Love”, for example, and “love” are two different tokens. As are “rage” and “fury”.
In a modern model, we start with an embedding model, and build a LUT mapping token identities to vectors.
This does two things for you.
First, it solves the above problem, which is that “different” tokens can be conceptually similar. They’re embedded in a space where they can be compared and contrasted in many dimensions, and it becomes less sensitive to wording.
Second, because the incoming context is now a tensor, it can be used with differentiable model, back propagation and so forth.
I did something with this lately, actually, using a trained BERT model as a reranker for Markov chain emmisions. It’s rough but manages multiturn conversation on a consumer GPU.
yobbo 2 hours ago | parent
Transformers can be interpreted as tricks that recreate the state as a function of the context window.
I don't recall reading about attempts to train very large discrete (million states) HMMs on modern text tokens.
qoez 2 hours ago | parent
Untwittered: A Markov model and a transformer can both achieve the same loss on the training set. But only the transformer is smart enough to be useful for other tasks. This invalidates the claim that "all transformers are doing is memorizing their training data".
currymj 2 hours ago | parent
https://web.stanford.edu/~jurafsky/slp3/ed3book_aug25.pdf
I don't know the history but I would guess there have been times (like the 90s) when the best neural language models were worse than the best trigram language models.
ssivark 2 hours ago | parent
The problem with HMMs is that the sequence model (Markov transition matrix) accounts for much less context than even Tiny LLMs. One natural way to improve this is to allow the model to have more hidden states, representing more context -- called "clones" because these different hidden states would all be producing the same token while actually carrying different underlying contexts that might be relevant for future tokens. We are thus taking a non-Markov model (like a transformer) and re-framing its representation to be Markov. There have been sequence models with this idea aka Cloned HMMs (CHMMs) [1] or Clone-Structured Cognitive Graphs (CSCGs) [2]. The latter name is inspired by some related work in neuroscience, to which these were applied, which showed how these graphical models map nicely to "cognitive schemas" and are particularly effective in discovering interpretable models of spatial structure.
I did some unpublished work a couple of years ago (while at Google DeepMind) studying how CHMMs scale to simple ~GB sized language data sets like Tiny Stories [3]. As a subjective opinion, while they're not as good as small transformers, they do generate text that is surprisingly good compared with naive expectations of Markov models. The challenge is that learning algorithms that we typically use for HMMs (eg. Expectation Maximization) are somewhat hard to optimize & scale for contemporary AI hardware (GPU/TPU), and a transformer model trained by gradient descent with lots of compute works pretty well, and also scales well to larger datasets and model sizes.
I later switched to working on other things, but I still sometimes wonder whether it might be possible to cook up better learning algorithms attacking the problem of disambiguating contexts during the learning phase. The advantage with an explicit/structured graphical model like a CHMM is that it is very interpretable, and allows for extremely flexible queries at inference time -- unlike transformers (or other sequence models) which are trained as "policies" for generating token streams.
When I say that transformers don't allow flexible querying I'm glossing over in-context learning capabilities, since we still lack a clear/complete understanding and what kinds of pre-training and fine-tuning one needs to elicit them (which are frontier research questions at the moment, and it requires a more nuanced discussion than a quick HN comment).
It turns out, funnily, that these properties of CHMMs actually proved very useful [4] in understanding the conceptual underpinnings of in-context learning behavior using simple Markov sequence models instead of "high-powered" transformers. Some recent work from OpenAI [5] on sparse+interpretable transformer models seems to suggest that in-context learning in transformer LLMs might work analogously, by learning schema circuits. So the fact that we can learn similar schema circuits with CHMMs makes me believe that what we have is a learning challenge and it's not actually a fundamental representational incapacity (as is loosely claimed sometimes). In the spirit of full disclosure, I worked on [4]; if you want a rapid summary of all the ideas in this post, including a quick introduction to CHMMs, I would recommend the following video presentation / slides [6].
[1]: https://arxiv.org/abs/1905.00507
[2]: https://www.nature.com/articles/s41467-021-22559-5
[3]: https://arxiv.org/abs/2305.07759
[4]: https://arxiv.org/abs/2307.01201
[5]: https://openai.com/index/understanding-neural-networks-throu...
[6]: https://slideslive.com/39010747/schemalearning-and-rebinding...
Anon84 2 hours ago | parent
You could probably point your code to Google Books N-grams (https://storage.googleapis.com/books/ngrams/books/datasetsv3...) and get something that sounds (somewhat) reasonable.
robot-wrangler 2 hours ago | parent
lukev 1 hour ago | parent
I think it's worth mentioning that you have indeed identified a similarity, in that both LLMs and Markov chain generators have the same algorithm structure: autoregressive next-token generation.
Understanding Markov chain generators is actually a really really good step towards understanding how LLMs work, overall, and I think its a really good pedagogical tool.
Once you understand Markov generating, doing a bit of handwaving to say "and LLMs are just like this except with a more sophisticated statistical approach" has the benefit of being true, demystifying LLMs, and also preserving a healthy respect for just how powerful that statistical model can be.
canjobear 1 hour ago | parent
kleiba 1 hour ago | parent
OvrUndrInformed 47 minutes ago | parent
unoti 42 minutes ago | parent
bilsbie 24 minutes ago | parent