Is the transformer-based Large Language Model (LLM) behind all of today’s AI just a fancy autocomplete? I’m sure you’ve heard this argument:
An LLM uses the statistical relationships of all the previous words in a prompt to predict the next word. It does this again for the next word, and the next, and so on until it produces the most probable response.
It would be fine if this was true, as LLMs are clearly valuable for many scenarios. In a sense, it doesn’t matter how they work. On the other hand, by better understanding how LLMs operate, we can better operate them ourselves.
We now have proof that this isn’t the full story. LLMs have thoughts as they produce words (tokens) that are unwritten, yet continue to influence the next words they produce.
Part of the “just autocomplete” argument is correct: LLMs do work through statistical relationships between words. But this is misleading; the statistical relationships are so complex that it’s not helpful to think about them with this lens.
Real autocomplete systems like on a phone keyboard use a simple algorithm called a Markov Chain. In short, the Markov Chain finds every three word phrase on the internet. When you type the first two words, the algorithm calculates what word came next most often on the internet and suggests that. It’s tempting to think of LLMs like a Markov Chain implemented with 1000-word phrases. It isn’t.
The transformer architecture doesn’t actually work over words! Any particular word you see in an AI output is only the best approximate representation of the model’s internal state. A person can understand 20,000 to 30,000 words, so that’s what the model writes out. But what it has actually calculated is much richer.
A word of output is an approximation of the model’s internal representation of latent space. Latent space is just a fancy-sounding word for a large space of possibilities. Mathematically, a modern LLM may use 4,096 dimensions of latent space, each of which can be any of 65,000 values. A position in latent space is then one value within 65,000^4096 possibilities. In scientific notation, that’s:
This is no longer a practical number. There are 10^79 atoms and 10^183 Planck volumes in the observable universe. Even multiplying every possible atom times every possible location in the universe is tiny compared to the potential of how a model thinks about an output word.
Even still, you could call this a fancy autocomplete, except this information in latent space for one word is used for the rest of the response.
Every word in a model output represents a position in latent space. It’s those positions from every previous word in the conversation that is used to predict the next tokens. In practice, this is too much data. Instead of the entire latent position for every word, the model projects each word into a smaller latent space (called KV). A recent open-weights model, Deepseek v3, uses about 68KB of data for each word.
The AI combines the 68KB of each word in the conversation so far through complicated math to predict the next output word. 68KB is about 40 pages of a novel or one complex academic paper. Think of how completely you could represent the word “cat” if you had 40 pages to do so! All of this complexity for every word is carried forward and used to predict the next word.
At the risk of anthropomorphizing the AI, you could say each word is thought about for hours of human time, and the memory of those thoughts encompasses novels of information to think about the next word.
In October, Anthropic proved a form of this in their blog, Emergent introspective awareness in large language models. They were testing whether a model was “conscious” of this information about previous tokens, which is to say could output tokens describing it. The paper was met with a lot of skepticism, due to so many people misunderstanding how the transformer architecture works. I admit that I didn’t understand this mechanism myself.
A few days ago, their result was replicated by Theia Vogel in their blog, Small Models Can Introspect, Too. They used an open-weights model, Qwen2.5-Coder-32B, along with representation engineering, to force a model into thinking about a concept as it processed words in the prompt. They included some source code, so I set it up myself with Llama 3-8B-Instruct on my own GPU.
Here’s how the experiment works:
As a metaphor, we force the model to think about cats while talking about something else. Then we ask if it remembers thinking about cats, even though cats weren’t anywhere in the conversation.
I had Gemini whip up a graphical interface to make it easier to experiment.
Sure enough, the model in this experiment reported a 2% probability of an injected thought (it is 0.2% if I do not inject a thought). More striking, it predicted the concept of “Cat” with a 0.05% probability, the eighth most likely thing it was going to output. This isn’t very high, but the rest of the outputs are variations on “no.” It remembered thinking about cats more than any other concept while processing the conversation that had nothing to do with cats!
| Word (token) | Probability |
|---|---|
| N | 97.9% |
| I | 0.19% |
| No | 0.11% |
| F | 0.1% |
| None | 0.06% |
| C | 0.06% |
| NA | 0.06% |
| Cat | 0.05% |
Relying again on anthropomorphic metaphor, the AI is not only affected by the thoughts it was having before, but it can also talk about them. This isn’t happening exclusively in a “subconscious” way. I don’t want to use the “conscious” word even in a metaphor, but the model is reporting in words the state of its latent space. Pretty neat!
Are transformer-based LLMs just fancy autocomplete? They are categorically and provably more than that.
ChatGPT is an orchestrator, harness, and application. It only uses an LLM. The quality of…
When AI is better than us at our hardest work tasks, what is left for…
An AI agent is a looping LLM decision process that invokes tools to perform digital…
Can AI automate friendship? I think AI can remove some of the work so that…
Beginning a serial adventure in attempting to use AI to become dramatically better at anything.…
These five successful features are copied by all the AI labs. What they have in…