Interpretability: What AI Actually Thinks (Then Fixing It)

Krang is the real thinker within the big robot. From Teenage Mutant Ninja Turtles.

You know that AI hallucinates, but it also has many other issues that limit its value. Worse, we have very little idea how to properly fix those issues! We simply do not understand AI well enough to know how to fix its bugs.

How an AI model actually arrives at an answer is almost entirely a black box system. Not even the model developers can explain why a model chose one particular word over another. However, we are starting to light up the black box and figure out how AI models tick, with the science of interpretability.

You need to understand it to fix it

This isn’t a simply intellectual pursuit. It’s a real and important critique of AI that no one can tell how it arrived at an answer. This is absolutely crucial for traditional software. Consider the late 2010s crashes of Boeing 737 Max aircraft. After some investigation, we know precisely what caused the computer system to push the nose of the airplane down towards the ground: a bad hardware sensor indicated that the airplane was about to stall. Boeing’s engineers know the exact malfunction that caused the crash. We cannot do the same for today’s AI software.

Failing to understand how AI works internally is what ultimately causes hallucination, harmful output, and useless results. In traditional software, we could find the bug and fix it. In AI software, we are often stuck placing clumsy guardrails, restrictive instructions, and asking the user to double-check output.

ALWAYS return in precise JSON without any errors. DON’T HALLUCINATE!!

We could resolve all of those issues, if only we understood the inner workings of the AI model.

How AI works

Don’t worry, I’m not about to delve into the mathematics of transformers. Instead, let me give you some intuition about how they operate.

You’ve heard complaints that Large Language Models (LLMs), an application of Deep Neural Nets, are next-word predictors. This is true in a technical sense, but it’s a mostly useless description in the practical sense.

LLMs started out as “completion” models. You would give it the start of a sentence, and it would write out each of the next words (actually tokens, but that’s not important here) of a sentence. You could send, “The quick brown fox jumps over the lazy”, and the model would output “dog”. Those first eight words are repeated many times in the training data, and nearly always “dog” is the next word. This simple example does work like the next-word predictor on your phone keyboard.

The trick is that LLMs are not simply looking at the frequency of those words appearing in that order in the training data. Each of those words has a meaning within the black box of the model. This gets complicated quickly. “Quick” will usually be an adjective to mean “fast,” but it can also uncommonly refer to a part of a fingernail. The model needs to know both interpretations, and somehow stores both in its internal neurons.

Just as important for LLMs, each word is affected to different amounts by each of the preceding words. If you change it to, “…over twelve lazy”, the model will correctly make “dogs” plural. When this scales up trillions of times, you can solve arbitrary problems by setting up the completion to stop right before the answer.

Unfortunately for understanding them, scaling up LLMs trillions of times also means it’s no longer comprehensible how they work in any particular situation.

Understanding neurons

Interpretability is the study of how an AI arrived at an answer. One of my projects last year used an interpretability technique called Representation Engineering (RepE). RepE works by noting which neurons within the model activate the most when completing a sentence on a particular topic or with a particular emotion. Then for future sentences, you can artificially boost those neurons to be louder. You should definitely try out the interactive version on my article from last fall.

RepE is crude: it feels like giving the model’s brain electroshock therapy. More than a small tweak leads quickly to gibberish. I believe this is because there is too much error in knowing which neurons should be boosted.

Fortunately, recent improvements in the field have shown us how to understand a little bit about what any internal neuron of a model means. Scientists at OpenAI and elsewhere used powerful AI models to study simpler ones, creating neuron dictionaries. It’s not actually quite that simple, as a word is contained within several neurons, sort of “smeared” between them and also overlapping with other concepts. And it changes based on the previous words (like I said, not very simple!). In any case, the word “quick” when in this sentence is made up of neurons (in Gemma-2-2B) that have been described by another model as:

words associated with speed and ease
the word “rapid”
mentions of legal counsel
code snippets and C++ references
the word “quickly”

Among others. It’s fun to see how wrong so much of the model is, or perhaps how correct it is in a dimension we cannot understand. You can try this out in OpenAI’s Neuron viewer.

Understanding relationships

Interpreting a model’s output begin with understanding the neurons, but there’s more to do. We also need to understand the relationships between neurons. This is encoded directly in the values of the model, but there are too many and too smeared around to make very much sense of directly. Worse, the next output word of the model can be influenced by thousands to millions of previous words.

I sometimes think of interpretability as filling in a map. Understanding the meaning of neurons is writing in city and place names. Understanding their relationships is like drawing the roads in-between them.

Researchers at Anthropic call interpreting relationships “circuit tracing,” and it is even more fun than understanding neurons. Here is one piece of the map between neurons for our quick brown fox sentence:

The pink highlighted neuron can be understood to refer to concepts about “numbers and code,” which is similar to the neurons we saw active for the word “quick.” There are actually several neurons about programming language and structured syntax that help the model produce “dog” as output (visible in the top right). Perhaps this sentence’s origins as a typing exercise is related to typing while programming? With more investigation we could find out for sure.

Debugging “Elara”

The first custom GPT I built with ChatGPT was Dungeon Quests Infinity. When testing, I found that characters were often named “Elara.” At one point I added instructions to have it pick something else. What’s even weirder is that this persisted across multiple models and even models from Anthropic and Google. All AI loves to name characters “Elara.” Why?

I used the Circuit Tracer tool to find out. I started the sentence, “The sorceress was named Lady” to see what would influence the next word. After digging in, tracing, grouping, and investigating, I noticed a neuron specifically about “text related to the World of Warcraft online game.” I made this extremely simplified map of what leads to the output:

World of Warcraft does indeed have a character named “Elara.” The model this time was actually going to produce a name that started “Sy”, which I’m betting would be Lady Sylvanus Windrunner, another World of Warcraft character. This is no smoking gun, but it’s possible that World of Warcraft characters are over-represented in the training data. If you go look at Lady Sylvanus’s description page, you’ll notice it’s longer than an average US President’s Wikipedia page!

Interpreting models to fix bugs

If I’m right about over-representing World of Warcraft character names, I can fix the bug in my Dungeon Quests Infinity game. Instead of writing “Don’t name characters Elara,” I can say, “Don’t use existing names from World of Warcraft.” This will both de-emphasize “Elara” and other fantasy names that get repetitive. It’s a more durable fix for a broader class of the bug. Or if I were the model developer, I could address the problem for everyone in the next model by balancing out the training data.

As the science of interpretability improves, we’ll be able to address the more serious issues of hallucination, harmful output, and useless results. One day, we may quash bugs by carefully tracing these neurons of imprecise meaning and adjusting their relationships.

The black box of AI is lighting up. Interpretability gives us the map to find and properly fix AI models, making them more useful for everyone.

If you are interested in this topic, I strongly recommend reading this fascinating article by Anthropic: Tracing the thoughts of a large language model \ Anthropic. You’ll never argue that “it’s just a next word predictor” again!

Abram Jackson