World-first research dissects an AI’s mind, and starts editing its thoughts

If you listen to the rather compelling arguments of AI doomsayers, the coming generations of artificial intelligence represent a profound danger to humankind – potentially even an existential risk.

We’ve all seen how easily apps like ChatGPT can be tricked into saying or doing naughty things they’re not supposed to. We’ve seen them attempt to conceal their intentions, and to seek and consolidate power. The more access AIs are given to the physical world via the internet, the greater capacity they’ll have to cause harm in a variety of creative ways, should they decide to.

Why would they do such a thing? We don’t know. In fact, their inner workings have been more or less completely opaque, even to the companies and individuals that build them.

The inscrutable alien ‘minds’ of AI models

These remarkable pieces of software are very different to most of what’s come before them. Their human creators have built the architecture, the infrastructure and the methods by which these artificial minds can develop their version of intelligence, and they’ve fed them vast amounts of text, video, audio and other data, but from that point onward, the AIs have gone ahead and built up their own ‘understanding’ of the world.

They convert these massive troves of data into tiny scraps called tokens, sometimes parts of words, sometimes parts of images or bits of audio. And then they build up an incredibly complex set of probability weights relating tokens to one other, and relating groups of tokens to other groups. In this way, they’re something like the human brain, finding connections between letters, words, sounds, images and more nebulous concepts, and building them up into an insanely complex neural web.

Massive amounts of data in, massively complex neural webs out

Massive amounts of data in, massively complex neural webs out

These colossal matrices full of probability weightings represent the ‘mind’ of an AI, and they drive its ability to receive inputs and respond with certain outputs. And, much like the human brains that have inspired their design, it’s been nigh-on impossible to figure out exactly what they’re ‘thinking,’ or why they’re making certain decisions.

Personally, I’ve been conceiving them as strange alien minds locked in black boxes. They can only communicate with the world via the limited pipelines by which information can flow in and out of them. And all attempts to ‘align’ these minds to work productively, safely and inoffensively alongside humans have been done at the pipeline level, not to the ‘minds’ themselves.

We can’t tell them what to think, we don’t know where rude words or evil concepts live in their brains, we can only constrain what they can say and do – a concept that’s difficult now, but promises to become increasingly harder the smarter they become.

This is my highly reductive, bonehead-level understanding of a dense and complex situation – and please jump into the comments to expand, query, debate or clarify if necessary – but it gives some indication of why I think the news that’s come out of Anthropic and OpenAI recently is such an important milestone in humanity’s relationship with AIs.

What is interpretability?

Interpretability: Peering into the black box

“Today,” writes the Anthropic Interpretability team in a blog post from late May, “we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.”

Essentially, the Anthropic team has been tracking the ‘internal state’ of its AI models as they work, having them spit out great lists of numbers representing the ‘neuron activations’ in their artificial brains as they interact with humans. “It turns out,” writes the team, “that each concept is represented across many neurons, and each neuron is involved in representing many concepts.”

Using a technique called ‘dictionary learning,’ through ‘sparse autoencoders,’ Anthropic researchers began trying to match patterns of ‘neuron activations’ with concepts and ideas familiar to humans. They had some success late last year working with extremely small “toy” versions of language models, discovering the ‘thought patterns’ that activated as the models dealt with ideas like DNA sequences, nouns in mathematics, and uppercase text.

It was a promising start, but the team was by no means sure it’d scale up to the gigantic size of today’s commercial LLMs, let alone the machines to follow. So Anthropic built a dictionary learning model capable of dealing with its own medium-sized Claude 3 Sonnet LLM, and set about testing the approach at scale.

The results? Well, the team was blown away. “We successfully extracted millions of features from the middle layer of Claude 3.0 Sonnet,” reads the blog post, “Providing a rough conceptual map of its internal states halfway through its computation. This is the first ever detailed look inside a modern, production-grade large language model.”

Multimodal models develop independent concepts, like "Golden Gate Bridge" that can be accessed through images as well as text in multiple languages
Multimodal models develop independent concepts, like “Golden Gate Bridge” that can be accessed through images as well as text in multiple languages


It’s fascinating to learn that the AI is storing concepts in ways that are independent of language, or even data type; the ‘idea’ of the Golden Gate Bridge, for example, lights up as the model processes images of the bridge, or text in multiple different languages.

And ‘ideas’ can become a whole lot more abstract than that, as well; the team discovered features that activated when encountering things like coding errors, gender bias, or many different ways of approaching the concept of discretion or secrecy.

All the darkest ideas of humanity, and all your fears about AI are robustly shown in the concept maps
All the darkest ideas of humanity, and all your fears about AI are robustly shown in the concept maps


And indeed, the team was able to locate all sorts of darkness in the AI’s conceptual web, from ideas about code backdoors and biological weapons development, to concepts of racism, sexism, power-seeking, deception and manipulation. It’s all in there.

What’s more, the researchers were able to look at the relationships between different concepts stored in the model’s ‘brain,’ developing a measure of ‘distance’ between them and building a series of mind maps that show how closely concepts are connected. Near the Golden Gate Bridge concept, for example, the team found other features like Alcatraz island, the Golden State Warriors, California Governor Gavin Newsom and the 1906 San Francisco earthquake.

The team worked out a way to measure "distances" between concepts, allowing the creation of these incredible conceptual mind maps
The team worked out a way to measure “distances” between concepts, allowing the creation of these incredible conceptual mind maps


The same held for more abstract concepts, right down to the idea of a Catch-22 situation, which the model had grouped close to ‘impossible choices,’ ‘difficult situations,’ ‘curious paradoxes,’ and ‘between a rock and a hard place.’ “This shows,” writes the team, “that the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity. This might be the origin of Claude’s excellent ability to make analogies and metaphors.”

The beginnings of AI brain surgery – and potential lobotomies

“Importantly,” writes the team, “we can also manipulate these features, artificially amplifying or suppressing them to see how Claude’s responses change.”

The team began “clamping” certain concepts, altering the model so that certain features were forced to fire as it answered completely unrelated questions, and found it drastically altered the model’s behavior, as shown in the video below.

Dictionary learning on Claude 3 Sonnet

This is pretty incredible stuff; Anthropic has shown it can not only create a mind-map of an artificial intelligence – it can also edit relationships within that mind map and toy with the model’s understanding of the world – and subsequently its behavior.

The potential in terms of AI safety here is clear; if you know where the bad thoughts are, and you can see when the AI is thinking them, well, you’ve got an added layer of oversight that could be used in a supervisory sense. And if you can strengthen or weaken connections between certain concepts, you could potentially make certain behaviors vanish from an AI’s range of possible responses, or even excise certain ideas from its understanding of the world.

It’s conceptually reminiscent of Jim Carrey and Kate Winslet paying a brain-wiping company to delete each other from their memories after a breakup, in the sci-fi masterpiece Eternal Sunshine of the Spotless Mind. And, like the movie, it raises the question: can you ever really delete a powerful idea?

The Anthropic team also proved the potential peril of this approach, “clamping” the concept of scam emails, and showing how a powerful enough mental connection to the idea could quickly bypass the Claude model’s alignment training forbidding it from writing such content. This kind of AI brain surgery can indeed supercharge a model’s potential for evil behavior, and allow it to smash through its own guardrails.

Editing the AI's neural connection strengths can radically alter its behavior
Editing the AI’s neural connection strengths can radically alter its behavior


Anthropic has other reservations about the extent of this technology. “The work has really just begun,” writes the team. “The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place).

“Understanding the representations the model uses doesn’t tell us how it uses them; even though we have the features, we still need to find the circuits they are involved in. And we need to show that the safety relevant features we have begun to find can actually be used to improve safety. There’s much more to be done.”

This kind of thing could be a hugely valuable tool, in other words, but it’s unlikely ever to fully understand a commercial-scale AI’s thought processes. This will give little comfort to doomsayers, who will point out that when the consequences are potentially existential, a 99.999% success rate ain’t gonna cut the mustard.

Still, it’s a phenomenal breakthrough, and a remarkable insight into the way these unbelievable machines understand the world. It’d be fascinating to see how closely an AI’s mental map sits with a human’s, should it ever be possible to measure that.

OpenAI: also working on interpretability, but apparently a ways behind

Anthropic is one key player in the modern AI/LLM field, but the juggernaut in the space is still OpenAI, makers of the groundbreaking GPT models and certainly the company driving the public conversation around AI the hardest.

Indeed, Anthropic was founded in 2021 by a group of former OpenAI employees, to put AI safety and reliability at the top of the priority list while OpenAI partnered with Microsoft and started acting more like a commercial entity.

But OpenAI has also been working on interpretability, and using a very similar approach. In research released in early June, the OpenAI Interpretability team announced it had found some 16 million ‘thought’ patterns in GPT-4, many of which the team considers decipherable and mappable onto concepts meaningful to humans.

The OpenAI team doesn’t seem to have ventured into the map-building or mind-editing areas yet, but it also notes the challenges inherent in understanding a large AI model as it works. “Currently,” writes the team, “passing GPT-4’s activations through the sparse autoencoder results in a performance equivalent to a model trained with roughly 10x less compute. To fully map the concepts in frontier LLMs, we may need to scale to billions or trillions of features, which would be challenging even with our improved scaling techniques.”

So from both companies, it’s early days at this point. But at least humanity now has at least two ways to prize open the ‘black box’ of an AI’s neural web and begin understanding how it thinks.

The OpenAI research paper is available here.
The Anthropic research paper is available here.

Hear members of Anthropic’s Interpretability team discuss this research in detail in the video below.

Scaling interpretability

Sources: Anthropic, OpenAI

Source of Article