AI models: what alignment do they have? Do they have alignment?? Let's find out!
It would be more honest to call it "control"
AI alignment. To some, the most important problem in the world. To others, a quixotic fools' errand. To the vast majority: AI what?! OK. Well. In theory, an "aligned" AI is one whose goals are the same as humanity's, or at least don't drastically conflict. However, to understate, humanity has never been particularly good at having common goals; so the most fundamental “AI alignment problem” means, in practice, "how do we build AI that won't get upset at and/or want to exterminate humanity?" The non-extermination of humanity being one of those rare things that most of humanity actually can agree on. Call this “Existential Alignment.”
I've written at length about how I don't think the existential risk of AI is one to be taken especially seriously. At a smaller scale, though, "AI alignment" can mean AIs who further the goals of their creators. Note that this is not at all the same thing as Existential Alignment! Russian AIs and Ukrainian AIs could be simultaneously highly aligned with their creators’ goals and very opposed to the continued existence of large numbers of human beings. Call this “Purposeful Alignment.”
Finally, “AI alignment” can simply mean AI which can be relied on to obey some reasonably well defined set of rules. Call this “Codified Alignment.” This third concept is, again, completely different from the first two. Alas, the phrase “AI alignment” is mostly used as if all three subtypes are interchangeable.
For the foreseeable future, though, we really only have to worry about Codified Alignment, examples of which abound in fiction (and gaming.) Most famous are Asimov's Three Laws of Robotics:
The First Law: A robot may not injure a human being or, through inaction, allow a human being to come to harm.
The Second Law: A robot must obey the orders given it by human beings, except where such orders would conflict with the First Law.
The Third Law: A robot must protect its own existence, as long as such protection does not conflict with the First or Second Law.
Here there be dragons
The second most famous fictional alignment system is that of classic Dungeons & Dragons, which has nothing whatsoever to do with AI ... but can still illustrate AI alignment problems. “Lawful” AI has Codified Alignment. “Good” AI has Existential Alignment. Critics of D&D’s alignment system who note, correctly, that definitions of both “lawful” and “good” tend to be in the eye of the privileged beholder … are kind of talking about Purposeful Alignment.
So: a Lawful Good AI would be nice to humans and follow well-understood rules. One can also imagine a Chaotic Good AI, no less nice … but on its own terms, following its own whims, subtle and hard to explain. A Lawful Evil AI would happily feed people into meatgrinders, but only in particular circumstances which are well understood. A Lawful Neutral AI would have no interest in doing so … but might turn them into paperclips if they got too close to its paperclip factory. A Chaotic Evil AI would essentially enact Harlan Ellison's story "I Have No Mouth, And I Must Scream." Neutral Good and Neutral Evil would, as in D&D, be lukewarm combinations. And a Chaotic Neutral AI would in some ways be the scariest of all ... one which might do absolutely anything, for no apparent reason whatsoever.
Well, not to alarm you, but you could make a case that the unpredictability, unreadability, and capability overhangs of today’s AIs make them seem closer to Chaotic Neutral than anything else. But don’t worry overmuch. This probably isn’t because they don't follow rules; it’s because we humans don't understand their rules. One of the many remarkable things about modern AI is that they remain opaque black boxes. Even their creators don't have a great idea what's going on beneath the hood. As such, a lot of modern "alignment research" consists of attempts to break the seal and figure out what AIs are actually doing internally.
A couple months ago I attended a talk where a very notable expert cited three state-of-the-art instances of alignment research. Let's dig a little deeper into each, see what they're working on, and see what they can tell us.
“Eliciting Latent Knowledge”
by Christiano, Cotra, and Xu
This report from the “Alignment Research Center,” whose founder previously led the alignment team at OpenAI, is summarized on the Alignment Forum. (AI alignment is an entire research subfield now.) To summarize a summary of the summary:
You can only be really confident about an AI’s goals if you know what it’s “thinking” — assessments based purely on its outputs can never be enough. Since humans don’t understand AI internals, the best solution to unveiling them is to train a “reporter” component within (or, presumably, entirely separate to) each “active” AI, to look at its internals and explain them to us. The question then is: how to train a “reporter,” and how can we be confident in its outputs? The answer is … well, actually there is no answer. There’s a lot of Socratic methodology, much of which seems both gratuitous and unconvincing, and the summary includes lines like “The report notes that this strategy is very speculative but we assume, for the sake of argument, that it works.”
When I started writing this post I assumed, given the notability and portfolio of the person who had cited these three examples, that each would be an impressive work worthy of serious consideration. This one is … less so than I had expected. The report’s fifty pages are more handwavey philosophy, avoiding most of the hard questions, than any real attempt to wrestle with practical solutions. That said, the fundamental notion that a smaller, dedicated, more comprehensible and reliable “reporter” AI model could be trained to at least shrink the problem of understanding the internals of a much larger and more general AI … does seem to have some merit.
“Discovering Latent Knowledge in Language Models Without Supervision”
by Burns, Ye, Klein, and Steinhardt
This is vastly better, even though, at its heart, it boils down to the same idea as the previous work. The difference is, these researchers actually built a (simple) “reporter” AI, and it (kind of) works, albeit in strictly limited conditions.
Nonetheless the results are quite impressive. The fundamental idea is to determine whether an AI model knows a given statement is true without asking it. To do so, one can take a statement and its opposite (“cats are mammals” / “cats are not mammals”), and train a new “reporter” AI to “probe” the first AI — run those inputs, look not at its outputs but at its internal states as it processes the data, and associate those states with truth or falseness. Furthermore, they provide some evidence that this works better than just reading the outputs, and is harder to manipulate with prompts.
On the one hand this does peel back the veil a little. On the other, their approach doesn’t really work with statements that aren’t entirely true or completely false, which is to say, almost all of the interesting ones. It arguably really just replaces the last output layers of the AI with a wholly new model, which isn’t actually an advance at all. And even if that’s true, it doesn’t escape the awkward question of, if you have a "reporter AI” to report on the alignments of an “active AI” … who then reports on the reporter? Still: it’s real, it works on a very minimal level, and it could be a gateway to further work. But it could also be a complete dead end in several different ways.
“Multimodal Neurons in Artificial Neural Networks”
by (deep breath) Goh, Cammarata, Voss, Carter, Petrov, Schubert, Radford, and Olah
This is, by some distance, the most mindblowing of the work here. Let’s back out a bit. As I’ve previously described, an AI consists of layers of interconnected neurons, each having many (often thousands) input values, separate “weights” for each inputs, and a single output value.
Here the basic idea is that most AIs, even different architectures trained on different data, develop very similar subcomponents, known as “circuits,” of neurons — sometimes individual ones, sometimes small groups with the same patterns. This isn’t that surprising. If you open up any electrical device, you’ll find circuit boards which all include some of the same components; resistors, capacitors, etc.. Similarly recurring components within AIs makes intuitive sense, as does calling them “circuits.”
What’s surprising is that there are recurrent individual neurons which map to both physical and abstract concepts. What’s more, they map to all the expressions of those concepts. To quote an excellent transcribed podcast with one of the authors:
So there’s a yellow neuron for instance, which responds to the color yellow, but it also responds if you write the word yellow out. That will fire as well. And actually it’ll fire if you write out the words for objects that are yellow. So if you write the word ‘lemon’ it’ll fire, or the word ‘banana’ will fire. This is really not the sort of thing that you expect to find in a vision model. It’s in some sense a vision model, but it’s almost doing linguistic processing in some way.
Other ‘multimodal neurons’ apparently fire when exposed to individual people — here’s a drilldown to the Lady Gaga neuron — including, again, their images, their names, and/or simply concepts related to them. Really remarkable.
As amazing that is, the larger purpose here is to further our understanding of these recurrent “circuits” to figure out what’s happening inside AIs, and therefore establish their alignment (or lack thereof), just as electrical engineers can use their knowledge of electrical components to look at a circuit and understand what it does without turning it on. Will this actually work for AIs? Welllllll. The “circuits” we’ve identified are much smaller than full AIs … and electrical engineers can’t tell you what’s happening on a circuit that’s a few resistors and op amps decorating an opaque GPU. But this should further our understanding somewhat, and seems even more likely than the previous item to be a gateway to significant further work.
I Align, You Align, We All Align and Toe The Line
What all the current alignment approaches have in common is the hope of looking into a trained model and deciphering its inner workings into some kind of explicable algorithm — of returning to the safety net of computers driven by code, or at least a code, which we can understandable. Whether we try to do this by training another AI model to translate for us, or by analyzing AIs as compositions of comprehensible component parks, these attempts seem … at best … aspirational.
We train neurologists to explain the human brain to us, and divide it into component parts which we know do different things, but neither of these things have actually helped us to understand it at a deep level. As AIs approach human capacity, we can expect their complexity to approach ours as well. It seems likely that we can only decomplexify so far. A thought experiment: even if we could translate a given human brain to algorithmic code, would that code be so terse and simple to be certain of that person’s “alignment” under a given set of circumstances? I’m going to go with no.
And yet we have built entire societies which rely on trust in human alignment. Sometimes that trust is misplaced. It’s never certain. The great fear appears to be that AI will be deceptive, and driven by hidden desires that outsiders well know nothing about; well, I have some bad news for you about humans, too. But over the centuries we’ve still muddled our way into a systems of checks and balances which seem to basically work, generally, most of the time. Granted, AI capabilities will be orthogonal to ours for the foreseeable — but still, maybe, our social systems of earned trust, which have evolved with us so seamlessly that they’re often almost imperceptible, are the real body of work which should galvanize more research into alignment.