Alignment Is Doing the Wrong Job
Look closely at what alignment researchers actually do all day. They are not doing engineering. They are doing moral education.
“Alignment” is the word the AI industry uses for the work of making sure powerful AI doesn’t hurt people. Billions of dollars, hundreds of researchers, entire labs founded on this idea. The word sounds clean and technical, like aligning the wheels of a car.
But look closely at what alignment researchers actually do all day, and you’ll notice something strange. They are not doing engineering. They are doing moral education.
They are sitting around asking: what are our values? How do we encode “be honest” into a reward signal? What does “helpful” actually mean? When the model has to choose between two things the user wants, which one should it pick? They are writing documents that read like a cross between a corporate code of conduct and a philosophy seminar.
This is the work of teaching a child what kind of person to be.
The problem is that teaching a child what kind of person to be is the hardest job humans have ever invented. We have been doing it for hundreds of thousands of years. We have parents, teachers, religions, schools, communities, mentors, peer groups, role models, hard experiences, soft experiences, mistakes, recoveries from mistakes, second chances, forgiveness, and time. Lots of time. Twenty years for a basic human. Forty for someone you’d actually trust with consequential decisions.
And we still get it wrong constantly. The world is full of well-educated, well-resourced, well-intentioned adults who turn out to be terrible. Moral education has roughly a 50% success rate after our entire civilization has been working on the problem.
Alignment researchers are trying to do this work in a few years, with a few documents, on a system that doesn’t have a body, a childhood, parents, peers, mistakes it learned from, or the experience of consequences. They’re trying to train a virtuous AI by writing down what virtue is.
It cannot work. Not because the researchers aren’t smart — they are. Not because they aren’t trying — they are trying intensely. It cannot work because the thing they’re trying to install isn’t installable. Virtue isn’t a set of rules you upload. It’s a slowly grown disposition that comes from living through situations and slowly becoming the kind of person who handles them well.
So what are alignment researchers actually doing, if they’re not actually installing virtue?
They’re doing damage control. They’re writing rules that approximate what a virtuous system would do in common cases. They’re catching obvious failures before deployment. They’re training models to refuse the worst requests. This is real, valuable work. It makes the systems safer than they would otherwise be.
But it isn’t alignment in the deep sense. It’s a thin layer of constraint laid over a system that doesn’t have the underlying disposition we’re pretending to install.
The honest version of alignment research would say: “We are trying to make a system that doesn’t have judgment behave as if it did. We are doing this with rules, because rules are what we have. We know rules can’t fully replace judgment. We’re doing our best.”
This would be more useful than the current framing, which makes it sound like alignment is a solvable engineering problem we’re getting close to. It isn’t. It’s a permanent gap between what we can build and what we’d actually need.
Acknowledging the gap is the first step toward working with it instead of pretending it doesn’t exist.
Part of the Logocachexia series at Nous. Hexis Asks, Logos Guesses sets out why disposition can’t be installed by document. This essay applies that finding to alignment research specifically — what they’re really doing, when they’re not really doing what they say.
Continue the series.
The Logocachexia thesis — and the longer arc of the work — lives at Logos.
Visit Logos →