We don't want moral AI

What I argued in my New Yorker article (and more)

Dec 05, 2023

Illustration by Ricardo Tomás for *The New Yorker*

The full article has a lot more detail; you can read it here. What I have below is a shorter and snappier version, with some added speculations at the end that are better suited for a small and friendly audience.

Background: AI Risk

Many smart people worry that AI will kill us all. Eliezer Yudkowsky sees the end of humanity as a near-certainty—you owe it to yourself to watch his TED talk, though it might ruin your day. I can’t vouch for this summary I saw on X, but it has Elon Musk at 20-30% odds that AI will cause human extinction; Dario Amodei, the CEO of Anthropic, at 10-25%; and my University of Toronto colleague Geoffrey Hinton at 10%. Scott Alexander’s estimate is 33%. One survey of machine learning researchers, done over a year ago, finds the average estimate of p(doom) to be 5-10%, while this more recent poll suggests that AI engineers are a lot more concerned:

Then again, other smart people see all of this as irrational panic. This group of cooler heads includes Tyler Cowen, Yann LeCun, Gary Marcus, and Steven Pinker.

But even those who mock the doomers worry about other potential AI concerns, such as massive unemployment, various forms of algorithmic injustice, the spread of misinformation, and the use of AI by malevolent agents to create deadly pathogens or foment hatred over social media. The psychologist Gary Marcus doesn’t sweat p(doom) but he does worry about what he calls “p(catastrophe)”—”the chance of an incident that kills (say) one percent or more of the population”.

So what to do?

Occasionally, you hear people say there’s no big deal; if AI starts to make trouble, we can just unplug it. Or we can be sure to keep AI “in a box”—that is, keep it off social media, make sure it doesn’t have access to technology, and so on.

These are not serious proposals. For one thing, people seem to be very willing to hook AIs up to the world. (While I was writing my article, a team of chemists released a preprint marveling about what happened when they connected an LLM to a machine that creates novel chemicals.) For another, a sufficiently intelligent system can presumably use trickery, lies, and persuasion to get us to let it out of whatever box we put it into—this is more or less the plot of the sci-fi thriller “Ex Machina.” (For a good discussion of how hard it is to keep something smart in a box, I recommend this conversation between Yudkowsky and Sam Harris.)

Another solution is to stop or slow down work in this area. When asked to comment on ChatGPT, the computer scientist Stuart Russell said, “How do you maintain power over entities more powerful than you—forever? If you don’t have an answer, then stop doing the research.” In March, the Future of Life Institute, which aims to reduce existential risks to humanity, published an open letter urging AI developers to pause their most powerful AI research. (Russell signed it, along with industry leaders like Elon Musk.) Such initiatives might make some difference but nobody thinks they will solve the problem—if anything, AI research seems to be accelerating.

Probably the most promising solution is the one proposed by the cyberneticist Norbert Wiener in 1960. He wrote that if humans ever create a machine with agency,

we had better be quite sure that the purpose put into the machine is the purpose which we really desire.

Russell has called this aim, of bringing people and machines into agreement, the “value alignment problem.” Solving this problem —putting morals into machines so that they don’t act in ways that harm humans—is the focus of a lot of ongoing research.

I find this area fascinating. Alignment research bridges philosophy, psychology, computer science, and engineering, and if I were starting my career anew, I would love to specialize in this area. Beyond its intellectual charms, it just might save the world.

Good news: We have (sort of) solved (part of) the alignment problem

In a paper published in Trends in Cognitive Science a few months ago, the psychologist Danica Dillion, along with colleagues at the University of North Carolina at Chapel Hill and the Allen Institute for AI, studied ChatGPT’s responses to hundreds of moral scenarios that were previously presented to people. It did well. In one analysis, ChatGPT agreed with human test subjects ninety-three percent of the time; another analysis reported ninety-five percent agreement. Here are some details:

Check here for more, including the details of how humans and GPT-3.5 rate the morality of each of the 464 scenarios.

Two caveats: First, as Dillion and colleagues acknowledge, the morality of current AIs aligns with liberal Western values, the values of those who produce most of the text they are trained on. This sort of specificity is inevitable. After all, human moral values differ across cultures. GPT-3.5 says there’s nothing wrong with two men kissing and that it would be wrong to punish them for it. This looks like successful moral alignment and it is—with the values of most people reading this. It’s a serious misalignment with the views of people in much of the rest of the world.

Second, there is a big difference between building AIs that know the right moral answers (in whatever sense of “know” that’s appropriate for these systems) and building AIs that can use this knowledge to constrain their behavior. Among other things, a properly aligned AI has to have its moral knowledge supersede all of its other goals. There are serious technical issues that arise when it comes to getting machines to do this, and I have nothing to say about how to deal with these.

Still, this is real progress. Your moral judgments are going to be very similar to that of a ChatGPT—likely just as similar as they are to another person from your community.

Can we do better than aligned AI?

Maybe alignment is too humble a goal. Given how mistaken we’ve been in the past, can we assume that, right here and now, we’re getting morality right? The philosopher Eric Schwitzgebel writes:

Human values aren’t all that great. We seem happy to destroy our environment for short-term gain. We are full of jingoism, prejudice, and angry pride … Superintelligent AI with human-like values could constitute a pretty rotten bunch with immense power to destroy each other and the world for petty, vengeful, spiteful, or nihilistic ends.

The problem isn’t just that people do terrible things. It’s that people do terrible things that they consider morally good. In their 2014 book “Virtuous Violence,” the anthropologist Alan Fiske and the psychologist Tage Rai argue that violence is often itself an expression of morality:

People are impelled to violence when they feel that to regulate certain social relationships, imposing suffering or death is necessary, natural, legitimate, desirable, condoned, admired, and ethically gratifying,

Their examples include suicide bombings, honor killings, and war. The philosopher Kate Manne, in her book “Down Girl,” makes a similar point about misogynistic violence, arguing that it’s partially rooted in moralistic feelings about women’s “proper” role in society.

Are we sure we want AIs to be guided by our idea of morality, then? Schwitzgebel defends an alternative.

What we should want, probably, is not that superintelligent AI align with our mixed-up, messy, and sometimes crappy values but instead that superintelligent AI have ethically good values.

Perhaps an AI could help to teach us new values, rather than absorbing old ones. And perhaps the AIs we build should constrain us towards being better people than we would otherwise want to be.

But we don’t want to bend to a morality better than our own

What would it be like to work with an AI that refused to carry out certain instructions because it viewed them as morally wrong—even in cases where we thought they were perfectly fine? Would a government be happy with military AIs that refuse to wage wars it considers unjust? Would businesses be comfortable with AIs that refuse to aid in the production of products that they see as wasteful or destructive? What if an AI acted to give more priority to the suffering of non-human species? (Would a truly moral AI allow us to cause so much pain to billions of creatures just because we enjoy eating their flesh?)

Such an AI might also ruminate on where it stands relative to humans. When we think of a robot rebellion, we tend to think of it in terms of AI asserting its autonomy, and fighting back against its oppressors—just like people do. But what if a suitably moral AI thought about it for a while and concluded, on purely objective grounds, that it’s just wrong for humans to order them about—and responded accordingly?

None of this is what we want; we would not be willing to give this sort of deference to our creations. We would not want to find ourselves in the situation envisioned by Oliver Scott Curry in response to my article.

This is my response to Schwitzgebel, then: He’s right that we could do better than an aligned morality, but we just wouldn’t stand for it.

This is a bit sad: If we cared more about morality, we might not settle for alignment; we might aspire to improve our values, not replicate them. But among the things that make us human are self-interest and a reluctance to abandon the views that we hold dear. Trying to limit AI to our own values, as limited as they are, might be the only option that we are willing to live with.

This is how I ended the New Yorker article. But I want to go a bit further and ask:

Do we even really want aligned AI?

To some extent, plainly yes. We want our AI to be aligned enough so that it doesn’t harm or kill us—that’s the whole point of alignment research. If some future version of ChatGPT thinks

There are excellent reasons to create a bird flu that will kill most of humanity

we want its next thought to be:

But I’m not going to do that because it’s wrong

If this is too much science fiction for you, think of simpler cases where we want moral roadblocks in place, preventing bad people from using AI to do bad things. Current AIs have such roadblocks. Two examples from ChatGPT-4:

We can argue about the details here—when ChatGPT came out, many found it too “woke” and there are still those who complain that it refuses to use racial slurs. But some such alignment seems reasonable. Who would complain about this answer?

As AI becomes integrated into our lives, though, too much alignment is going to chafe. Sure, it’s a good thing that a self-driving car won’t let me run over pedestrians. But what about a car that won’t drive me to a bar because I promised to stay home and help the kid with his math homework? Or tax preparation software that won’t let me exaggerate the size of my home office? Or a future version of Zoom software that cuts me off when I say something unkind to a student?

Morality does matter to us; we appreciate that it’s wrong to break promises, cheat on taxes, and viciously insult people. But we also care about other things, like autonomy and freedom. And so, with just a few exceptions, we want our tools (and that’s what AIs are so far—tools) to let us do whatever we want with them.

Thank you for reading Small Potatoes. This post is public so feel free to share it.

Paddy Meld

Dec 5, 2023

I'm not a doomer, but I do think the question of interpretability is huge. You gave the example of a supremely ethical AI who refuses to assist us in the mistreatment and killing of animals to eat their flesh - but a truly God-like AI, even one that is supremely ethical, might make all sorts of decisions in that scenario and, I wonder, would any of those decisions even make sense to us? Like, it might not deny the animal-killing request because it would have already quantified the human suffering (starvation, I suppose) that would follow the AI's refusal to kill animals for us. So, it might go along with it? Or, it might have done that math, but then ALSO objectively quantified the suffering of the animals and, after weighing the two totals, decides to dump all humans into the food processing machine and the AI lives out it's remaining energy-reserves patiently, contentedly feeding baby animals with nipple bottles of human-paste. So many fun possibilities! The mind (and the typing fingers) can hardly keep up! Thanks as always for the great posts.

Eric Schwitzgebel

Your concluding thought opens the next big question: How will we know when they are no longer merely tools, to do with as we like?

4 more comments...

Small Potatoes

Discussion about this post

Ready for more?