Re: guaranteeing friendliness

From: Eliezer S. Yudkowsky (
Date: Sat Dec 03 2005 - 19:00:41 MST


It disturbs me that none of the replies to you seem to have actually
addressed, except perhaps peripherally, your point about goals changing
during the AI's evolution. The problem is indeed, as one poster said,
that you cannot use "evolution" to describe the process of
self-modification, but that's not an explanation without also explaining
the key difference between natural selection and self-modification.

Here's my own answer, hope it helps.

There is a fallacy oft-committed in discussion of Artificial
Intelligence, especially AI of superhuman capability. Someone says:
"When technology advances far enough we'll be able to build minds far
surpassing human intelligence. A superintelligence could build enormous
cheesecakes - cheesecakes the size of cities - by golly, the future will
be full of giant cheesecakes!" The question is whether the
superintelligence wants to build giant cheesecakes. The vision leaps
directly from capability to actuality, without considering the necessary
intermediate of motive.

People often immediately declare that Friendly AI is an impossibility,
because any sufficiently powerful AI will be able to modify its own
source code to break any constraints placed upon it.

The first flaw you should notice is a Giant Cheesecake Fallacy. Any AI
with free access to its own source would, in principle, possess the
ability to modify its own source code in a way that changed the AI's
optimization target (the region into which the AI tries to steer
possible futures). This does not imply the AI has the motive to change
its own motives. I would not knowingly swallow a pill that made me
enjoy killing babies, because currently I prefer that babies not die.

But what if I try to modify myself, and make a mistake? When computer
engineers prove a chip valid - a good idea if the chip has 155 million
transistors and you can't issue a patch afterward - the engineers use
human-guided, machine-verified formal proof. The glorious thing about
formal mathematical proof, is that a proof of ten billion steps is just
as reliable as a proof of ten steps. But human beings are not
trustworthy to peer over a purported proof of ten billion steps; we have
too high a chance of missing an error. And present-day theorem-proving
techniques are not smart enough to design and prove an entire computer
chip on their own - current algorithms undergo an exponential explosion
in the search space. Human mathematicians can prove theorems far more
complex than modern theorem-provers can handle, without being defeated
by exponential explosion. But human mathematics is informal and
unreliable; occasionally someone discovers a flaw in a previously
accepted informal proof. The upshot is that human engineers guide a
theorem-prover through the intermediate steps of a proof. The human
chooses the next lemma, and a complex theorem-prover generates a formal
proof, and a simple verifier checks the steps. That's how modern
engineers build reliable machinery with 155 million interdependent parts.

Proving a computer chip correct requires a synergy of human intelligence
and computer algorithms, as currently neither suffices on its own.
Perhaps a true AI could use a similar combination of abilities when
modifying its own code - would have both the capability to invent large
designs without being defeated by exponential explosion, and also the
ability to verify its steps with extreme reliability. That is one way a
true AI might remain knowably stable in its goals, even after carrying
out a large number of self-modifications.

So that's one possible answer. More generally: It is disrespectful to
human ingenuity to declare a challenge unsolvable without taking a close
look and maybe exercising a little creativity. Especially when the
stakes are this high. It is an enormously strong statement to say that
you cannot do a thing - that you cannot build a heavier-than-air flying
machine, that you cannot get useful energy from nuclear reactions, that
you cannot fly to the Moon. Such statements are universal
generalizations, quantified over every single approach that anyone ever
has or ever will think up for solving the problem. It only takes a
single counterexample to falsify a universal quantifier. The statement
that Friendly (or friendly) AI is theoretically impossible, dares to
quantify over every possible mind design and every possible optimization
process - including human beings, who are also minds, some of whom are
pretty nice and wish they were nicer. At this point there are any
number of vaguely plausible reasons why Friendly AI might be humanly
impossible, and it is still more likely that the problem is solvable but
no one will get around to solving it in time. But one should not so
quickly write off the challenge, especially considering the stakes.

Eliezer S. Yudkowsky                
Research Fellow, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:54 MDT