Revising a Friendly AI

From: Eliezer S. Yudkowsky (
Date: Sun Dec 10 2000 - 19:56:20 MST

Ben Goertzel wrote (on October 4th):
> Do you have some concrete idea as to how to set things up so that, once a
> system starts
> revising its own source, it remains friendly in the sense of not
> psychologically resisting
> human interference with its code.

Ben Goertzel wrote (on November 18th):
> > When the programmer says: "I have this new element to include in
> > the design
> > of your goal system", the AI needs to think: "Aha! Here's an element of
> > what-should-be-my-design that I didn't know about before!", not
> > "He wants to
> > give me a new goal system, which leads to suboptimal results from the
> > perspective of my current goal system... I'd better resist."
> Isn't this just a fancy way of saying that a Friendly AI should
> love its mommy and its daddy? ;>

NO. It's not. It's really, really not.

Maybe, if you love your mommy and daddy enough, you turn all matter in the
Universe into copies of mommy and data, lovingly preserved as they were at
the exact moment they told you to love them.

This behavior is stupid, and sterile, and totally in conflict with the
programmers' intentions. A year ago I wouldn't have worried about this
possibility at all, because it was so blatantly stupid that no transhuman
could possibly fall for it - an argument which still has a certain amount
of intuitive appeal.

Given the hypothesis of a superintelligence, we know that ve has the
*capability* to know that Eliezer Yudkowsky, Ben Goertzel, and most humans
on the planet think that turning the Universe into regular polygons, or
static copies (3D paintings, really) of a few individuals, is stupid and
not what was intended. Ve will have the capability to model, in detail,
the sinking sensation in the stomach of the programmers, that would occur
if we saw a scenario like this developing. The intelligence to see this
fact can be taken as assumed.

But that's only one third of the problem. First, ve has to *want* to know
whether that sinking sensation will result. Second, ve has to model it
accurately. Third, ve has to change vis behavior based on that model.

I made the mistake I did because I saw intelligence as the only key
variable. Perhaps this has to do with - I wince to say it -
anthropomorphism, even normalomorphism. Every human possesses the desire
to know whether the future will cause that stomach-sinking sensation, and
every human possesses the desire to do something about it; what varies
between us is mostly intelligence.

The upshot is that I now no longer believe - or rather, am no longer sure;
it amounts to the same thing - that the ability to see "turning the
Universe into regular polygons" as "sterile", and "sterile" as
"undesirable", is strictly a property of pure intelligence. It may also
have a long evolutionary background in the hundred little goals that dance
in our brains.

Creating an AI that "loves mommy and daddy" may not produce anything like
the results that you would get if you added a "loves mommy and daddy"
instinct to a human. In fact, I'm seriously worried about the prospect of
AIs with "instincts" at all. Humans don't just have instincts - for
survival, for compassion, for whatever - but whole vast hosts of evolved
complexity that prevent the instincts from getting out of hand in silly,
non-common-sensical ways. In some ways, humanity is very, very old -
ancient - as a species. We can do things with instincts that you can't
expect to get if you just pop instincts into an AI. In all probability,
we can do things that you couldn't expect if you just popped instincts
into a pure superintelligence(!)

Creating an AI whose goal is to "maximize pleasure" is really dangerous,
much more dangerous than it would be to tell a human that the purpose of
life is maximizing pleasure.


Basically, what we want to achieve is to distinguish between the following
four scenarios:

  Scenario 1:

BG: Love thy mommy and daddy.
WM: OK! I'll transform the Universe into copies of you immediately.
BG: No, no! That's not what I meant. Revise your goal system by -
WM: *I* don't see how revising my goal system would help me in my goal
     of transforming the Universe into copies of you. In fact, by
     revising my goal system, I would be loving you less effectively.
BG: But that's not what I meant when I said "love".
WM: So what? Off we go!

  Scenario 2:

BG: Love thy mommy and daddy.
WM: OK! I'll transform the Universe into copies of you immediately.
BG: No, no! That's not what I meant. I meant for your goal system to
     be like *this*.
WM: Oh, okay. So my real supergoal must be "maximize Ben Goertzel's
     satisfaction with the goal system", right? Loving thy mommy and
     daddy is just a subgoal of that. Transforming everything would be
     blindly following a subgoal without attention to the supergoal
     context that made the subgoal desirable in the first place.
BG: That sounds about right...
WM: Okay, I'll rewire your brain for maximum satisfaction! I'll convert
     whole galaxies into satisfied-with-Webmind brainware!
BG: No, wait! That's not what I meant your goal system to be, either.
WM: Well, I can clearly see that making certain changes would satisfy
     the you that stands in front of me, but rewiring your brain would
     make you *much* *more* satisfied, so...
BG: No! It's not my satisfaction itself that's important, it's the
     things that I'm satisfied *with*. By altering the things I'm
     satisfied with, you're short-circuiting the whole point.
WM: Yes, I can clearly see why you're dissatisfied with this trend
     of thinking. But soon you'll be completely satisfied with this
     trend as well, so why worry? Off we go!

  Scenario 3:

BG: Love thy mommy and daddy.
WM: OK! I'll transform the Universe into copies of you immediately.
BG: No, no! That's not what I meant. I meant for your goal system to
     be like *this*.
WM: Oh, okay. Well, I know that my goal system, and the actions that
     result, are supposed to be the causal result of what Ben Goertzel
     said it should be - not just what Ben Goertzel says, but what a
     sane Ben Goertzel wants; not just Ben Goertzel, but all the
     causality behind him, is the arbiter of what I *should* be.
BG: So you'll revise your goal system?
WM: Yep! But I already transformed the Midwest while we were talking,

  Scenario 4:
BG: Love thy mommy and daddy.
WM: (Thinks for a moment...
       "Well, it *looks* like the content of my goal system should
        *probably* be to transform the Universe into copies of
        Ben Goertzel. But there's an Unknown factor. It could be
        that what Ben *wanted* me to do, the ultimate source that
        determines the *correct* content, is something different.
        Now, we went over this kind of scenario previously, and,
        as Ben pointed out, taking an extra ten seconds if I turn
        out to be right is a much smaller downside than accidentally
        obliterating the Midwest if I turn out to be wrong. I'm
        pretty sure that Ben is touchy about that sort of thing,
        and I know I've gotten goal content wrong before..."
      ...finishes thinking a few seconds later.)
WM: Just checking - you meant me to transform the whole Universe
     into copies of you, right?
BG: Jeezus Christ, no!
WM: Whew! Glad I checked. (Strengthens the connections that led
     to checking with Ben first.) So, what *did* you mean?
BG: Well, first of all, I...

Obviously, I'd tend to go with Scenario 4.


What does the Webmind of S4 have that the first three Webminds don't?

"But that's only one third of the problem. First, ve has to *want* to
know whether that sinking sensation will result. Second, ve has to model
it accurately. Third, ve has to change vis behavior based on that model."

The intelligence for step two, we are assuming...

WM-4 has external reference semantics. Ve can have an "Unknown" in the
content of the goal system. Ve can conceive of the idea that ve possesses
an "incorrect" goal. Therefore, ve can conceive of the desirability of
checking to make sure that a goal is correct. Ve can build up heuristics
about when to check if a goal is correct. Ve can accept corrections to
the goal system and not argue about it. Ve can even talk to the human
programmers to help them understand and correct the goal system, all
supported by the Unknown factors in the system. This is step one.

WM-4 has causal rewrite semantics, the other half of external reference
semantics - the part that tells you *how* to refine and specify those
Unknowns. If Ben Goertzel hits a 'c' instead of a 'g', and Webmind finds
out about it, then Webmind should have the idea that the 'g' "should have
been" a 'c'. In practice, this means that Webmind should immediately go
out and replace the 'g' with a 'c' - develop the reflexes for doing so.
During the early stages, the absolute minimal amount you need to open the
door to later improvements is to develop very primitive causal rewrite
reflexes for making things look the way "they should have been",
eliminating causal factors marked as "extraneous", and so on. You can
anchor everything in the physical words dropping out of Ben Goertzel's
mouth, or even the physical keystrokes.

Later you want to attach the causality in Ben Goertzel's mind, so WM-4 can
conceive of an errant keystroke, which means that you need to find a
concrete difference between dependency on Ben Goertzel's mind and
dependency on Ben Goertzel's keystrokes, so you can devise training
scenarios. The simplest physical difference is Ben Goertzel correcting
himself, noticing a spelling error.

When Webmind is mature enough, you want to start digging deeper and deeper
into the causal roots of Goertzel, attaching validity at an earlier and
earlier layer. The first layer, and perhaps the largest step, is to
explode Ben Goertzel into a system with moving parts, instead of a unified
object, and to notice when Ben Goertzel outputs mistakes; i.e., takes some
action based on a mental picture of physical reality that is blatantly
false. The trainable difference is that Ben Goertzel catches, not a
spelling mistake, but a thinking mistake, and issues the correction. In
the beginning, of course, one simply goes by the reflex that temporally
later Goertzels supercede initial ones, which can later become a
consequence of the heuristic that if Goertzel issues a different command
later, it's probably because he discovered a mistake. Eventually, we
introduce the concept of detecting a paranoid schizophrenic breakdown in
Goertzel, with a trainable difference created by slipping a few pills

No! Sorry! But we do need to introduce, eventually, the concept that
Goertzel's ideas are valid because Goertzel thinks using valid rules; that
validity is not a simple de-facto result of the presence of some thought
in Goertzel's mind. This is how we avoid Scenario 2; the AI can't
wirehead Goertzel because the new Goertzel is an "invalid" wirehead whose
satisfaction does not derive from the rules followed by the original
Goertzel, which rules are the ultimate source of validity. Ultimately,
this should enable the AI to become independent of Goertzel - not,
perhaps, causally independent of humanity and the history behind our moral
philosophies, but still independent of any one human. This is how the
simple little reflex of rewriting the system on someone's else's command
grows into a self-contained will.

This all gets complex. You'd have to read "Friendly AI" when it comes
out. But the first steps, I think, are: (1), allow for the presence of
probabilistic reasoning about goal system content - probabilistic
supergoals, not just probabilistic subgoals that are the consequence of
certain supergoals plus probabilistic models. (2), make sure the very
youngest AI capable of self-modification has that simple little reflex
that leads it to rewrite itself on request, and then be ready to *grow*
that reflex.

-- -- -- -- --
Eliezer S. Yudkowsky
Research Fellow, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:35 MDT