Re: Building a friendly AI from a "just do what I tell you" AI

From: Tim Freeman (
Date: Wed Nov 21 2007 - 04:24:09 MST

Date: Sat, 17 Nov 2007 19:46:24 -0200
>So we go to the OAI [=obedient AI] and say: "Tell me how I can build
>a friendly AI in a manner that I can prove and understand that it
>will be friendly."

There might be something useful there. You'd need to have a clear
definition of "friendly" to have much confidence that the result was
what you wanted. A crisp definition of "friendly" would be a
specification that you understand, so the question you're asking your
OAI would be "how do I make a practical implementation of this

>* The OAI is not able and doesn't want to do anything besides
>answering questions in a restricted way: it will output media
>(text/pictures/videos/audio). Think of it as a glorified calculator AI
>(GCAI). The reason this is so, is because this is the way it was
>* It will not go into an infinite loop or decide that it needs to turn
>the whole earth/universum into computronium if it is faced with a
>question beyond it's capabilities. If you use your desk calculator and
>press the key "pi" the calculator doesn't start an infinite loop in
>order to calculate all digits of pi, but it just outputs pi to some
>digits. If you ask the GCAI:
>- "Calculate pi" it would ask back:
>- "How many digits do you want?"
>- "I want them all!"
>- "Sorry, I cannot do that."
>- "Ok, give me the first 3^^^3 digits."
>- "Sorry, but the universe will be dead before I can finish this task.

Then the dialogue can continue:

- "How many digits of pi could you compute in a year?"
- The calculator gives some large number, say M. (The calculator may
  not have a self-concept, so the word "you" in the question might not
  make sense. In that case the question would be "How many digits of
  pi could a device constructed according to these plans compute in a
  year?", then you give the calculator its own plans as part of the
- "Give me plans for a device that can compute 2*M digits of pi in a
- The calculator gives detailed plans for some device.
- "Give me plans for a device that can compute 4*M digits of pi in a
- The calculator takes a little longer, and outputs a new bunch of
  plans for a more complex device.
- The user continues the previous line of questioning until computing the next
  batch of plans takes longer than the user is willing to wait for an
- The (idiot) user then implements the last set of plans received, and
  starts the new device. Since the calculator was smarter than the
  user, the user can't understand the plans, so we don't benefit from
  the user's oversight.
- The new device achieves its goal by converting the Earth, among
  other things, into computronium.

The calculator was hoped to be safe because by hypothesis it can't
self-modify. Self-modification isn't the main issue, because
fundamentally nothing in the world has a "self" to start with. The
concept of a persistent, modifiable "self" assumes that main
consequence of the past is a future that resembles the past. We then
identify the future and the past, and label some of the
approximately-shared features as the "self" of whatever entity we're
talking about, and if the future "self" is different from the past
"self" we talk about self-modification. Once you get an AI doing
engineering work, this model stops working because there's little
reason to believe the future will closely resemble the past.

The real issue is, when you get an AI to do engineering work, you need
to ensure that the AI understands its social context well enough so
the consequences of the engineering work satisfy the other unstated
desires of its master, and its master's in-group, which would ideally
be all humans. (Realistically, all large human projects seem to be
about dominating other humans, so it seems unlikely that the in-group
will consist of all humans. It's better to have some survivors than
no survivors, so a probably-not-all-inclusive ingroup is a smaller
problem than the likelihood of killing everybody.)

Saying "the idiot user shouldn't have implemented a plan he didn't
understand" doesn't work. Humans can't tell with any reliability
whether they accurately understand something. There is unavoidable
risk here, but eventually, if the AI is smarter than the humans, we
have to rely on the AI understanding the humans, not the humans
understanding the AI.

I have some ideas about how to do this in the paper at Unfortunately the paper
needs revision and hasn't yet made sense to someone who I didn't
explain it to personally. Maybe I'll be able to make it readable over

In that paper I specify a machine that will infer the goals of a given
group of people and pursue a weighted average of those goals, given
sufficient training data about the perceptions and voluntary actions
of those people. Your voluntary actions are the contractions of your
voluntary muscles, so the problem of providing the training data is
conceptually simpler than the problem we started with.

Unfortunately, I can't prove that a good implementation of this
wouldn't kill everybody because I can't prove the original group of
people didn't want to kill everybody. All I can do is point at the
design and say that by construction, with sufficient (unobtainable)
computational resources, it would apparently do what people want. I
can't think of anything else useful to say about it.

Tim Freeman      

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:01:00 MDT