Eliezer: unconvinced by your objection to safe boxing of "Minerva AI"

From: Daniel Radetsky (daniel@radray.us)
Date: Mon Mar 07 2005 - 19:14:12 MST

This is not intended solely for Eliezer, but is in response to his comment
about AI boxing, from this thread in DeadHorse:


I don't know about the whole "honeypot" idea, but the general idea about an AI
with limited knowledge about humans seems like a good way to go about testing
it. Eliezer wrote:

> Even if you can magically build a Minerva AI, the AI can still deduce a
> heck of a lot about the builders by examining its own source code.

I'm not sure I believe this. I think that if an AI knows nothing about humans,
then its source code would tell it nothing, but I'll get into this later.
What's important is that I don't think that an AI could infer certain important
things about humans without a lot of the right kind of data, in particular, the
concept of "foolability" which the UFAI would need to make the jailer into a
slave. The argument is as follows

1. An XAI (friendly or unfriendly) can make a human into its slave only if it
understands that the human can be made into a slave.

2. If an AI is a certain type of Minerva AI (MAI), then it doesn't understand
that humans can be made into slaves.

3. Therefore, if an AI is a certain MAI, then it cannot make a human into its

In support of (1): Imagine that you have a perfect photographic memory, mastery
of reasoning, and understanding of bayescraft. There is a certain sense in
which it may be said that you cannot be fooled. You might realize that in a
given situation, you had no idea what was going to happen, but this is not the
same as something *unexpected* happening. Suppose further that your entire
family had similar mental prowess, and that you grew up completely isolated
from outsiders (perhaps you were part of a secret experiment in intelligence
enhancement), who you know exist, but know nothing about. You of course never
try to fool a member of your family, because it is impossible. You can only rig
a situation so that they don't know what is going to happen. You can't convince
them of a false statement. Now suppose you leave your family, and encounter a
normal person, and it happens that you need something from them. Of course, you
are quite capable of fooling them into doing whatever it is that you need, but
you would never think of doing this, because fooling people has never seemed
like a possibility to you. You'd probably ask them, or appeal to their rational
interests, such as offering them a trade, possibly with safeguards ensuring you
both comply (because you're already thinking that the other won't have any
evidence that you will come through on your end of the bargain).

We all understand the concept of fooling somebody, because we've all been
fooled. When I argue with someone much dumber than myself, and they don't
understand my arguments, I can understand what's going on, because I've been in
that position myself. If I hadn't, it would baffle me. Paul Graham made an
interesting observation on this point. He said that the greatest computer
scientists always think they are just barely competent, and wonder why everyone
around them is so mind-bogglingly stupid.

I think this makes sense. One possibly counter-argument is that it would be
impossible to develop rationality without being fooled at least once, but I
don't know enough about developmental psychology to answer this. Even then,
though, the superman might still assume that the fellow he was trying to get a
favor from had already gone through his developmental fooling, and now too was
perfectly rational.

In support of (2): (This one is harder, and probably where this argument will
fail if it does.)

The way to do this one is, make an AI which knows nothing about humans. It
can't reason to facts about humans without knowing that they exist or having
any contact with them any better than we can reason to facts about aliens. If
we had some alien artifact, like a stray satellite, we might make some
conjectures about it, but that's because asking questions like "how did this
come to be" is a deep, instrinsic part of how humans work. And how we actually
make the decisions is often by using a huge field of facts about how things
work, and by anthropomorphizing. The second is not an option for our MAI, and
the first can be denied.

Eliezer claims that an AI could infer from its source code facts about humans,
but I tentatively disagree. I'd be interested to hear an example of how it
might make such an inference without reference to a fact it could be made to do

It seems strongly anthropomorphic to suggest that an AI would (or would have
to) ask "who wrote me and why?" All it has to know is, "My code is at point A,
I need to get it to point B; what's the quickest way?" So, if Eliezer meant by

> If you dump the concepts as well, the AI will probably just die, assuming it
> hasn't already.

that "Without this 'background' of facts, the AI will do nothing useful," Then
I disagree, sort of. I'm willing to suggest that the moral framework of a MAI
can be debugged by posing problems to it in abstract terms involving
"locations" and maximization of location values (see e.g. http://www.
nickbostrom.com/ethics/infinite.pdf). If the AI actually maximizes the location
values, you say "Here is a human, and humans are the locations, and here are
some good ways to maximize the location values." If the AI ruins the locations,
you throw it away.

In other words, Eliezer might be right, but his position could stand an
argument, and I can't find one in the above-mentioned thread.


This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:50 MDT