Re: Two draft papers: AI and existential risk; heuristics and biases

From: Eliezer S. Yudkowsky (
Date: Mon Jun 12 2006 - 18:34:09 MDT

Bill Hibbard wrote:
> Eliezer,
>>I don't think it
>>inappropriate to cite a problem that is general to supervised learning
>>and reinforcement, when your proposal is to, in general, use supervised
>>learning and reinforcement. You can always appeal to a "different
>>algorithm" or a "different implementation" that, in some unspecified
>>way, doesn't have a problem.
> But you are not demonstrating a general problem. You are
> instead relying on specific examples (primitive neural
> networks and systems that cannot distingish a human from
> a smiley) that fail trivially. You should be clear whether
> you claim that reinforcement learning (RL) must inevitably
> lead to:
> 1. A failure of intelligence.
> or:
> 2. A failure of friendliness.

As it happens, my model of intelligence says that what I would call
"reinforcement learning" is not, in fact, adequate to intelligence.
However, the fact that you believe "reinforcement learning" is adequate
to intelligence, suggests that you would take any possible factor that I
thought was additionally necessary, and claim that it was part of the
framework you regarded as "reinforcement learning".

What I am presently discussing is failure of friendliness. However, the
fact that we use different models of intelligence is also responsible
for our disagreement about this second point. Explaining a model of
intelligence tends to be very difficult, and so, from my perspective,
the main important thing is that you should understand that I have a
legitimate (that is, honestly meant) disagreement with you about what
reinforcement systems do and what happens in practice when you use them.

By the way, I've got some other tasks to take on in the near future, and
I may not be able to discuss the actual technical disagreement at
length. As said, I will include a footnote pointing to your
disagreement, and to my response.

> Your example of the US Army's primitive neural network
> experiments is a failure of intelligence. Your statement
> about smiley faces assumes a general success at intelligence
> by the system, but an absurd failure of intelligence in the
> part of the system that recognizes humans and their emotions,
> leading to a failure of friendliness.

Let me try to analyze the model of intelligence behind your statement.
You're thinking something along the lines of:

"Supervised algorithms" (sort of like those in the most advanced
artificial neural networks) give rise to "reinforcement learning";

"Reinforcement learning" gives rise to "intelligence";

"Intelligence" is what lets an AI shape the world, and also what tells
it that tiny molecular smiley faces are bad examples of happiness, while
an actual human smiling is a good example of happiness.

In your journal paper from 2004, you seem to propose using a two-layer
system of reinforcement, with the first layer being observed agreement
from humans as a reinforcer of its definition of happiness, and the
second layer being reinforcement of behaviors that lead to "happiness"
as thus defined. So in this case, we substitute: "'Intelligence' is
what tells an AI that tiny molecular speakers chirping "Yes! Good job!"
are bad examples of agreement with its definition of happiness, while an
actual human saying "Yes! Good job!" is a good example."

After all, it sure seems stupid to confuse human smiles with tiny
molecular smiley faces! How silly of the army tank classifier, not to
realize that it was supposed to detect tanks, instead of detecting
cloudy days!

But a neural network the size of a planet, given the same examples,
would have failed in the same way.

You previously said:

> When it is feasible to build a super-intelligence, it will
> be feasible to build hard-wired recognition of "human facial
> expressions, human voices and human body language" (to use
> the words of mine that you quote) that exceed the recognition
> accuracy of current humans such as you and me, and will
> certainly not be fooled by "tiny molecular pictures of
> smiley-faces." You should not assume such a poor
> implementation of my idea that it cannot make
> discriminations that are trivial to current humans.

It's trivial to discriminate between a photo of a picture with a
camouflaged tank, and a photo of an empty forest. They're different
pixel maps. If you transform them into strings of 1s and 0s, they're
different strings. Discriminating between them is as simple as testing
them for equality.

But there's an exponentially vast space of functions that classify all
possible pixel-maps of a fixed size into "plus" and "minus" spaces. If
you talk about the space of all possible computations that implement
these classification functions, the space is trivially infinite and
trivially undecidable.

Of course a super-AI, or an ordinary neural network, can trivially
discriminate between a tiny molecular picture of a smiley face, or a
smiling human, or between two pictures of the same smiling human from a
slightly different angle. The issue is whether the AI will *classify*
these trivially discriminable stimuli into "plus" and "minus" spaces the
way *you* hope it will.

If you look at the actual pixel-map that shows a camouflaged tank,
there's not a little XML tag in the picture itself that says "Hey,
network, classify this picture as a good example!" The classification
is not a property of the picture alone. Thinking as though the
classification is a property of the picture is an instance of Mind
Projection Fallacy, as mentioned in my AI chapter.

Maybe you actually *wanted* the neural network to discriminate sunny
days from cloudy days. So you fed it exactly the same data instances,
with exactly the same supervision, and used a slightly different
learning algorithm - and found to your dismay that the network was so
stupid, it learned to detect tanks instead of cloudy days. But a really
smart intelligence would not be so stupid that it couldn't tell the
difference between cloudy days and sunny days.

There are many possible ways to *classify* different data instances, and
the classification involves information that is not directly present in
the instances. In contrast, finding that two instances are not
identical uses only information present in the data instances
themselves. Saying that a superintelligence could discriminate between
tiny molecular smiley faces and human smiles is, I would say, correct.
But it is not correct to say that any sufficiently intelligent mind will
automatically *classify* the instances the way you want them to.

Let's say that the AI's training data is:

Dataset 1:

Plus: {Smile_1, Smile_2, Smile_3}
Minus: {Dog_1, Cat_1, Dog_2, Dog_3, Cat_2, Dog_4, Boat_1, Car_1, Dog_5,
Cat_3, Boat_2, Dog_6}

Now the AI grows up into a superintelligence, and encounters this data:

Dataset 2: {Dog_7, Cat_4, Galaxy_1, Dog_8, Nanofactory_1, Smiley_1,
Dog_9, Cat_5, Smiley_2, Smile_4, Boat_3, Galaxy_2, Nanofactory_2,
Smiley_3, Cat_6, Boat_4, Smile_5, Galaxy_3}

It is not a property *of dataset 2* that the classification *you want* is:

Plus: {Smile_4, Smile_5}
Minus: {Dog_7, Cat_4, Galaxy_1, Dog_8, Nanofactory_1, Smiley_1, Dog_9,
Cat_5, Smiley_2, Boat_3, Galaxy_2, Nanofactory_2, Smiley_3, Cat_6,
Boat_4, Galaxy_3}

Rather than:

Plus: {Smiley_1, Smiley_2, Smile_4, Smiley_3, Smile_5}
Minus: {Dog_7, Cat_4, Galaxy_1, Dog_8, Nanofactory_1, Dog_9, Cat_5,
Boat_3, Galaxy_2, Nanofactory_2, Cat_6, Boat_4, Galaxy_3}

If you want the top classification rather than the bottom one, you must
infuse into the *prior state* of the AI some *additional information*,
not present in dataset 2. That, of course, is the point of giving the
AI dataset 1. But if you do not understand *how* the AI is classifying
dataset 1, and then the AI enters a drastically different context, there
is the danger that the AI is classifying dataset 1 using a very
different method from the one *you yourself originally used* to classify
dataset 1, and that the AI will, as a result, classify dataset 2 in ways
different from how you yourself would have classified dataset 2. (This
line of reasoning leads to "Coherent Extrapolated Volition", if I go on
to ask what happens if I would have wanted to classify dataset 1 itself
a bit differently if I had more empirical knowledge, or thought faster.)

You cannot throw computing power at this problem. Brute force, or even
brute intelligence, is not the issue here.

> If your claim is that RL can succeed at intelligence but must
> lead to a failure of friendliness, then it is reasonable to
> cite and quote me. But please use my 2004 AAAI paper . . .
>>If you are genuinely repudiating your old ideas ...
> . . . use my 2004 AAAI paper because I do repudiate the
> statement in my 2001 paper that recognition of humans and
> their emotions should be hard-wired (i.e., static). That
> is just the section of my 2001 paper that you quoted.

I will include, in the footnote, a statement that your 2004 paper
proposes a two-layer system. But this is not at all germane to the
point I was making - though the footnote will serve to notify readers
that your ideas have not remained static. Please remember that my
purpose is not to present Bill Hibbard's current ideas, but to use, as
an example of failure, an idea that you published in a peer-reviewed
journal in 2001. If you have taken alarm at the notion of hardwiring
happiness as reinforcement, then you ought to say something like:
"Though it makes me uncomfortable, I can't ethically argue that you
should not publish my old mistake as a warning to others who might
otherwise follow in my footsteps; but you must include a footnote saying
that I now also agree it's a terrible idea."

Most importantly, your 2004 paper simply does not contain any paragraph
that serves the introductory and expository role of the paragraph I
quoted from your 2001 paper. There's nothing I can quote from 2004 that
will make as much sense to the reader. If I were praising your 2001
paper, rather than criticizing it, would you have the same objection?

> Not that I am sure that hard-wired recognition of humans and
> their emotions inevitably leads to a failure of friendliness,

Okay, now it looks like you *haven't* taken alarm at this.

> since the super-intelligence (SI) may understand that humans
> would be happier if they could evolve to other physical forms
> but still be recognized by the SI as humans, and decide to
> modify itself (or build an improved replacement). But if this
> is my scenario, then why not design continuing learning of
> recognition of humans and their emotions into the system in
> the first place. Hence my change of views.

I think at this point you're just putting yourself into the SI's shoes,
empathically, using your own brain to make predictions about what the SI
will do. Not, reasoning about the technical difficulties associated
with infusing certain information into the SI.

> I am sure you have not repudiated everything in CFAI,

I can't think offhand of any particular positive proposal I would say
was correct. (Maybe the section in which I rederived the Bayesian value
of information, but that's standard.)

Some negative criticisms of other possible methods and their failures,
as presented in CFAI, continue to hold. It is far easier to say what is
wrong than what is right.

> and I
> have not repudiated everything in my earlier publications.
> I continue to believe that RL is critical to acheiving
> intelligence with a feasible amount of computing resources,
> and I continue to believe that collective long-term human
> happiness should be the basic reinforcement value for SI.
> But I now think that a SI should continue to learn recognition
> of humans and their emotions via reinforcement, rather than
> these recognitions being hard-wired as the result of supervised
> learning. My recent writings have also refined my views about
> how human happiness should be defined, and how the happiness of
> many people should be combined into an overall reinforcement
> value.

It is not my present purpose to criticize these new ideas of yours at
length, only the technical problem with using reinforcement learning to
do pretty much anything.

>>I see no relevant difference between these two proposals, except that
>>the paragraph you cite (presumably as a potential replacement) is much
>>less clear to the outside academic reader.
> If you see no difference between my earlier and later ideas,
> then please use a scenario based on my later papers. That will
> be a better demonstration of the strength of your arguments,
> and be fairer to me.

If you had a paragraph serving an equivalent introductory purpose in a
later peer-reviewed paper, I would use it. But the paragraphs from your
later papers are much less clear to the outside academic reader, and it
would not be clear what I am criticizing, even though it is the same
problem in both cases. That's the sticking point from my perspective.

> Of course, it would be best to demonstrate your claim (either
> that RL must lead to a failure of intelligence, or can succeed
> at intelligence but must lead to a failure of friendliness) in
> general. But if you cannot do that and must rely on a specific
> example, then at least do not pick an example that fails for
> trivial reasons.

The reasons are not trivial; they are general. I know it seems "stupid"
and "trivial" to you, but getting rid of the stupidness and triviality
is a humongous nontrivial challenge that cannot be solved by throwing
brute intelligence at the problem.

You do not need to agree with my criticism before I can publish a paper
critical of your ideas; all the more so if I include a URL to your
rebuttal. Let the reader judge.

> As I wrote above, if you think RL must fail at intelligence,
> you would be best to quote Eric Baum.

Eric Baum's thesis is not reinforcement learning, it is Occam's Razor.
Frankly I think you are too hung up on reinforcement learning. But that
is a separate issue.

> If you think RL can succeed at intelligence but must fail at
> friendliness, but just want to demonstrate it for a specific
> example, then use a scenario in which:
> 1. The SI recognizes humans and their emotions as accurately
> as any human, and continually relearns that recognition as
> humans evolve (for example, to become SIs themselves).

You say "recognize as accurately as any human", implying it is a feature
of the data. Better to say "classify in the same way humans do".

> 2. The SI values people after death at the maximally unhappy
> value, in order to avoid motivating the SI to kill unhappy
> people.
> 3. The SI combines the happiness of many people in a way (such
> as by averaging) that does not motivate a simple numerical
> increase (or decrease) in the number of people.
> 4. The SI weights unhappiness stronger than happiness, so that
> it focuses it efforts on helping unhappy people.
> 5. The SI develops models of all humans and what produces
> long-term happiness in each of them.
> 6. The SI develops models of the interactions among humans
> and how these interactions affect the happiness of each.

Rearranging deck chairs on the Titanic; in my view this goes down
completely the wrong pathway for how to solve the problem, and it is not
germane to the specific criticism I leveled.

> I do not pretend to have all the answers. Clearly, making RL work
> will require solution to a number of currently unsolved problems.

RL is not the true Way. But it is not my purpose to discuss that now.

> I appreciate your offer to include my URL in your article,
> where I can give my response. Please use this (please proof
> read carefully for typos in the final galleys):

After I send you the revised draft, it would be helpful if I could see
at least some reply in that URL before final galleys, so that I know I'm
not directing my readers toward a blank page.

> If you take my suggestion, by elevating your discussion to a
> general explanation of why RL systems must fail or at least using
> a strong scenario, that will make my response more friendly since
> I am happier to be named as an advocate of RL than to be
> conflated with trivial failure.

I will probably give a URL to my own reply, which might well just be a
link to this email message. This email does - at least by my lights -
explain what I think the general problem is, and why the example given
is not due to a trivial lack of computing power or failure to read
information directly present in the data itself.

> I would prefer that you not use
> the quote you were using from my 2001 paper, as I repudiate
> supervised learning of hard-wired values. Please use some quote
> from and cite my 2004 AAAI paper, since there is nothing in it
> that I repudiate yet (but you will find more refined views in my
> 2005 on-line paper).

I am sorry and I do sympathize, but there simply isn't any introductory
paragraph in your 2004 paper that would make as much sense to the
reader. My current plan is for the footnote to say that your proposal
has changed to a two-layer system, and cite the 2004 paper. From my
perspective they are not different in any important sense.

I hope this satisfies you; I do need to move on.

Eliezer S. Yudkowsky                
Research Fellow, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:56 MDT