Re: Two draft papers: AI and existential risk; heuristics and biases

From: Bill Hibbard (
Date: Thu Jun 15 2006 - 08:06:08 MDT

Eliezer Yudkowsky wrote:
> Bill Hibbard wrote:
> > Eliezer,
> >
> >>I don't think it
> >>inappropriate to cite a problem that is general to supervised learning
> >>and reinforcement, when your proposal is to, in general, use supervised
> >>learning and reinforcement. You can always appeal to a "different
> >>algorithm" or a "different implementation" that, in some unspecified
> >>way, doesn't have a problem.
> >
> > But you are not demonstrating a general problem. You are
> > instead relying on specific examples (primitive neural
> > networks and systems that cannot distingish a human from
> > a smiley) that fail trivially. You should be clear whether
> > you claim that reinforcement learning (RL) must inevitably
> > lead to:
> >
> > 1. A failure of intelligence.
> >
> > or:
> >
> > 2. A failure of friendliness.
> As it happens, my model of intelligence says that what I would call
> "reinforcement learning" is not, in fact, adequate to intelligence.
> However, the fact that you believe "reinforcement learning" is adequate
> to intelligence, suggests that you would take any possible factor that I
> thought was additionally necessary, and claim that it was part of the
> framework you regarded as "reinforcement learning".

Reinforcement learning (RL) is not a particular algorithm,
but is a formal problem statement or paradigm (Baum uses
the phrase "formal context"). As Baum describes in "What
is Thought?", there are many classes of algorithms for
solving this problem. Thus you cannot exclude algorithms,
known or yet unknown, unless they violate the RL paradigm.

>From my first writings about AI I picked RL as my model
for how brains work in part because it is open ended and
there is much that I don't know about how brains work.
Thus it is unfair for you to base your demonstration of
failure of my ideas on some particular algorithm that I
never claimed as adequate for intelligence.

I also picked RL as my model because it showed an approach
to protecting humans from AI that was different from
Asimov's Laws, which I felt had unresolvable ambiguities.
Yes, human brains use reason and Asimov's Laws work by
reason. But in my view learning rather than reason is
fundamental to how brains work. Reason is part of the
simulation model of the world that brains evolved in order
to solve the credit assignment problem for RL. In my view
the proper way to protect human interests is through the
reinforcement values in AIs. Rather than constraining AI
behavior by rules, it is better to design AI motives to
produce safe behavior. n order to make this argument I
did not have to specify a RL algorithm, and I didn't.

Evolution via genetic selection is an example of RL:
genetic mutations are reinforced by the survival and
reproduction of organisms carrying those mutations. The
scientific method is another good example: theories are
reinforced by whether their predictions agree with
experiment. The RL paradigm is pretty general and can be
implemented by a wide variety of algorithms. I believe
that human brains work according the RL paradigm, using
very complex and currently unknown algorithms, and hence
demonstrate the adequacy of the RL paradigm for

> What I am presently discussing is failure of friendliness. However, the
> fact that we use different models of intelligence is also responsible
> for our disagreement about this second point. Explaining a model of
> intelligence tends to be very difficult, and so, from my perspective,
> the main important thing is that you should understand that I have a
> legitimate (that is, honestly meant) disagreement with you about what
> reinforcement systems do and what happens in practice when you use them.

I never doubt that you honestly mean what you write.

Since you think the RL paradigm is inadequate for
intelligence, then you should see friendliness as a moot
issue for RL. If it isn't intelligent, it isn't a threat.

Your scenario of a system that is adequate for intelligence
in its ability to rule the world, but absurdly inadequate
for intelligence in its inability to distinguish a smiley
face from a human, is inconsistent. Inconsistent assumptions
can be used to demonstrate anything. If you think that RL is
inadequate for intelligence, you should argue for that rather
than using inconsistent assumptions to turn it into an
argument about friendliness.

Based on your writings, I think you probably do have a
model of intelligence that does not fit the RL paradigm.
For example, your desire to *prove* that your design for
intelligence will *never* violate certain invariants
seems difficult to reconcile with the RL paradigm, since
effective RL algorithms tend to employ outside-the-box
mechanisms like genetic mutations and inspired, crazy

> By the way, I've got some other tasks to take on in the near future, and
> I may not be able to discuss the actual technical disagreement at
> length. As said, I will include a footnote pointing to your
> disagreement, and to my response.
> > Your example of the US Army's primitive neural network
> > experiments is a failure of intelligence. Your statement
> > about smiley faces assumes a general success at intelligence
> > by the system, but an absurd failure of intelligence in the
> > part of the system that recognizes humans and their emotions,
> > leading to a failure of friendliness.
> Let me try to analyze the model of intelligence behind your statement.
> You're thinking something along the lines of:
> "Supervised algorithms" (sort of like those in the most advanced
> artificial neural networks) give rise to "reinforcement learning";

I'm not sure what this means, but doubt that I agree.
I used the phrase "supervised learning" in my 2001
paper to indicate RL (algorithm unspecified, because
the RL algorithms necessary for intelligence were
unknown in 2001, and still unknown) with reinforcements
coming from external trainers rather than some internal
encoding. I used "supervised" to indicate supervision
by an external agent, and certainly not to indicate
artificial neural networks.

> "Reinforcement learning" gives rise to "intelligence";

The RL paradigm , with currently unknown algorithms,
gives rise to intelligence.

> "Intelligence" is what lets an AI shape the world, and also what tells
> it that tiny molecular smiley faces are bad examples of happiness, while
> an actual human smiling is a good example of happiness.
> In your journal paper from 2004, you seem to propose using a two-layer
> system of reinforcement, with the first layer being observed agreement
> from humans as a reinforcer of its definition of happiness, and the
> second layer being reinforcement of behaviors that lead to "happiness"
> as thus defined. So in this case, we substitute: "'Intelligence' is
> what tells an AI that tiny molecular speakers chirping "Yes! Good job!"
> are bad examples of agreement with its definition of happiness, while an
> actual human saying "Yes! Good job!" is a good example."
> After all, it sure seems stupid to confuse human smiles with tiny
> molecular smiley faces! How silly of the army tank classifier, not to
> realize that it was supposed to detect tanks, instead of detecting
> cloudy days!
> But a neural network the size of a planet, given the same examples,
> would have failed in the same way.

But I certainly never said that neural networks were the
proper RL algorithm for intelligence. Of course, it
depends on what you mean by the phrase "neural networks".
Its general use among computer scientists is for a network
of formalized neurons without feedback, and these are
certainly inadequte for intelligence. On the other hand,
human brains are networks of real neurons (and other types
of cells) that do demonstrate intelligence in an easily
portable size.

> You previously said:
> > When it is feasible to build a super-intelligence, it will
> > be feasible to build hard-wired recognition of "human facial
> > expressions, human voices and human body language" (to use
> > the words of mine that you quote) that exceed the recognition
> > accuracy of current humans such as you and me, and will
> > certainly not be fooled by "tiny molecular pictures of
> > smiley-faces." You should not assume such a poor
> > implementation of my idea that it cannot make
> > discriminations that are trivial to current humans.
> It's trivial to discriminate between a photo of a picture with a
> camouflaged tank, and a photo of an empty forest. They're different
> pixel maps. If you transform them into strings of 1s and 0s, they're
> different strings. Discriminating between them is as simple as testing
> them for equality.
> But there's an exponentially vast space of functions that classify all
> possible pixel-maps of a fixed size into "plus" and "minus" spaces. If
> you talk about the space of all possible computations that implement
> these classification functions, the space is trivially infinite and
> trivially undecidable.
> Of course a super-AI, or an ordinary neural network, can trivially
> discriminate between a tiny molecular picture of a smiley face, or a
> smiling human, or between two pictures of the same smiling human from a
> slightly different angle. The issue is whether the AI will *classify*
> these trivially discriminable stimuli into "plus" and "minus" spaces the
> way *you* hope it will.

But there is a broad consensus among humans about
classifications. I assume that the RL paradigm is
adequate for intelligence in humans and in a super-AI,
and hence can conclude that a super-AI classifies
within the gamut of how the general human consensus
would classify.

To make your case you must demonstrate that the RL
paradigm is inadequate for intelligence, and without
assuming some particular class of RL algorithms.

> If you look at the actual pixel-map that shows a camouflaged tank,
> there's not a little XML tag in the picture itself that says "Hey,
> network, classify this picture as a good example!" The classification
> is not a property of the picture alone. Thinking as though the
> classification is a property of the picture is an instance of Mind
> Projection Fallacy, as mentioned in my AI chapter.
> Maybe you actually *wanted* the neural network to discriminate sunny
> days from cloudy days. So you fed it exactly the same data instances,
> with exactly the same supervision, and used a slightly different
> learning algorithm - and found to your dismay that the network was so
> stupid, it learned to detect tanks instead of cloudy days. But a really
> smart intelligence would not be so stupid that it couldn't tell the
> difference between cloudy days and sunny days.
> There are many possible ways to *classify* different data instances, and
> the classification involves information that is not directly present in
> the instances. In contrast, finding that two instances are not
> identical uses only information present in the data instances
> themselves. Saying that a superintelligence could discriminate between
> tiny molecular smiley faces and human smiles is, I would say, correct.
> But it is not correct to say that any sufficiently intelligent mind will
> automatically *classify* the instances the way you want them to.

But humans constantly depend on agreement with other
humans on classifications. This consensus on
classifications is adequate so that humans know what
each other mean. A sufficiently intelligent mind will
learn to classify in agreement with the human consensus
if its reinforcement value for learning classifications
is to agree with humans.

> Let's say that the AI's training data is:
> Dataset 1:
> Plus: {Smile_1, Smile_2, Smile_3}
> Minus: {Dog_1, Cat_1, Dog_2, Dog_3, Cat_2, Dog_4, Boat_1, Car_1, Dog_5,
> Cat_3, Boat_2, Dog_6}
> Now the AI grows up into a superintelligence, and encounters this data:
> Dataset 2: {Dog_7, Cat_4, Galaxy_1, Dog_8, Nanofactory_1, Smiley_1,
> Dog_9, Cat_5, Smiley_2, Smile_4, Boat_3, Galaxy_2, Nanofactory_2,
> Smiley_3, Cat_6, Boat_4, Smile_5, Galaxy_3}
> It is not a property *of dataset 2* that the classification *you want* is:
> Plus: {Smile_4, Smile_5}
> Minus: {Dog_7, Cat_4, Galaxy_1, Dog_8, Nanofactory_1, Smiley_1, Dog_9,
> Cat_5, Smiley_2, Boat_3, Galaxy_2, Nanofactory_2, Smiley_3, Cat_6,
> Boat_4, Galaxy_3}
> Rather than:
> Plus: {Smiley_1, Smiley_2, Smile_4, Smiley_3, Smile_5}
> Minus: {Dog_7, Cat_4, Galaxy_1, Dog_8, Nanofactory_1, Dog_9, Cat_5,
> Boat_3, Galaxy_2, Nanofactory_2, Cat_6, Boat_4, Galaxy_3}
> If you want the top classification rather than the bottom one, you must
> infuse into the *prior state* of the AI some *additional information*,
> not present in dataset 2. That, of course, is the point of giving the
> AI dataset 1. But if you do not understand *how* the AI is classifying
> dataset 1, and then the AI enters a drastically different context, there
> is the danger that the AI is classifying dataset 1 using a very
> different method from the one *you yourself originally used* to classify
> dataset 1, and that the AI will, as a result, classify dataset 2 in ways
> different from how you yourself would have classified dataset 2. (This
> line of reasoning leads to "Coherent Extrapolated Volition", if I go on
> to ask what happens if I would have wanted to classify dataset 1 itself
> a bit differently if I had more empirical knowledge, or thought faster.)
> You cannot throw computing power at this problem. Brute force, or even
> brute intelligence, is not the issue here.

Of course there are ambiguous cases and minor
disagreements between humans, and so no AI will
match every human's classification. But the AI will
agree with the general human consensus for those
cases where consensus exists. The general human
consensus is very clear that a tiny molecular smiley
face is not a human. There is pretty unanimous
consensus on the classification "human".

> > If your claim is that RL can succeed at intelligence but must
> > lead to a failure of friendliness, then it is reasonable to
> > cite and quote me. But please use my 2004 AAAI paper . . .
> >
> >>If you are genuinely repudiating your old ideas ...
> >
> > . . . use my 2004 AAAI paper because I do repudiate the
> > statement in my 2001 paper that recognition of humans and
> > their emotions should be hard-wired (i.e., static). That
> > is just the section of my 2001 paper that you quoted.
> I will include, in the footnote, a statement that your 2004 paper
> proposes a two-layer system. But this is not at all germane to the
> point I was making - though the footnote will serve to notify readers
> that your ideas have not remained static. Please remember that my
> purpose is not to present Bill Hibbard's current ideas, but to use, as
> an example of failure, an idea that you published in a peer-reviewed
> journal in 2001. If you have taken alarm at the notion of hardwiring
> happiness as reinforcement, then you ought to say something like:
> "Though it makes me uncomfortable, I can't ethically argue that you
> should not publish my old mistake as a warning to others who might
> otherwise follow in my footsteps; but you must include a footnote saying
> that I now also agree it's a terrible idea."
> Most importantly, your 2004 paper simply does not contain any paragraph
> that serves the introductory and expository role of the paragraph I
> quoted from your 2001 paper. There's nothing I can quote from 2004 that
> will make as much sense to the reader. If I were praising your 2001
> paper, rather than criticizing it, would you have the same objection?

Sorry to hear this. I think my 2004 paper presents a much
clearer and better thought out description of intelligence
and human safety. But I appreciate your offer to include a

> > Not that I am sure that hard-wired recognition of humans and
> > their emotions inevitably leads to a failure of friendliness,
> Okay, now it looks like you *haven't* taken alarm at this.

I think my newer ideas are more likely to produce
friendly AI, and since I don't think a proof of
perpetual friendliness is possible, "more likely" is
about as good as it gets. I would intervene if someone
wanted to use my 2001 paper as the basis for actually
building a SI, but there's no chance of that.

> > since the super-intelligence (SI) may understand that humans
> > would be happier if they could evolve to other physical forms
> > but still be recognized by the SI as humans, and decide to
> > modify itself (or build an improved replacement). But if this
> > is my scenario, then why not design continuing learning of
> > recognition of humans and their emotions into the system in
> > the first place. Hence my change of views.
> I think at this point you're just putting yourself into the SI's shoes,
> empathically, using your own brain to make predictions about what the SI
> will do. Not, reasoning about the technical difficulties associated
> with infusing certain information into the SI.

As I said above, I think the SI can learn classifications
that are within the gamut of general human consensus,
including classification of long term life satisfaction.
The SI will also have the ability to make (imperfect)
predictions about individual humans just as we can make
such predictions about each other. Based on this, and the
SI's values, I can make some predictions about its behavior.
This all comes down to our disagreement about whether the RL
paradigm is adequate for intelligence, and you should make
your argument in those terms.

> > I am sure you have not repudiated everything in CFAI,
> I can't think offhand of any particular positive proposal I would say
> was correct. (Maybe the section in which I rederived the Bayesian value
> of information, but that's standard.)
> Some negative criticisms of other possible methods and their failures,
> as presented in CFAI, continue to hold. It is far easier to say what is
> wrong than what is right.
> > and I
> > have not repudiated everything in my earlier publications.
> > I continue to believe that RL is critical to acheiving
> > intelligence with a feasible amount of computing resources,
> > and I continue to believe that collective long-term human
> > happiness should be the basic reinforcement value for SI.
> > But I now think that a SI should continue to learn recognition
> > of humans and their emotions via reinforcement, rather than
> > these recognitions being hard-wired as the result of supervised
> > learning. My recent writings have also refined my views about
> > how human happiness should be defined, and how the happiness of
> > many people should be combined into an overall reinforcement
> > value.
> It is not my present purpose to criticize these new ideas of yours at
> length, only the technical problem with using reinforcement learning to
> do pretty much anything.

There are much better known advocates of RL as the basis
for intelligence than me. But I am an advocate and happy
to be named as such. I am not happy to be named as an
advocate of artificial neural networks as adequate for
intelligence, or an advocate of systems that cannot
distinguish a smiley face from a human.

> >>I see no relevant difference between these two proposals, except that
> >>the paragraph you cite (presumably as a potential replacement) is much
> >>less clear to the outside academic reader.
> >
> > If you see no difference between my earlier and later ideas,
> > then please use a scenario based on my later papers. That will
> > be a better demonstration of the strength of your arguments,
> > and be fairer to me.
> If you had a paragraph serving an equivalent introductory purpose in a
> later peer-reviewed paper, I would use it. But the paragraphs from your
> later papers are much less clear to the outside academic reader, and it
> would not be clear what I am criticizing, even though it is the same
> problem in both cases. That's the sticking point from my perspective.
> > Of course, it would be best to demonstrate your claim (either
> > that RL must lead to a failure of intelligence, or can succeed
> > at intelligence but must lead to a failure of friendliness) in
> > general. But if you cannot do that and must rely on a specific
> > example, then at least do not pick an example that fails for
> > trivial reasons.
> The reasons are not trivial; they are general. I know it seems "stupid"
> and "trivial" to you, but getting rid of the stupidness and triviality
> is a humongous nontrivial challenge that cannot be solved by throwing
> brute intelligence at the problem.

Huh? Is stupidity a problem that cannot be solved by
throwing brute intelligence at it? But I think you
meant something else and just used bad wording.

Brute intelligence can produce classifications that
agree with the general human consensus, and hence
"know what we mean" in the same way that we know what
each other mean.

> You do not need to agree with my criticism before I can publish a paper
> critical of your ideas; all the more so if I include a URL to your
> rebuttal. Let the reader judge.
> > As I wrote above, if you think RL must fail at intelligence,
> > you would be best to quote Eric Baum.
> Eric Baum's thesis is not reinforcement learning, it is Occam's Razor.
> Frankly I think you are too hung up on reinforcement learning. But that
> is a separate issue.

On page 29 of "What is Thought?", Baum wrote:

  Evolution thus leads to creatures that are essentially
  reinforcement learners with an innate, programmed-in
  reward system: avoid pain, eat when hungry but not when
  full, desire parental approval, and react to stop whatever
  causes your child to cry.

This is a clear statement that humans and other animals
are reinforcement learners.

Chapter 7 is entitled "Reinforcement Learning", Chapter
10 is devoted to his experiments relating economic
principles to the solution of the very difficult credit
assignment problem of RL, and other chapters include
numerous insights into how brains learn by reinforcement.

The book does include extensive discussion of Occam's
Razor. On pages 12 and 13 Occam's Razor is used to chose
between different classes of RL algorithms, and this is
extensively elaborated throughout the book.

Any time I write that RL is the basis of intelligence, I
cite Baum's "What is Thought?" He is a widely respected
RL researcher and a more eloquent advocate than I.

> > If you think RL can succeed at intelligence but must fail at
> > friendliness, but just want to demonstrate it for a specific
> > example, then use a scenario in which:
> >
> > 1. The SI recognizes humans and their emotions as accurately
> > as any human, and continually relearns that recognition as
> > humans evolve (for example, to become SIs themselves).
> You say "recognize as accurately as any human", implying it is a feature
> of the data. Better to say "classify in the same way humans do".

I agree, your wording is better. Or "within the gamut
of general human consensus."

> > 2. The SI values people after death at the maximally unhappy
> > value, in order to avoid motivating the SI to kill unhappy
> > people.
> >
> > 3. The SI combines the happiness of many people in a way (such
> > as by averaging) that does not motivate a simple numerical
> > increase (or decrease) in the number of people.
> >
> > 4. The SI weights unhappiness stronger than happiness, so that
> > it focuses it efforts on helping unhappy people.
> >
> > 5. The SI develops models of all humans and what produces
> > long-term happiness in each of them.
> >
> > 6. The SI develops models of the interactions among humans
> > and how these interactions affect the happiness of each.
> Rearranging deck chairs on the Titanic; in my view this goes down
> completely the wrong pathway for how to solve the problem, and it is not
> germane to the specific criticism I leveled.
> > I do not pretend to have all the answers. Clearly, making RL work
> > will require solution to a number of currently unsolved problems.
> RL is not the true Way. But it is not my purpose to discuss that now.
> > I appreciate your offer to include my URL in your article,
> > where I can give my response. Please use this (please proof
> > read carefully for typos in the final galleys):
> >
> >
> After I send you the revised draft, it would be helpful if I could see
> at least some reply in that URL before final galleys, so that I know I'm
> not directing my readers toward a blank page.

>From the time I sent that URL it contained a statement
that I am waiting to see the final version of AIRisk.pdf.
Now I have added a record of our email exchange, which
is a good explanation of the issues. I look forward to
seeing a revised draft.

> > If you take my suggestion, by elevating your discussion to a
> > general explanation of why RL systems must fail or at least using
> > a strong scenario, that will make my response more friendly since
> > I am happier to be named as an advocate of RL than to be
> > conflated with trivial failure.
> I will probably give a URL to my own reply, which might well just be a
> link to this email message. This email does - at least by my lights -
> explain what I think the general problem is, and why the example given
> is not due to a trivial lack of computing power or failure to read
> information directly present in the data itself.
> > I would prefer that you not use
> > the quote you were using from my 2001 paper, as I repudiate
> > supervised learning of hard-wired values. Please use some quote
> > from and cite my 2004 AAAI paper, since there is nothing in it
> > that I repudiate yet (but you will find more refined views in my
> > 2005 on-line paper).
> I am sorry and I do sympathize, but there simply isn't any introductory
> paragraph in your 2004 paper that would make as much sense to the
> reader. My current plan is for the footnote to say that your proposal
> has changed to a two-layer system, and cite the 2004 paper. From my
> perspective they are not different in any important sense.

I appreciate your offer of a footnote and a citation to my
2004 paper, along with your willingness to provide my URL.

> I hope this satisfies you; I do need to move on.

I will be dissatisfied if you make your case by assuming
some algorithm, such as artificial neural networks, that I
never claimed was adequate for intelligence, or if you use
an example of a system that is adequate for intelligence
in its ability to rule the world but inadequate for
intelligence in its inability to distinguish a smiley face
from a human.

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:56 MDT