Re: Two draft papers: AI and existential risk; heuristics and biases

From: Bill Hibbard (
Date: Wed Jun 07 2006 - 11:24:55 MDT


> I don't think it
> inappropriate to cite a problem that is general to supervised learning
> and reinforcement, when your proposal is to, in general, use supervised
> learning and reinforcement. You can always appeal to a "different
> algorithm" or a "different implementation" that, in some unspecified
> way, doesn't have a problem.

But you are not demonstrating a general problem. You are
instead relying on specific examples (primitive neural
networks and systems that cannot distingish a human from
a smiley) that fail trivially. You should be clear whether
you claim that reinforcement learning (RL) must inevitably
lead to:

  1. A failure of intelligence.


  2. A failure of friendliness.

Your example of the US Army's primitive neural network
experiments is a failure of intelligence. Your statement
about smiley faces assumes a general success at intelligence
by the system, but an absurd failure of intelligence in the
part of the system that recognizes humans and their emotions,
leading to a failure of friendliness.

If your claim is that RL must lead to a failure of
intelligence, then you should cite and quote from Eric Baum's
What is Thought? (in my opinion, Baum deserves the Nobel
Prize in Economics for his experiments linking economic
principles with effective RL in multi-agent learning systems).

If your claim is that RL can succeed at intelligence but must
lead to a failure of friendliness, then it is reasonable to
cite and quote me. But please use my 2004 AAAI paper . . .

> If you are genuinely repudiating your old ideas ...

. . . use my 2004 AAAI paper because I do repudiate the
statement in my 2001 paper that recognition of humans and
their emotions should be hard-wired (i.e., static). That
is just the section of my 2001 paper that you quoted.

Not that I am sure that hard-wired recognition of humans and
their emotions inevitably leads to a failure of friendliness,
since the super-intelligence (SI) may understand that humans
would be happier if they could evolve to other physical forms
but still be recognized by the SI as humans, and decide to
modify itself (or build an improved replacement). But if this
is my scenario, then why not design continuing learning of
recognition of humans and their emotions into the system in
the first place. Hence my change of views.

I am sure you have not repudiated everything in CFAI, and I
have not repudiated everything in my earlier publications.
I continue to believe that RL is critical to acheiving
intelligence with a feasible amount of computing resources,
and I continue to believe that collective long-term human
happiness should be the basic reinforcement value for SI.
But I now think that a SI should continue to learn recognition
of humans and their emotions via reinforcement, rather than
these recognitions being hard-wired as the result of supervised
learning. My recent writings have also refined my views about
how human happiness should be defined, and how the happiness of
many people should be combined into an overall reinforcement

> I see no relevant difference between these two proposals, except that
> the paragraph you cite (presumably as a potential replacement) is much
> less clear to the outside academic reader.

If you see no difference between my earlier and later ideas,
then please use a scenario based on my later papers. That will
be a better demonstration of the strength of your arguments,
and be fairer to me.

Of course, it would be best to demonstrate your claim (either
that RL must lead to a failure of intelligence, or can succeed
at intelligence but must lead to a failure of friendliness) in
general. But if you cannot do that and must rely on a specific
example, then at least do not pick an example that fails for
trivial reasons.

As I wrote above, if you think RL must fail at intelligence,
you would be best to quote Eric Baum.

If you think RL can succeed at intelligence but must fail at
friendliness, but just want to demonstrate it for a specific
example, then use a scenario in which:

  1. The SI recognizes humans and their emotions as accurately
  as any human, and continually relearns that recognition as
  humans evolve (for example, to become SIs themselves).

  2. The SI values people after death at the maximally unhappy
  value, in order to avoid motivating the SI to kill unhappy

  3. The SI combines the happiness of many people in a way (such
  as by averaging) that does not motivate a simple numerical
  increase (or decrease) in the number of people.

  4. The SI weights unhappiness stronger than happiness, so that
  it focuses it efforts on helping unhappy people.

  5. The SI develops models of all humans and what produces
  long-term happiness in each of them.

  6. The SI develops models of the interactions among humans
  and how these interactions affect the happiness of each.

If you demonstrate a failure of friendliness against a weaker
scenario, all that really demonstrates is that you needed the
weak scenario in order to make your case. And it is unfair to
me. As I said, best would be a general demonstration, but if
you must pick an example, at least pick a strong example.

I do not pretend to have all the answers. Clearly, making RL work
will require solution to a number of currently unsolved problems.
Jeff Hawkins' work on hierarchical temporal memory (HTM) is
interesting in this respect, given the interactions within the
human brain between the cortex (modeled by HTM) and lower brain
areas where RL has been observed (in my view RL is in a lower area
because it is fundamental, and the higher areas evolved to create
the simulation model of the world necessary to solve the credit
assignment problem for RL). Clearly RL is not the whole answer,
but I think Eric Baum has it right that it is critical to

I appreciate your offer to include my URL in your article,
where I can give my response. Please use this (please proof
read carefully for typos in the final galleys):

If you take my suggestion, by elevating your discussion to a
general explanation of why RL systems must fail or at least using
a strong scenario, that will make my response more friendly since
I am happier to be named as an advocate of RL than to be
conflated with trivial failure. I would prefer that you not use
the quote you were using from my 2001 paper, as I repudiate
supervised learning of hard-wired values. Please use some quote
from and cite my 2004 AAAI paper, since there is nothing in it
that I repudiate yet (but you will find more refined views in my
2005 on-line paper).


p.s., Although I receive digest messages from extropy-chat,
for some reason my recent posts to it have all bounced. Could
someone please forward this message to extropy-chat?

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:56 MDT