Re: FAI and SSSM

From: Eliezer S. Yudkowsky (
Date: Thu Dec 12 2002 - 19:28:47 MST

Bill Hibbard wrote:

>>>With super-intelligent machines, the key to human safety is
>>>in controlling the values that reinforce learning of
>>>intelligent behaviors. In machines,
>>Machines? Something millions of times smarter than a human cannot be
>>thought of as a "machine". Such entities, even if they are incarnated as
>>physical processes, will not be physical processes that share the
>>stereotypical characteristics of those physical processes we now cluster
>>as "biological" or "mechanical".
> You've just arguing over the definition of words. I am
> using "machines" to mean artifacts constructed by humans.
> If you read my book you'll see I imagine super-intelligent
> machines as quite different from any other machines humans
> have built.

What about processes that construct themselves? Does it make sense to
describe the child of the child of the child of the child of the mind that
humans originally built as an "artifact constructed by humans"? Is it
useful to describe it so, when it shares none of the characteristics that
we presently attach to "machines"? Call it a mind, or better yet an
entity; we might be wrong on both counts but at least we won't be quite as
wrong as if we use the word "machine".

Yes, I'm arguing over the definition of a word; deliberately so because I
think that people expect certain characteristics to hold true of
"machines", and that these characteristics don't hold true of SIs
(superintelligences). I would expect an originally human SI and an
originally human-built SI to have more in common with each other than
either would have in common with a modern-day human or a human-equivalent
AI, and I would expect even a half-grown AI to have almost nothing in
common with the physical objects we categorize as "machines".

>>> we can design them so
>>>their behaviors are positively reinforced by human happiness
>>>and negatively reinforced by human unhappiness.
>>A Friendly seed AI design, a la:
>>doesn't have positive reinforcement or negative reinforcement, not the way
>>you're describing them, at any rate. This makes the implementation of
>>your proposal somewhat difficult.
> I am well aware of the relation between your approach based
> on planning behavior from goals, and my approach based on
> values for reinforcement learning.
> A robust implementation of reinforcement learning must solve
> the temporal credit assignment problem, which requires a
> simulation model of the world. This simulation model is the
> basis of reasoning based on goals. Planning and goal-based
> reasoning are emergent behaviors of a robust implementation
> of reinforcement learning.

Perhaps the complex behavior of planning is emergent in the simple
behavior of reinforcement, as well as the simple behavior of reinforcement
being a special case of the complex behavior of planning. I don't think
so, but then I haven't tried to figure out how to do it, so I wouldn't
know whether it's possible.

But human evolution includes specific selection pressures on goals, apart
from selection pressures on reinforcement. Imperfectly deceptive social
organisms that argue linguistically about each other's motives in adaptive
political contexts develop altruistic motives and rationalizations from
altruistic motives to selfish actions; if supra-ancestral increase in
intelligence or knowledge overcomes the force of rationalization, you are
then left with a genuine altruist. How would a robust implementation of
reinforcement learning duplicate the moral and metamoral adaptations which
are the result of highly specific selection pressures in an ancestral
environment not shared by AIs? You can transfer moral complexity directly
rather than trying to reduplicate its evolutionary causation in humans,
but you do have to transfer that complexity - it is not emergent just from

>>A simple goal system that runs on positive reinforcement and negative
>>reinforcement would almost instantly short out once it had the ability to
>>modify itself. The systems that implement positive and negative
>>reinforcement of goals would automatically be regarded as undesirable,
>>since the only possible effect of their functioning is to make *current*
>>goals less likely to be achieved, and the current goals at any given point
>>are what would determine the perceived desirability of self-modifying
>>actions such as "delete the reinforcement system". A Friendly AI design
>>needs to be stable even given full self-modification.
> I think you may be assuming a non-robust implementation of
> reinforcement learning that does not use a simulation model
> to solve the temporal credit assignment problem.

I confess that I don't see how this changes anything at all. I assumed a
simulation model that is not only used for temporal credit assignment, but
which allows for imagination of novel behaviors whose desirability is
determined by the match of their extrapolated effects against previously
reinforced goal patterns. Without this ability, no reinforcement-based
system would ever be capable of carrying out complex creative actions such
as computer programming - when I write a program, I am reasoning from
abstract, high-level design goals to novel concrete code, not just
implementing coding behaviors that have been previously reinforced.

When I say "simple reinforcement system", I mean "a lot simpler than a
human or a Friendly AI"; "simple" does include full modeling/simulation
capabilities, for both credit assignment and imagination of novel
behaviors. Maybe calling it a "flat" reinforcement system would be
better. The problem with a flat reinforcement system is that it
flash-freezes itself the moment it becomes capable of self-modification.
Originally, you built the system such that it contained certain internal
functional modules which modified goal patterns conditional on external
sensory events. And from its goals at any given point, the system judges
the desirability of future states of the universe, and hence the
desirability of actions leading to those future states.

Now imagine this system looking at the fact that it possesses
reinforcement modules, and considering the desirability of actions which
remove those modules. Any internal system, whose effect is to change the
cognitive pattern against which imagined future events are matched to
determine their desirability, is automatically undesirable; if the AI's
future pattern changes, then the AI will take actions which result in an
inferior match of those futures against the current pattern governing
actions. To protect the goals currently governing actions (including
self-modifying actions), the system will remove any internal functionality
whose effect is to modify the top layer of its goal system.

This action feels intuitively wrong to a human because humans have extra
complexity in the goal system, which for purposes of Friendly AI we can
think of as humans treating moral arguments as having the semantics of
probabilistic statements about external referents. See the appropriate
sections of "Creating Friendly AI" for more information.

How do you think temporal credit assignment would change this? It doesn't
seem relevant.

>>Finally, you're asking for too little - your proposal seems like a defense
>>against fears of AI, rather than asking how far we can take supermorality
>>once minds are freed from the constraints of evolutionary design. This
>>isn't a challenge that can be solved through a defensive posture - you
>>have to step forward as far as you can.
> Not at all. Reinforcement is a two-way street, including both
> negative (what you call defensive)

No, that wasn't what I meant by "defensive" at all. I was referring to
human attitudes about futurism.

> and positive reinforcement.
> My book includes a vivid description of the sort of heaven on
> earth that super-intelligent machines will create for humans,

As I believe your book observes, such vivid descriptions are pointless
because we aren't smart enough to get the description right. For example,
your book refers to automated farms and factories rather than
nanotechnology and uploading. This, of course, does not mean that
nanotechnology is the correct description; only that we can already be
pretty sure that farming and factories are destined for the junk-heap of

Recommended book: "Permutation City", Greg Egan.
Recommending online reading:

> assuming that they learn behaviors based on values of human
> happiness, and assuming that they solve the temporal credit
> assignment problem so they can reason about long term happiness.

What if someone has goals beyond happiness? Many philosophies involve
greater complexity than that.

My current best understanding of morality is that "good" consists of
people getting what they want, defined however they choose to define it.
But I'm not infallible, so that understanding is itself subject to change.
  What happens if it turns out that "happiness" isn't what you really
wanted? How does your design recover from philosophical errors by the

>>>Behaviors are reinforced by much different values in human
>>>brains. Human values are mostly self-interest. As social
>>>animals humans have some more altruistic values, but these
>>>mostly depend on social pressure. Very powerful humans can
>>>transcend social pressure and revert to their selfish values,
>>>hence the maxim that power corrupts and absolute power
>>>corrupts absolutely.
>>I strongly recommend that you read Steven Pinker's "The Blank Slate".
>>You're arguing from a model of psychology which has today become known as
>>the "Standard Social Sciences Model", and which has since been disproven
>>and discarded. Human cognition, including human altruism, is far more
>>complex and includes far more innate complexity than the behaviorists
> I am quite familiar with Pinker's ideas. He gave a great talk
> on "The Blank Slate" here in Wisconsin last year (I was lucky
> to get a seat, the room was packed). In fact, my ideas about
> human selfishness and altruism are largely based on Pinker's
> How the Mind Works.

I have not read that one of Pinker's books, so I don't know how much
emotional evolutionary psychology is in it. Anyway, if you're lucky
enough to have access to a source of dead trees, you might also want to
read Matt Ridley's "The Origins of Virtue" for specific material on the
evolutionary origins of altruism.

> I think you are assuming I am a Skinner behaviorist because you
> are thinking of reinforcement learning without a solution of
> the temporal credit assignment problem.
>>If you can't spare the effort for "The Blank Slate", . . .
> That's kind of a cheap shot, Eliezer.

It honestly wasn't intended as such. People have lives beyond the books I
want them to read, and I try to be aware of that. These days I rarely if
ever read material that is not available online. There's a lot more
effort involved in obtaining a dead tree than in clicking a link, so if
someone can't spare the effort for the dead tree, I try to provide a link
as a second-best alternative, if possible. I guess I should be more
careful in my phrasing next time. Sorry.

I also didn't realize that your book to which you referred was available
online. I've now read it. Don't suppose you could return the favor and
check out "Creating Friendly AI", if you haven't done so already?

Eliezer S. Yudkowsky                
Research Fellow, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:41 MDT