Re: [agi] Two draft papers: AI and existential risk; heuristics and biases

From: Eliezer S. Yudkowsky (
Date: Tue Jun 06 2006 - 23:19:49 MDT

Ben Goertzel wrote:
> This brings us back to my feeling that some experimentation with AGI
> systems is going to be necessary before FAI can be understood
> reasonably well on a theoretical level. Basically, in my view, one
> way these things may unfold is
> * Experimentation with simplistic AGI systems ... leads to
> * Theoretical understanding of AGI under limited resources ... which
> leads to...
> * The capability of theoretically understanding FAI ... which leads
to ...
> * Building FAI
> Now, this building of FAI *may* take the form of creating a whole new
> AGI architecture from scratch, *or* it may take the form of minorly
> modifying an existing AGI ... or it may be understood why some
> existing AGI design is adequate and there is not really any
> Friendliness problem with it. We don't know which of these
> eventualities will occur because we don't have the theory of FAI
> yet...
> Your excellent article AIGR, in my view, does not do a good job of
> arguing against this sort of perspective that I'm advocating here. I
> understand that this is not its job, though: it is mostly devoted to
> making more basic points, which are not sufficiently widely
> appreciated and with which I mainly agree enthusiastically.

I am highly skeptical of calls for "AGI experimentation" as an answer to
Friendly AI concerns, for several reasons.

First, as discussed in the chapter, there's a major context change
between the AI's prehuman stage and the AI's posthuman stage. I can
think of *many* failure modes such that the AI appears to behave well as
a prehuman, then wipes out humanity as a posthuman. What I fear from
this scientific-sounding business of "experimental investigation" is
that the results of your investigation will be observed good behavior,
and you will conclude that the AI "is good" and will stay good under
extreme context changes. This is not, in fact, a licensable conclusion.

Trial and error is all well and good in science, unless you happen to be
dealing with an existential risk. Richard Loosemore, at the AGI
conference, said, "We need to go out and do some alchemy." Alchemy
killed a hell of a lot of alchemists; and in later days, so too Madam
Curie, who investigated radiation before anyone realized it was
dangerous... And yes, they were martyrs to science; we are better off,
even counting the casualties, than if they had never tried; the
overwhelming majority of humans survived, and eventually profited from
the knowledge gained by these early pioneers of what *not* to do. But -
as I said in the chapter conclusion - imagine how careful you would have
to be if you wanted to survive as an *individual*; and that is how
careful humanity must be to survive existential risks. I espouse the
Proactionary Principle for everything *except* existential risks. When
you have to get it right the first time, alchemy is not an option. So
let's dispense with praising ourselves on looking scientific if we
propose just doing the experiment and seeing what happens; because it is
not unheard-of for the experimental result to be "You've made an amazing
discovery! By the way, you're dead." We can't afford a species-scale
version of that.

Second, there's an *enormous* amount of experimentation and observation
that's already been done in the cognitive sciences. I feed off this
body of pre-existing work in a dozen fields, and it gives me more
concentrated evidence than I could assemble for myself in a hundred
lifetimes. And all that I have studied is not the thousandth part of
the whole. But where the processor is inefficient, no amount of
evidence may suffice. If there's already a thousand times as much
evidence as you could review in your lifetime, what makes you think that
what's needed is one more experiment - rather than an insight that we
already have more than enough evidence to see, but we aren't looking at
the right way?

Of course this objection has a special poignancy for me because, as far
as I can tell, yes, we already have all the evidence we need, far more
than enough, and the only problem is understanding the implications of
what we already know. Pity that humans aren't logically omniscient.

But just which experiments do you propose to perform, and what do you
expect them to tell you? If you don't know what you're looking for,
what is this one experiment such that spending a year to perform it has
a higher probability of yielding an unexpected, unguessable insight,
than spending the same year studying a thousand existing experiments
from twenty fields? Especially if your experiment constitutes an
existential risk?

As a Bayesian, I ought to be skeptical because, in theory, how you
interpret evidence depends on which hypotheses you are testing. You
might see good behavior and elaborate moral arguments from a simple test
AI and say, "Ooh, it all just emerged! I'm glad; I thought that was
going to be really difficult." But one could equally say, "The
probability of this specific complexity 'just emerging' is
infinitesimal; but what is not an infinitesimal probability is that the
AI's odd internal utility function has given rise to the behavioral
strategy of trying to fool you." Which hypothesis has greater prior
probability? Which hypothesis has greater likelihood density in the
range of the observed evidence?

Now in practice, I admit that there have been cases where the
experimental observations told us which hypotheses we needed to test;
nearly all revolutionary science, as opposed to routine science, happens
this way. But it is also true in practice that you have to know what
you're seeing. When it comes to interpreting what the behavior of an
AGI can tell us about its internal workings, I think you may need to
solve most of the problem in order to know what you're seeing. In
matters of Friendly AI, I think that the same evidence might be
interpreted in highly different ways by a naive observer (especially
someone who wanted to believe the AI was friendly) and someone who'd
gone deep into theory before performing the experiment. In ordinary
science, you can exclude the hypothesis that Nature is actively trying
to deceive you.

Third, last time I checked, you were still attempting to come up with
reasons why the AI you planned to experiment with could not possibly
undergo hard takeoff, rather than building in controlled ascent /
emergency shutdown features in at every step as a simple matter of
course. I recall that you once said that the chance of the current
version of Novamente undergoing an unexpected hard takeoff was "a
million to one". If you've read the heuristics and biases chapter, you
now know why a statement like this is sufficient to make me say to
myself, "This is the way the world ends." Whether or not the Novamente
project is responsible (I regard that as quite a small probability),
this is the way the world ends. It is quite possible that whosoever
destroys the world will do so using an AI that they believed "couldn't
possibly" be a threat, so putting emergency features into AIs that
"can't possibly" be a threat is a highly valuable heuristic. Even
natural selection, the trivial case of design with intelligence equal to
zero, output an AGI. For me to trust that someone genuinely did have a
legitimate reason for staking Earth's existence on their desire to know
some AGI result - and this would be the case even with every safety
precaution imaginable - I'd have to see them taking safety precautions
as a matter of course, not trying to guess whether they were "necessary"
or not. Guessing this is a bad habit. Sooner or later you'll guess wrong.

There's similarly the case chronicled by Nicholas Nassim Taleb, which I
didn't manage to put into the _Cognitive biases_ chapter, of someone
who, as his stocks continually dropped, continually kept asserting that
they were bound to go up again; and Taleb notes that he did not plan in
advance what to do in case his stocks dropped, or plan in advance what
that contingency would mean, or set any hard point where he would take
alarm and sell; and thus he was wiped out. So you don't plan to
implement emergency shutdown features today, and what's much more
alarming, you haven't exposed for comment your schedule of exactly when
they will become necessary. In fact, you haven't even given us any
reason to believe that, if you got unexpectedly powerful results out of
an AGI, you wouldn't just send out for champagne, pushing your brilliant
new discovery as far and as fast as you could, coming up with reasons
why you were safe or expending the minimal effort needed to purchase a
warm glow of satisfaction, and proceed thus until, unnoticed, you pass
the unmarked and unremarkable spot that was your very last chance to
turn back.

But the most important reason I am highly skeptical is the same reason
my current self would be highly skeptical of section 4 of CFAI, which is
that the actual strategy, the actual day-to-day actions, look exactly
like they would look if the thought of FAI had never entered your/my
mind. The purpose of rational thought is to direct actions. If you're
not going to change your actions, there's really no point in wasting all
that effort on thinking.

It's not thought, but action, that counts. I'd have a very different
opinion of this verbal advice to "devise an experimental strategy for
FAI" if you posted a webpage containing a list of which FAI-related
experiments you wanted to do, what you thought you might learn from them
that you couldn't read off of existing science, and which observations
you felt would license you to make which conclusions about the rules of
Friendly AI. Why would I feel better? Because I don't expect someone's
first forays into Friendly AI to turn out well. So what matters is
whether the particular mistakes someone makes force them to learn more
about the subject, hold their ideas to high standards, work out detailed
consequences, expose themselves to criticism, witness other people doing
things that they think will fail and so beginning to appreciate the
danger of the problem... What counts in the long run is whether your
initial mistaken approach forces you to go out and learn many fields,
work out consequences in detail, appreciate the scope of the problem,
hold yourself to high standards, fear mental errors, and, above all,
keep working on the problem.

The mistake I made in 1996, when I thought any SI would automatically be
Friendly, was a very bad mistake because it meant I didn't have to
devote further thought to the problem. The mistake I made in 2001 was
just as much a mistake, but it was a mistake that got me to spend much
more time thinking about the problem and study related fields and hold
myself to a higher standard.

If I saw a detailed experimental plan, expressed mathematically, that
drew on concepts from four different fields or whatever... I wouldn't
credit you with succeeding, but I'd see that you were actually changing
your strategy, making Friendliness a first-line requirement in the sense
of devoting actual work-hours to it, showing willingness to shoulder
safety considerations even when they seem bothersome and inconvenient,
etc. etc. It would count as *trying*. It might get you to the point
where you said, "Whoa! Now that I've studied just this one specific
problem for a couple of years, I realize that what I had previously
planned to do won't work!"

But in terms of how you spend your work-hours, which code you write,
your development plans, how you allocate your limited reading time to
particular fields, then this business of "First experiment with AGI" has
the fascinating and not-very-coincidental-looking property of having
given rise to a plan that looks exactly like the plan one would pursue
if Friendly AI were not, in fact, an issue.

Because this is a young field, how much mileage you get out will be
determined in large part by how much sweat you put in. That's the
simple practical truth. The reasons why you do X are irrelevant given
that you do X; they're "screened off", in Pearl's terminology. It
doesn't matter how good your excuse is for putting off work on Friendly
AI, or for not building emergency shutdown features, given that that's
what you actually do. And this is the complaint of IT security
professionals the world over; that people would rather not think about
IT security, that they would rather do the minimum possible and just get
it over with and go back to their day jobs. Who can blame them for such
human frailty? But the result is poor IT security.

What kind of project might be able to make me believe that they had a
serious chance of achieving Friendly AI? They would have to show that,
rather than needing to be *argued into* spending effort and taking
safety precautions, they enthusiastically ran out and did as much work
on safety as they could. That would not be sufficient but it would
certainly be necessary. They would have to show that they did *not*
give in to the natural human tendency to put things off until tomorrow,
nor come up with clever excuses why inconvenient things do not need to
be done today, nor invent reasons why they are almost certainly safe for
the moment. For whoso does this today will in all probability do the
same tomorrow and tomorrow and tomorrow.

I do think there's a place for experiment in Friendly AI development
work, which is as follows: One is attempting to make an experimental
prediction of posthuman friendliness and staking the world on this
prediction; there is no chance for trial and error; so, as you're
building the AI, you make experimental predictions about what it should
do. You check those predictions, by observation. And if an
experimental prediction is wrong, you halt, melt, catch fire, and start
over, either with a better theory, or holding yourself to a stricter
standard for what you dare to predict. Maybe that is one way an
adolescent could confront an adult task. There are deeper theoretical
reasons (I'm working on a paper about this) why you could not possibly
expect an AI to be Friendly unless you had enough evidence to *know* it
was Friendly; roughly, you could not expect *any* complex behavior that
was a small point in the space of possibilities, unless you had enough
evidence to single out that small point in the large space. So,
although it sounds absurd, you *should* be able to know in advance what
you can and can't predict, and test those predictions you dare make, and
use that same strict standard to predict the AI will be Friendly. You
should be able to win in that way if you can win at all, which is the
point of the requirement.

Eliezer S. Yudkowsky                
Research Fellow, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:56 MDT