Re: SIAI's flawed friendliness analysis

From: Durant Schoon (
Date: Thu May 22 2003 - 19:50:38 MDT

Yikes! I go watch the "Matrix Reloaded" a couple of times and whole
week passes with lots of action on sl4...

> From: Bill Hibbard <>
> Subject: Re: SIAI's flawed friendliness analysis
> Hi Durant,

Hi Bill,

> > > The SIAI guidelines involve digging into the AI's
> > > reflective thought process and controlling the AI's
> > > thoughts, in order to ensure safety. My book says the
> > > only concern for AI learning and reasoning is to ensure
> > > they are accurate, and that the teachers of young AIs
> > > be well-adjusted people (subject to public monitoring
> > > and the same kind of screening used for people who
> > > control major weapons). Beyond that, the proper domain
> > > for ensuring AI safety is the AI's values rather than
> > > the AI's reflective thought processes.
> >
> > Two words: "Value Hackers"
> >
> > Let us try to understand why Eliezer chooses to focus on an "AI's
> > reflective thought processes" rather than on explicitly specifying an
> > "AI's values". Let's look at it this way: if you could, wouldn't you
> > rather develop an AI which could reason about *why* the values are the
> > way they are instead of just having the values carved in stone by a
> > programmer.
> >
> > This is *safer* for one very important reason:
> >
> > The values are less likely corruptible, since the AI that actually
> > understands the sources for these values can reconstruct them from
> > basic principles/information-about-humans-and-the-world-in-general.
> >
> > The ability to be able to re-derive these values in the face of a
> > changing external environment and an interior mindscape-under-
> > development is in fact *paramount* to the preservation of Friendliness.
> This sounds good, but it won't work because values can
> only be derived from other values. That's why the SIAI
> recommendations use value words like "valid", "good"
> and "mistaken". But the recommendations leave the
> definition of these words ambiguous. Furthermore,
> recommendation 1, "Friendliness-topped goal system",
> defines a base value (i.e., supergoal) but leaves it
> ambiguous. Defining rigorous standards for these words
> will require the SIAI recommendations to precisely
> define values.
> The ambiguous definitions in the SIAI analysis will be
> exploited by powerful people and institutions to create
> AIs that protect and enhance their own interests.

I just finished reading LOGI and I'm starting to re-read CFAI (this
time carefully), so bear with me, if I mangle this. CFAI focuses on
which design considerations we should undertake when designing a
Friendly AI.

You are pointing out the critical period of an seedling AI's
development in which Friendliness might not only be conveyed poorly,
but might be conveyed incorrectly with the malicious intent of
benefiting special interests.

I wonder if we can come up with meta-guidelines here, not to out AI
programmers with unFriendly intentions (I confess I skipped ahead a
few messaged to Eliezer's reply), but instead create a guideline that
tell us which guidelines are serving the interests of the few at the
expense of the many. Can we make this an engineering problem?

I'm asking: Is it possible to create a special-interests-detector that
could be the very first part of the Friendliness system? Such a
sensory modality would be something like the classic
cheater-detectors which are rife in own brains. Can we make it hard
for anyone to insert subversive directions?

Even if you trust everyone on your team, it might make sense to create
this detection system as one of the first

Then if there were lurking spies, it would theoretically be harder for
them to slip in their code. Of course, you are now forced to examine
the detector for backdoors...And so the race begins as the very first
line of code is written.

Potential spies, who pass enough scrutiny to appear as capable Seed AI
programmers, should be smart enough to know that they are not taking a
risk like hacking a bank's software system to steal 60% of all the
money in the world, they are creating a mind which has the potential
to overpower the entire world. It would take a programmer of
pathological audacity and disregard for humanity to muck with
something like that...but you never know. I say it might be worth
investigating some simple and obvious cheater-detectors as first moves
on the board.

Thanks for bringing this up (Eliezer feel free to slap me if this is
already carefully covered in CFAI).

> > As mentioned before, this all hinges on the the ability to create an
> > AI in the first place that can understand "how and why values are
> > created" as well as what humans are, what itself is, and what the
> > world around us is. Furthermore, we are instructed by this insight
> > as to what the design of an AI should look like.
> Understanding what humans are, what the AI itself is, and
> what values are, is implicit in the simulation model of the
> world created by any intelligent mind. It uses this model to
> predict the long-term consequences of its behaviors on its
> values. Such understanding is prescribed by an intelligence
> model rather than a safe AI model.

Having just read this recently, in section 1.4 of CFAI, I interpret
your comments as concentrating on "Friendliness Content", ie. the
values, rather than on "Friendliness Acquisition" and "Friendliness
Structure", ie. the intelligence.

We probably all agree about goal-seeking AI's having a proper model of
the world, the proper values, and the proper intellect to act
according to the values.

While trying to understand what the differences are between your
advice and Eliezer's advice, I'm wondering if you advocate that the AI
be allowed/encouraged to modify it's own values, and if so, at what
point do you diverge from what Eliezer suggests?

> > Remembering our goal is to build better brains, the whole notion of
> > the SeedAI bootstrap is to get a AI that builds a better AI. We must
> > then ask ourselves this question: Who should be in charge of
> > designating this new entity's values? The answer is "the smartest most
> > capable thinker who is the most skilled in these areas". At some point
> > that thinker is the AI. From the get-go we want the AI to be competent
> > at this task. If we cannot come up with a way to ensure this, we
> > should not attempt to build a mind in the first place (this is
> > Eliezer's view. I happen to agree with this. Ben Goertzel & Peter
> > Voss's opposing views have been noted previously as well(*)).
> It is true that AIs, motivated by their values, will design
> and build new and better AIs. Based on the first AI's
> understanding about how to achieve its own values, it may
> design new AIs with slightly different values. As Ben says,
> there can be no absolute guarantee that these drifting values
> will always be in human interests.

Absolute or not, the aim of Friendly AI is to preserve human interests
or "better" should we all agree on what "better" is. As I see it,
we're sort of forced into this without any guarantee. If we don't
develop Friendly AI, someone could develop a non-Friendly AI.

> But I think that AI values
> for human happiness link AI values with human values and
> create a symbiotic system combining humans and AIs. Keeping
> humans "in the loop" offers the best hope of preventing any
> drift away from human interests. An AI with humans in its
> loop will have no motive to design an AI without humans in
> its loop. And as long as humans are in the loop, they can
> exert reinforcement to protect their own interests.

All Friendly volitional sentients are welcomed here! Believe me, I
don't want to be left out of the loop...unless if I were left in the
loop I would cause all sorts of damage. I have a feeling that we're
going to lose control of the ship no matter what, but if CFAI works,
the ship will take us where we want to go anyway. According to
Friendliness, our interests are very important to an FAI.

Now if we don't have a design that preserves Friendliness, we can be
removed from the loop, as soon as post human intelligence takes off,
right? We have to "win" before that happens. Of course, humans will be
in the loop up until that point, but afterward, blammo,
singularity. If it's not a group we trust that creates AI, it will be
some group we don't trust. It's a strange endgame, but here we are,
force to deal with it, hopefully with enough time to save ourselves.

With all of my comments, I am assuming the case where singularity
pulls the rug out from under our feet, out of our control. If you're
not concerned about that particular scenario, then perhaps I sound
overly paranoid. That assumption will have design consequences,
eg. what you have to failsafe when. I just want to make sure we're on
the same page, worrying about the same thing.

> > In summary, we need to build the scaffolding of *deep* understanding
> > of values and their derivations into the design of AI if we are to
> > have a chance at all. The movie 2001 is already an example in popular
> > mind of what can happen when this is not done.
> The "*deep* understanding of values" is implicit in superior
> intelligence. It is a very accurate simulation model of the
> world that includes understanding of how value systems work,
> and the effects that different value systems would have on
> brains. But choosing between different value systems requires
> base values for comparing the consequences of different value
> systems.

Yes, and those originally come from the programmers...I understand now
that the potential point of failure you're highlighting is right
there, when the seedling is being trained. Sorry if I didn't get that

> Reinforcement learning and reinforcement values are essential
> to intelligence. Every AI will have base values that are part
> of its definition, and will use them in any choice between
> different value systems.

Yes (note - I'm equating "reinforcement learning" with "learning",
ie. "Bayesian learning". If this is a subtle issue concerning two types
of learning, then I'm not up to date on the terminology).

> Hal in the movie 2001 values its own life higher than the
> lives of its human companions, with results predictably bad
> for humans. Its the most basic observation about safe AI.

What I was trying to say with that example is that it seemed like the
programmers just set some (moral) values and let the intelligence
crank got the "wrong" answer. In retrospect, I probably
shouldn't have brought that up as a comparison with your suggestions
for AI, since I should've given your approach more credit. The Hal
example is well known, and I can probably be confident that you've
considered this particular issue even though I've never read your
book. Thank you for not smacking me too hard.

With my last post, I sincerely wanted to emphasize the AIs role in
further developing the values of the AI. I'm hoping you'll say more
about this, if you've considered it. I don't think we humans have the
option of always staying smarter than the AIs we build and because of
that realization (happy or as unhappy as it may be) we have to set AI
on a course that will ensure all our interests when AI surpasses
us...even assuming that we'll upgrade our personal capabilities as
eager, happy transhumanists, I predict we'll be behind the curve.

I don't think there's a way to stay ahead of this game unless we
upload a human first...and that's more dangerous than CFAI, in my

> > We cannot think of everything in advance. We must build a mind that
> > does the "right" thing no matter what happens.
> The right place to control the behavior of an AI is its
> values.
> There's a great quote from Galileo: "I do not feel obliged to
> believe that the same God who endowed us with sense, reason
> and intellect has intended us to forgo their use." When we
> endow an artifact with an intellect, we will not be able to
> control its thoughts. We can only control the accuracy of its
> world model, by the quality and quantity of brain power we
> give it, and its base values. The thoughts of an AI will be
> determined by its experience with the world, and its values.

Is there anything you specifically disagree with in CFAI 1.4
"Content, Acquisition, Structure"? I think we're all talking about the
same thing. Eliezer is saying "Yeah, we need those values (content)
and we really, really, really need to concentrate on making sure
the AI wants to develop better values and can learn new ones
(structure and acquisition)".

It sounds like you want to separate values from intelligence, but do
you really mean something stronger like "An AI should not modify it's
own values"? I'm guessing because I'm looking for a source of
discrepancy between your and Eliezer's views.

I hope I've made clear my interpretation of the view that an AI should
be able to modify it's values...since once it transcends human
intelligence, you want the structure and acquisition to scale.

> > Considering your suggestion that the AI *only* be concerned with
> > having accurate thoughts: I haven't read your book, so I don't know
> > your reasoning for this. I can imagine that it's an *easier* way to do
> > things. You don't have to worry about the hard problem of where values
> > come from, which ones are important and how to preserve the right ones
> > under self modification. Easier is not better, obviously, but I'm only
> > guessing here why you might hold "ensuring-accuracy" in higher esteem
> > than other goals of thinking (like preserving Friendliness).
> Speculating about my motives is certainly an *easier* way for
> you to argue against my ideas.

I know, I should read your book. Please help by summarizing when
possible so get the gist of your argument. I only have so much time in
the day :(

...though a little more for a brief window (hence these long posts)

> An accurate simulation model of the world is necessary for
> predicting and evaluating (according to reinforcement values)
> the long-term consequences of behaviors. This is integral to
> preserving friendliness, rather than an alternative.


> When you say "values ..., which ones are important and how to
> preserve the right ones" you are making value judgments about
> values. Those value judgments must be based on some set of
> base values.

Yes, there needs to be a mechanism that modifies or produces new
values and it has to be bootstrapped somehow (the programmers inputting
the information for example). This is the security concern, to which you
rightfully draw our attention.

We humans do it through a combination of hard-wired instincts,
culturally learned behaviors, and intelligent deductions based on our
experiences. AI will follow the same natural pattern: initial
conditions, external influence, internal methods for processing inputs.

> > (*) This may be the familiar topic on this list of "when" to devote
> > your efforts to Friendliness, not "if". This topic has already been
> > discussed exhaustively and I would say it comes down to how one
> > answers certain questions: "How cautious do you want to be?", "How
> > seriously do consider that a danger could arise quickly, without a
> > chance to correct problems on a human time scale of thinking/action".
> >
> > > In my second and third points I described the lack of
> > > rigorous standards for certain terms in the SIAI
> > > Guidelines and for initial AI values. Those rigorous
> > > standards can only come from the AI's values. I think
> > > that in your AI model you feel the need to control how
> > > they are derived via the AI's reflective thought
> > > process. This is the wrong domain for addressing AI
> > > safety.
> >
> > Just to reiterate, with SeedAI, the AI becomes the programmer, the
> > gatekeeper of modifications. We *want* the modifier of the AI's values
> > to be super intelligent, better than all humans at that task, to be
> > more trustworthy, to do the right thing better than any
> > best-intentioned human. Admittedly, this is a tall order, an order
> > Eliezer is trying to fill.
> >
> > If you are worried about rigorous standards, perhaps Eliezer's proposal
> > of Wisdom Tournaments would address your concern. Before the first
> > line of code is ever written, I'm expecting Eliezer to expound upon
> > these points in sufficient detail.
> Wisdom Tournaments will only address my concerns if they
> define precise values for a safe AI, and create a political
> movement to enforce those values.

I really like Eliezer's idea that even if the programmer's get some
small part of Friendliness wrong, there is some wiggle room for the AI
to get it right. During the training period there will be a back and
forth and Eliezer has always said that in the beginning the programmer
should be considered, right.

By stating your need for a political movement, I am reading that you

1) Prefer some amount of public participation
2) Feel safer with oversight of what the programmers are actually
   teaching the seedling.

I have to admit, I'm not sure how I'd feel if Eliezer and a group of
1st percentile programmers disappeared to the Mongolian Underground
and created their Friendly AI in secret.

Of all the authors I've read so far, I guess I'd trust him the most,
but admittedly the concept is unnerving (of anyone doing that and then
having a chance at succeeding).

One thing to do is to publicly try to define as much about
Friendliness as possible in advance, so that if anyone is crazy enough
to attempt it on their own, they'll follow safe(ish) guidelines (if
they know what's good for them).

I understand you want more detail for these proposed Tournaments. I'd
like more detail too. And I'm sure Eliezer, et al. would as
well. Someone has to write them first :-)

> > > Clear and unambiguous initial values are elaborated
> > > in the learning process, forming connections via the
> > > AI's simulation model with many other values. Human
> > > babies love their mothers based on simple values about
> > > touch, warmth, milk, smiles and sounds (happy Mother's
> > > Day). But as the baby's mind learns, those simple
> > > values get connected to a rich set of values about the
> > > mother, via a simulation model of the mother and
> > > surroundings. This elaboration of simple values will
> > > happen in any truly intelligent AI.
> >
> > Why will this elaboration happen? In other words, if you have a
> > design, it should not only convince us that the elaboration will
> > occur, but that it will be done in the right way and for the right
> > reasons. Compromising any one of those could have disastrous effects
> > for everyone.
> The elaboration is the same as the creation of subgoals
> in the SIAI analysis. When you say "right reasons" you
> are making a value judgment about values, that can only
> come from base values in the AI.

True. So we can all agree that we should be concerned with:

1) Where base values come from and what they are
2) How values are added/removed/modified

I'm curious, when you look at Eliezer's design suggestions do you want
the values be decoupled more from the intelligent processes which
operate on them? Or is that a minor concern and you're mostly worried
about how base values are assigned (including social processes to do
so) and what, specifically, these values should be.

"Ambiguity = danger" as you propose.

> > > I think initial AI values should be for simple
> > > measures of human happiness. As the AI develops these
> > > will be elaborated into a model of long-term human
> > > happiness, and connected to many derived values about
> > > what makes humans happy generally and particularly.
> > > The subtle point is that this links AI values with
> > > human values, and enables AI values to evolve as human
> > > values evolve. We do see a gradual evolution of human
> > > values, and the singularity will accelerate it.
> >
> > I think you have good intentions. I appreciate your concern for doing
> > the right thing and helping us all along on our individual goals to be
> > happy(**), but if the letter-of-the-law is upheld and there is
> > no-true-comprehension as to *why* the law is the way it is, we could
> > all end up invaded by nano-serotonin-reuptake-inhibiting-bliss-bots.
> This illustrates the difference between short-term and
> long-term. Brain chemicals give short-term happiness
> but not long-term. An intelligent AI will learn that
> drugs do not make people happy in the long term.

I like that you resort to a higher intelligence determining that this
is not in our best interests to make us happy, but technically we
would be happy, since we'd never come down from our perma-highs. On
this list, Eliezer has predicted that a Friendly AI would give full
disclosure about the predicted outcome of our choices before allowing
ourselves to do things like wireheading.

A sadist, a masochist and a super-intelligent Friendly AI walk into a
bar...oh wait, I've used that one :)

> > I think Eliezer takes this great idea you mention, of guiding the AI
> > to have human values and evolve with human values, one step
> > further. Not only does he propose that the AI have these human(e)
> > values, but he insists that the AI know *why* these values are good
> > ones, what good "human" values look like, and how to extrapolate them
> > properly (in the way that the smartest, most ethical human would),
> > should the need arise.
> Understanding all the implications of various values is
> a function of reason (i.e., simulating the world). I'm all
> in favor of that, but it is a direct consequence of superior
> intelligence. It is part of an intelligence model rather
> than a friendliness model.

I wonder if it's safe for me to write this equation:

friendliness model =

  intelligence model + FriendlinessIsSuperGoalForever

Does that work for you?

> But for "the AI (to) know *why* these values are good ones" is
> to make value judgments about values. Such value judgments
> imply some base values for judging candidate values. This
> is my point: the proper domain of AI friendliness theory
> is values.

I see, I see. You want more detail about the base values and the
programming mechanism designed to preserve Friendliness. You have no
qualms with the approach, but you want more than just a sketch of what
this looks like.

Me too!

> > Additionally, we must consider the worst case, that we cannot control
> > rapid ascent when it occurs. In that scenario we want the the AI to be
> > ver own guide, maintaining and extending-as-necessary ver morality
> > under rapid, heavy mental expansion/reconfiguration. Should we reach
> > that point, the situation will be out of human hands. No regulatory
> > guidelines will be able to help us. Everything we know and cherish
> > could depend on preparing for that possible instant.
> >
> > (**) Slight irony implied but full, sincere appreciation bestowed. I
> > consider this slightly ironic, since I view happiness as a signal that
> > confirms a goal was achieved rather than a goal, in-and-of-itself.
> Yes, but happiness is playing these two different roles in
> two different brains. It is the signal of positive values in
> humans, and observation of that signal is the positive value
> for the AI. This links AI values to human values.

OK, I see the distinction. We probably agree that the AI has to be
"smart" about this, like a human would be "smart" about it so we end
up getting what we want instead of what we say we want. That means
coming up with a model of the human in question and constantly
updating it, just like we do when we model other minds. I have a
feeling I should read CFAI carefully to get an answer for this.

You're distinction, earlier, of short term goals and long term goals
is a good one. That should definitely be a consideration when humans
are given "full disclosure" of the possible outcomes of our choices.

The future holds much weirdness for transhumans who will be able to
modify our their own psyches. Thinking about desire and volition, can
get muddled. Earlier you distinguished short term and long term goals,
which any intelligent being should do. But now imagine we have full
control over our own minds. If you enjoy gambling, shouldn't you be
allowed to gamble (short term pleasure) even though you know it's bad
for you (long term loss of money)? Or should you just turn off your
enjoyment of gambling? With this particular situation, I think one can
examine ones goals, and calculate which is the better thing to do.

Or, assuming sex is still interesting, will gay people "cure"
themselves and become straight? Or will straight people cure
themselves and become bisexual or celibate? This might all be decided
by predicting likely future scenarios and their probabilities and then
we get to choose which future we want to live in...and with everyone
else making decisions in real time the models of the future will be
constantly changing. Ah those derivatives markets should be fun...If
we can get to the point where were not all killing each other and are
all increasing in prosperity, then the rest of this is icing on the
cake...ok, I've digressed.

Again, I think Friendliness, properly implemented, will link human
desires to the AI's desire. The AI will model us a sentients and
proceed accordingly. That's the whole idea.

> > > Morality has its roots in values, especially social
> > > values for shared interests. Complex moral systems
> > > are elaborations of such values via learning and
> > > reasoning. The right place to control an AI's moral
> > > system is in its values. All we can do for an AI's
> > > learning and reasoning is make sure they are accurate
> > > and efficient.
> >
> > I'll agree that the values are critical linchpins as you suggest, but
> > please do not lose sight of the fact that these linchpins are part of
> > a greater machine with many interdependencies and exposure to an
> > external, possibly malevolent, world.
> There are both benevolence and malevolence in the world.
> For further details, see ;)

Yes, I realize you've probably noticed it's a nasty world out there :)
I just wanted to express my hope that if we can put as many safety
measures as possible into the design, we better protect ourselves from
disaster. I'm sure you agree :)

I'm not sure it is easy to cleanly cleave values from value-modifying
processes when self-modification is involved. This may require
intensive scrutiny and might not look like non-recursively self
modifying solutions. We might have to expend extra effort to get these
parts right:

1) Model of the world (including self, sentients and non-sentients)
2) Model of Friendliness/Values
3) Processes for modifying Friendliness
4) Processes for taking deliberate action in the world, while
   preserving Friendliness

All these things need to be right. All are dynamic and
interconnected. We can't write any one of these in stone and then work
on the rest of them in isolation, they all have to work in
concert. There needs to be a system-level solution.

If you take this all into account when you write:

"The right place to control an AI's moral system is in its values. All
we can do for an AI's learning and reasoning is make sure they are
accurate and efficient."

then I'm totally with you.

> > The statement: "All we can do for an AI's learning and reasoning is
> > make sure they are accurate and efficient" seems limiting to me, in
> > the light Eliezer's writings. If we can construct a mind that will
> > solve this most difficult of problems (extreme intellectual ascent
> > while preserving Friendliness) for us and forever, then we should aim
> > for nothing less. Indeed, not hitting this mark is a danger that
> > people on this list take quite seriously.
> Unsafe AI is a great danger, but I think we disagree on
> the greatest source of that danger. The SIAI analysis is
> so afraid of making an error in the definition of its
> base values (i.e., supergoal) that it leaves them
> ambiguous.

Ah I see that. But I take CFAI 1.0 as a sketch. More dirty work lies
ahead. I've suggested starting with a detector for "special interests
coerce others". This suggestion probably demonstrates my *lack* of
understanding of CFAI, but I hope to correct that shortly :)

It will be interesting to see how the initial layers are built up,
ie. what is specified in what order. As I read CFAI (for real this
time), I'm going to try to keep an eye out for this. I have a feeling
from skimming it before that such fine details were not explained.

> This ambiguity will be exploited by the main danger for
> safe AI, namely powerful people and institutions who will
> try to manipulate the singularity for protect and enhance
> their own interests.

More detail is needed! If one had a complete design spec for a
Friendly AI, should it be posted to the internet...I guess if one were
confident it was right, one would...or should one just build it?

I think you raised a good question probably on lots of people's minds:
What does go into these Wisdom tournaments exactly? (or maybe we're
not ready to ask this yet, first a book on rationality needs to be
written to start a movement to garner interest to ...) I'm not really

> The SIAI analysis completely fails to recognize the need
> for politics to counter the threat of unsafe AI posed by
> the bad intentions of some people.

There are two classes of scenarios we need to protect our selves from:

1) A public project with good intentions (that can be distorted)
2) A private project with bad intentions

We might be able to protect ourselves from #1 with oversight and good
rules for wisdom tournaments.

#2 is a bit stranger. I'm wondering if the only way to beat that is to
race to #1 and get there first. If we slow down #1 so that we lose #2,
we're just as hosed.

> That threat must be opposed by clearly defined values for
> safe AIs, and a broad political movement successful in
> electoral politics to enforce those values.

Here is Eliezer's snap summary about Friendliness from CFAI:

"the elimination of involuntary pain, death, coercion, and stupidity"

If we can clearly define the values for safe AIs that lead to this
better world, then by all means they should be published. And that
might help us if others who are flying below the radar of public
political processes read them as well.

PS -

I'm interested to carry on this dialog as I make my way through the
document (we'll see how much time I have to type along the way though

Durant Schoon

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:42 MDT