Re: Self-modifying FAI (was: How hard a Singularity?)

From: Eliezer S. Yudkowsky (
Date: Wed Jun 26 2002 - 10:47:15 MDT

James Higgins wrote:
> At 07:29 AM 6/26/2002 -0400, Eliezer S. Yudkowsky wrote:
>> I agree with your response to Ben. We don't expect an AI's belief
>> that the sky is blue to drift over successive rounds of
>> self-modification. Beliefs with an external referent should not
>> "drift" under self-modification except insofar as they "drift" into
>> correspondence with reality. Write a definition of Friendliness made
>> up of references to things which exist outside the AI, and the content
>> has no reason to "drift". If content drifts it will begin making
>> incorrect predictions and will be corrected by further learning.
> Unfortunately, can we construct a definition of friendliness using
> external reference points which truly equals what we really want?

And here begins the fun!

Turn around the question and look at it another way. Humans are building
the AI. Humans are external to the AI. Everything inside the AI that we
*want* to be there is there because of something outside the AI. Every time
you ask yourself a question about Friendship design, a decision takes place
inside your mind. That decision is a real thing and it is shaped by
reference to a number of things, from many different places perhaps, but all
outside the AI. Whatever exists inside the AI should be, from the *AI's*
perspective, an external reference to the intentions of the programmers,
which in turn may be treated as an external reference to whatever
considerations the programmer used to make the decision. As long as the AI
regards its own Friendship system in this way, it is externally anchored.

Now it may be necessary at some point for the AI to begin snipping the
programmers out of the loop and using the considerations that the
programmers use, directly, but this process is something that occurs with
the cooperation of the programmers and the supervision of the programmers,
and should not need to be rushed beyond what the AI is capable of handling
mentally at any given point. And beyond that, if the AI is to really grow
into the Singularity in the same way that humans grew beyond the goals of
the evolutionary metaprogram, it may be necessary for the AI to add
basic-level moral content beyond what was available to the programmers, but
*if* so, it would be for reasons that made sense under the moral philosophy
handed down from the humans, just as our own reasons for defying evolution
are evolved ones. In essence this just says that an AI would possess the
same ability to "grow beyond" as a human upload.

> Given
> much greater knowledge and intelligence what we attribute to friendly
> behavior may end up looking quite different.
> Your definition of ethics is a good example. If an alien landed
> tomorrow and the first person it met was a fantastic salesman, the
> salesman may appear to be exceedingly friendly. When in fact their only
> goal is to open up a new trade route and they don't in fact care one
> iota about the alien, only the result! ;)

Which problem are we discussing here? The idea that a hostile AI could
deliberately lie in order to masquerade as Friendly? Or the assertion that
a Friendship programming team would wind up with a hostile AI that appears
as Friendly because the specification was ambiguous? These problems are
very different structurally!

If there may be more than one cause of Friendly-seeming behavior, that could
break a system that anchors only in immediate feedback about what is and
isn't Friendly in the real world - for example, a blind neural network
training algorithm (if generic nets could be trained on problems like that,
which they can't). However, a system that asks questions about imaginary
scenarios, or which receives information about imaginary scenarios, is
likely to quickly receive information that distinguishes between the two
models. And a system that asks questions *about* reasons for good behavior
can *directly* disambiguate between the two models.

> We may *think* we are defining friendliness via external reference
> points but actually be defining only the appearance of friendliness or
> something similar. Thus the SI would only need to appear friendly to
> us, even while it was planning to turn the planet into computing resources.

That's why you discuss (anchor externally) the *reasons* for decisions, not
just the decision outputs. You aren't anchoring the final output of the
causal system, you're anchoring *all* the nodes in the system.

>> Furthermore, programmers are physical objects and the intentions of
>> programmers are real properties of those physical objects. "The
>> intention that was in the mind of the programmer when writing this
>> line of code" is a real, external referent; a human can understand it,
>> and an AI that models causal systems and other agents should be able
>> to understand it as well. Not just the image of Friendliness itself,
>> but the entire philosophical model underlying the goal system, can be
>> defined in terms of things that exist outside the AI and are subject
>> to discovery.
> A human can understand the words "The intention that was in the mind of
> the programmer when writing this line of code", but they could never
> fully UNDERSTAND it. This is why I think you need to have more real
> life experience, Eliezer. Those of us that are married can easily
> understand why the above is not possible. You can never FULLY
> understand what someone else intends by something.

You don't need perfect understanding. You just need approximate sensory
information that provides enough information to build an approximate model
that controls decisions well enough for a Friendly AI to get by at any given
point. When the FAI is infrahuman it can just *ask* whenever it's not
unsure of something. When the FAI is transhuman it can unpack all the
references that didn't make sense earlier. So infrahuman understanding
should be enough to govern infrahuman decisions, and transhuman
understanding should be enough to govern transhuman decisions.

To solve the transhuman problem posed by transhuman AI, you have to figure
out how to use the AI's transhuman intelligence to solve the problem, not
rely on solving it yourself. That's Friendly AI.

> To use Eliezer's method, while I may not be correct I'm quite certain
> you are wrong. (Does that make me an honorary Friendship Programmer?)

No, that makes you a perfectionist by successive approximation. But asking
structural questions about metawish construction scores points. (Whee!

Eliezer S. Yudkowsky                
Research Fellow, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:39 MDT