From: Eliezer S. Yudkowsky (
Date: Mon Jun 17 2002 - 01:34:01 MDT

Anand AI wrote:
> 01. Does CFAI argue for a set of panhuman characteristics that comrpise
> human moral cognition? If so, what characteristics do we have evidence for,
> and what characteristics of human moral cognition will be reproduced?

CFAI argues that there exists *some* set of panhuman characteristics, but
does not argue for a *specific* set of panhuman characteristics. The model
of Friendliness learning is based on reasoning backward from observed
specific humans to a systemic model of altruism which is grounded in
panhuman characteristics (and, if necessary, social and memetic
organizational processes). In other words, the idea is not that *you*, the
programmer, know how to build a model of altruism which is
programmer-independent, but that you, the programmer, know how to build an
AI which can arrive at such a model, given sufficient intelligence, and can
rely on the interim approximation represented by the ethics of several
specific programmers, given insufficient intelligence.

There often seems to be some confusion about the question to which "Creating
Friendly AI" is intended as an answer. In science fiction and the popular
press, a question often raised is "How do we know our AIs won't turn on us
and kill us?" CFAI happens to provide some answers for this question, but
that's not the question CFAI is intended to answer. The question CFAI is
intended to answer is: "How can you make sure that it doesn't matter who
the programmers are?" Or to be more precise: "Is there a strategy, of
bounded complexity, which if followed arrives at the same optimal good AI
regardless of who builds and teaches it?" This general form contains within
it some critical subproblems of the moral philosophy of AI creation:

1) In building a seed AI, you may (or may not) be building something
eternal - something that has a beginning, but not an end, and no end of
consequences. The AI itself might be eternal, or it might make a choice
that has an eternal effect; the morality is the same. If there's a
sensitive dependency of the goodness of the outcome on small variances in
the initial conditions, then *any* compromise from absolute perfection of
the *programmers* represents an existential risk. While I can imagine
someone arguing that it might be a moral necessity to build something
imperfect and eternal if the alternative is the complete extinction of
humanity, I am currently more inclined to label any permanent compromise of
humanity's potential absolutely unacceptable. If you can't build something
eternal and optimal, don't build something eternal.

2) In building a seed AI, a small group of programmers are standing in as a
proxy for humanity. The Singularity is something that belongs to humanity
and it would be - under my understanding of moral philosophy - deeply
immoral to steal it. This is the wellspring of a seed AI programmer's
professional ethics. One of the panhuman atoms from which individual moral
philosophies are built is the act we call "empathy" or "sympathy"; taking
another agent's viewpoint; putting yourself in someone else's shoes. If
there is a privileged correlation of the programmers' moral patterns with
the AI's moral pattern, this means it is no longer possible for an external
observer to put himself or herself in the shoes of an AI programmer, or vice
versa. If so you cannot be standing in as a proxy for humanity; you are not
building transhumanity; you are stealing it. You have to pick a strategy
such that you'd be comfortable with a different group of programmers
following it, then follow that strategy yourself. If there's a central
optimum you have to aim for the optimum. If there's a space of optima with
no central point (a question Rafal once raised) then the best you can do is
probably picking out a "typical" point in that space. Think of the
structural core of Friendliness as a (specific, fleshed-out) way of saying:
  "Be the AI that the best possible programmer would have built." If there
is no one "best programmer" under the definition, just a space of equal
optima, then you have to be comfortable saying to the AI: "Be the AI that a
typical 'optimal' programmer would have built, without reference to where I
myself happen to be relative to that space." Because from the perspective
of an external observer, that's what you're asking them to accept.

> 02. Why is volition-based Friendliness the assumed model of Friendliness
> content? What will it and what will it not constitute and allow? If the
> model is entirely incorrect, how is this predicted to affect the AI's
> architecture?

Volition-based Friendliness is the best model of morality Yudkowsky-2001
could come up with and is still current as of Yudkowsky-2002. As for "What
will it and what will it not constitute and allow?", I would suggest asking
specific questions and looking over the specific answers to see what pattern
is present, since this is how volition-based Friendliness would be passed
along to a Friendly AI.

Remember, however, that by the Law of Programmer Symmetry - if I may call it
such - volition-based Friendliness is not the problem. The problem is
coming up with a strategy such that if some other programming team follows
it, their AI will eventually arrive at volition-based Friendliness [or
something better] regardless of what their programmers started out
believing. And to do that you have to pass along to the AI an understanding
of how people argue about morality, in a semantics rich enough to represent
all the structural properties thereof.

In terms of the actual moral philosophy behind volition-based morality -
well, let me throw this over to the next question:

> 03. What alternatives to volition-based Friendliness have been considered,
> and why were they not chosen?

Why volition-based morality? Well, previously, I had a more informal model
of a concrete morality based on an appreciation of life, truth, and joy.
(Incidentally, I'm sorry if I start sounding unbearably goody two-shoes
during any of this, but Anand has asked a direct question and a straight
answer takes precedence over the usual social rules about self-deprecation.)
  The question, as I see it, is whether appreciating life, truth, and joy is
a universal or something about which individuals may legitimately disagree.
  Absent an objective morality (Friendliness can handle this too, BTW) it
seems to me that it is something about which individuals may legitimately
disagree, and that if someone says "I want to die," their opinion on this
overrides what I see as the value of life. The shift in basic values might
be described as seeing *freedom* as the central good, with the goodness of
life, truth, and joy being special cases of my freedom to value these things.

The next moral question is whether a Friendly AI should value *only* freedom
or whether a Friendly AI should also value life, truth, and joy. The
structural power of Friendly AI means that the programmers don't necessarily
have to answer this question *correctly* - but Friendly AI *does* require
that the programmers do their best job to answer the question as such, so
that the AI gets a chance to see what kind of cognitive forces are involved
in producing a *concrete* moral answer and not just the meta-moral answer of
"Let an SI figure it out." So what's the best answer I can come up with?
Currently I'm leaning slightly away from the "pure" volition-based
Friendliness expressed in CFAI and toward a Friendly AI that respects
freedom but also has its own conception of a moral good. The Law of
Programmer Symmetry says that I should only do this if I'd be comfortable
with someone else using the same strategy to create a FAI that respected my
freedom but also had morals whose content might differ from my own, and that
the FAI can't actually get the morals directly from me using this method.

Two possibilities, failing objective morality or a unique attractor, are (a)
that extra-volitional morality is determined by majority vote of the
extra-volitional moralities of everyone involved with a Friendly AI playing
a given social role, or alternatively (the first alternative may not be
self-consistent) that the FAI has a "typical" personal morality selected
from a space of optimal moralities that has no central point. Currently I
am leaning toward the second alternative on the grounds that a Friendly AI
should be a human-equivalent philosopher; I'm not sure that going along with
a majority vote - as the ultimate cognitive grounding of morality, rather
than because you respect majority votes - is cognitively the same thing as
having your own morality. (This is also about achieving "upload
equivalence"; the system embodied by a Friendly seed AI has to be at least
as good, heading into the Singularity, as the system embodied by any upload
or social structure of uploads. An upload would have a growable personal
morality that was "owned" by the upload and not borrowed from someone else.)

A Friendly AI embodies a system that produces morality. The CFAI semantics
are supposed to be expansive enough to create an AI that learns, embodies,
and improves-under-its-own-rules *any* system that produces morality - an
individual programmer, a group of programmers, or a planet that evolves an
ecology in which evolves a species that passes around memes that are
eventually picked up by a programmer who sets out to create an AI. You can
dig back as far into the past light cone as seems philosophically necessary
- given enough intelligence to infer the events of interest, which doesn't
currently seem to require unreasonable intelligence (the events of interest
take place on a tractably high level of abstraction; it doesn't require
knowing the past position of individual atoms or anything of the sort).

So from this perspective, the question is whether the AI is *modeling* a
moral system or *being* a moral system. Should a mature FAI, dealing with a
question of truth or falsity, model the factual answer a human would arrive
at and then model the moral judgement a human would make based on that
factual answer? Or should a mature FAI actually use its *own* model of the
world to arrive at the best factual answer the *FAI* knows of, wherever a
judgement of desirability relies on a question of fact? Once you step into
the second territory, the AI is starting to cross the line from *modeling* a
moral system into *being* a moral entity. I think this is what we want and
it's why I keep using phrases like "building an independent moral
philosopher". But it also looks to me like this may imply that to build a
self-consistent moral philosophy, it has to be an individual philosophy - it
can't be the borrowed (modeled) moral philosophy of a group. If I value
life, truth, joy, and freedom and choose to make freedom primary, then it
may not be possible to self-consistently build, within the human frame of
reference, an FAI that values freedom *only*. You have to convey the
process that arrives at the decision, not the decision itself, and the
question is which we should identify the AI's mind with.

> 04. How will the AI know and decide what constitutes "normativeness"?

According to the above, in the case of questions of fact, "normative" is the
*AI's* best model of the facts themselves, not the AI's best judgement of
what an idealized human would think about the facts. In this case there is
no doubt that the AI "owns" this portion of the moral philosophy and is not
borrowing it. Some other things may need to start out as being borrowed,
though, because without a good model the AI can't make the final decision as
to which parts of the model to identify with. What would be borrowed in
this way, and finally - when the AI was sure - absorbed? What the AI starts
out by borrowing is the final, surface decisions of the programmers, which
can be learned experientially, and moreover can be filled in, if blanks are
encountered, by directly asking the programmers. But this doesn't define
what's normative, and it isn't something that an FAI would finally identify
with and absorb as its own - as a final output, maybe, if the programmers
are *right*, but not as the actual system of moral philosophy.

So the FAI starts digging into the programmers' past light cones to arrive
at a model of where the morality given it came from. At first what the FAI
ends up with is just a model of the programmers' thoughts - the proximal
causes of the programmers' statements. Most of this will also be individual
material, and hence not something an FAI could absorb. But some of the
elements that play a role in the production of moral thoughts may be
emotions, panhuman chunks of brainware. Let's temporarily suppose that an
emotion is something which, in a certain context, recognizes certain kinds
of thoughts, binds to those thoughts, and shades those thoughts in certain
ways that may make them directly joyful or sorrowful, prideful or shameful,
or the other various kinds of subjective negative and positive feedback
that, in various ways, uplift certain thoughts and cast others down. Within
a certain context, the thought of helping someone else - altruism - is
joyful. If the FAI is modeling an FAI programmer, the matter is probably
more complicated than that because the programmer may be aware of the
emotion and using it deliberately, or the programmer may have been
influenced by this emotion in childhood to choose a moral philosophy of
altruism in which altruism is *not* dependent on the contextual conditions
that are necessary to activate the emotion.

The point is that, at this point in the construction of the model, there is
an 'atom' of morality that is not unique to the programmer. The atom is not
a moral judgement. The programmer's moral judgements are made by very
complex 'molecules' that bear the individual signature of the programmer,
but one of the 'atoms' happens to be panhuman. Maybe the strength of the
atom is an individual variable, probably distributed along a gaussian curve
as most quantitative individual variables are, in which case you might
either substitute an 'average' strength for the altruism emotion, or arrive
at the judgement that 'more altruism is better' and select a value for the
altruism emotion from the far right of the curve, or off the curve, or a
'maximal' value if that works. This doesn't necessarily mean the FAI has
'absorbed' the emotion - just that the FAI is trying to model the production
of altruism through a causal system in which this emotion is an element, and
testing out what happens if you replace the programmer's settings for this
emotion with "typical" or "maximal" settings. If the FAI replaces the
programmer's factory settings for the altruism emotion with "maximal"
settings, and the end result is recognized by that programmer or by the
other programmers as a stronger and more altruistic philosophy, then the FAI
may decide to take a tentative step backward from the experientially learned
final outputs of the individual programmers and say that the programmers'
statements are moral because they are altruistic, rather than altruism being
interesting because it plays a role in the programmers' statements. This is
how an FAI would start to work its way back from the programmers to
humanity. It is also how the programmers would begin to learn to trust the
FAI's moral judgement over their own. It might or might not be possible to
take further steps backward into the past light cone; to ask whether
altruism itself has "evolved wrong" or could have evolved better, for example.

So the question is, how does an FAI decide when to take a step backward in
the chain of causality? Under causal validity semantics, when do you move
back the acausal level another notch? At first, "normativeness" in this
case might be operationally defined as "what the programmers say is
normative", but this is also something where the system that produces the
programmers' judgements can be deduced by examining those judgements. At
some point, the programmers acknowledge that the AI's judgement of what is
"normative" is better than the programmers' judgement of what is normative.
  At the point where the AI's judgement of normativeness, and the AI's
judgement of Friendliness system architecture, and the AI's judgement of
morality, all appear to the programmers to be of transhuman competency, it
would be time (perhaps past time) to "launch" the AI. At this point the AI
might not have finished implementing the Law of Programmer Symmetry -
fulfilling the wish "Be the best AI that we could possibly have designed" -
but you would have to rely on the AI to decide how to ground itself in a
programmer-independent way.

Much of the thinking I have been describing so far is thinking that could be
described to a comparatively young AI, but which it would take a very mature
intelligence to implement. If the critical flashpoint of a seed AI is
substantially infrahuman intelligence, there would be either the option of
cooperative ascent so that the AI can actually talk to the programmers' and
check the programmers' judgement, or the option of trying to describe the
structural properties of the entire Friendliness development scenario above,
to a young AI, in sufficient detail that the AI could grow to transhuman
intelligence with much of the 'target' of Friendliness still undefined, then
use that transhuman intelligence to simulate a typical Friendliness
development scenario and thereby define the target. Both of these scenarios
have certain risks. I think that perhaps the critical pragmatic challenge
of Friendly AI will be creating the "definition of the definition of the
definition" of Friendliness in such a way that a very young AI can not only
be given the definition, but that the young AI can actually *practice*
"filling out definitions of definitions of definitions", so that you can see
whether the AI might be able to fill out the definition of the definition of
the definition of Friendliness - to what extent one would need a cooperative
ascent, or alternatively be able to go directly into the "throw" and "catch"
of a seed AI racing full speed ahead with an incompletely filled-out but
structurally complete model of Friendliness.

I hope at least part of this email was not total gibberish.

Eliezer S. Yudkowsky                
Research Fellow, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:39 MDT