From: Eliezer S. Yudkowsky (sentience@pobox.com)
Date: Mon Jun 17 2002 - 01:34:01 MDT
Anand AI wrote:
> 01. Does CFAI argue for a set of panhuman characteristics that comrpise
> human moral cognition? If so, what characteristics do we have evidence for,
> and what characteristics of human moral cognition will be reproduced?
CFAI argues that there exists *some* set of panhuman characteristics, but 
does not argue for a *specific* set of panhuman characteristics.  The model 
of Friendliness learning is based on reasoning backward from observed 
specific humans to a systemic model of altruism which is grounded in 
panhuman characteristics (and, if necessary, social and memetic 
organizational processes).  In other words, the idea is not that *you*, the 
programmer, know how to build a model of altruism which is 
programmer-independent, but that you, the programmer, know how to build an 
AI which can arrive at such a model, given sufficient intelligence, and can 
rely on the interim approximation represented by the ethics of several 
specific programmers, given insufficient intelligence.
There often seems to be some confusion about the question to which "Creating 
Friendly AI" is intended as an answer.  In science fiction and the popular 
press, a question often raised is "How do we know our AIs won't turn on us 
and kill us?"  CFAI happens to provide some answers for this question, but 
that's not the question CFAI is intended to answer.  The question CFAI is 
intended to answer is:  "How can you make sure that it doesn't matter who 
the programmers are?"  Or to be more precise:  "Is there a strategy, of 
bounded complexity, which if followed arrives at the same optimal good AI 
regardless of who builds and teaches it?"  This general form contains within 
it some critical subproblems of the moral philosophy of AI creation:
1)  In building a seed AI, you may (or may not) be building something 
eternal - something that has a beginning, but not an end, and no end of 
consequences.  The AI itself might be eternal, or it might make a choice 
that has an eternal effect; the morality is the same.  If there's a 
sensitive dependency of the goodness of the outcome on small variances in 
the initial conditions, then *any* compromise from absolute perfection of 
the *programmers* represents an existential risk.  While I can imagine 
someone arguing that it might be a moral necessity to build something 
imperfect and eternal if the alternative is the complete extinction of 
humanity, I am currently more inclined to label any permanent compromise of 
humanity's potential absolutely unacceptable.  If you can't build something 
eternal and optimal, don't build something eternal.
2)  In building a seed AI, a small group of programmers are standing in as a 
proxy for humanity.  The Singularity is something that belongs to humanity 
and it would be - under my understanding of moral philosophy - deeply 
immoral to steal it.  This is the wellspring of a seed AI programmer's 
professional ethics.  One of the panhuman atoms from which individual moral 
philosophies are built is the act we call "empathy" or "sympathy"; taking 
another agent's viewpoint; putting yourself in someone else's shoes.  If 
there is a privileged correlation of the programmers' moral patterns with 
the AI's moral pattern, this means it is no longer possible for an external 
observer to put himself or herself in the shoes of an AI programmer, or vice 
versa.  If so you cannot be standing in as a proxy for humanity; you are not 
building transhumanity; you are stealing it.  You have to pick a strategy 
such that you'd be comfortable with a different group of programmers 
following it, then follow that strategy yourself.  If there's a central 
optimum you have to aim for the optimum.  If there's a space of optima with 
no central point (a question Rafal once raised) then the best you can do is 
probably picking out a "typical" point in that space.  Think of the 
structural core of Friendliness as a (specific, fleshed-out) way of saying: 
  "Be the AI that the best possible programmer would have built."  If there 
is no one "best programmer" under the definition, just a space of equal 
optima, then you have to be comfortable saying to the AI:  "Be the AI that a 
typical 'optimal' programmer would have built, without reference to where I 
myself happen to be relative to that space."  Because from the perspective 
of an external observer, that's what you're asking them to accept.
> 02. Why is volition-based Friendliness the assumed model of Friendliness
> content? What will it and what will it not constitute and allow? If the
> model is entirely incorrect, how is this predicted to affect the AI's
> architecture?
Volition-based Friendliness is the best model of morality Yudkowsky-2001 
could come up with and is still current as of Yudkowsky-2002.  As for "What 
will it and what will it not constitute and allow?", I would suggest asking 
specific questions and looking over the specific answers to see what pattern 
is present, since this is how volition-based Friendliness would be passed 
along to a Friendly AI.
Remember, however, that by the Law of Programmer Symmetry - if I may call it 
such - volition-based Friendliness is not the problem.  The problem is 
coming up with a strategy such that if some other programming team follows 
it, their AI will eventually arrive at volition-based Friendliness [or 
something better] regardless of what their programmers started out 
believing.  And to do that you have to pass along to the AI an understanding 
of how people argue about morality, in a semantics rich enough to represent 
all the structural properties thereof.
In terms of the actual moral philosophy behind volition-based morality - 
well, let me throw this over to the next question:
> 03. What alternatives to volition-based Friendliness have been considered,
> and why were they not chosen?
Why volition-based morality?  Well, previously, I had a more informal model 
of a concrete morality based on an appreciation of life, truth, and joy. 
(Incidentally, I'm sorry if I start sounding unbearably goody two-shoes 
during any of this, but Anand has asked a direct question and a straight 
answer takes precedence over the usual social rules about self-deprecation.) 
  The question, as I see it, is whether appreciating life, truth, and joy is 
a universal or something about which individuals may legitimately disagree. 
  Absent an objective morality (Friendliness can handle this too, BTW) it 
seems to me that it is something about which individuals may legitimately 
disagree, and that if someone says "I want to die," their opinion on this 
overrides what I see as the value of life.  The shift in basic values might 
be described as seeing *freedom* as the central good, with the goodness of 
life, truth, and joy being special cases of my freedom to value these things.
The next moral question is whether a Friendly AI should value *only* freedom 
or whether a Friendly AI should also value life, truth, and joy.  The 
structural power of Friendly AI means that the programmers don't necessarily 
have to answer this question *correctly* - but Friendly AI *does* require 
that the programmers do their best job to answer the question as such, so 
that the AI gets a chance to see what kind of cognitive forces are involved 
in producing a *concrete* moral answer and not just the meta-moral answer of 
"Let an SI figure it out."  So what's the best answer I can come up with? 
Currently I'm leaning slightly away from the "pure" volition-based 
Friendliness expressed in CFAI and toward a Friendly AI that respects 
freedom but also has its own conception of a moral good.  The Law of 
Programmer Symmetry says that I should only do this if I'd be comfortable 
with someone else using the same strategy to create a FAI that respected my 
freedom but also had morals whose content might differ from my own, and that 
the FAI can't actually get the morals directly from me using this method.
Two possibilities, failing objective morality or a unique attractor, are (a) 
that extra-volitional morality is determined by majority vote of the 
extra-volitional moralities of everyone involved with a Friendly AI playing 
a given social role, or alternatively (the first alternative may not be 
self-consistent) that the FAI has a "typical" personal morality selected 
from a space of optimal moralities that has no central point.  Currently I 
am leaning toward the second alternative on the grounds that a Friendly AI 
should be a human-equivalent philosopher; I'm not sure that going along with 
a majority vote - as the ultimate cognitive grounding of morality, rather 
than because you respect majority votes - is cognitively the same thing as 
having your own morality.  (This is also about achieving "upload 
equivalence"; the system embodied by a Friendly seed AI has to be at least 
as good, heading into the Singularity, as the system embodied by any upload 
or social structure of uploads.  An upload would have a growable personal 
morality that was "owned" by the upload and not borrowed from someone else.)
A Friendly AI embodies a system that produces morality.  The CFAI semantics 
are supposed to be expansive enough to create an AI that learns, embodies, 
and improves-under-its-own-rules *any* system that produces morality - an 
individual programmer, a group of programmers, or a planet that evolves an 
ecology in which evolves a species that passes around memes that are 
eventually picked up by a programmer who sets out to create an AI.  You can 
dig back as far into the past light cone as seems philosophically necessary 
- given enough intelligence to infer the events of interest, which doesn't 
currently seem to require unreasonable intelligence (the events of interest 
take place on a tractably high level of abstraction; it doesn't require 
knowing the past position of individual atoms or anything of the sort).
So from this perspective, the question is whether the AI is *modeling* a 
moral system or *being* a moral system.  Should a mature FAI, dealing with a 
question of truth or falsity, model the factual answer a human would arrive 
at and then model the moral judgement a human would make based on that 
factual answer?  Or should a mature FAI actually use its *own* model of the 
world to arrive at the best factual answer the *FAI* knows of, wherever a 
judgement of desirability relies on a question of fact?  Once you step into 
the second territory, the AI is starting to cross the line from *modeling* a 
moral system into *being* a moral entity.  I think this is what we want and 
it's why I keep using phrases like "building an independent moral 
philosopher".  But it also looks to me like this may imply that to build a 
self-consistent moral philosophy, it has to be an individual philosophy - it 
can't be the borrowed (modeled) moral philosophy of a group.  If I value 
life, truth, joy, and freedom and choose to make freedom primary, then it 
may not be possible to self-consistently build, within the human frame of 
reference, an FAI that values freedom *only*.  You have to convey the 
process that arrives at the decision, not the decision itself, and the 
question is which we should identify the AI's mind with.
> 04. How will the AI know and decide what constitutes "normativeness"?
According to the above, in the case of questions of fact, "normative" is the 
*AI's* best model of the facts themselves, not the AI's best judgement of 
what an idealized human would think about the facts.  In this case there is 
no doubt that the AI "owns" this portion of the moral philosophy and is not 
borrowing it.  Some other things may need to start out as being borrowed, 
though, because without a good model the AI can't make the final decision as 
to which parts of the model to identify with.  What would be borrowed in 
this way, and finally - when the AI was sure - absorbed?  What the AI starts 
out by borrowing is the final, surface decisions of the programmers, which 
can be learned experientially, and moreover can be filled in, if blanks are 
encountered, by directly asking the programmers.  But this doesn't define 
what's normative, and it isn't something that an FAI would finally identify 
with and absorb as its own - as a final output, maybe, if the programmers 
are *right*, but not as the actual system of moral philosophy.
So the FAI starts digging into the programmers' past light cones to arrive 
at a model of where the morality given it came from.  At first what the FAI 
ends up with is just a model of the programmers' thoughts - the proximal 
causes of the programmers' statements.  Most of this will also be individual 
material, and hence not something an FAI could absorb.  But some of the 
elements that play a role in the production of moral thoughts may be 
emotions, panhuman chunks of brainware.  Let's temporarily suppose that an 
emotion is something which, in a certain context, recognizes certain kinds 
of thoughts, binds to those thoughts, and shades those thoughts in certain 
ways that may make them directly joyful or sorrowful, prideful or shameful, 
or the other various kinds of subjective negative and positive feedback 
that, in various ways, uplift certain thoughts and cast others down.  Within 
a certain context, the thought of helping someone else - altruism - is 
joyful.  If the FAI is modeling an FAI programmer, the matter is probably 
more complicated than that because the programmer may be aware of the 
emotion and using it deliberately, or the programmer may have been 
influenced by this emotion in childhood to choose a moral philosophy of 
altruism in which altruism is *not* dependent on the contextual conditions 
that are necessary to activate the emotion.
The point is that, at this point in the construction of the model, there is 
an 'atom' of morality that is not unique to the programmer.  The atom is not 
a moral judgement.  The programmer's moral judgements are made by very 
complex 'molecules' that bear the individual signature of the programmer, 
but one of the 'atoms' happens to be panhuman.  Maybe the strength of the 
atom is an individual variable, probably distributed along a gaussian curve 
as most quantitative individual variables are, in which case you might 
either substitute an 'average' strength for the altruism emotion, or arrive 
at the judgement that 'more altruism is better' and select a value for the 
altruism emotion from the far right of the curve, or off the curve, or a 
'maximal' value if that works.  This doesn't necessarily mean the FAI has 
'absorbed' the emotion - just that the FAI is trying to model the production 
of altruism through a causal system in which this emotion is an element, and 
testing out what happens if you replace the programmer's settings for this 
emotion with "typical" or "maximal" settings.  If the FAI replaces the 
programmer's factory settings for the altruism emotion with "maximal" 
settings, and the end result is recognized by that programmer or by the 
other programmers as a stronger and more altruistic philosophy, then the FAI 
may decide to take a tentative step backward from the experientially learned 
final outputs of the individual programmers and say that the programmers' 
statements are moral because they are altruistic, rather than altruism being 
interesting because it plays a role in the programmers' statements.  This is 
how an FAI would start to work its way back from the programmers to 
humanity.  It is also how the programmers would begin to learn to trust the 
FAI's moral judgement over their own.  It might or might not be possible to 
take further steps backward into the past light cone; to ask whether 
altruism itself has "evolved wrong" or could have evolved better, for example.
So the question is, how does an FAI decide when to take a step backward in 
the chain of causality?  Under causal validity semantics, when do you move 
back the acausal level another notch?  At first, "normativeness" in this 
case might be operationally defined as "what the programmers say is 
normative", but this is also something where the system that produces the 
programmers' judgements can be deduced by examining those judgements.  At 
some point, the programmers acknowledge that the AI's judgement of what is 
"normative" is better than the programmers' judgement of what is normative. 
  At the point where the AI's judgement of normativeness, and the AI's 
judgement of Friendliness system architecture, and the AI's judgement of 
morality, all appear to the programmers to be of transhuman competency, it 
would be time (perhaps past time) to "launch" the AI.  At this point the AI 
might not have finished implementing the Law of Programmer Symmetry - 
fulfilling the wish "Be the best AI that we could possibly have designed" - 
but you would have to rely on the AI to decide how to ground itself in a 
programmer-independent way.
Much of the thinking I have been describing so far is thinking that could be 
described to a comparatively young AI, but which it would take a very mature 
intelligence to implement.  If the critical flashpoint of a seed AI is 
substantially infrahuman intelligence, there would be either the option of 
cooperative ascent so that the AI can actually talk to the programmers' and 
check the programmers' judgement, or the option of trying to describe the 
structural properties of the entire Friendliness development scenario above, 
to a young AI, in sufficient detail that the AI could grow to transhuman 
intelligence with much of the 'target' of Friendliness still undefined, then 
use that transhuman intelligence to simulate a typical Friendliness 
development scenario and thereby define the target.  Both of these scenarios 
have certain risks.  I think that perhaps the critical pragmatic challenge 
of Friendly AI will be creating the "definition of the definition of the 
definition" of Friendliness in such a way that a very young AI can not only 
be given the definition, but that the young AI can actually *practice* 
"filling out definitions of definitions of definitions", so that you can see 
whether the AI might be able to fill out the definition of the definition of 
the definition of Friendliness - to what extent one would need a cooperative 
ascent, or alternatively be able to go directly into the "throw" and "catch" 
of a seed AI racing full speed ahead with an incompletely filled-out but 
structurally complete model of Friendliness.
I hope at least part of this email was not total gibberish.
-- Eliezer S. Yudkowsky http://intelligence.org/ Research Fellow, Singularity Institute for Artificial Intelligence
This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:39 MDT