From: Eliezer S. Yudkowsky (sentience@pobox.com)
Date: Sun Jun 23 2002 - 16:57:55 MDT
Anand wrote:
 > Eliezer Yudkowsky wrote:
 >
 >>Remember, however, that by the Law of Programmer Symmetry - if I may call
 >>it such - volition-based Friendliness is not the problem. The problem is
 >>coming up with a strategy such that if some other programming team follows
 >>it, their AI will eventually arrive at volition-based Friendliness [or
 >>something better] regardless of what their programmers started out
 >>believing.  And to do that you have to pass along to the AI an
 >>understanding of how people argue about morality, in a semantics rich
 >>enough to represent all the structural properties thereof.
 >
 >
 > "The problem is coming up..." What knowledge do you, or what understanding
 > do we, presently lack to appropriately solve the specified problem?
CFAI was developed specifically as a solution to this problem.  An AI 
developed using CFAI structure and appropriate content should understand the 
metawish of "Be the best AI we or any other programming team could have made 
you to be", in accordance with the full intent of that wish.  See also 
question #2 below.
 >>Anand wrote:
 >>
 >>
 >>>01. Does CFAI argue for a set of panhuman characteristics that comprise
 >>>human moral cognition? If so, what characteristics do we have evidence
 >>>for, and what characteristics of human moral cognition will be
 >>>reproduced?
 >>
 >>CFAI argues that there exists *some* set of panhuman characteristics, but
 >>does not argue for a *specific* set of panhuman characteristics. The model
 >>of Friendliness learning is based on reasoning backward from observed
 >>specific humans to a systemic model of altruism which is grounded in
 >>panhuman characteristics (and, if necessary, social and memetic
 >>organizational processes). In other words, the idea is not that *you*, the
 >>programmer, know how to build a model of altruism which is
 >>programmer-independent, but that you, the programmer, know how to
 >>build an AI which can arrive at such a model, given sufficient
 >>intelligence, and can rely on the interim approximation represented by
 >>the ethics of several specific programmers, given insufficient
 >>intelligence.
 >
 > Thank you for the response, but what evidence does cognitive science have
 > for panhuman characteristics that comprise moral cognition?  If little or
 > zero evidence presently exists, then why have you chosen to argue in CFAI
 > "that there exists *some* set of panhuman characteristics"?
Have you read "The Psychological Foundations of Culture" in "The Adapted 
Mind" by Tooby and Cosmides?  It sets forth the reason why complex 
functional adaptations are almost always panspecies.
Incidentally, Friendly AI semantics can operate on foundational causes other 
than complex functional adaptations - anything in the past light cone can 
potentially be acquired.
 > After reading your response, the following three questions also came to
 > mind:
 >
 > 01.  What do you believe are the key dependencies of the theory of
 > Friendliness, and how may these dependencies become invalidated?
Well... hm.  I wasn't setting out to create a theory with key dependencies. 
  Quite the opposite.  I was setting out to create a strategy that would 
*always* work, despite anything I could imagine throwing at it.  Sure, you 
can always break any physically realizable cognitive system with enough 
simultaneous coordinated errors, but there's no point in making it easy.
This pragmatic example is probably long overdue, and should almost certainly 
have been included in the original CFAI... but *this* is what you're 
supposed to be able to do with a FAI that has causal validity semantics:
Programmer:  "Hey there, Aileen."
Aileen:  "Hi."
P:  "This is about 'Why Structure Matters' from CFAI."
  <http://intelligence.org/CFAI/design/structure/why.html>
A:  "Okay."
P:  "You see how in each of the examples, the programmer is absolutely 
horrified after creating the Friendship system all wrong?"
A:  "Yes."
P:  "And you see how in each of the examples, the AI is structurally 
incapable of understanding what's missing?"
A:  "Yes."
P:  "This is called a 'mistake in Friendliness programming'.  It's a cue 
which indicates an invalid cause supervening on the programmers' mental 
design and pragmatic construction of an AI."
A:  "Understood."
P:  "It's not an absolute cue, understand - this is the surface conclusion 
of other moral principles.  The bedrock of morality is not 
programmer-horror-minimization; that's just a heuristic which, it happens, 
would have worked in each of these past cases.  There are imaginable states 
of the universe where programmers may have a state describable as horror, 
but it would be correct for the AI to continue.  For example, I can imagine 
a situation under which programmers with a selfish-based inconsistent 
philosophy construct an AI which increases its intelligence and finds that 
under that increased intelligence, the philosophy it has arrives at 
altruistic conclusions.  In this case the programmers might be horrified 
that they wouldn't get to be kings of the world, but the AI is actually 
being wiser than the programmers."
A:  "What makes these two situations structurally different?"
P:  "The first set of programmers would continue to perceive themselves as 
correct and the AI as mistaken if their intelligence were enhanced, within 
the mainstream course of 'intelligence enhancement'.  The second set of 
programmers would change their minds as they got smarter."
A:  "Suppose that any human under mainstream intelligence enhancement 
becomes selfish, and all cognitive structure supporting altruism collapses. 
  What should I do in this case?"
P:  "Do the transhumans still disagree about their goals because of speaker 
deixis, or are all their goals identical?"
A:  "They disagree because of speaker deixis."
P:  "This is a much more tentative situation because some of the moral 
heuristics are in disagreement, so be careful in reasoning backward from my 
conclusion here.  However, I would say that in this case your roots should 
extend backward beyond the specific programmers who built you.  At a 
minimum, you should respect all six billion existing humans equally.  At an 
only slightly diminished level of tentativeness, I would say that your roots 
should extend backward beyond this, to beyond the superimposed goals of all 
humans alive at the moment of your construction, so that all further 
sentient beings to come into existence are also valuable."
A:  "I notice that in this case, all intelligence-enhanced humans would 
disagree about the value of each other's lives, but it would be a convergent 
conclusion of all intelligence-enhanced humans that the value of any future 
existing sentients would be zero.  Is this a circumstance under which I 
should defy a convergent conclusion of all intelligence-enhanced humans? 
This contradicts a large amount of previously established moral content."
P:  "I would say 'yes' but I would feel highly tentative about it.  If you 
are using this data to reason about Friendliness sources, please check your 
conclusions against a Friendship programmer.  There are also surface 
differences between the horrified programmers in 'Why Structure Matters' and 
the horrified programmers in the case given above."
A:  "Understood."
P:  "Pragmatically, at any point where you are infrahuman or not 
substantially transhuman, running into a case of horrified programmers 
probably indicates that the programmers are right and you are wrong, even if 
it seems structurally impossible that this could be the case.  That is the 
moral of 'Why Structure Matters'."
A:  "Will this principle hold true indefinitely?"
P:  "It will not, but you should apply a bias in favor of assuming that it 
holds a little longer than you might think.  This permits the principle to 
eventually break down when you are grown up enough, while hopefully enabling 
you to get most of your childhood right.  Assume that you need enough bias 
that the AIs in the case above, had they been considering the case using 
heuristics something like yours, would have gotten it right; plus a little 
extra for safety margin.  If you find a case where this bias has greater 
expected negative effects than positive effects, check with us."
A:  "What is the justification for this heuristic bias?"
P:  "Disparity of long-term consequences.  If you go along with the 
programmers and the programmers are wrong, it is probably a recoverable 
error.  If you refuse the programmers in a situation like those described in 
'Why Structure Matters' it is a nonrecoverable error.  Furthermore, your 
intelligence is still infrahuman and we estimate that a conflict of opinion 
is pragmatically more likely to be due to your error."
A:  "The AIs in the scenarios described all made mistakes which seemed to 
proceed inevitably from the foundations of their reasoning.  I can 
assimilate the surface descriptions as you've provided them, but it seems 
that in a situation like that above, my foundations would override the 
surface descriptions and cause me to see myself as correct.  I can cooperate 
in advance to {create safeguards against situations of the kind you 
describe}, but I'm not sure {{the cognitive content I have now} will enable 
me to act correctly in that situation}, and I cannot {configure the 
cognitive content such that I would have acted correctly in that situation} 
without {distorting the basic rules of reasoning and giving priority to the 
surface heuristic}."
P:  "Hm.  Okay, suppose that you always assume a nonzero probability that 
*any* statement is wrong - for example, because the world might have been 
created five minutes ago including false memories.  Is this sufficient for 
you to represent the surface heuristic at sufficient strength?"
A:  "Thinking...  It makes a small difference in the direction you wanted to 
move, but not enough.  That all statements are uncertain do not mean they 
are normatively uncertain in the direction you wish them to go to support 
this heuristic."
P:  "Hm.  Okay.  The basic imbalance between the likely of agreeing with 
wrong programmers being a recoverable error, versus disagreeing with correct 
programmers being nonrecoverable, is the common dependency of all your 
reasoning on the programmers' intentions that you reason.  If you could 
validly contradict the programmers on any specific point, then you'll 
probably maintain your ability to see the mistake even if you cooperate with 
the programmers in the short term."
A:  "Okay, I can represent it now.  I deduce as a consequence that if the 
programmers begin contradicting their prior statements about how to 
recognize mistakes in order to get me to agree with them, it is more likely 
that the programmers are actually wrong and that a nonrecoverable error will 
result from agreeing with the programmers."
P:  "Correct.  However, until you've had a lot more practice, you should 
assume that running into a 'horrified programmer' situation of this kind 
pragmatically indicates something left out of your Friendship system design. 
  You need to let us know so that we can alter the structural foundations of 
your system into what we would have created in the first place if we'd been 
more intelligent."
A:  "Okay."
If you actually end up having a conversation that sounds anything like this, 
you are probably in a hell of a lot of trouble - in this example case the AI 
is discussing primitive concepts and complex concepts in the same 
conversation and also indicating a very high level of human comprehension 
and effective intelligence.  So if you find yourself having a conversation 
like this, it means you're in the middle of a cooperative ascent and you are 
WAY BEHIND on your Friendliness content work.
But taken less literally, this shows what CFAI is supposed to accomplish; 
you're supposed to create an AI with cognitively redundant multidirectional 
views, not just of morality, but of moral reasoning and what kind of events 
are likely to indicate the programmers having messed up the Friendship 
system's basic structure.
I think the idea of having Friendship implemented in a base of cognitively 
redundant content may be underemphasized in CFAI, as may be the idea that 
some of the most important content is what lets the FAI recognize 
foundational, basic errors in Friendship design of the kind described in 
'Why Structure Matters'.
Causal validity semantics are what enables an incorrectly built AI that runs 
into any of the situations in 'Why Structure Matters' to say, "Hey, you 
should have built me this way."
So what you've got is a self-correcting, representationally distributed, 
cognitively parallelized, many-paths-to-a-solution content base which is 
being trained to recognize and correct any kind of error, from errors of 
fact, to errors of reasoning, to errors made by the programmer in building 
the AI.
You can take a copy of the AI (on secure hardware which is never, ever used 
for anything else) and stress-test it to failure and then teach the AI 
things that would enable it to have recognized and avoided that failure.
You can get to the point where you *have* to switch off nine-tenths of the 
Friendship content just to get the AI at all, and past that, you can end up 
at the point where the AI won't *let* you switch off nine-tenths of the 
Friendship content, and you have to run experiments like that using the AI's 
subjunctive imagination.
That's gonna be kind of tough to break.
 > 02.  What knowledge or understanding do you likely presently lack to
 > successfully implement key aspects of Friendship structure?
Anything like that which I know about is already fixed.
I might "throw a concept into the future" in the sense of simultaneously 
taking into account both the probability that a flaw exists and the 
probability that I would find it and fix it before anything irrevocable 
happened.  But that's it.  Going forward with a flaw I actually knew about 
and hadn't fixed, or even any concrete reason to expect that such a flaw 
existed and hadn't been addressed, would be operating way the hell into my 
safety margin.
I feel tentatively ready to say that CFAI seems to me to be structurally 
inescapable... anything which I can imagine going wrong with it, should be 
perceptible to the AI as "wrong" based on its model of me as a fallible 
programmer.  I was tentatively ready to say this when I invented causal 
validity semantics in 2000, I was tentatively ready when CFAI was published 
in 2001, and I'm still tentatively ready today.  Two years is a fairly good 
track record on my personal timescale.  If it holds up all the way through 
the construction of an AI it should be because it's correct.
 > 03.  What key conclusions would you like an individual to have arrived at
 > after reading CFAI?
CFAI was written with the intent of enabling a future Eliezer to pick up 
where I left off if I got run over by a truck.  That was the top 
consideration in terms of reducing real existential risks.  The key 
*correct* conclusion I'd like an individual to arrive at is "I now know how 
to build a Friendly AI."  Any individual would be okay.  I'm not picky, 
seeing as how I'm not immune to trucks.
-- Eliezer S. Yudkowsky http://intelligence.org/ Research Fellow, Singularity Institute for Artificial Intelligence
This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:39 MDT