"Supergoal" considered harmful

From: Eliezer S. Yudkowsky (sentience@pobox.com)
Date: Sat Jul 16 2005 - 10:35:18 MDT

The term "supergoal" is a word I used in the early days of writing "Creating
Friendly AI" because I was literate enough to have heard of goals and
subgoals, but not quite literate enough to know that I should be saying
"utility function".

Human beings, at least the ones I know, are fascinated with the ultimate, the
supreme, the overriding. And for this reason the term "supergoal" seems to
exert an unhealthy fascination on some psyches. I'm thinking of the people
who believe that, because you take an English statement and call it the
supergoal, the AI will actually do what you mean in the sense of that
statement as you intend it - if it is so terribly important how could the AI
do otherwise? Or that because you call something a "supergoal", it has the
power to override other things in exactly the way you would wish them overridden.

To say that A overrides B is nothing. Saying exactly how and under what
circumstances A overrides B is interesting. There is no sense in which the
utility function of a decision process may be said to "override" anything,
since all expected utility is computed in the first place using a utility
function. But it is certainly true that under an expected utility system, if
you have a case where action A normally leads to outcome O, but the decision
system *knows* itself to occupy a state S where A does not lead to O, then the
expected utility of A may change. Which works out to saying that if an
expected-utility AI has a utility function that values (cognitive states
binding to) states in which humans live, and ordinarily the AI's own survival
is useful to this end, but under some circumstances the AI's survival is not
useful to this end, and the AI knows this, then the AI will not assign high
expected utility to surviving under those circumstances because that is not
how the AI computes utility. That is something like overriding. But it is
not necessary for the AI to become extremely pumped about the issue, shout
proclamations, etc.

Similarly with having a utility function of paperclips. A decision agent
whose utility function is linear in paperclips does not think that paperclips
are the meaning of life. It is not extremely pumped about paperclips. It
will not argue about the morality of paperclips. It will compute the number
of paperclips it expects to result from each action, then choose that action
whose associated expectation of paperclips is highest. It is impossible to
say that its goal of paperclips "overrides" the value of human life in its
viewpoint because it attaches no value to human life to begin with; it just
chooses between actions by counting expected paperclips, *that's all*.
Similarly with the AI "realizing" that paperclips are "uninteresting" and
other such anthropomorphisms. If this AI is doing any computing at all it is
because the action of carrying out such computation was expected to lead to
higher expected paperclips than other available actions. There is no way to
translate the notion of "uninteresting" with respect to utility functions, for
this paperclip maximizer. For such an agent, choice between utility functions
is a meaningless concept - not an *undesirable* concept but a *meaningless*
concept. The closest it could come would be in considering a choice between
running two alternate versions of source code, one version that implements the
current utility function and one version that does something else. Obviously,
the action of keeping the former version leads to higher expected paperclips.
  And that's all. That's the end of it. Delete the words "interesting",
"important", and "meaningful" from your vocabulary; the only criterion that
matters is "more expected paperclips than any alternative action".

Eliezer S. Yudkowsky                          http://intelligence.org/
Research Fellow, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:51 MDT