Re: Friendly AI koans

From: Justin Corwin (
Date: Thu Jul 11 2002 - 21:45:15 MDT

Eliezer, I noticed this email in my archives, and noted no obvious replies
to it. I hadn't commented on it either (being on a trip at the time) and
supposed that answers are better late than never. Some comments and thoughts

>1. You're designing a Friendship system. You think you know how to
>transfer over the contents of your own moral philosophy over, but you can't
>for the life of you think of any way to even begin to construct a moral
>philosophy that could legitimately be said to belong to "humanity" and not
>just you. Others have repeatedly demanded this of you and you think they
>are completely justified in doing so. What do you do?

Well. There are two things being asked here. If the philosophy that you
transfer over to AI can legitimately be called human-universal, and if the
ability to transfer 'your' moral philosophy neccesarily implies the ability
to transfer a defined or recieved philosophy.

The first question is one of generation. Can you generate a human-universal
philosophy. I would argue no. Our design limitations prevent us from forming
a universal moral philosophy (observer bias) and the same defect would
prevent us from generalizing, even within our limited domain of humans.

Can you generate a moral philosophy that is arguably 'more' universal? More
'acceptably' universal? Sure. Put your moral philosophy with talmudic
commentary online. Encourage debate. Post such debates. Revise the
philosophy. Allow others to do so. If you're short on time, simply post your
philosophy, set some ground rules and let other people worry about it. That
way it could drift further away from your independent vector anyway.

Alternatively, you could generate a commentary based moral philosophy, one
that centers around removing the observer bias. Such a moral philosophy
could be aggressively universal. (in theory, at least). Consensus may be
elusive, but an actively independent morality seems more likely to be
universal, or at least acceptably universal, than a personal one.

The latter question, can your ability to transfer your philosophy be
considered equivalent to the ability to transfer 'any' moral precepts...

That's a touchy subject for me. I personally tend to believe that any
singleton will have too many subconcious assumptions and unquestioned
beliefs to reliably transfer an alien or substantially different moral
philosophy to code or even writing. With extra-ordinary checks and balances,
and fearsome internal controls, you might succeed in marginally firewalling
your own personality out of the coding. But my gut instinct says that your
implicit assumptions and viewpoints will seep into any such system, or even
a system for generating such a system. This is a stunningly informal
position, so i haven't much supporting evidence for it. It intuits, but i
can't tell if it has legs or not. (besides i'm at work and can't really
schedule that much time to exploring it.

>2. You didn't think of the idea of probabilistic supergoals when you were
>designing the Friendship system. Instead your AI has a set of "real"
>supergoals of priority 10, and one meta-supergoal of priority 1000 that
>says to change the "real" supergoals to whatever the programmer says they
>should be. At some point you want to tweak the meta-supergoal, but you
>find that the AI has deleted the controls which would allow this, because
>the physical event of any change whatever to the meta-supergoal is
>predicted to lead to suboptimal fulfillment of the AI's current
>maximum-priority goal. If you want a case like this to be recoverable by
>argument with the AI rather than direct tampering with the goal system,
>what does the AI need to know - what arguments does the AI need to perceive
>as valid - in order to be argued out of its blind spot?

This is a simple objective-subjective rightness conflict, and I hope you
didn't put anything simply wrong into slot 10. The AI must be aware that
specific instances of goal hierarchies are subordinate to the intention of
the goal hierarchy itself. I E, if someone who sucks at english writes the
goal system, the intention of such goal system is more important than his
sucky spelling. This may not work in a goal hierarchy where the meaning of
the goals are defined and not interpreted, for example defined in Z or
Propositional Calculus.

Also, the AI must be aware that the proposed changes to the goalsystem
preserve intentionality of that goal system into a more efficient shape. (if
they don't, why the fuck are you messing with it, anyway?)

>3. Someone offers a goal system in which sensory feedback at various
>levels of control - from "pain" at the physical level to "shame" at the top
>"conscience" level - acts as negative and positive feedback on a
>hierarchical set of control schema, sculpting them into the form that
>minimizes negative and maximizes positive feedback. Given that both
>systems involve the stabilization of cognitive content by external
>feedback, what is the critical difference between this architecture and the
>"external reference semantics" in Friendly AI? How and why will the
>architecture fail?

The question is whether you want the philosophical answer or the engineering

Engineering first because I'm like that.

First of all, this is a terribly simplified model of the proposed goal
modification system, i would hope. I say goal modification system because
such a system would be incapable of generating goals in and of itself. All
such a system would offer is a series of "convergence" and "avoid" commands.
Such a system could be likened to a expert driving system that relied on
negative instruction. Example: Avoid white lines would keep a car on the
road, while, keep the right of double yellows, would keep you in your
lane(in this country at least). More complicated input would allow the car
to drive to specific locations. However, such avoid/converge meta-rules can
hardly give rise to new locations, or to alternative routes to locations
already known and reinforced.

Supposing the I/O of the goal system is more complex than stated, the reason
it would fail is becuase the goal system is based upon referents with no
real value. The referents arbitrarily defined would quickly spiral into

the philosophical ramifications can be explored later as i believe this is

but then this is a ramble at work, so who knows. I'd have to be more formal
to know if this email is worth anything or not.

>Eliezer S. Yudkowsky
>Research Fellow, Singularity Institute for Artificial Intelligence

Justin Corwin

"Is all the world but jails and churches?"
                       ~Rage Against the Machine.

MSN Photos is the easiest way to share and print your photos:

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:40 MDT