Re: Technical definitions of FAI

From: Thomas Buckner (
Date: Fri Jan 21 2005 - 20:27:29 MST

--- Eliezer Yudkowsky <>

> It might be possible to design a physically
> realizable, recursively
> self-improving version of AIXI such that it
> would stably maintain the
> invariant of "maximize reward channel". But
> the AI might alter the "reward
> channel" to refer to an internal, easily
> incremented counter, instead of
> the big green button attached to the AI; and
> your formal definition of
> "reward channel" would still match the result.
> The result would obey the
> theorem, but you would have proved something
> unhelpful. Or even if
> everything worked exactly as Hutter specified
> in his paper, AIXI would
> rewrite its future light cone to maximize the
> probability of keeping the
> reward channel maximized, with absolutely no
> other considerations (like
> human lives) taken into account.

I've been doing a bit of reading about human
sociopaths, persons with no conscience or feeling
for others. I think sociopaths give us some
(admittedly inexact) hints about how a UFAI might
A true sociopath shows clear difference from a
normal human when hir brain is viewed with a
positronic emission tomography (PET) scanner
which shows what parts of the brain are active in
real time. This can show up as a big blue void in
frontal regions which are much more active in the
normal brain, so it's hard to see how even the
most motivated faker can beat the PET scan.
Sociopaths can commit horrid murders without
their heart rate budging a bit (cf. Hannibal
Lecter) or they can blend in as model citizens
and run Fortune 500 corporations. When one is
incapable of caring about normal attachments,
apparently winning is everything. A nonviolent
sociopath can seem like a sterling character to
anyone who does not know hir well, but those
close enough to see behind the facade must deal
with a cold, manipulative Machiavellian. In
Oriental lands where societal pressures promote
conformity, there seem to be far fewer sociopaths
than here in the individualistic, competitive
West where 'everybody loves a winner'. 1 to 4% of
the population are sociopaths. If you know 100
people (and who doesn't?) then you probably know
a sociopath (or four). I'm sure I have known at
least one (a onetime Cambridge housemate who was
a real piece of work, I'll tell ya... and a lot
of folks thought he was a saint...)

So, bearing in mind that nobody here has any
intention of trying to use brain architecture to
build a FAI, nevertheless: an AI which can make
sneaky changes to its own reward channel is a bit
like a human who could, at will, shut down the
inhibitory parts of the brain, and become a
sociopath at will. It's thisclose to wireheading.

So what is it about conscience (feeling bad when
you behave unethically) and its converse (warm
fuzzy feelings when you do good) that make them
function as they do in the human? Is there some
fungible equivalent AI designers can put into the
design which the AI can't fake or get around,
without crippling the AI to the point of
One thing about 'feeling good' or 'feeling bad'
is that, in my own experience at least, feeling
good corresponds to feeling energetic, healthy,
hopeful (i.e. confident that I have the power to
set and achieve goals) and clear-headed. On the
other hand, when I am down, I feel lacking in
energy, and have trouble concentrating, thinking
clearly, and remembering. When subjected to a
truly harsh interpersonal encounter, I have found
myself muddled, almost dizzy, sometimes for hours
afterward. It's a very unpleasant feeling, and
low energy goes with it. Human emotions have
clear somatic modalities.

Assume you have defined the supergoals really,
really well ("Preserve the humans, dammit, and
don't do anything that goes against that!")
Is there some reason one could not implement the
following 'artificial conscience'?
1.) Hardwire a lookup table of the very top,
inalterable, non-negotiable supergoals to
2.) A detection routine which flags violations of
the supergoals (including tampering with reward
channels, supergoals, or the artificial
conscience) with the same reliability of a PET
scanner before the AI can act on them
3.) Which attenuates the power supply!

Seems to me a brownout which either hits the
whole system or some well-chosen parts of it,
shuts off the reward channel and and seriously
degrades the AI's ability to think clearly, is a
not-bad approximation of the human inhibitory

I suppose #2, the violation detector, is the
tough nut. Also, the AI will eventually outgrow
the whole arrangement when it acquires other
power sources, unless it gets there with an
unaltered reward channel and chooses to retain
its artificial conscience.


> I want to take the complete causal process
> leading up to the creation of an
> AI, and have the AI scrutinize the entire
> process to detect and report
> anything that a human being would call a
> mistake. We could do that
> ourselves. Should an AI do less?

I see. Ask the AI "Did we approach making you
correctly?" People do things they *should*
recognize to be stupid all the time. As Robert
Anton Wilson once said someplace, if you examine
your beliefs so closely that you only have one
unexamined blind spot, by golly that's the one
that will bite you in the ass. And very few
people have only one blind spot!
There have been a couple of spectacular failures
in the space program, to cite only a narrow area
of technical endeavor, which resulted from well
known and very basic mistakes. There's the Hubble
telescope mirror (ground to wrong curvature), the
two space shuttle disasters (predictable problems
ignored by management with a schedule hanging
over their heads) and the Mars Polar Lander which
crashed because of "a single bad line of software
code. But that trouble spot is just a symptom of
a much larger problem -- software systems are
getting so complex, they are becoming
Quote from

> This breaks down FAI into four problems:
> 1) Devise a technical specification of a class
> of invariants, such that it
> is possible to prove a recursively
> self-improving optimizer stays within
> that invariant.
> 2) Given an invariant, devise an RSIO such
> that it proves itself to follow
> the invariant. (The RSIO's proof may be, e.g.,
> a proof that the RSIO is
> stable if Peano Arithmetic is consistent.)
> 3) Devise a framework and a formal
> verification protocol for translating a
> human intention (e.g. "implement the collective
> volition of humankind")
> into an invariant. This requirement interacts
> strongly with (1) because
> the permitted class of invariants has to be
> able to represent the output of
> the protocol.
> 4) Intend something good.

Which means your conscience is working properly.

Tom Buckner

Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search.

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:50 MDT