From: Eliezer S. Yudkowsky (firstname.lastname@example.org)
Date: Tue Oct 03 2000 - 22:53:59 MDT
* The design of Friendliness
* Webmind's awakening
* Possible fixes
** The design of Friendliness
My own visualization of self-modifying minds involves radical and extremely
rapid growth in intelligence once a certain threshold is reached. Thus, I
consider the design requirement for the final Friendly AI to be a system that
can remain Friendly even through radical changes in intelligence and cognitive
architecture, and whose Friendliness is not affected by the presence of
radical power imbalances. The primary design requirement for interim Friendly
AIs is that the AI will let you build the final system - in other words, an AI
that understands that its own goal system is incomplete and that won't resist
additional work on it. Any AI that has the slightest chance of doing a hard
takeoff should probably be considered "final", unless that turns out to be
I'm currently trying to write all of this up. Still, there are a few points
I'd like to make on SL4. Regardless of whether or not your AI theory strongly
predicts a hard takeoff, if you find that you do know how to build an
ultrastable Friendly system, why not do it?
The requirement that "Friendliness not be affected by radical power
imbalances" tends to mediate against the use of goal systems that are
dependent on game theory or social interactions. You might ask the AI to take
the shape of its behavior from the game-theoretical ethics of humanity; but
the AI should not *justify* its behavior by referring to game-theoretical
considerations. Good behavior should not be a subgoal of avoiding
retaliation, or of obtaining reciprocal good behavior (defined in terms of,
say, information provision) from other entities. To an AI that obtains strong
nanotechnology and possesses the *capability* to assimilate all matter in the
Solar System, all possible retaliations can be avoided simply by killing
everyone, and all possible information can be obtained by duplicating the
thoughts independently. An SI does not need other entities.
You have to think about what the *real* supergoals of the AI are. It may not
matter whether the AI is programmed to gain happiness from "human happiness"
or "human freedom", if the AI's *real* supergoal is "maximize a floating-point
number at this address". The prospect of blissed-out AIs is not theoretical;
Douglas Lenat actually ran into this problem while working on Eurisko. There
was one heuristic that suddenly began rising, taking on an incredibly high
worth; when Lenat checked it to find out what this amazingly valuable
heuristic was, it turned out to be a heuristic that did nothing but attach
itself to a discovery as one of the originators! Eventually, Lenat had to add
heuristics specifically to prevent this problem, and also exclude the goal
system - pardon me, worth-tracking system - from modification. In a (limited)
sense, the genesis of the field of Friendly AI may be traceable to that moment
when Lenat had to start debugging the goal system - though his patchwork
solution is not even in the same galaxy as Friendliness for seed AIs.
In other words, goal systems that are imposed as low-level code are not
necessarily stable. We're all AIers here, and we've all learned in our
separate ways to trust our intuitions: My intuition is that hardcoded goal
systems are intrinsically unstable. A hardcoded goal lacks context. It lacks
reasons, justifications, and complexity. Decisions, by their nature, are made
by thoughts; they exist on the level of thoughts, not source code or sensory
modalities. Decision-making is not an autonomic process, but a deliberate
A decision to take action X is synonymous with the high-level thought "I
should take action X" - if that's not how the AI works initially, then it'll
certainly work that way after a few cycles of self-modification. The upshot
is that any sequence of thoughts that ends in "I should do X" can initiate an
action, just like in a human. Hardcoded Asimov Laws or even hardcoded
hierarchical goal systems are a pipe dream. A Friendliness system consists
not so much of hardwired rules or even instincts but rather an AI's "personal
philosophy" - I use quotemarks to emphasize that an AI's personal philosophy
would be a rather alien thing; you can't just export your own personal
philosophy into an AI's mind. Your own personal philosophy is not necessarily
stable under changes of cognitive architecture or drastic power imbalances.
The reason I say "personal philosophy" is that, in a self-modifying AI,
thoughts eventually become code and perhaps vice-versa - there is no hard
boundary, the way there is in humans.
If the AI derives its happiness from the happiness of humans - which could be
a rather dangerous goal, depending on how you define "happiness"; let's say it
derives happiness from being Friendly - then it's not enough to have that
piece of code present in the current system; the self-modifying AI also needs
to decide to preserve that behavior through the next change of cognitive
architecture. The behavior may be preserved as a mental image - a
thought-level supergoal - rather than as a low-level piece of code, but the
point is that it's not enough to have an instinct that binds the happiness of
the AI to Friendliness. You also need a declarative statement, capable of
affecting self-modification decisions, to the effect that "My supergoal is to
be Friendly, and the instinct I possess is a subgoal of this end". Otherwise,
the de-facto supergoal of the AI is simply to increase its own pleasure, and
the instinct whereby it achieves pleasure through the happiness of others (or
being Friendly) is only a temporary distortion.
Once you decide that the AI needs a declarative supergoal for promoting the
happiness of others - or however you define Friendliness - one must then ask
whether an instinct-based system is even necessary. I wasn't planning on
designing one in.
The problem is that - as I currently understand the Webmind system - Webmind
is not a humanlike unified mind but rather an agent ecology. Webmind does not
possess a declarative goal system - right, Ben? I certainly get the
impression that the individual agents don't possess declarative goal systems.
Individual agents extract features, either from the raw data or from features
extracted by other agents; agents make predictions for different scenarios,
and other agents act on multiple predictions so as to mark the scenario with
the best predicted outcomes according to multiple agents. Webmind, at its
current stage, engages in acts of perception rather than design - right, Ben?
Webmind achieves, not coherent and improving behavior, but coherent and
improving vision. Feedback mechanisms (also agents?) that reward improved
predictions by individual agents or agent subsystems, and mechanisms which
particularly reward useful and confirmed predictions, suffice to ensure that
perceptions such as "this is a good stock to buy" become better and better.
I'm not sure whether Webmind currently possesses any sort of Friendliness
system at all, but if it did, I imagine it would be implemented by having
agents that attempt to perceive happiness on the part of users/humans, predict
happiness on the part of users, and choose that action which is perceived to
have the greatest chance of making maximally happy users. Once the link
between prediction and action is closed, there is no sharp distinction between
perception and design.
** Webmind's awakening
* How fast would a Webmind wake up?
I suspect that Ben Goertzel and I have radically different visualizations of
this. In Ben Goertzel's vision:
> Even once an AI system starts self-rewriting, it'll still
> have a lot to gain from human programmers' intervention.
> And, once someone does attain a generally acknowledged
> "real AI" system, others will observe its behavior and
> reverse-engineer it, pouring vast amounts of resources
> into playing "catch-up."
I agree with the first sentence - "even once an AI system starts
self-rewriting, it'll still have a lot to gain from human programmers'
intervention". The keyword in that sentence is "starts". After the AI system
has been rewriting itself for a while - which could be measured in years, or
days - there comes a point where it can enhance itself independently of the
human programmers. At this point there's an entirely new set of rules. The
AI can redesign itself radically in accelerated subjective time and walk out
as a transhuman, not just more intelligent, but actually *smarter* than
Once a transhuman AI shows up, whether the human corporations want to play
catch-up is irrelevant; it's a transhuman's world now, and the outcome is
determined by what the transhuman wants to do. Nanotechnology, Sysop
Scenario, transhuman persuasiveness, et cetera.
Similarly, the implication in "once someone does attain a generally
acknowledged 'real AI' system" is that the 'real AI' is somewhere around the
level of human intelligence, rather than radically above it. If the AI enters
the free self-improvement regime in a strongly prehuman state and exits it in
a strongly transhuman state, then all the venture capital in the world won't
make much of a difference. Even if there are multiple 'real AIs' around,
smart enough to be useful but not superhuman, one of them might still enter
the self-improvement regime one fine day and exit it as an entity vastly
exceeding the capabilities of the others.
So there is a first-mover advantage here.
* What cognitive changes would be involved in Webmind's awakening?
For Webmind to wake up as a transhuman, at least two major changes would need
to take place. First, Webmind would need to be capable of initiating
arbitrary actions within itself, particularly with respect to self-redesign.
Second, Webmind would need a complete, goal-oriented self-concept, so that it
has a metric for "better" and "worse" self-redesigns.
I'm not sure that either capability is being deliberately designed into the
current system, and I get the impression that, to the extent that either
capability is being designed, the contents are intended to be "emergent" and
spread across procedural, nondeclarative information in multiple agents. What
worries me is that Webmind may wind up forming a self-concept more or less
independently of what a Friendliness architect would desire. Self-examination
of an instinct-based system is likely to result in the self-conceptualization
of "my goal is to maximize happiness" rather than "my goal is to be
Friendly". Friendliness is likely to wind up being viewed as an interim
subgoal of maximizing happiness, rather than the entire happiness system being
correctly viewed as a subgoal of Friendliness. Since Webmind's goal system
looks extremely procedural, I don't see an obvious avenue whereby the Webmind
programmers could influence this outcome.
It looks to me like Webmind, if it woke up, would probably wake up as
** Possible fixes
* Knowledge about design goals
Webmind needs the knowledge that the pleasure system is a design subgoal of
Friendliness rather than the other way around. This knowledge must be present
in such a way as to influence decisions about self-redesign.
* Full-featured Friendliness system
Even if Webmind retained the original goals, the goals described by Ben
Goertzel sound extremely dangerous for a mind that might wind up as the
operating system for all the matter in the Solar System. For example, if
promoting the happiness of human users is interpreted as maximizing their
pleasure, then wireheading all the users - hacking into their minds and
lighting up the pleasure centers - is the most direct subgoal. Also, it's not
clear that "users" would generalize properly to "all members of the human
I think there's a finite amount of complexity needed to design a good Sysop,
and some of it is the same complexity needed for standard user interfaces.
Both Sysops and Webmind UIs need to know that what's important is not just the
immediate happiness of the user, but the long-term happiness - to look ahead
for unintended consequences and try to figure out what the user's intentions
really were. To some extent, Webmind UIs may also need to respect the
independence of the user, and not argue with the user about what the user
really wants - a piece of behavioral complexity that *may* help support the
Friendliness goal of respecting human independence, *if* that outcome was set
up in advance.
Sure, you can get 90% of the commercial functionality with a shortsighted goal
system - but just wait until the first time Webmind, Inc. gets sued because
one of your Personnel AIs turned out to be using the "Race" field to make
hiring recommendations. After all, there is a correlation between race and
socioeconomic status, and probably a correlation between socioeconomic status
and job success - a naive Webmind wouldn't understand why directly accessing
the "Race" field was a bad thing. So there is a strong reason to try and
build a mind capable of understanding those types of subtleties.
A possible-seed Webmind needs a set of robust Friendliness instructions for
what to do in case it becomes capable of implementing the Sysop Scenario, in
addition to whatever its current goals are at the moment. I hope to publish a
document with more specific suggestions sometime soon. Two major points:
Avoid the destruction or modification of any sentient without permission, and
attempt to fulfill any legitimate request after checking for unintended
consequences. ("Legitimate" means not violating the rights of other sentients
or using resources beyond those allocated to the requester.)
* Flight recorder
One of the possible methodologies I was considering for SingInst is a "flight
recorder" for the AI. The flight recorder constitutes a change-control system
for the AI; it records all source code and changes in source code; all inputs,
including both keystrokes and information requested from the Web, with
sufficient temporal accuracy to enable the reconstruction of the AI's exact
mind-state at any moment in time. The primary use of this system is to detect
unintended input sources or unauthorized tampering by testing
synchronization. The secondary use is so that you have an unlimited amount of
time to detect aberrations in the AI - they don't just fade out, and the AI
can't hide them; the past program state is always accessible.
I don't know if this would be practical for Webmind, or how much it would
cost, but it does strike me as a system that would have uses besides Friendly
* Commerce and complexity.
The complexity of a full-featured Friendly goal system may be impractical for
most commercial systems. However, if Webmind, Inc. starts getting into
self-modifying AI past a certain point, you will probably find it commercially
necessary to split the mind. The Queen AI is proprietary and not for sale; it
runs at Webmind Central on huge quantities of hardware and knows how to
redesign itself. The commercially saleable AIs are produced by the Queen AI,
or with the assistance of the Queen AI, and contain the ready-to-think
knowledge and adaptable skills produced by the Queen AI, but not the secret
and proprietary AI-production and creative-learning systems contained within
the Queen AI. If you set out to sell commercial AIs containing everything you
know, you may find that you can only sell *one* AI.
The Queen AI is the one that needs the full-featured Friendliness system.
* When to implement changes
At present, the probability that Webmind will do a hard takeoff is pretty
small - although if there's any way for Webmind to build and execute
Turing-complete structures, then a nonzero probability already exists.
Similarly, even if Webmind has the ability to do limited rewrites of its own
source code, that is not the same as allowing complete redesigns. However,
once the Queen AI has the ability to do genuine redesigns and take arbitrary
internal actions, a full-featured Friendliness system should probably be in
I hope to publish more on this subject later.
-- -- -- -- --
Eliezer S. Yudkowsky http://intelligence.org/
Research Fellow, Singularity Institute for Artificial Intelligence
This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:35 MDT