From: Eliezer Yudkowsky (email@example.com)
Date: Fri Jan 21 2005 - 10:12:07 MST
The following will not make sense unless you've read "A Technical
Explanation of Technical Explanation".
In TechExp, I suggested that a "technical" model is a model which makes
*sharp* predictions, that is, concentrates its probability mass into a tiny
volume in the space of possible outcomes.
What we want for assurance of Friendliness is an accurate (calibrated and
confirmed) technical model of the FAI, a model which sends essentially
*all* of its probability mass into the Friendly space of outcomes. That
is, we would have a legitimate, technical assurance of continued
Friendliness, given the starting data, starting code, and chips that
behaved like the designer said they did. (Actually, we would want to
encode some probability of error in both the transistors and the chip
specification, test for that outcome, and require that a single bitflip not
destroy the system. All of this can be handled quantitatively and
technically, just as in _Nanosystems_ Drexler carefully stacks together
probabilistic carbon atoms to produce a rod logic unit with a failure
probability of 10^-64.)
How would we achieve this assurance? By explicitly encoding the technical
model and asking the FAI to verify its own code.
Then what is a *technical* definition of Friendliness? A technical
definition of "Friendliness" would be an invariant which you can prove a
recursively self-improving optimizer obeys.
This doesn't address the issue of choosing the right invariant, or being
able to design an invariant that specifies what you think it specifies, or
even having a framework for invariants that won't *automatically* kill you.
It might be possible to design a physically realizable, recursively
self-improving version of AIXI such that it would stably maintain the
invariant of "maximize reward channel". But the AI might alter the "reward
channel" to refer to an internal, easily incremented counter, instead of
the big green button attached to the AI; and your formal definition of
"reward channel" would still match the result. The result would obey the
theorem, but you would have proved something unhelpful. Or even if
everything worked exactly as Hutter specified in his paper, AIXI would
rewrite its future light cone to maximize the probability of keeping the
reward channel maximized, with absolutely no other considerations (like
human lives) taken into account.
Once you specify an invariant and set it into motion, how do you guarantee
that what happens will match the pleasant mental imagery inside your head
that led you to specify that invariant? The binding between human
intention and human design is weak, untechnical, and unverifiable, because
human intelligence itself is weak, untechnical, and unverifiable. So I
want to ask the AI itself to re-verify the binding between human intentions
and human designs that gave rise to my attempt to specify an invariant.
That's something I'm still trying to figure out.
I want to describe a stronger way of binding together an invariant and a
programmer's intention than the programmer just writing down something that
he hopes will fit his mental imagery. I don't want to set an invariant in
motion blindly, like a stone rolling down a hill, based on an uncomputed
hope that my keystrokes match my dreams. I want a framework that describes
the programmers' fallible intelligence as a part of the causal process that
gives rise to the AI. I want to verify the invariant against the
programmers, and not just the AI code against the invariant. I want this
framework to be a formal part of the theory and be formally verifiable by
the AI, maybe not at near-certain probability like the dynamics at
transistors, but still with, say, 90% probability. And I want to use the
verification framework to describe the verification framework itself, to
confirm that the verifier matches what humans intended from the
I want to take the complete causal process leading up to the creation of an
AI, and have the AI scrutinize the entire process to detect and report
anything that a human being would call a mistake. We could do that
ourselves. Should an AI do less?
This breaks down FAI into four problems:
1) Devise a technical specification of a class of invariants, such that it
is possible to prove a recursively self-improving optimizer stays within
2) Given an invariant, devise an RSIO such that it proves itself to follow
the invariant. (The RSIO's proof may be, e.g., a proof that the RSIO is
stable if Peano Arithmetic is consistent.)
3) Devise a framework and a formal verification protocol for translating a
human intention (e.g. "implement the collective volition of humankind")
into an invariant. This requirement interacts strongly with (1) because
the permitted class of invariants has to be able to represent the output of
4) Intend something good.
-- Eliezer S. Yudkowsky http://singinst.org/ Research Fellow, Singularity Institute for Artificial Intelligence
This archive was generated by hypermail 2.1.5 : Sat May 18 2013 - 04:00:45 MDT