Re: Friendliness not an Add-on

From: J. Andrew Rogers (
Date: Sun Feb 19 2006 - 12:57:03 MST

On Feb 19, 2006, at 6:12 AM, Ben Goertzel wrote:
> Of course, there is the following problem: If one has an AI system
> that is able to self-improve via adding new physical resources to
> itself as well as revising its code, THEN the future algorithmic
> information of this AI system may vastly exceed the algorithmic
> information of the initial version, plus the algorithmic information
> of the human society creating the AI system and all its other
> computers, etc. In this case, it would seem there is fundamentally no
> way for the human AI-creators and the initial AI to prove the
> Friendliness of the future AI system, because of the "10 pound formal
> system can't prove a 20 pound theorem" problem. (In the language of
> the above discussion of Rice's Theorem, when one allows the addition
> of a lot of new physical compute power, this makes it too hard to
> delimite the class of possible "algorithms" to be verified in
> advance.)

There is a general argument against bolt-on Friendliness that uses
this line of reasoning, the self-modeling Godel problem.

Any external Friendliness Verifier will be more complex than the AI
it is verifying (in which case, we are only shifting the problem) or
it will not work. If the external FV has imperfect information, it
can be gamed by exploiting the asymmetries in the Godel problem
mentioned above. Effective counter-measures against trivial FV are
computationally *much* cheaper than maintaining the integrity and
quality of the FV. It can be something as simple as dumping tons of
low-order information through the FV (which may happen anyway as a
consequence of basic AI function) such that the FV quickly hits its
resource limits with a disconcertingly high error bound on false
positives. There are alternative ways of constructing this scenario,
but they all lead to the result that either the FV can be gamed at
modest expense to the AI or the FV itself is a bigger threat than the
AI it is verifying. One could argue that for very simple and
primitive AI implementations, external FV is a "good enough"
temporary measure if sufficiently clever in implementation.

Integrating Friendliness directly in the control function of the
machinery does not avoid this issue, but it does change the nature of
the game. With a trivial external FV, the AI can play a game of
resource chicken in which it has the advantage until the FV breaks,
and an FV will not "fail safe" by definition. With Friendliness
Verification integrated into the machinery and not being distinct
from it, the AI is at a permanent disadvantage because the AI would
have to game itself, with the asymmetries of the Godel problem
working against it rather than for it. The AI will essentially be
verifying itself as a precondition for action, bounding the actions
at a low level to those it has the resources to verify. This does
not necessarily guarantee Friendliness over the long term and
"verify" is somewhat undefined here, but as a model it does prevent
some trivial exploits. I will add that I am not attempting to
proscribe any particular implementation theory here or even saying
robust integrated Friendliness is possible, just characterizing the
problem of the problem.

In short, putting Friendliness Verification in some approximation of
an external machine context appears to be a generally exploitable
vulnerability. It needs to be a property of the AI machinery itself
to have some semblance of robustness. Most AI researchers are
counting on this not being the case, but it does not appear to be a
reasonable assumption as far as I can see.

J. Andrew Rogers

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:55 MDT