From: Stuart Armstrong (firstname.lastname@example.org)
Date: Thu Aug 07 2008 - 20:37:53 MDT
Some built-in limitations are very useful for an entity, even if they
depart from perfect rationality. For instance, a propensity for
violent anger is a very worthwhile trait in many societies and
situations; the ability to reject an advantageous deal is a necessary
trait for negotiators; the inability to break a contract brings great
dividends. Often, organisations impose such limits on themselves; the
logic of MAD worked as long as you can convince your enemy that you
cannot refuse to retaliate after an attack – which means making
changes in your own organisation so that retaliation can't de
derailed. An old strategy of the swiss army was based around the fact
that generals could never order their soldiers to surrender.
Whereas for the singularity, we want the AI(s) to have a particular
limitation: be "nice" to humans (for whatever value of "nice"). Is
there some way to set up the situation so that being nice to humans is
an advantageous limitation?
Comparative advantage suggests one possibility; if the resources
required to run humans is small, it might be in the AI's interests to
farm out certain problems to humankind, rather than designing its own
solving software. But I wouldn't count too much on that. Trying to
implement advantageous limitations in the case of the a fast
singularity is probably pointless as well.
I propose an idea that might be useful in the case of a slow
singularity where the AI's code is obscure and AI competition is rife
(this is a situation I have not seen dealt with much so far – it's
also highly dangerous).
(Assumptions: the notions of "niceness", obscure code, and "keeping
one's word" are notions that can be sensibly extended to higher levels
The idea would be that, once a reasonably intelligent AI is designed
with an obscure code (obscure meaning that a comparable intelligence
can get little information from the code, compared with observing the
AI's behaviour), it be copied extensively and spread about the world,
with certain crucial differences: that the level of "niceness" to
humans be randomised, but correlated strongly with an advantageous
limitation (for instance, the inability to break one's word). Also,
these would both be correlated with a desire to keep one's code
obscure. A few AI's should be added where those behaviours vary in
correlated fashion through time.
The idea then is that AI's would be unable to determine if the entity
they are interacting with is trustworthy. All they would know is that
one that is "nasty" to humans, or tries to clarify its source code, is
untrustworthy. Hence it will be advantageous to AI's to always be nice
to humans; hopefully it will be in the AI's interest to integrate
"niceness" into themselves as they upgrade their intelligence (this
would be the case if, for instance, it were partially possible to
distinguish between true niceness and just faking it). As long as the
cost of being nice is low, and the AI's are unable to clear up their
source code in a single upgrade, this should be the stable attractor.
Note to those familiar with signalling in evolution: this is not the
same set-up. Signalling positive attributes generally requires
behaviours that are costly to fake; here, the behaviour – niceness to
humans – is cheap to fake, and is a necessary, not sufficient, signal
of another attribute. The equilibrium will be different, in that it
should be in every AI's interest to fake.
If people think this is interesting, I'll try and model it
mathematically. And get a handle on its stability.
This archive was generated by hypermail 2.1.5 : Sat May 25 2013 - 04:01:13 MDT