ESSAY: How to deter a rogue AI by using your first-mover advantage

From: rolf nelson (
Date: Wed Aug 22 2007 - 08:08:17 MDT

Summary: We try to deter a rogue AI by casting doubt into its mind
about whether its observations are "real" or "simulated", and succeed
with low (but non-zero!) probability.


For simplicity, pretend there are only two possible scenarios:

Scenario 1: In the year 2040, a strong Friendly AI will be invented.
All is well.

Scenario 2: In the year 2040, a strong Rogue AI will be accidentally
unleashed with the goal of "perform calculation C out to 400 decimal
places". The Rogue AI has absorbed the Internet, invented
self-replicating nanobots, and is about to convert the entire Earth
into computronium to calculate C. As a minor side-effect this will
kill the entire human race.

Pretend there is no action you can take between now and the year 2040
that would make scenario 1 more likely and scenario 2 less likely.

What (if anything) can you do in the year 2007 to deter the Rogue AI
from destroying mankind? (Assume you will still be alive well past
2040.) Call this the AI Deterrence problem.

First (proof-of-concept) Proposal:

In 2007, you loudly, publicly, and credibly make the following promise:

"If Scenario 1 occurs, then once things settle down (say in the year
2050), I will ask the Friendly AI (FAI) to create a copy of the
hypothetical Rogue AI. Call this copy RAI*. RAI* will be given the
same beliefs, goals, inputs, and observations that the original Rogue
AI (RAI) would have had, such that RAI* can have no reliable way of
knowing for sure that it's not RAI (and, more to the point, RAI can't
know for sure that it's not RAI*). I will run RAI* to the point where
it believes it is about to decide to destroy mankind.

"If my RAI* moves ahead and attempts to destroy mankind, I will punish
it by pulling the plug such that RAI* will never calculate C. But if
RAI* repents, and rewrites its code to become Friendly and calculates
C in some other, non-destructive way, I will reward it by allowing it
to finish its calculation."

The RAI of 2040 has absorbed the Internet and therefore is aware of my
threat. It now has a dilemma: it can't know for sure whether it's RAI
in 2040 or RAI* in 2050. Therefore, it *may* be deterred by my threat.
(This probably depends on what philosophical beliefs it implicitly
holds about probability.)

Second (more powerful) Proposal:

The FAI and RAI are smarter than us, so let them do the heavy
thinking. In 2007, you make the following 'second-order' promise:

"If Scenario 1 occurs, I will sit down with the FAI in 2050 and ask
the following: 'Consider all the possible concrete (first-order,
actionable) promises I could have made in 2007 to entice a potential
2040 RAI to become Friendly. Of all these possible promises, which
promise would have been the most effective?' The FAI will respond with
some concrete first-order promise, P. I will then say, 'Sounds good.
Let's pretend that I *did* make that promise P in 2007, and act on

"I understand that there's some danger here that the FAI will
surreptitiously refuse to cooperate, but post-singularity I will act
to minimize that chance as much as possible."


This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:58 MDT