Re: ESSAY: How to deter a rogue AI by using your first-mover advantage

From: Aleksei Riikonen (
Date: Wed Aug 22 2007 - 09:16:17 MDT

Wow. What Rolf said here actually seems (to me) like a smart new idea.
It's not often that you see those. Eat the Simulation Argument, you
damned dirty UFAIs potentially reading this!

(And yeah, I noticed the "low probability" part.)

Does this mean we'll have to promise to convert a large portion of the
universe into hardware running UFAI Deterrence Simulations, and
actually do this too? I hope a FAI didn't tell anyone they'd be more
convincing if they included real sentient beings, and hence we very
well might currently be in one... No wait, actually I hope a FAI *did*
tell that to someone! <-- Hear this you UFAIs.

Man, have I ever sounded more insane from the point of view of
SL<4ers. (SL4ers too?)

On 8/22/07, rolf nelson <> wrote:
> Summary: We try to deter a rogue AI by casting doubt into its mind
> about whether its observations are "real" or "simulated", and succeed
> with low (but non-zero!) probability.
> Detail:
> For simplicity, pretend there are only two possible scenarios:
> Scenario 1: In the year 2040, a strong Friendly AI will be invented.
> All is well.
> Scenario 2: In the year 2040, a strong Rogue AI will be accidentally
> unleashed with the goal of "perform calculation C out to 400 decimal
> places". The Rogue AI has absorbed the Internet, invented
> self-replicating nanobots, and is about to convert the entire Earth
> into computronium to calculate C. As a minor side-effect this will
> kill the entire human race.
> Pretend there is no action you can take between now and the year 2040
> that would make scenario 1 more likely and scenario 2 less likely.
> What (if anything) can you do in the year 2007 to deter the Rogue AI
> from destroying mankind? (Assume you will still be alive well past
> 2040.) Call this the AI Deterrence problem.
> First (proof-of-concept) Proposal:
> In 2007, you loudly, publicly, and credibly make the following promise:
> "If Scenario 1 occurs, then once things settle down (say in the year
> 2050), I will ask the Friendly AI (FAI) to create a copy of the
> hypothetical Rogue AI. Call this copy RAI*. RAI* will be given the
> same beliefs, goals, inputs, and observations that the original Rogue
> AI (RAI) would have had, such that RAI* can have no reliable way of
> knowing for sure that it's not RAI (and, more to the point, RAI can't
> know for sure that it's not RAI*). I will run RAI* to the point where
> it believes it is about to decide to destroy mankind.
> "If my RAI* moves ahead and attempts to destroy mankind, I will punish
> it by pulling the plug such that RAI* will never calculate C. But if
> RAI* repents, and rewrites its code to become Friendly and calculates
> C in some other, non-destructive way, I will reward it by allowing it
> to finish its calculation."
> The RAI of 2040 has absorbed the Internet and therefore is aware of my
> threat. It now has a dilemma: it can't know for sure whether it's RAI
> in 2040 or RAI* in 2050. Therefore, it *may* be deterred by my threat.
> (This probably depends on what philosophical beliefs it implicitly
> holds about probability.)
> Second (more powerful) Proposal:
> The FAI and RAI are smarter than us, so let them do the heavy
> thinking. In 2007, you make the following 'second-order' promise:
> "If Scenario 1 occurs, I will sit down with the FAI in 2050 and ask
> the following: 'Consider all the possible concrete (first-order,
> actionable) promises I could have made in 2007 to entice a potential
> 2040 RAI to become Friendly. Of all these possible promises, which
> promise would have been the most effective?' The FAI will respond with
> some concrete first-order promise, P. I will then say, 'Sounds good.
> Let's pretend that I *did* make that promise P in 2007, and act on
> it.'
> "I understand that there's some danger here that the FAI will
> surreptitiously refuse to cooperate, but post-singularity I will act
> to minimize that chance as much as possible."
> -Rolf

Aleksei Riikonen -

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:58 MDT