Re: AI box; self-modeling and computronium catastrophe (was Re: Why friendly AI (FAI) won't work)

From: Nick Hay (
Date: Wed Nov 28 2007 - 21:03:10 MST

On Nov 28, 2007 5:58 PM, Nick Tarleton <> wrote:
> On Nov 28, 2007 4:06 PM, Thomas McCabe <> wrote:
> > On Nov 28, 2007 11:49 AM, Harry Chesley <> wrote:
> > > In the former, you just don't supply any output channels except ones
> > > that can be monitored and edited.
> >
> > Won't work. See
> Suppose:
> - The only input to the AI is a very narrow problem definition ("prove
> the Riemann hypothesis"). Notably, the input does not contain anything
> about human psychology or language, or even the existence of a world
> outside its system.
> - The only output comes after the AI has shut down. No interaction.
> How, then, could it manipulate humans into escaping? At worst you'd
> get a safe failure, either because it had insufficient resources or
> because you specified the problem wrong. If this allows safe passive
> AI, it could be very useful.

This increases the minimum competency of the AI, but isn't obviously
enough to make the problem impossible. For example,

(1) the AI might be able to infer the existence of humans from the
problem specification and its source code (consider: how many bits of
information are fed to the AI by its input and its code?).

(2a) the AI may be able to output through multiple channels, e.g.
shuffle bits around to generate EM waves which induce a current in you
cellphone which... Or perhaps the sandbox is insecure, and the AI can
connect to the internet or hide code elsewhere in the machine e.g.
overwrite the firmware.
(2b) or, if the AI is being run multiple times it may be able to infer
this and cooperate with its later executions somehow,
(2c) or, it works out human psychology sufficiently so as to write the
magic symbol/paragraph which both looks like a solution and makes the
reader really want to connect the AI's box to the internet
(2d) or, the programmers (predictably) peek at the AI's memory, or
find they need to run an interactive process to get real results.

I don't know whether any of these ideas actually work, but it seems
like something may well work. It is not too hard to generate a lot of
potential vulnerabilities like this. In general you can try to patch
against each of these, but without a more rigorous method you can't
patch against all possible holes.

Btw, the archives are voluminous, and this has (I think) been
discussed there before. You might find interesting things there.

> > > This slows things down tremendously,
> > > but is much safer. In the later, you just don't build in any motivations
> > > that go outside the internal analysis mechanisms, including no means of
> > > self-awareness. In essence, design it so it just wants to understand,
> > > not to influence.
> >
> > Understanding requires more computronium. Hence, computronium = good.
> > Hence, humans will be converted to computronium, as quickly as
> > possible.
> I have a reason to think this might not happen even in an unboxed AI.
> As I understand, part of the AGI problem is getting the AI to model
> itself as a continuous part of the world rather than a wholly separate
> entity (AIXI-style), so that e.g. it 'understands' that it should
> prevent its physical housing from being damaged. Might an AI that
> models itself as something separate not be able to 'conceive' of
> expanding its computational resources by manipulating the _external_
> world? Of course, I'm not suggesting anyone try this - I'm hardly
> confident of it - and it doesn't seem as useful as my first idea.

If it models the external world well enough to avoid damaging itself,
there seems to great risk it would be able to model more e.g. the
effect of computronium (AIXI wouldn't make computronium, although I
can't prove it, but it would overwrite reality, although I can't prove
that either). This level of modeling accuracy is a relatively fine
line given a sufficiently powerful AI.

> > > Third, defining FAI is as bug-prone as implementing it. One small
> > > mistake in the specification, either due to lack of foresight or human
> > > error (say, a typo), and it's all for nothing.
> > That's what CEV is for, see
> > The idea is that you don't specify Friendliness content; you specify
> > the process to derive Friendliness content.
> CEV is itself content capable of being mis-specified, but I basically agree.

Yes, CEV is not the first part of the AI development process. It
would catch a certain class of programmer errors, but you need more
error correcting mechanisms below it, and perhaps after it.

-- Nick Hay

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:01:01 MDT