AI box; self-modeling and computronium catastrophe (was Re: Why friendly AI (FAI) won't work)

From: Nick Tarleton (
Date: Wed Nov 28 2007 - 15:58:46 MST

On Nov 28, 2007 4:06 PM, Thomas McCabe <> wrote:
> On Nov 28, 2007 11:49 AM, Harry Chesley <> wrote:
> > In the former, you just don't supply any output channels except ones
> > that can be monitored and edited.
> Won't work. See

- The only input to the AI is a very narrow problem definition ("prove
the Riemann hypothesis"). Notably, the input does not contain anything
about human psychology or language, or even the existence of a world
outside its system.
- The only output comes after the AI has shut down. No interaction.
How, then, could it manipulate humans into escaping? At worst you'd
get a safe failure, either because it had insufficient resources or
because you specified the problem wrong. If this allows safe passive
AI, it could be very useful.

> > This slows things down tremendously,
> > but is much safer. In the later, you just don't build in any motivations
> > that go outside the internal analysis mechanisms, including no means of
> > self-awareness. In essence, design it so it just wants to understand,
> > not to influence.
> Understanding requires more computronium. Hence, computronium = good.
> Hence, humans will be converted to computronium, as quickly as
> possible.

I have a reason to think this might not happen even in an unboxed AI.
As I understand, part of the AGI problem is getting the AI to model
itself as a continuous part of the world rather than a wholly separate
entity (AIXI-style), so that e.g. it 'understands' that it should
prevent its physical housing from being damaged. Might an AI that
models itself as something separate not be able to 'conceive' of
expanding its computational resources by manipulating the _external_
world? Of course, I'm not suggesting anyone try this - I'm hardly
confident of it - and it doesn't seem as useful as my first idea.

> > Third, defining FAI is as bug-prone as implementing it. One small
> > mistake in the specification, either due to lack of foresight or human
> > error (say, a typo), and it's all for nothing.
> That's what CEV is for, see
> The idea is that you don't specify Friendliness content; you specify
> the process to derive Friendliness content.

CEV is itself content capable of being mis-specified, but I basically agree.

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:01:01 MDT