From: Stuart Armstrong (dragondreaming@googlemail.com)
Date: Wed Mar 12 2008 - 08:11:49 MDT
Dear Rolf,
Thanks for your comments!
> I can think of definitions of trustworthy that are
> useful-to-have-implemented (like "won't kick off a line of descendants that
> will eventually kill me") and definitions of trustworthy that are
> practical-to-measure (like "won't stab me in the next 30 seconds"), which do
> you mean when you use the word "trustworthy" in the paper?
The ambiguity in "trustworthy" is part of the problem. I tend to use
it in the sense of "telling the truth", and acting on what it says it
will. My position is that the words used - "honest", "trustworthy",
"safe" - mean different things in different circumstances. My friend
down the pub may be honest, trustworthy and safe - but make him
president of the US, without changing anything about him, and he will
remain pretty honest, but become un-trustworthy (since trustworthiness
in a politician is something different than it is for a friend) and
probably very unsafe.
> The chaining system looks isomorphic to a subset of self-improving systems,
> where the step
>
> A -> A + A' (AI A creates AI A' and continues running)
>
> maps to
>
> A -> [C + A + A'] (AI rewrites its code to become a different AI, which is
> itself a composite of A, A', and a control module that gives A veto power
> over A'). Framed this way, is there a short explanation of why this
> limitation (on how an AGI can modify itself) is helpful? If this limitation
> is required to keep some kind of invariant, it might be better to specify
> the invariant directly,
The answer to this is that I do not believe that this invariant can be
specified in advance ("honesty", maybe; "safeness" (or friendliness)
definetly not). At different levels of power and understanding, these
terms need to be redefined. The closest analogy is the difference
between how adults and children would define them. Or, take an upload
of the most moral individual you can find; if you were to multiply his
intelligence by a million, would you still be sure of how "safe" or
moral he was?
The procedure is:
A figures out an invariant I, with some human help, which encompasses
its best (and our best) understanding of what "safe", "honest" and
"trustworthy" mean. It then implements I within itself. It then
constructs the next AI, called A', and verifies whether this AI has
the invariant I (to within very narrow bounds). If not, it vetoes it;
if so, it allows A' to proceed. A' then constructs its own invariant
I', checks it with us and with the lower AI's, and the procedure
repeats itself.
> In other words, why do you believe the proposed
> system wouldn't take a random walk away from "humanity lives" scenarios and
> towards "humanity dies" scenarios?
It's worst than that: I feel that keeping everything constant, the
system will walk from "humanity lives" to "humanity dies" quite
naturally. This has to be corrected for at each stage. My reasons for
thiking this is possible is
1) A lot of the AI's ressources are devoted to ensuring such a drift
doesn't happen.
2) The more advanced AI's will know more, not less, about what
circumstances would cause humanity to end. They will be able to design
an AI that is safer than they are themselves, definitely safer than a
lower level AI is.
3) They constantly "touch base" with us, and will not proceed without
our approval. This means that the drift to "humanity dies" has to
happen in such a way we don't realise it. This would be easy to do if
the AI is out to get us; but since it isn't, and since it will be
searching the realm of our future possibilities with great precision,
we would see such a drift before it was complete (and, hopefully, way
before the AI is out of control).
Again, thanks for your comments,
Stuart
This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:01:02 MDT