Re: Friendliness not an Add-on

From: Ben Goertzel (ben@goertzel.org)
Date: Mon Feb 20 2006 - 17:16:18 MST


Charles,

Indeed, defining Friendliness in a way that is precise and yet
captures our intuitions is one major problem.

However, another, separate problem is that even if this definition
problem is solved, there is likely no way to make an AI that is in any
useful sense guaranteed to continue to satisfy Friendliness as it
self-modifies over time. This is because once the AI gets
fundamentally more algorithmically complex than us, it would seem to
defy our ability to prove anything about it.

To work around this problem in a very limited way, I last year
suggested the ITSSIM approach to iterated Friendliness, which
basically requires that: When an AGI self-modifies, it does so in such
a way that it believes its self-modified version will still be
Friendly according to its standards and will also continue with the
ITSSIM approach. But this doesn't fully solve the problem either, and
I have a feeling there will be no solution...

Singularities and guarantees, it would appear, probably don't mix very well...

-- Ben G

On 2/20/06, Charles D Hixson <charleshixsn@earthlink.net> wrote:
> On Sunday 19 February 2006 04:37 am, Ben Goertzel wrote:
> > Hi,
> >
> > About Rice's theorem... Sorry, I did not phrase my argument against
> > the relevance of this theorem very carefully. Here goes again.
> > Hopefully this reformulation is sufficiently precise.
> >
> > What that theorem says (as you know) is that for any nontrivial
> > property P (roughly: any property that holds for some arguments and
> > not others) it is impossible to make a program that will tell you, for
> > all algorithms A, whether A has property P.
> >
> > In other words, it says
> >
> > It is not true that:
> > { There exists a program so that
> > {For All nontrivial properties P and all algorithms A
> > { there exists a program Q that will tell you whether A has property P
> > }}}
> >
> > But a Friendliness verifier does not need to do this. A Friendliness
> > verifier just needs to verify whether
> >
> > * a certain class of algorithms A (the ones that it is plausibly
> > likely the AI system in question will ultimately self-modify into)
> >
> > satisfy
> >
> > * a particular property P: Friendliness
> >
> > The existence of a Friendliness verifier of this nature is certainly
> > not ruled out by Rice's Theorem.
> >
> > The problems are in formulating what is meant by Friendliness, and
> > defining the class of algorithms A.
> >
> > A log of the complete history of an AI system is not necessary in
> > order to define the plausible algorithm-class; this definition may be
> > given potentially by a priori knowledge about the nature of the AI
> > system in question.
> >
> > > > > To put it less formally, we'd be giving our Friendliness module the
> > > > > use of a genie which is somewhat unreliable and whose reliability in
> > > > > any particular decision is, for all intents and purposes, difficult
> > > > > to check.
> > > >
> > > > True
> > >
> > > Right. Doesn't this stike you as dangerous?
> >
> > It strikes me as potentially but not necessarily dangerous -- it all
> > depends on the details of the AI architecture.
> >
> > This is not the same as the "AI boxing" issue, in which the AI in the
> > box is like a genie giving suggestions to the human out of the box.
> > In that case, the genie is proposed to be potentially a sentient mind
> > with its own goals and motivations and with a lot of flexibility of
> > behavior. In the case I'm discussing, the "genie" is a
> > hard-to-predict hypothesis-suggester giving suggestions to a logical
> > cognition component controlled by a Friendliness verifier. And the
> > hard-to-predict hypothesis-suggester does not not need to be a
> > sentient mind on its own: it does not need flexible goals,
> > motivations, feelings, or the ability to self-modify in any general
> > way. It just needs to be a specialized learning component, similar in
> > some ways to Eliezer's proposed Very Powerful Optimization Process
> > used for world-simulation inside his Collective Volition proposal (I'm
> > saying that it's similar in being powerful at problem-solving without
> > having goals, motivations, feelings or strong self-modification; of
> > course the problem being solved by my hard-to-predict
> > hypothesis-suggester (hypothesis generation) is quite different than
> > the problem being solved by Eliezer's VPOP (future prediction)).
> >
> > > I never said evolutionary programming was "nontraceable". What I said
> > > was "nonverifiable". I am not splitting hairs, as these are completely
> > > different things. No program is nontraceable! You can just emulate the
> > > CPU and observe the contents of the registers and memory at any point in
> > > time. With that said though, you can probably see why this sort of
> > > traceability is not good enough. What needs to be traceable is not the
> > > how the bits were shuffled but how the conclusion reached was justified.
> >
> > a)
> > You have not presented any argument as to why verifiability in this
> > sense is needed for Friendliness verification.
> >
> > b)
> > Your criterion of verifiability seems to me to be unreasonably strong,
> > and to effectively rule out all metaphorical and heuristic inference.
> > But maybe I have misunderstood your meaning.
> >
> > Please consider the following scenario.
> >
> > Suppose we have a probabilistic-logical theorem-proving system, which
> > arrives at a conclusion. We can then trace the steps that it took to
> > arrive at this conclusion. But suppose that one of these steps was a
> > metaphorical ANALOGY, to some other situation -- a loose and fluid
> > analogy, of the sort that humans make all the time but current AI
> > reasoning software is bad at making (as Douglas Hofstadter has pointed
> > out in detail).
> >
> > Then, it seems to me that what your verifiability criterion demands is
> > not just that the conclusion arrived at through metaphorical analogy
> > be checked for correctness and usefulness -- but that a justification
> > be given as to why *that particular analogy* was chosen instead of
> > some other one.
> >
> > This means that according to your requirement of verifiability (as I
> > understand it) a stochastic method can't be used to grab one among
> > many possible analogies for handling a situation. Instead, according
> > to your requirement, some kind of verifiable logical inference needs
> > to be used to choose the possible analogy.
> >
> > In Novamente, right now, the way this kind of thing would be handled
> > would be (roughly speaking):
> >
> > a) a table would be made of the possible analogies, each one
> > quantified with a number indicating its contextual desirability
> >
> > b) one of the analogies would be chosen from the table, with a
> > probability proportional to the desirability number
> >
> > According to your definition of verifiability this is a bad approach
> > because of the use of a stochastic selection mechanism in Step b.
> >
> > However, I have my doubts whether it is really possible to achieve
> > significant levels of general intelligence under severely finite
> > resources without making this kind of stochastic selection in one form
> > or another. (I'm not claiming it is necessary to resort to
> > pseudorandom number generation; just that I suspect it's necessary to
> > resort to something equally arbitrary for selecting among options in
> > cases where there are many possibly relevant pieces of knowledge in
> > memory and not much information to go on regarding which one to use in
> > a given inference.)
> >
> > > What I meant was that when B proposes an action, A can either verify that
> > > B did the correct thing or point out a flaw in B's choice.
> > >
> > > The statment above is sufficient but not necessary to show that A is
> > > smarter than B, in the colloquial sence of the phrase.
> >
> > I find this interpretation of the "smarter" concept very inadequate.
> >
> > For instance, suppose I have a collaborator who is more reliable in
> > judgment than me but less creative than me. For sake of concretness,
> > let's call this individual by the name "Cassio."
> >
> > Let A=Ben, B=Cassio
> >
> > Now, it may be true that "When Ben proposes an action, Cassio can
> > either verify that Ben proposed the correct thing, or point out a flaw
> > in his choice"
> >
> > This does not necessarily imply that Cassio is smarter than Ben -- it
> > may be that Ben is specialized for hypothesis generation and Cassio is
> > specialized for quality-verification.
> >
> > The colloquial notion of "smartness" is not really sufficient for
> > discussing situations like this, IMO.
> >
> > > To illustrate, suppose Alice is helping Bob play chess. Whenever Bob
> > > suggests a move, she always says something like "I agree with your move"
> > > or "Yikes! If you go there, he'll fork your rooks in two moves! You
> > > overlooked this move here." If she can always do this, it should be
> > > absolutely clear that Alice is a better chess player than Bob.
> >
> > Yes, but I can also imagine two chess masters, where master A was
> > better at coming up with bold new ideas and master B was better at
> > pointing out subtle flaws in ideas (be they bold new ones or not).
> > These two masters, if they were able to cooperate very closely (e.g.
> > through mental telepathy), might be able to play much better than
> > either one on their own. This situation is more like the one at hand.
> >
> > (i.e., I think your internal quasirandom selection mechanism has
> > chosen a suboptimal analogy here ;-)
> >
> > This discussion has gotten fairly in-depth, but the crux of it is, I
> > don't feel you have made a convincing argument in favor of your point
> > that it is implausible-in-principle to add Friendliness on to an AGI
> > architecture designed without a detailed theory of Friendliness on
> > hand. I don't feel Eliezer has ever made a convincing argument in
> > favor of this point either. It may be true but you guys seem far from
> > demonstrating it...
> >
> > -- Ben
>
> A problem is that for trivial pieces of code it's not even possible to define
> what friendliness consists of, unless you consider a sort routine performing
> a sort correctly to be friendly. Even for most more complex parts, taken in
> isolation, you won't be able to predict their friendliness.
> E.g.: If a module for modeling possible outcomes can't model "unfriendly"
> outcomes, it won't be able to avoid them, but if it can, then it will be able
> to generate them as portions of a plan to execute them. So even at the level
> were friendliness is unambiguously recognizable it doesn't necessarily make
> sense to exclude unfriendly thoughts.
>
> More to the point, I sometimes find myself unable to decide which of two
> proposed actions would reasonably be considered "friendly" in a larger
> context (i.e., my personal choices of how to relate to people). How much
> coercion is it "friendly" to exert to prevent an alcoholic from drinking?
> Clearly one shouldn't offer them a drink, that would, in this context, be
> unfriendly. Is one required to hide the fact that one has alcoholic drinks
> on the premises? If they bring one with them, should one refuse to allow
> them entry? What is this "friendliness"? Practically, I generally choose
> from self-interest, and refuse to allow them in with alcohol or when
> obviously having drunk...but is this friendly, or merely selfish?
>
> What, exactly, *is* this friendliness that we wish our AIs to exhibit?
>



This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:55 MDT