RE: Humane-ness

From: Christopher Healey (
Date: Wed Feb 18 2004 - 05:43:57 MST

I'll try to orient my response away from making specific value judgments of intrinsic desirability, as you made a point of mentioning in your reply [far, far..] below. Also, I will attempt to keep my responses within my understanding of CFAI and it's implications, to minimize cross-contamination with my own ideas, although I'm afraid this will occur anyhow.

> My contention is that there is almost surely NO WAY to guarantee that an AGI
> will remain "humane" under radical iterated self-modification, according to
> Eliezer's (not that clear to me) definition of "humane"
> This is a conjecture about WHAT SORT OF PROPERTIES are going to be
> potentially stable under radical iterated self-modification
> I contend that
> -- specific moral injunctions like "Be good to humans"
> -- messy, complex networks of patterns like "humane-ness"
> will probably NOT be stable in this sense, but more abstract and crisp
> ethical principles like
> -- Foster joy, growth and free-choice
> probably will be.

Guarantee was a bad word choice for me, which I should have approached from the other extreme by saying, "minimize our chance of catastrophic outcome across a maximal predictive horizon".
From my vantage, all three of these principles seem grounded in vastly complex experiential reference. Even given the assumption that Voluntary Joyous Growth (VJG) IS the more abstracted and crisp, I'd posit that to represent it (and all other human social generalizations) in a way that truly preserves it's flexibility across contexts would still involve solving 99% of the problem.
This problem doesn't seem to be "What is the correct property?", or even "How do we get our AGI to preserve this property?”, but rather "What is required in getting our AGI to converge not on the property we define, but on what we intended when we selected the property?" The point I am trying to convey is that if you can achieve this, then the AGI should converge on any supergoal we define, regardless of the goal's complexity, given time and guidance to attain and exceed human equivalent cognition.
This seems to be the central aim of Friendly AI; first ensure that it's structurally possible (i.e. not structurally crippled), next seed and teach increasingly complex semantic primitives until the basic content of Friendliness can be represented, and finally build on these to implement a singly rooted goal structure based on the specific concept of Friendliness, which is structurally identified to be an approximation to real Friendliness (externally referenced).
So if I am wrong and confused in how I phrase an attempt at a humane or VJG supergoal, at least I wouldn't have to feel scared about the AGI plodding along without extracting the intent and confirming it in every applicable instance, interactively when possible and always against past supergoal related experiences. The supergoal would then possibly be changed to better approximate the actual intent, should a novel situation present itself, as an instrumental action towards maximizing future fulfillment of the current supergoal. We WOULD feel scared, but due to concerns over structural design issues and not minor content vagaries.
Of course, you would still have to try and get it right, and present the AGI with honest feedback. Assuming excellent structural design of the AGI, it would still eventually converge on the intended supergoal under incompetent or error prone education, but it would happen much less efficiently. And however long it takes to converge could easily be too long for us, especially if it's wreaking misinterpretation-based havoc (as opposed to programmer trained and intended havoc). So, programmer intention is certainly still overridingly important, with incidental content quality much less an issue. (The average quality level of ALL content related to a specific programmer would mediate all interactions highly, however.)
My increasing perception is that simpler may not be better in this matter. It feels like trying to generalize from generalizations. Approximating from approximations. Destroying salient information.
I am still wrestling with my intuitions on this. But your own contributions on SL4 have been key in helping me define my own understanding of the matter, and in putting it to the grinding wheel.
I appreciate this opportunity to bounce these gleaned thoughts off of you :)

> I think that playing with simple self-modifying goal-driven AI systems will
> teach us enough about the dynamics of cognition under self-modification,
> that we'll get a much better idea of whether my conjecture is correct or
> not.
> Please note, I am not even arguing as to whether the goal of making
> humane-ness survive the Transcension in a self-modifying AI is DESIRABLE. I
> am arguing that it is probably IMPOSSIBLE. So I am focusing my attention on
> goals that I think are POSSIBLE, along with doing the research needed to
> better assess what is really possible or not.
> -- Ben G
Well, I'd have to agree that even considering the dangers, there are surely many things we'll fail to understand without thorough experimentation, especially nuts and bolts implementation. I do think we need to take an extremely active role in exploring those areas, but under duly diligent methodologies.
. . .
To any and all, if I am patently confused on CFAI in whole or in part, please don't be shy. Beat me up.

Thank you,
Chris Healey

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:45 MDT