Fundamental problems

From: Mitchell Porter (
Date: Tue Feb 14 2006 - 06:20:48 MST

Eliezer has posted a job notice at the SIAI website, looking
for research partners to tackle the problem of rigorously
ensuring AI goal stability under self-enhancement transformations.
I would like to see this problem (or perhaps a more refined one)
stated in the rigorous terms of theoretical computer science; and
I'd like to see this list try to generate such a formulation.

There are several reasons why one might wish to refine the
problem's specification, even at an informal level. Clearly, one
can achieve goal stability by committing to an architecture
incapable of modifying its goal-setting components. This is an
insufficiently general solution for Eliezer's purposes, but he does
not specify exactly how broad the solution is supposed to be,
and perhaps he can't. Ideally, one should want a theory of
Friendliness which can say something (even if only 'irredeemably
unsafe, do not use') for all possible architectures. More on this

What follows has no actual theory. This is mostly a
methodological preamble. There's a little philosophy, a little
polemic, some general thoughts on pragmatic strategy, and
(if I get that far) some actual theory. It's also rather repetitive,
but I'm shooting this off as-is in order to kickstart some
discussion. A rough synopsis:

1. A characteristic statement from me, of skepticism about
prevailing philosophies of mind and matter, implying that current
science cannot *even in principle* understand all the issues of
the Singularity.

2. But unFriendly RPOPs are an important sub-issue, this is what
SIAI proposes to deal with, and I endorse their approach, insofar
as I understand it to be a call for rigor.

3. Some thoughts on the importance of *general* Friendliness
theory, as opposed to architecture-specific Friendliess theory.

So first, the philosophical statement of principle. I do not believe
that thought is (solely) computation, or that artificial intelligence
is actually intelligence. I regard the computational description of
mind and the mathematical description of physical nature as
ontologically impoverished; they provide a causal account of
formal changes of state, but say nothing about the 'substance'
of those states. To do that, one must do phenomenology,
epistemology, and ontology at a level more profound than
natural sciences permits. I would point towards Husserl's notion
of 'eidetic sciences' besides logic and mathematics as the way

The stipulation that mind is more than computation implies that
the combination of physics and computer science simply cannot
illuminate all the issues involved with the Singularity. For example,
they have nothing to say about the possibility of pseudoFriendly
superintelligences with a wrong philosophy of mind, whose
universe-optimizing strategy does not respect the actual
ontological basis of our existence (whatever that may be), but
only the continued existence of some active causal emulation of
our minds (for example). In crude terms, this is the scenario
where *substrate matters*, but the AIs cannot know this,
because they don't actually know anything; they are just
self-organizing physical processes with functionally superhuman
intelligence, homeostatically acting to bring about certain
conditions in their part of the universe.

So, while I do not believe that even the solution of all the
problems that SIAI sets for itself would suffice to ensure a
happy Singularity, I recognize that unFriendly RPOPs (Really
Powerful Optimization Processes) are almost certainly possible,
that they are a threat to our future in ways both blunt and
subtle (just as Friendly RPOPs would be powerful allies), and that
computer science is the relevant discipline, just as classical
mechanics is the basic discipline if you wish to land a rocket on
the moon. For this reason I personally agree that theoretical
progress in these matters is overwhelmingly important, if only
it can be achieved.

So, now to business. What do we want? A theory of 'goal
stability under self-enhancement', just as rigorous as (say) the
theory of computational complexity classes. We want to pose
exact problems, and solve them. But before we can even pose
them, we must be able to formalize the basic concepts. What
are they here? I would nominate 'Friendliness', 'self-enhancement',
and 'Friendly self-enhancement'. (I suppose even the definition of
'goal' may prove subtle.) It seems to me that the rigorous
characterization of 'self-enhancement', especially, has been
neglected so far - and this tracks the parallel failure to define
'intelligence'. We have a sort of empirical definition - success at
prediction - which provides an *empirical* criterion, but we need
a theoretical one (prediction within possible worlds? but then
which worlds, and with what a-priori probabilities?): both a way
to rate the 'intelligence' of an algorithm or a bundle of heuristics,
and a way to judge whether a given self-modification is actually
an *enhancement* (although that should follow, given a truly
rigorous definition of intelligence). When multiple criteria are in
play, there are usually trade-offs: an improvement in one direction
will eventually be a diminution in another. One needs to think
carefully about how to set objective criteria for enhancement,
without arbitrarily selecting a narrow set of assumptions.

This meta-problem - of arbitrarily narrowing the problem space,
achieving tractability at the price of relevance - relates to another
issue, a question of priorities. By now, there must be very many
research groups who think they have a chance of being the first
to achieve human-equivalent or superhuman AI. If they care
about 'Friendliness' at all (regardless of whether they know that
term), they are presumably trying to solve that problem in private,
and only for the cognitive architecture which they have settled
upon for their own experiments.

Now, the probability that YOU win the race is less than 1, probably
much less; not necessarily because you're making an obvious
mistake, but just because we do not know (and perhaps cannot
know, in advance) the most efficient route to superintelligence.
Given the uncertainties and the number of researchers, it's fair to
say that the odds of any given research group being the first are
LOW, but the odds that *someone* gets there are HIGH. But this
implies that one should be working, not just privately on a
Friendliness theory for one's preferred architecture, but publicly on
a Friendliness theory general enough to say something about all
possible architectures. That sounds like a huge challenge, but it's
best to know what the ideal would be, and it's important to see
this in game-theoretic terms. By contributing to a publicly available,
general theory of Friendliness, you are hedging your bets;
accounting for the contingency that someone else, with a
different AI philosophy, will win the race.

To expand on this: the priority of public research should be to
achieve a rigorous theoretical conception of Friendliness, to develop
a practical criterion for evaluating whether a proposed AI
architecture is Friendly or not, and then to make this a *standard*
in the world of AI research, or at least "seed AI" research.

So, again, what would I say the research problems are? To
develop behavioral criteria of Friendliness in an 'agent', whether
natural or artificial; to develop a theory of Friendly cognitive
architecture (examples - an existence proof - would be useful;
rigorous proof that these *and only these* architectures exhibit
unconditional Friendliness would be even better); to develop
*criteria* of self-enhancement (what sort of modifications
constitute an enhancement?); to develop a knowledge of
what sort of algorithms will *actually self-enhance*.

Then one can tackle questions like, which initial conditions
lead to stably Friendly self-enhancement; and which
self-enhancing algorithms get smarter fastest, when
launched at the same time.

The aim should always be, to turn all of these into
well-posed problems of theoretical computer science, just
as well-posed as, say, "Is P equal to NP?" Beyond that, the
aim should be to *answer* those problems (although I
suspect that in some cases the answer will be an unhelpful
'undecidable', i.e. trial and error is all that can be advised,
and so luck and raw computational speed are all that will
matter), and to establish standards - standards of practice,
perhaps even standards of implementation - in the global
AI development community.

Furthermore, as I said, every AI project that aims to
produce human-equivalent or superhuman intelligence
should devote some fraction of its efforts to the establishment
of universal safe standards among its peers and rivals - or at
least, devote some fraction of its efforts to thinking about
what 'universal safe standards' could possibly mean. The odds
are it is in your interest, not just to try to secretly crack the
seed AI problem in your bedroom, but to contribute to
developing a *public* understanding of Friendliness theory.
(What fraction of efforts should be spent on private project,
versus on public discussion, I leave for individual researchers
to decide.)

One more word on what public development of Friendliness
standards would require - more than just having a
Friendliness-stabilizing strategy for your preferred architecture,
the one by means of which you hope that your team will win
the mind race. Public Friendliness standards must have
something to say on *every possible cognitive architecture* -
that it is irrelevant because it cannot achieve superintelligence
(although Friendliness is also relevant to the coexistence of
humans with non-enhancing human-equivalent AIs); that it
cannot be made safe, must not win the race, and should
never be implemented; that it can be made safe, but only
if you do it like so.

And since in the real world, candidates for first
superintelligence will include groups of humans, enhanced
individual humans, enhanced animals, and all sorts of AI-human
symbioses, as well as exercises such as massive experiments in
Tierra-like darwinism - a theory of Friendliness, ideally, would
have a principled evaluation of all of these, along the lines I
already sketched. It sounds like a tall order, it certainly is, and
it may even be unattainable, pre-Singularity. But it's worth
having an idea of what the ideal looks like.

That's all I have to say for now; as I said, this was just preamble.

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:55 MDT