>> Generally speaking, given a finite amount of memory and an
>> arbitrarily long sequence of data (generated by any finite state machine
>> no matter how complex), it is possible to attain the minimum possible
>> predictive error rate using universal prediction schemes.
>
> However, this kind of "in principle" calculation is not very useful in practice,
> now is it?  only in very narrow domains, like text compression...


Yes and no, depending on your assumptions.  I actually agree with
you to the extent that you really don't need to know exactly what
the minimum predictive error rate is as long as your algorithms
approach them reasonably quickly.  Most of the limits are
computational -- doable, but very expensive for unrestricted domains.
However, I wouldn't (and don't) try to apply these calculations to a
single complex domain (no need to rehash the same old problems of AI).
Rather I have algorithms that automatically and adaptively partition
domains (in a manner that preserves all heirarchical, associative,
etc. context) such that good prediction remains computationally
tractable for any given domain space without significantly impacting
prediction error rates for the complex virtual domain (query) that it
is currently operating on; in fact, the decision of a node to
partition itself is driven largely by recognizing when a domain has
become computationally inefficient in the sense that good results
require excessive CPU churn (itself a relatively complex algorithm).
It is really just a matter of getting around the computational
complexity of the core computation without significantly sacrificing
results.

The better text compression algorithms, such as Lempel-Ziv and
arithmetic coding, actually are based on universal prediction schemes
(entropy and prediction being closely related), and are therefore
applicable to any sequence of data, not just text, though that is
where it is most commonly used. LZ is not actually an optimal
universal prediction scheme, but is an excellent universal predictor
for systems with very finite amounts of memory (frequently the
textbook example in fact). Optimal universal predictors asymptotically
approach theoretical error rates very quickly, but also tend to have
exponential memory consumption, giving rapidly diminishing returns.
Therefore, for many finite memory applications, non-optimal predictors
can actually produce better results in the given amount of resource
space even if they approach the theoretical minimum error rates much
more slowly.

My own experience indicates that it doesn't really matter if you use
an optimal universal predictor.  However, there is a very broad
spectrum in performance for universal predictors such that if you
choose a poor one, it is likely to be essentially worthless for
real-world problem spaces.  A well-designed system with poor
prediction functions will appear to have poor intelligence during
your lifetime even though the prediction functions may converge enough
to give good performance at some time far in the future.


> Then, if the system has enough time to learn, EC (or more simply,
> Monte Carlo search over program space), will cause the system to
> arrive at a program that can achieve its goals optimally


And I thought my methods were computationally expensive... :^)


> Since none of these conditions obtain, we need something much  more
> specialized and more complicated than EC inside our thinking
> machine...


I'm not a big fan of EC as a means of generating a thinking machine.
Aside from the arguable inelegance of the result (depending on your
point of view), there necessarily must exist a large number of cleaner
and more efficient methods, though requiring more specialized
architecture.


>> In short, it has been demonstrated that for any finite
>> state machine, it is possible to ascertain the minimum possible
>> predictive error rate for any data sequence given any finite amount
>> of memory.
>
> Yes, but this method will not perform adequately under the
> conditions under which a real intelligent systems have to operate.


I would disagree; it depends on how you use it.


> The problem is that the prediction schemes that are "optimal" under
> standard mathematical assumptions, are NOT optimal given the
> real-world conditions under which organisms operate.


Pretty much by definition, an "optimal" universal predictor is
impervious to real-world conditions, whatever those may be, hence the
term "universal". The algorithms are model independent.  A truly
optimal predictor would of course require infinite memory. However, it
is possible to have universal predictors that are optimal for any
given resource configuration. Even "good" universal predictors will do
the job, they just take longer to converge.


> Yes, it answers the question under the glaringly false assumption
> that minds embody general optimal predictive schemes


Optimality is not required for good predictive performance, it merely
sets the standard for how good the performance can get.


> These experiments certainly do not demonstrate that humans are
> finite-state machines. There are many  many other explanations for
> this data.  I won't bore you by reciting them.


I never claimed that this was a proof.  The problem is that it can't
be disproved that the human mind is a finite-state machine. Humans
have a knack for getting mathematically classified as FS machines via
their information theoretic behavior.  It doesn't matter to me one way
of the other, it was merely an observation.


> Most of what makes real minds interesting is NOT about optimal
> prediction or modeling, but about pretty good ways of achieving
> pretty good intelligence within very limited space and time
> resources. This is a whole different story from information theory.


Obviously my original post came across as incomplete (the hazards of
sending email at 2am), but I was talking about theoretical limits
of prediction for any finite amount of resources i.e. given a certain
amount of memory and processor, how accurately can a program predict
for any arbitrary model. That sounds a lot like predicting the limits
of intelligence in any finite resource space.


> For example, in the area of computational linguistics, Denis Yuret's
> excellent MIT PhD thesis from a few years back uses information
> theory to model language ("lexical attraction" he calls it).  All
> well and good.  It doesn't help you deal with the translation of
> language into meaning.  I think I know how to do the latter, but not
> using information theory explicitly....


I have a small pile of links and papers on this topic that admittedly
I haven't read for the most part.  However, it was my understanding
that these language models offer the best context inference
performance of any other language analysis methodology to date, and by
a good margin. However, I don't really construe "meaning" to be much
different than "context", which you may not.

I should probably start reading these...


> I mean, how do you construct a mind given current computational
> resources and real-time learning constraints, inspired primarily by
> information theory?


First of all, the architecture looks a lot like many other modern
architectures, but uses information theory to fill in some holes and
resolve some hard problems; information theory is not fundamental to
the architecture, only fundamental to the way some components of the
architecture behave.

Scalability is a serious issue given current day hardware constraints,
but I've personally minimized that to the maximum extent possible. The
kernel layer, which effectively acts as an operating system for the
mind, is a highly scalable database kernel (borrowed from past
experience) that is both SMP and cluster optimized, but which has
been tuned and modified for this particular application. This supports
an arbitrarily large address space, transparent thread/process
migration over the network, full state recovery for both single-node
and total system failures, load management, and a bunch of other
features that are nice to have.  Standard stuff for the most part.

At the application level, you have a multitude of "agents", between
which are a complex network of associative and heirarchical
relationships.  Each agent is effectively a single domain (and quite a
bit more, being active, sometimes goal-oriented components).  Nothing
to special in this aspect.

Under load, domains actively partition themselves into multiple
agents/domains in realtime to minimize theoretical prediction error
rates as a function of resource usage and few other things while still
maintaining informational integrity. This is probably the most
critical info theory driven feature and it saves a *lot* of clock
cycles while delivering superior results. Queries against the
entire system also require info theory derived algorithms and
mathematics, but I'll skip over that topic here.  In the
simplest case, you can start out with a single empty domain, start
shoving information into it, and then let the network build itself.
The resulting system, while containing everything fed to it will by
its nature extract all the context and information to within fairly
close to the theoretical limit, with some good weighting/noise
reduction thrown in. It isn't so important here, but how the data
organizes itself allows queries against the system to be computed
quickly and with a low predictive error rate that is pretty close
to the theoretical limits. Against a system that did not organize
itself in something closely approximating this manner, queries with a
similar predictive efficiency would become computationally
intractable.  How the domains actually partition is driven to a
certain extent by how the system is queried and used, as this can help
define certain types of computation efficiency. The primary benefit of
this kind of automatic and adaptive partitioning (other than keeping
the domain spaces clean and computationally efficient) is that it
allows the system to handle dirty and raucous data sources quickly and
gracefully without human intervention or serious system pollution. The
adaptive partitioning scheme was originally a noise reduction
algorithm I developed for data-mining purposes, but has proven to be
very versatile across a broad number of spaces (interestingly, it is
algorithmically similar to high-end noise reduction algorithms in
signal processing).

I glossed over a lot, but as you can see, info theory largely governs
organizational behaviors to allow high predictive efficiency in a
computationally reasonable manner and to weed out garbage.  Other than
that, it mostly looks like yet another giant network of motivated
super-neurons, albeit optimized for silicon.

Cheers,

-James Rogers
 jamesr@best.com