Pattern representation and Solomonoff

From: James Rogers (jamesr@best.com)
Date: Tue Sep 16 2003 - 02:54:05 MDT


There are so many different concepts interwoven into a single elegant model
that I get easily distracted when explaining certain aspects of my work.

My use of the word "convergent" in a previous email was perhaps unfortunate;
the importance of it was implied by a layer or two of indirection for
someone with the appropriate context.

I'm going to start exploring some of the basic interesting points mentioned
in the title here, but I'm going to work from a more basic data-oriented
perspective rather than from the assumption of a particular knowledge
representation framework which has almost no context for most of the people
reading this.

Let us assume that we have a trivial function that generates the following
infinite stream of characters:

F = "1947194719471947..."

Writing a program that can express such a function is a minor feat, one line
in Python and many other languages. By definition, this function has a low
Kolmogorov complexity. More importantly, the entropy of the function has a
very specific profile, being locally high and globally low. Locally, small
fragments have high entropy taken by themselves (e.g. "4719"), and are not
meaningfully regular. Taken over long fragments, higher order regularities
are apparent even upon casual observation (e.g. the pattern "1947" appears
to repeat ad infinitum).

How would one represent this function from the standpoint of efficient
knowledge representation? Assume that pattern computation is a non-issue
(relevant to AGI but not to this email or narrow discussion), and that a
reference to a character takes as much memory as the character itself in
practice.

1.) A naïve implementation would simply store the sequence of characters as
they are generated, since there is no guarantee that they will repeat
forever; the repetition is inferred only, we have no special knowledge of
the function.

2.) A more clever implementation would use a sequential statistical model
(assume it is a 4th order model, for the sake of the example), which would
code a long sequence of references to the "1947" node of the model. This
represents a substantial savings, requiring a single reference to store
instead of four characters. Quite a savings and giving a roughly 4:1
compression ratio after discounting the fixed overhead of the model. Let's
call the model overhead function OH(), which is a function of the number of
patterns stored in it (complex and exponential, but minor in some cases --
very important later).

3.) An even more clever implementation doesn't use a single statistical
model. Rather than modeling the data stream alone, it also models the
references to the models, and encodes references to *them* and so forth.
This works even better, giving logarithmic memory usage as a function of the
total characters stored minus model overhead.

Each of the implementation's is more clever with representation than the
previous, but you say "So what? It is a simple repeating pattern!
Representation doesn't matter!"

Let's change the assumptions a bit. Let's assume that the pattern only
repeats for 1,000 iterations before changing to another random sequence of
four characters (say, "6872"), and keeps changing every 1,000 iterations.
Lets call the number of different patterns "N". Implementation #1
faithfully records the characters "as is" like it always has, implementation
#2 makes adjustments to its statistical model with a modest increase in the
overhead for a space complexity of DATA/4 + OH(N), and implementation #3
faces a simple increase of Nlog DATA + OH(logN)*N.

Let's make the problem even more interesting. Let's assume that the
function rolls over to the beginning after 1,000 different patterns are
generated 1,000 times each. If we were talking about a UTM with infinite
memory, this would not be a problem, but since we live in this universe we
have to deal with the fact that everything is an FSM, with some fixed amount
of character/reference storage "M". It is conceivable that with
implementation #1, we never reach the rollover point where the function
repeats the original pattern because the machine runs out of memory after M
characters. Implementation #2 does better, at M/4 - OH(N). Implementation
#3 does even better at Nlog M + OH(logN)*N, with a vastly smaller memory
footprint than either #1 or #2 to losslessly encode the same pattern.

Again you ask, so what? One implementation is substantially more efficient
than the other, but that is a matter of degree.

Something very interesting happens when the function rolls over and repeats
the same pattern again. Assume the machine has enough memory that #2 has
space to record two full iterations of the "super-pattern" where the digit
patterns go back to the original pattern of "1947" and repeating the
sequence of apparently random pattern changes before running out of memory.
It follows that #1 will never see the rollover super-pattern and view the
sequence of pattern changes as truly random, and #3 will easily have enough
memory resources to record several iterations of the rollover super-pattern.

For the second iteration of the super-pattern, you see a new memory
consumption function: both #2 and #3 start to average less memory used per
character stored the more times the super-pattern repeats. #2
asymptotically approaches DATA/4 for a minor improvement (taking into
account that "DATA" is twice as large after two iterations of the
super-pattern), with the function OH(N) effectively becoming a constant.
However, #3 consumes *almost no additional memory*, just some negligible
additional model overhead, and it already has a more efficient model OH()
scalability than #2 (the details of why there is an effective difference in
model efficiency is out-of-scope here).
 
#2 and #3 represent implementations that are capable of "converging", which
is to say the memory/character stored can start decreasing as more of an
apparently FSM-generated pattern is encoded. Obviously, #2 and #3 represent
two very different implementational efficiencies, as #3 requires vastly less
memory to losslessly encode a pattern, something that becomes particularly
evident as the corpus of data to encode increases. What is also evident is
that the memory profiles are vastly different such that #3 could converge on
patterns that are sufficiently complex that #2 does not have enough memory
to converge in some finite amount of memory (and even if it did, it would be
comparatively wasteful of resources). One thing that I didn't really cover
is the pathological case of truly random data. In this case, #3 will
actually give the WORST performance of all three. #1, #2, and #3 are
actually different algorithm classes (evidenced to a certain extent by their
space complexity profiles). #1 guarantees space requirements but never
converges, #3 resource requirements fluctuate quite a bit within certain
hard limits depending on the apparent complexity of the data source but will
start to converge at the slightest hint of a pattern (obvious or subtle and
abstract), and #2 is somewhere in the middle. The example function was
painfully simple, and the scalability math is bit uglier to compute with
more real world datasets.

Which brings me to an important point. The efficiency of an implementation
is directly related to the Kolmogorov complexity of a pattern that an
implementation can losslessly store in a finite amount of memory. As in the
above example, the differences in practical efficiency can be extraordinary.
More efficient implementations are capable of recognizing and utilizing more
abstract patterns. There is actually multiple ways to do implementations
with the rough efficiency class of #3 (I've probably come up with three or
four myself) so it is really a class of algorithms, but that is beyond the
scope of this email. To ground this a bit, #1 is a brain-dead UTM model, #2
is represented in CS literature, and #3 is only described in AIT mathematics
(and my code base).

Now to bring up another important observation that I'm guessing most people
overlooked, probably in part because I didn't go into the implementational
details of #3, though it should be somewhat obvious upon reflection:

The entire state space of the #3 data structure will be almost devoid of
duplicate patterns. You can show why this should/must be the case from a
number of theoretical vectors (left as an exercise to the reader -- an easy
one).

The implication is important, though. It suggests that one has a structure
that approximates an exhaustive index of every pattern it has seen as an
intrinsic side-effect of its function. While there is some truth to this,
the problem is much uglier in practice and a much more complex theoretical
problem.

In any case, it is very late and my brain is numb. I may have screwed up
the OH() function for #3, but I can't think well enough right now to
ascertain whether or not it is correct. It looks like a footnote in this
email, but it is quite important and a small book in itself. I'll see about
making any corrections to glaring errors tomorrow.

The reference in a previous post was to a #3 class structure that converges
at very close to optimal that actually expresses a truly exhaustive pattern
index. Much harder than it sounds because the OH() function is a serious
bitch to persuade to play nice.

Cheers and Good Night,

-James Rogers
 jamesr@best.com



This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:42 MDT