>> Generally speaking, given a finite amount of memory and an >> arbitrarily long sequence of data (generated by any finite state machine >> no matter how complex), it is possible to attain the minimum possible >> predictive error rate using universal prediction schemes. > > However, this kind of "in principle" calculation is not very useful in practice, > now is it? only in very narrow domains, like text compression... Yes and no, depending on your assumptions. I actually agree with you to the extent that you really don't need to know exactly what the minimum predictive error rate is as long as your algorithms approach them reasonably quickly. Most of the limits are computational -- doable, but very expensive for unrestricted domains. However, I wouldn't (and don't) try to apply these calculations to a single complex domain (no need to rehash the same old problems of AI). Rather I have algorithms that automatically and adaptively partition domains (in a manner that preserves all heirarchical, associative, etc. context) such that good prediction remains computationally tractable for any given domain space without significantly impacting prediction error rates for the complex virtual domain (query) that it is currently operating on; in fact, the decision of a node to partition itself is driven largely by recognizing when a domain has become computationally inefficient in the sense that good results require excessive CPU churn (itself a relatively complex algorithm). It is really just a matter of getting around the computational complexity of the core computation without significantly sacrificing results. The better text compression algorithms, such as Lempel-Ziv and arithmetic coding, actually are based on universal prediction schemes (entropy and prediction being closely related), and are therefore applicable to any sequence of data, not just text, though that is where it is most commonly used. LZ is not actually an optimal universal prediction scheme, but is an excellent universal predictor for systems with very finite amounts of memory (frequently the textbook example in fact). Optimal universal predictors asymptotically approach theoretical error rates very quickly, but also tend to have exponential memory consumption, giving rapidly diminishing returns. Therefore, for many finite memory applications, non-optimal predictors can actually produce better results in the given amount of resource space even if they approach the theoretical minimum error rates much more slowly. My own experience indicates that it doesn't really matter if you use an optimal universal predictor. However, there is a very broad spectrum in performance for universal predictors such that if you choose a poor one, it is likely to be essentially worthless for real-world problem spaces. A well-designed system with poor prediction functions will appear to have poor intelligence during your lifetime even though the prediction functions may converge enough to give good performance at some time far in the future. > Then, if the system has enough time to learn, EC (or more simply, > Monte Carlo search over program space), will cause the system to > arrive at a program that can achieve its goals optimally And I thought my methods were computationally expensive... :^) > Since none of these conditions obtain, we need something much more > specialized and more complicated than EC inside our thinking > machine... I'm not a big fan of EC as a means of generating a thinking machine. Aside from the arguable inelegance of the result (depending on your point of view), there necessarily must exist a large number of cleaner and more efficient methods, though requiring more specialized architecture. >> In short, it has been demonstrated that for any finite >> state machine, it is possible to ascertain the minimum possible >> predictive error rate for any data sequence given any finite amount >> of memory. > > Yes, but this method will not perform adequately under the > conditions under which a real intelligent systems have to operate. I would disagree; it depends on how you use it. > The problem is that the prediction schemes that are "optimal" under > standard mathematical assumptions, are NOT optimal given the > real-world conditions under which organisms operate. Pretty much by definition, an "optimal" universal predictor is impervious to real-world conditions, whatever those may be, hence the term "universal". The algorithms are model independent. A truly optimal predictor would of course require infinite memory. However, it is possible to have universal predictors that are optimal for any given resource configuration. Even "good" universal predictors will do the job, they just take longer to converge. > Yes, it answers the question under the glaringly false assumption > that minds embody general optimal predictive schemes Optimality is not required for good predictive performance, it merely sets the standard for how good the performance can get. > These experiments certainly do not demonstrate that humans are > finite-state machines. There are many many other explanations for > this data. I won't bore you by reciting them. I never claimed that this was a proof. The problem is that it can't be disproved that the human mind is a finite-state machine. Humans have a knack for getting mathematically classified as FS machines via their information theoretic behavior. It doesn't matter to me one way of the other, it was merely an observation. > Most of what makes real minds interesting is NOT about optimal > prediction or modeling, but about pretty good ways of achieving > pretty good intelligence within very limited space and time > resources. This is a whole different story from information theory. Obviously my original post came across as incomplete (the hazards of sending email at 2am), but I was talking about theoretical limits of prediction for any finite amount of resources i.e. given a certain amount of memory and processor, how accurately can a program predict for any arbitrary model. That sounds a lot like predicting the limits of intelligence in any finite resource space. > For example, in the area of computational linguistics, Denis Yuret's > excellent MIT PhD thesis from a few years back uses information > theory to model language ("lexical attraction" he calls it). All > well and good. It doesn't help you deal with the translation of > language into meaning. I think I know how to do the latter, but not > using information theory explicitly.... I have a small pile of links and papers on this topic that admittedly I haven't read for the most part. However, it was my understanding that these language models offer the best context inference performance of any other language analysis methodology to date, and by a good margin. However, I don't really construe "meaning" to be much different than "context", which you may not. I should probably start reading these... > I mean, how do you construct a mind given current computational > resources and real-time learning constraints, inspired primarily by > information theory? First of all, the architecture looks a lot like many other modern architectures, but uses information theory to fill in some holes and resolve some hard problems; information theory is not fundamental to the architecture, only fundamental to the way some components of the architecture behave. Scalability is a serious issue given current day hardware constraints, but I've personally minimized that to the maximum extent possible. The kernel layer, which effectively acts as an operating system for the mind, is a highly scalable database kernel (borrowed from past experience) that is both SMP and cluster optimized, but which has been tuned and modified for this particular application. This supports an arbitrarily large address space, transparent thread/process migration over the network, full state recovery for both single-node and total system failures, load management, and a bunch of other features that are nice to have. Standard stuff for the most part. At the application level, you have a multitude of "agents", between which are a complex network of associative and heirarchical relationships. Each agent is effectively a single domain (and quite a bit more, being active, sometimes goal-oriented components). Nothing to special in this aspect. Under load, domains actively partition themselves into multiple agents/domains in realtime to minimize theoretical prediction error rates as a function of resource usage and few other things while still maintaining informational integrity. This is probably the most critical info theory driven feature and it saves a *lot* of clock cycles while delivering superior results. Queries against the entire system also require info theory derived algorithms and mathematics, but I'll skip over that topic here. In the simplest case, you can start out with a single empty domain, start shoving information into it, and then let the network build itself. The resulting system, while containing everything fed to it will by its nature extract all the context and information to within fairly close to the theoretical limit, with some good weighting/noise reduction thrown in. It isn't so important here, but how the data organizes itself allows queries against the system to be computed quickly and with a low predictive error rate that is pretty close to the theoretical limits. Against a system that did not organize itself in something closely approximating this manner, queries with a similar predictive efficiency would become computationally intractable. How the domains actually partition is driven to a certain extent by how the system is queried and used, as this can help define certain types of computation efficiency. The primary benefit of this kind of automatic and adaptive partitioning (other than keeping the domain spaces clean and computationally efficient) is that it allows the system to handle dirty and raucous data sources quickly and gracefully without human intervention or serious system pollution. The adaptive partitioning scheme was originally a noise reduction algorithm I developed for data-mining purposes, but has proven to be very versatile across a broad number of spaces (interestingly, it is algorithmically similar to high-end noise reduction algorithms in signal processing). I glossed over a lot, but as you can see, info theory largely governs organizational behaviors to allow high predictive efficiency in a computationally reasonable manner and to weed out garbage. Other than that, it mostly looks like yet another giant network of motivated super-neurons, albeit optimized for silicon. Cheers, -James Rogers jamesr@best.com