From: James Rogers (email@example.com)
Date: Thu Sep 11 2003 - 00:20:23 MDT
On 9/10/03 6:47 PM, "Kwame Porter-Robinson" <firstname.lastname@example.org> wrote:
> Can you give us some information on exactly what you
> are testing convergence upon? Does your data have to
> be "prepared", meaning you can't just use raw text or
> some other lowest denominator type of information.
> Have you considered applying the program to stock
> market data :)
Raw bytes, no massaging. In many cases, files and tarballs that I
downloaded off the Internet with many different types of data and file
formats. The efficacy is independent of the kind of data since it codes
patterns in the abstract. For the purposes of testing, the corpuses I've
collected range from raw genomics data to English literature collections to
source code tarballs and most anything else I could get my hands on. I also
have some generator functions that can create arbitrary amounts of
mathematically characteristic data for testing purposes.
There is only one tunable parameter (the data structure automatically adapts
itself to the nature of the data it is exposed to, so not much is needed),
and that has been locked into the same value for all the testing, mostly
because it was easier than changing it and it only has a minor impact on
efficiency anyway. That parameter was roughly set to a good value for raw
genomics data. Massaging or cleaning up the data would definitely improve
the efficiency and let the model converge faster, but that wasn't the point
of this series of tests. However, I will be testing some simple pre-filter
code shortly that I can insert into the data stream. The primary point of
pre-filtering is to improve the signal-to-noise ratio of the data stream; a
lot of raw file formats have a lot of extraneous crap in them, which is
handled just fine but hardly desirable.
And yes, there is already keen hedge fund interest based upon the
capabilities of earlier versions. Whereas I wasn't entirely satisfied with
certain aspects of those earlier generations at that time from a theoretical
standpoint, I am pretty much satisfied with the design and performance of
this system. I've been working on these problems for many years and have
gone through numerous revisions.
> And do you plan to keep this code to yourself or
> release to the world, in its developmental stage?
It won't be released in a developmental stage regardless. The code base is
quite small and clean but also very intricate; it would be very difficult to
hack unless you knew how it does what it does and why it does it in a
theoretical sense -- I wrote it and it is difficult even for me to think
through the interactions that lead to the correct emergent result. I
haven't decided on what to do with the code precisely, but the formation of
a company is underway (name already selected, in fact) and it looks like I
may be able to collect a pretty interesting cast of Silicon Valley
characters for that venture. The exact nature of the commercialization is
still being decided. I've been stalling on commercialization until I could
prove this major milestone to my satisfaction.
There is still work to be done, but it is becoming more a matter of tying up
loose ends and taking care of things I've put off, as well as simply
cleaning the code up a bit and adding some scalability enhancements for very
large systems. I am inclined to put in my own VM since many of them don't
take well to being used like this, but the BSD VM that most of the testing
is done on is very fast and smiles the whole time.
This archive was generated by hypermail 2.1.5 : Sat May 18 2013 - 04:00:35 MDT