From: Jeff Bone (jbone@jump.net)
Date: Sun Dec 30 2001 - 10:17:34 MST
Ben Goertzel wrote:
> Take for example the domain of bioinformatics.
Okay, lets... I spent about 8 months doing market research and technical due
dilly on a bioinformatics play back at the beginning of '99, so I can hopefully
discuss this without being a total 'tard. ;-) (For the record, my thoughts
about structured vs. unstructured data and costs are the result of spending the
last 12 months working on a storage resource management play that's in some
loose ways related to what I learned looking at the bioinformatics opportunity
a couple of years back. It's really been quite amazing seeing where all the
"storage" dollars flow in corporate America. And just for completion, the
middle period there was spent working on automated text summarization,
classification, and syndication of online newsflow. I may be wrong about any or
all of this, but hopefully I'm at least not totally uninformed. :-)
> There is a lot of useful information out there, in research papers.
There's a lot of good *summary* information there; let's be clear, though,
that the real information is in the datasets that underlie such papers. I'm
not saying unstructured information is universally better than structured
information; there's some information --- such as that found in bioinformatics
--- which is inherently structured. Data about physical process is inherently
structured. The unstructured summaries presented in such papers are never
substitutes for the real thing. Further, the assumption that you can create
such good, structured information by reconstituting from summarized,
unstructured form is erroneous at best.
And think about all the effort involved in building a tool to do this, even if
it were possible: it's likely to (a) be a highly domain- (or
sub-sub-sub-domain-)specific creature, and (b) the cost (effort) is certain to
outweigh the effort needed to simply obtain the original datasets from the
various researchers. I.e., the cost of building such a system amortized across
all uses of such a system will most likely outweigh the aggregate costs of
simply sharing the original datasets, or --- as has clearly happened in the
gene sequencing world --- building better *tools* for collaboratively creating
and sharing such datasets. (Standard formats for and tools for translating
between and managing such structured datasets are without a doubt useful.)
> I don't feel you're being practical at all
Well, right back at ya. ;-) Seriously, understand what I'm saying: structure
and unstructure aren't equivalent; you can't convert freely from one to the
other without losing information in the process; like all engineering
problems, the trick is to use the right tool for the job. Unf., human beings
have a tendency to overcategorize, overclassify, then endlessly argue about and
revise such ontologies. We get so wrapped up in our model-building that we
forget that the models are just that.
I believe that bioinformatics --- like many other fields --- would be
well-served by the availability of better tools for dealing with unstructured
data. Such tools would include better citation mapping mechanisms, academic
interaction mappers, domain-sensitive search engines, text summarizers,
similarity and "semantic clustering" engines, distributed data repositories,
collaboration / syndication / publication frameworks, etc.
IMO, though, these do not or only minimally involve the kinds of structured
markup of "knowledge" that you hear TBL and friends going on about lately.
I.e., it seems rather impractical to assume that anybody's ever going to sit
down and *agree* on a standard set of "bioinformatics" semantics tags that can
be used to impart structured meaning on unstructured bioinformatics texts. It
seems less likely that people would actually use such an ontology by manually
marking up such texts, and even less likely that any generic method (short of
human-equivalent AI) should be able to do so post facto, and less likely yet
that the results of such would be generally useful within the field.
Here're some guiding principles I've been finding useful in the last few years
that have informed this point of view:
Similarity, not ontology.
Filters / views, not containers.
Inappropriate or too much structure is bad.
Probabilities, not absolutes.
Generic interfaces, not APIs.
Composition, not inheritance.
Data flow, not control flow.
Chronology, not classification.
Free text, not markup.
Searching, not browsing.
Flat, not hierarchical.
Identities, not names.
Embrace messiness.
Cheers ;-)
$0.02,
jb
This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:37 MDT