RE: The inevitability of death, or the death of inevitability?

From: Ben Goertzel (ben@goertzel.org)
Date: Mon Dec 31 2001 - 11:44:14 MST


> > There is a lot of useful information out there, in research papers.
>
> There's a lot of good *summary* information there; let's be
> clear, though,
> that the real information is in the datasets that underlie such
> papers.

yes, but the real datasets are very rarely available. What industry is more
secretive than pharma
and biotech?

and of course, building software to automatically extract semantic knowledge
from diversely structured
raw datasets is just about as hard as building software to automatically
extract semantic knowledge
from text ;>

> Further, the assumption that you
> can create
> such good, structured information by reconstituting from summarized,
> unstructured form is erroneous at best.

It can be done to some extent, see for instance the work of rzhetsky at
columbia
http://genome6.cpmc.columbia.edu/~andrey/

> And think about all the effort involved in building a tool to do
> this, even if
> it were possible: it's likely to (a) be a highly domain- (or
> sub-sub-sub-domain-)specific creature,

yes, to date. rzhetsky's system is an example of that

> and (b) the cost (effort)
> is certain to
> outweigh the effort needed to simply obtain the original datasets from the
> various researchers.

This is an incredibly naive and ridiculous statement. How are you going to
get
the original datasets out of the research divisions of pharma companies??
Even academics
won't share their datasets with other academics for fear of helping the
competition.

About this aspect of the sociology of the research community, I'm afraid you
really
don't have a clue.

> I.e., the cost of building such a system
> amortized across
> all uses of such a system will most likely outweigh the aggregate costs of
> simply sharing the original datasets, or --- as has clearly
> happened in the
> gene sequencing world --- building better *tools* for
> collaboratively creating
> and sharing such datasets.

There is a lot of publicly available information in the bio domain, but a
hell of a lot
of secret information too. Contrast the situation with yeast (for which
there are extensive
public databases on protein-protein interaction, gene expression, and so
forth) with the situation
for humans (nearly all data is private, and info is available only in
research papers)

> Seriously, understand what I'm
> saying: structure
> and unstructure aren't equivalent; you can't convert freely from
> one to the
> other without losing information in the process;

Yes, of course that's true. You also can't communicate from one mind to
another
using language without losing information in the process, but, we do so
quite
regularly and usefully.

> I believe that bioinformatics --- like many other fields --- would be
> well-served by the availability of better tools for dealing with
> unstructured
> data. Such tools would include better citation mapping
> mechanisms, academic
> interaction mappers, domain-sensitive search engines, text summarizers,
> similarity and "semantic clustering" engines, distributed data
> repositories,
> collaboration / syndication / publication frameworks, etc.

All these are good things, but I still believe that information extraction
from
NL text will be an important part of the toolkit too.

> I.e., it seems rather impractical to assume that anybody's ever
> going to sit
> down and *agree* on a standard set of "bioinformatics" semantics
> tags that can
> be used to impart structured meaning on unstructured
> bioinformatics texts.

Check out

http://www.geneontology.org/

it already exists and is incorporated in some commercial projects, e.g.
Spotfire

ben



This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:37 MDT