>The human genome is 3.36 billion base pairs, or
>840 MBytes raw. Using an efficient compression
>algorithm it can be compressed to 21% of it's raw
>size, hence there is 180 MBytes of unique data in
>the genome.

Unfortunately the case isn't that simple. Not all DNA is
information-coding, in fact, most of it isn't. Early estimates pinned the
amount of protein-coding DNA to about 10%. We now know now that it's more
like about 2%. But which part of the genome do you count as "information"?
Just the exons that make it to mature peptides and RNA? How about the
introns (with important intron/exon boundary and branchpoint sequences that
inform the splicing apparatus) and 5' and 3' untranslated regions? How
about promoter and enhancer sequences? What about sequences that don't
directly affect gene expression, but that get methylated or acetylated and
can thus affect chromatin structure, and indirectly expression of nearby
genes? How about AT/GC ratios? Do those count as information? What about
the selfish DNA sequences (transposons, etc.) that make copies of themselves
and act like genomic parasites? By inserting themselves into or near genes,
they can disrupt their sequence or expression.

It's all rather complex, and by the most liberal estimate perhaps 20% of the
genomic sequence caries information that is pertinent to the development and
function of H. sapiens. That's maybe 600 or 700 million base pairs (160
MB). I draw the distinction because you could have, say, a document that is
840 kb, but what if 80% of it is random, wordless strings of letters? You
wouldn't call that information.

Anyway, that's a more accurate number to use for the information content of
the human genome.

Martin Striz

