Friday, 31 May 2013

Data compression

I could have gotten maudlin' and sentimental yesterday as we drove Dau.II off up to Dublin for her new life and I did . . . but not really about her: so full of hope, so full of the future.  But I took the opportunity of being in Dublin to drop into my old place of work.  The boss has decided to shift his traps to the Other University after 25 years man-and-boy in the same place. I bumped into my Venter Code pal as I was leaving and he said they were not taking much of the accumulated rommel to the spare, austere premises in their new workplace.  He asked if I would take "The Genbank Tape" because it was likely to finish up in the dumpster if I didn't take it.  So I brought home one of the last 2400ft magnetic tapes left in Ireland as a souvenir of how it was during the war, back in the early 1990s. Amazingly you can buy "new" still-in-its-packing 2400ft mag tapes on Ebay - a snip at $49.95 + $43.20 p&p. I am told that banks still use such legacy hardware because they are terrified of changing to something handier in case the data gets corrupted in the transition. Genbank is the database of every DNA sequence known to humankind and for the last 30 years has been more or less following Moore's law and doubling in size every 18 months.  The last release of GenBank that we received in the post may well have been Release 74 from December 1992, which contained just under 100,000 different sequences made up of 120,242,234 base pairs (bps).  So effectively that's what I brought home last night. If only I had a machine that could read it.  The database is/was about twice as big in bytes because for every 1000 bps of sequence there is/was about 1000 letters of plain text explaining what the sequence means.  So the tape I now have propped up on top of the piano at home contains about 250 megs of data.  I weighed it on the kitchen scales at 980g - it almost balanced a bag of sugar.  While I was in weighing mode I took my 4GB USB key out of my shirt-pocket and clocked that at 10g.

16x as much data.  1% of the weight.  That's compression.
OTOH, GenBank is now 151,178,979,155bp in size - as near as dammit 1000x bigger than it was 20 years ago in June 1993.   

  1. No, THIS is compression, 2.2 petabytes (10¹⁵) per gram!