Saturday, 9 March 2013

Digitising the OED

Here's another snippet from The Information which was quite evocative.  In 1987, the Oxford English Dictionary elected to digitise itself.  They estimated that it was about a Gigabyte in size and would require 120 typists to capture the information on an IBM mainframe not a lot smaller that Charles Babbage's 1850s Analytical Engine.  That works out at about 600 elapsed hours of parallel typing - call it 4 months.  They were clearly in a hurry to get the information loaded up.

A tuthree years after the OED's mammoth project, I was analyzing DNA sequences.  We were living in interesting times because digitisation was coming in but wasn't fully implemented.  In 1982, a database of DNA sequences had been created. The first edition (the database is updated every two or three months) had about 600 entries and was sent circulated paper.  For the last 30 years, the database has been doubling in size every 18 months and now has 150 million entries comprising 150 billion base-pairs - the As Ts Cs and Gs that encode all life on earth.  The scientific journals were starting to implement a policy that they would only publish sequence-based papers if the sequence itself had been submitted to the DNA database.  But occasionally a key sequence would appear in the University library as hard-copy on paper and we'd be too impatient (young thrusters that we were) to wait for the next edition of the database to pick it up.  So we'd type it in.  It is really difficult to proof read  


so the standard practice was to type the sequence in twice and hope that the two version were identical and then fix the discrepancies.  Just like the OED, it seemed smarter and quicker to have two people type the sequence in parallel and then run the comparison. Which inevitably led (young thrusters that we were) to a race.  Typically the task would be a kilobase (1024 ATCGs) and I can't remember how long it took us then; I've just spend 20 minutes typing in a kilobase to see, but we must have been quicker than that. 

Now 20 minutes of a task is the kind of effort that it's probably best to just 'suck it up' and do it rather than spending 20 minutes thinking up a niftier, more efficient way to doing it IF the task presents infrequently.  But I did wonder, back then, if it would be quicker to stick A T C G on the numeric pad of the keyboard, type 112343132431232431222 etc etc and convert the numbers to A T C G afterwards - those letters being rather awkwardly placed on the QWERTY keyboard.  It was certainly quicker to type from dictation if you could persuade someone to read the letters over to you.

