Sunday, 14 April 2013

Human genome complete! (10th Anniversary)

The Human Genome was brought to "completion" on this day ten years ago.  This bottle of champagne defined "complete" as 99.99% accurate over 99% of its 3,320,602,131 base-pair length.
I'm not talking about 15th & 16th Feb 2001 when The Human Genome was officially published in the two premier science journals Nature and Science.
And certainly I'm not talking about the absurd and hubristic raree show of 26th June 2000 when Bill Clinton and Tony Blair took a bow for their contribution to science of genomics in a para-atlantic announcementA selection of primary reports are available. That's the data, but there's plenty of back-story - far too much for one post, indeed.

In my 1st year chemistry class this last week we were doing TLC. The data that each pair was asked to put up on the board was a ratio:
D traveled by the chemical : D traveled by the solvent edge
As the numbers reeled off their calculators and went up for public scrutiny, I chid them all to use an appropriate level of accuracy.  If three independent estimates of the ratio for, say, aspirin are up on the board as 89% from Alice, 86% from Bob, and 91% from Chuck, then it's fatuous for Derek to write his estimate of the same ratio as 87.23% because that's what his calculator's display says.  To do so is "spurious accuracy" and shows that poor D hasn't a clue about what the ratio means, why it is useful or how reproducible and characteristic the estimate is.

So it's shoulder-slumpingly depressing for the huge tax-payer-funded genomic monolith Ensembl to continue reporting the length of the human genome as 3320602131 bp. (ie not 3,320,602,132). It makes my job of training young scientists that much harder.  It also shows that Ensembl hasn't really grasped that there isn't one human genome (Craig Venter's?  Jim Watson's ? Ewan Birney's ?) but about 7 billion of them and yours probably isn't 3,320,602,131 bp long.

Oooo, what a cross-patch! I hear you say.  My only excuse is that this is the first time I've bin bloggin' when ever-so-slightly hung-over.

So let's end on a positive note with a tribute to Jim Kent.  clock-back: as 1999 turns into 2000 it is decided that a line needs to be drawn under the Human Genome Project because everybode kno that the project is really only just starting and it's going to go on for another generation - the 1000 Genomes Project had it's Phase 1 milestone on 21st October 2012.  ANNyway, the HGP has accumulated shed-loads of sequence data, maybe 30 billion base pairs to give what we call 10x coverage of the genome.  The base-pairs are not all in a heap of 3 billion unlinked As Ts Cs and Gs but they are a loooonnng way from a crisp stack of 23 contiguous chunks; one for each of our pairs of chromosomes.  The Knobs (in probably both senses) at the top of the project hatch a cunning plan to have a press conference fronted by Blair and Clinton.  The White House (Blair lives in the kennel) comes back suggesting 26th June as a window some months away in the President's busy schedule, and the HGP knobs agree.  But it's a bit of punt because they don't have a plan (cunning or otherwise) for how the several thousand chunks are going to be assembled into something like the correct order with a minimum of wayward duplications and unaccountable missing bits.

Meanwhile Jim Kent is a graduate student with David Haussler in UC Santa Cruz. He's a hot-shot programmer and he and Haussler are interested in analysing the human genome when it's assembled in all sorts of nifty and creative ways. They know about the announcement and the days are ticking off and the HGP isn't announcing the assembly and the days are ticking off.  So Jim cracks and decides that he, Jim, is going to have to step up to the plate and do the assembly.  He believes he has a protocol that might work, he just needs a) to write a few thousand lines of code and b) a big-assed computer to do his bidding.  In contrast to Celera Genomics (which is burning dollars as the alternative commercial human genome project in contra-parallel to the tax-payer-funded HGP) Haussler and Kent don't have a big-assed computer or the money to buy one.  So they tool off to the local computer store and buy 50 PCs off the shelf, which Jim claggs together into a parallel processing computer cluster (PPCC).  In May (Deadline is 6 weeks away), Jim sits down to code.  He pulls the first of many all-nighters, he works all day, he eats on the hoof, he falls asleep at the terminal, he piles up lines and lines and lines of code.  He tries this, he debugs that, he cracks a niggling redundancy issue, he blasts through a logical log-jam with a precisely placed small-small stick of algorithmic dynamite.  It's a huge problem with horrendous dependencies and the game plan must be stored in Jim's head: there are 3 billion bits (that's a pile of sugar cubes 3 times bigger than our house) which come in only four colours and they have to be assembled in one precise and particular order.  A thousand piece jig-saw of a clear blue sky is as nothing to Jim's task.

On 22 June (D-day minus 4) Jim Kent calls time and delivers the goods. Craig Venter's team of hackers at Celera Genomics deliver their version 3 days later as press secretaries and publicists across two continents make whimpering noises and have fits of the the vapours.

Jim Kent - wottaguy!

