Saturday, 23 March 2019

Reproducing the code

Yesterday, Friday, I got up, tea'd up, spruced up because I was Countryboy Going To Dublin for the 'annual' VIBE symposium. I, wearing my 2014 VIBE organiser hat,  had sent out a call to everyone to remember their name badge, so I had a chance of remembering who they were:
Goodo, looking forward. No registration means no name-badges so could everyone remember to bring one? from their last conference?  Just as a courtesy to confused old buffers and total newbies.
Thanks, Bob
The previous (2016) organiser added a note:
And/or a roll of blank stickers that people can just write their name onto. 
To which the the 2019 organiser replied.
Great; please bring one.
Clearly some people are more invested in the Science than interpersonal signalling. And IMO, a roll of stickers is A Bad Idea because they are not designed to stick to jumpers and it invites useless contributions as shown [R]: Bubba? Who he? Who his boss? Which is his poster? What group is he affiliated with? Should I get concerned when he brings out a banjo?

The presentations were wide-ranging and all interesting, even those that were whoof totally over my head. Molecular sequence analysis is amazingly ambitious, seeming to keep up the precipitous drop in the cost of generating the primary sequence data. My report of VIBE 2013 noted that analytical throughput increased 4000x-fold in a decade while the co$t per DNA base had fallen by a factor of 10 million in 20 years. In the early 00s, I was tasked with determining if certain classes of genes [those switched on in the cells of the heart, liver and lights] were clustered in the human genome, which had just been released into the public domain. I learned to program in Perl, robbed code from my fellow workers at the genomics, hacked it to do my bidding, learned how to display data graphically using the GD add-in and presented my progress regularly at the lab-meetings. It was a challenge, it was hard work, it was great fun and eventually my mates started robbing my GD code to use for their own projects. A small part of my work-flow was getting a program to work - my first binfo boss used to say "that's just a simple Fortran program, take you ten minutes" which was orders-of-magnitude true: it took maybe 90 minutes. But it took the rest of the day, and sometimes the rest of the week to be confident that the results spat out were true and meaningful. I won't say I was a programming god, but I could do the work even if I still don't feel comfortable with associative arrays.

The first talk on Friday was an over-my-head job: Research and Clinical Applications of Fully Reproducible Containerised Workflow Architecture. Dang, but that could mean anything at all to an old code-kludger like me. Even though I didn't recognise the names of any of the software tools - they having been developed over the last 7 or 8 years since I fell away from the coal-face of science and went back to teaching down the country - I followed with admiration the aspiration. Reproducibility is the key!  It's like using Latin names to talk to Hungarian ornithologists. IF you build a pipeline to integrate a wild variety of disparate but relevant data so that you can develop predictors or therapies or a cure for colon cancer THEN you really need it to be reproducible in Hungary or next year. 

I've been there, and I've been caught with my pants down. Maybe 15 years ago, I was working in St. Vincent's before it became SVUH the University Hospital. The colon cancer people had generated a noisy multivariate dataset and asked me to see if there was any signal there. I had a word with my pal Aedin [prev] before she went all ad Astra and to Harvard on us. She was the local ADE-4 guru and with her help I beat the colon cancer data into submission. I was skeptical about the results I'd generated; not because they were wrong, but because they explained so little of the variation in the data. I sent the analysis up the line, with my skeptical warning, and Team Colon were delighted because TOP2A, a known actor in the development of colon cancer, bobbed to the top of the roiling mess of data, you could see it there, barely visible, awash with statistical noise, but top of the heap of over expressed genes. The project wasn't written up immediately: the bloke who'd done the original analysis finished his rotation, his boss moved to a different hospital and her boss was too busy saving lives. I moved to a different institution too. Four years later, I got a call to redo the analysis because the paper was finally ready for launch and they were checking through their pipeline; and the graphics I'd generated were kinda ugly.

Well it was a freaking nightmare. The computer on which I'd done the original analysis had been left in StVs as obsolete. ADE-4 had been totally re-written, and was reluctant to install on the new computer. I couldn't remember how to get it to work and Aedin was 5,000km away in Harvard rather than up the road in UCD. I knuckled down and did what I was asked but that's when my hair-loss started. TOP2A was still tops, which was a relief, and the paper was published. If it was left to me, it would have been written off  rather than written up - but then my scientific publication record is woeful. Even with the same people and precisely the same data, reproducibility was a hard slog.

The small world connection here is that the presenter of the Containerised Workflow Architecture talk yesterday answers to the same boss as the boss-boss-boss I was serving in 2005 and 2009. In the intervening years, the quality and esp. the quantity of the data has gone up and up. By using containers in your workflow you can let collaborators and rivals replicate your findings and run precisely comparable analyses of their own. The science will be reproducible and we'll be nearer to eliminating colon cancer.  All good until a senior scientist at Friday's VIBE asked - what happens if there is a mistake in your code; will everyone be replicating that error?

Hmmmm! As Osric has it: A hit, a very palpable hit. Later in the day, I was talking to a handful of MSc students who'd come up from Galway for the meeting but missed the first talk. They are a useful mix of biologists and coders and I said that I'd been part of a similar group when their boss had moved from mathematical physics to biological sequence analysis and helped me with my Perl scripting. I hoped they were sharing their expertise and propping each other up. And I gave them an executive summary of the first (Containerised Workflow Architecture) talk of the day which they'd missed and emphasised the critical replicating that error question. Don't do what we used to do, I said, if you are all robbing each other's code there's a good chance that errors will propagate through the lab. If we use computers [and especially fancy pipelines of interconnected software] to find out something new, nifty and important then someone else should rewrite as much of the code as possible from scratch. There are a hundred different ways of computing any reasonably complicated analysis, if you have two independently generated solutions and they agree then your confidence in the result is more than doubled.  Ain't gonna happen though, everyone wants their own apple to polish.

No comments:

Post a Comment