Monday, 13 April 2015

Not waving but drowning

. . . in statistical noise.  I've just read a nice piece The Myth of the Little Ice Age about finding signal in the noise by two economists from University College Dublin UCD Cormac Ó Gráda and Morgan Kelly.  Yes that Morgan Kelly, the one who predicted the bursting of the Irish property bubble in 2007 and who was invited at the time to commit suicide by the Taoiseach Bertie Ahern "Sitting on the sidelines, cribbing and moaning is a lost opportunity. I don't know how people who engage in that don't commit suicide because frankly the only thing that motivates me is being able to actively change something".  It is a fact known to all that Europe experienced a Little Ice Age for about 400 years before the Industrial Revolution started to heat the world economy and the planet in the 19thC.  What Kelly and Ó Gráda show is that this is a bit Mark TwainIt ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so.”  The human eye has an extraordinary ability to find patterns in what we see: if we didn't recognise the rosette of black spots in the jungle as part of a leopard, then we got eaten before we'd passed on our genes.  The problem is that we have a strong tendency to over-predict and find patterns when they just are not there. That was a good call back in jungly days - running away from a 'false positive' was much less costly than hanging around to confirm 'leopard'.  It's less good now - finding signal where there is just noise may have us shelling out folding money for cures or diets or services that just don't work except as anecdote.

Here above is the key figure from Kelly and ÓGráda's post.  The figures in the top panel are 'smoothed' by taking a 25-year moving average and plotting that.  The lower panel is the raw data of Summer temperature in the Netherlands, the place for which we have the longest run of reliable data: all spikes and crevasses but no suggestion of a long trough in the middle of the graph.  In my 'umble, there is little enough sign of such a trough in the middle of the timeline even when smoothed.  There was a very good run of hot Summers in the 1780s and 1790s for example. What economic historians have done in the past is take such data and 'over-fit' their sort of information [the price of wheat; estate rent-rolls; accounts of battles] and come up with an internally consistent story that is . . . just a story. Kelly & Ó Gráda identify these as 'anecdotal evidence' which is, in my blog, a contradiction in terms. The fact that the Thames at London froze during the Winter and supported a Fairground is put down to the existence of Old London Bridge holding a back pool of cold water rather than days and days of brutal low temperature. They also acknowledge that there are years/decades where there are runs of bad and good weather - can you spot the cold-Summer blip caused by the eruption of Tambora 12 April 1815 or Krakatoa in 1883?

These economists recognised in their analysis one that was carried out by economist and statistician Євген Євге́нович Слу́цький Eugen Slutsky who was born in Yaroslavl Oblast but is claimed as a son by both Russia and Ukraine. Slutsky took it into his head to process some random numbers with the tools used by economists to predict trends and effects in economic cycles.  The patterns were " . . . indistinguishable from business cycles".  When you hear a pundit on the wireless 'explaining' some fall in the price of Apple shares as being due to "profit taking" or "the ghost of Steve Jobs" you have a choice: take a huge pinch of salt or send a neatly wrapped parcel of horse-shit to the talking-head's business address.

All this finding signal where there is none reminded me [frisson] of one of the Three Good Ideas which have been my contribution to 40 years in Science. The first substantive bit of whole genome DNA sequencing was chromosome III of baker's yeast Saccharomyces cerevisiae.  Chr.III is about 320,000 bases long and contains about 180 protein coding genes.  About 25 years ago, the European Union formed a consortium and divided up the chunks for sequencing to a number of labs across Europe, they were paying 1 Ecu per base and Ireland (TCD, Genetics) undertook to sequence 10 kilobases as their contribution to the big push.  For perspective, the human genome is about almost exactly 100,000x bigger and we can now sequence a complete human genome for €1000 rather than the €3 billion it would cost if technology hadn't made things much more efficient.  I was working in a molecular evolution / bioinformatics / sequence analysis lab at the time, also in TCD Genetics.  Although my boss and I had made no contribution to the 10 kilobases, we carried out the first non-trivial (identifying where the genes were etc.) evolutionary analysis of the first chunk of genomic DNA from the first species to be so attacked.  What we did was calculate the % G+C for each of the 178 genes (G and C are two of the four bases/units that make up DNA) along the chromosome:
The panel on the left is admittedly noisy, like Summer in the Netherlands, and my contribution was to take a 15 point moving average, explicitly treating the DNA sequential data like an annual price of wheat time-series as if I was an economic historian. When you do that, the smoothed data develops two peaks, one in each arm of the chromosome.  My Gaffer was extremely skeptical (that's why he was The Boss, and why he went on to get himself an FRS a couple of years ago) and made me do the analysis again, do it backwards, look for obvious bias, check for known constraints, do a 11, 13, 17 and 19 point moving average, look again for some artifact in the data . . . and eventually agreed to write it up and submit it to Nature.  The Nature editorial board that week didn't recognise the ground-breaking first-first-first nature of our paper but it was readily accepted by N.A.R.  It made quite a stir in the yeast sequence analysis community and when the sequence of the second yeast chromosome [chr.XI; 666,500 kilobases] was published the graphics included a red-hot-bar indicating regions of high-medium-low %G+C.  This fluctuation in G+C content became known thereafter as the Dujon Effect after the L'Institut Pasteur's Bernard Dujon, the first author of that second paper.  Grrrrr?

I've been dining out on the story for the last 22 years. Now, thanks to Kelly & Ó Gráda, I'm thinking the whole thing was an over-fitted storm in a small tea-cup.  Make that Two Good Ideas in a lifetime of science?

No comments:

Post a Comment