Science matters: Whoa Excel woes

The thing about software is that it is designed to make things easy - if it well designed. If things are easy to do then they are done often and possibly without sufficient thought. If you had to statistical analysis by hand, you'd make damned sure that you designed the experiment well so that the results of your analysis meant something.

If the statistical analysis is no more than 10 or 15 minutes making mystical passes with Excel then you'd be more inclined to have a punt at the experimental design fully prepared to do it again - properly - if the data turns out to be impossible to analyse. In my teens, I did quite a few statistical computations 'by hand' - working through columns of data calculating the square of each value, the sum of the squared values, the square of the sum of the values etc. All without a calculator because they hadn't been invented in 1971. If you made an error it would propagate through the table and become a huge distorting wen on the face of the analysis. I must be the youngest person on the planet to use Charlier's Checks for internal consistency in a mighty table set up to calculate the variance of a dataset. Charlier's Checks required additional bookkeeping work, true, but they also gave you a solid confidence in the answer. Now you just lash the numbers into Excel and demand Descriptive Statistics to obtain not only the Variance, but the coefficients of kurtosis and skewness as well [example R]. Ten years later, when calculators were cheap and widely available, I helped construct one of the last hand-cranked commercial spreadsheets before Visicalc and Lotus1-2-3, the progenitors of Excel, swept such activity into the dustbin of history.

Excel is designed for businesses rather than science ans seems unduly driven by time and date. If you are a teeny bit careless the number 12 will become 12th Jan 1900 while 19,892 will be treated as my birthday 17th June 1954. Excel has a number of such helpful features which, if you're not careful, will leave you with a red face if nit in tears. There was a paper in Genome Biology a couple of years ago by Ziemann, Eren and El-Osta, which exposed a widespread casualness about data-in-excel. Molecular biology and genome analysis are now capable of generating terabytes of data for half-nothing in money and time. Scientific papers would be unreadable if all the data were put in tables; if not unreadable, then at least many many pages long. The rules of engagement dictate that the raw data has to be made available, so that rivals, reviewers or referees can critically evaluate it. The data is, accordingly installed in Supplementary Tables somewhere on The Cloud. Z.E.E. wrote a bot which trawled through these Excel tables associated with a number of reputable Journals and looked for damn-foolishness, sloppy copy-editting, and culpable negligence. They found a great deal of errors like where the gene Septin2 aka Sept2 had been helpfully translated as 02-Sep or March1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] became 03-10-2006 because Excel decided these names looked like dates. More to point, none of the authors or editors of these tables had ever checked them through, let alone corrected the errors. Just by following this line of reasoning, they found errors in about 20 % of all the Suppl Tables they ran through their wringer. Do you have less confidence in the findings of the associated peer-reviewed papers if the data has such obvious feet-of-clay?

I found this paper through a link in MeFi pointing at a group calling itself The European Spreadsheet Risks Interest Group - EuSpRIG - (“yewsprig”) for short. They have compiled a long list of errors compounded by the casual use of Excel, which they call horror stories. Whatever about a few 'typos' in the supplementary tables [which clearly nobody has read] of some specialised genome analysis papers, the horror stories have consequence: loss of money, loss of credibility, loss of reputation and loss of jobs. It's the kind of trusting-the-software stupidity that had the HSE flagging patients as less than dead in 2017.

I was reading these horror stories with a mix of nerdiness, incredulity and ~~complacence~~ actually not me with the complacent. I never believed what I caused to be generated by writing software: I checked all the calculations by hand many times before I accepted the output. I did my best to push through some edgy data to see how robust my programs were. Just before leaving work and going home to surf the blogosphere looking for nerdnik copy, I had calculated the final marks and breakdown for the continuous assessment CA part of my human physiology course. As it is a small class and as I couldn't work out how to send this compendium through Blackboard, I emailed each student the relevant line from my summary Excel spreadsheet. It was a bit of a rush, so I didn't double-check the numbers - partly because the sky wasn't going to fall if I got things wrong. Indeed, part of the reason for sending out the marks was to solicit complaints and queries. I might have forgotten to credit a sick-note, for example. When I next checked my mail there was indeed a query/complaint that the individual marks from Oct, Dec, Jan & Feb were not consistent with the recently compiled executive summary. Ooops! There is a peculiar legacy issue with student names at The Institute. Some early kludge software had choked on apostrophes as Not Alphabetic so email aliases now come as O Reilly, O Donovan, O Beirne. But one of my students is not properly registered so I have her as O'Donovan. Excel sorts O Leary before O'Donovan and in a frenzy of cutting and pasting between Excel sheets I had managed to transpose O Leary's marks with O Donovan's. Only a Pink Face over this because it doesn't really matter but it is a lesson to be vigilant in future.

Science matters

Friday, 22 March 2019

Whoa Excel woes

No comments:

Post a Comment