If the statistical analysis is no more than 10 or 15 minutes making mystical passes with Excel then you'd be more inclined to have a punt at the experimental design fully prepared to do it again - properly - if the data turns out to be impossible to analyse. In my teens, I did quite a few statistical computations 'by hand' - working through columns of data calculating the square of each value, the sum of the squared values, the square of the sum of the values etc. All without a calculator because they hadn't been invented in 1971. If you made an error it would propagate through the table and become a huge distorting wen on the face of the analysis. I must be the youngest person on the planet to use Charlier's Checks for internal consistency in a mighty table set up to calculate the variance of a dataset. Charlier's Checks required additional bookkeeping work, true, but they also gave you a solid confidence in the answer. Now you just lash the numbers into Excel and demand Descriptive Statistics to obtain not only the Variance, but the coefficients of kurtosis and skewness as well [example R]. Ten years later, when calculators were cheap and widely available, I helped construct one of the last hand-cranked commercial spreadsheets before Visicalc and Lotus1-2-3, the progenitors of Excel, swept such activity into the dustbin of history.
Excel is designed for businesses rather than science ans seems unduly driven by time and date. If you are a teeny bit careless the number 12 will become 12th Jan 1900 while 19,892 will be treated as my birthday 17th June 1954. Excel has a number of such helpful features which, if you're not careful, will leave you with a red face if nit in tears. There was a paper in Genome Biology a couple of years ago by Ziemann, Eren and El-Osta, which exposed a widespread casualness about data-in-excel. Molecular biology and genome analysis are now capable of generating terabytes of data for half-nothing in money and time. Scientific papers would be unreadable if all the data were put in tables; if not unreadable, then at least many many pages long. The rules of engagement dictate that the raw data has to be made available, so that rivals, reviewers or referees can critically evaluate it. The data is, accordingly installed in Supplementary Tables somewhere on The Cloud. Z.E.E. wrote a bot which trawled through these Excel tables associated with a number of reputable Journals and looked for damn-foolishness, sloppy copy-editting, and culpable negligence. They found a great deal of errors like where the gene Septin2 aka Sept2 had been helpfully translated as 02-Sep or March1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] became 03-10-2006 because Excel decided these names looked like dates. More to point, none of the authors or editors of these tables had ever checked them through, let alone corrected the errors. Just by following this line of reasoning, they found errors in about 20 % of all the Suppl Tables they ran through their wringer. Do you have less confidence in the findings of the associated peer-reviewed papers if the data has such obvious feet-of-clay?
I found this paper through a link in MeFi pointing at a group calling itself The European Spreadsheet Risks Interest Group - EuSpRIG - (“yewsprig”) for short. They have compiled a long list of errors compounded by the casual use of Excel, which they call horror stories. Whatever about a few 'typos' in the supplementary tables [which clearly nobody has read] of some specialised genome analysis papers, the horror stories have consequence: loss of money, loss of credibility, loss of reputation and loss of jobs. It's the kind of trusting-the-software stupidity that had the HSE flagging patients as less than dead in 2017.
I was reading these horror stories with a mix of nerdiness, incredulity and
No comments:
Post a Comment