Friday 14 August 2020

Lumpen Excel wins

 Shortcuts make things quicker . . . until you plough into a herd of sheep. In shortcuts I include a lot of hidden assumptions made on your behalf by your everyday software. Those word-completion boo-boos when sending the Boss a txt msg.  Those phone numbers that have their leading zeroes Excelised: 087-4822501 helpfully becoming 874,822,501 . . . because it looks like a number. Excel is designed for work in offices where a date-stamp is expected on many sorts of data. But dates are a precision nightmare, at least partly because the USA uses an illogical ordering convention: MMDDYYYY. The rest of us are typically 'little-endian' DDMMYYYY or 'big-endian' YYYYMMDD. Nightmare? 06AUG20 is unambiguous (if you spik ingles) but 06-08-20 would be two months ago in Baltimore. Blogspot/Blogger is an American company so the date-stamps on my posts are 'wrong' for me.

The transition between notes in a paper form and data in a computer has been fraught with trouble because the coders didn't speak to the effectives enough to know what the right questions were. Clinical notes might use <50% as a short hand for less than half. But a coder decision based on "< looks like an HTML tag and so is safer deleted in Excel" translates a rather woolly statement into a precise-but-wrong 50%.

A few years ago I wrote about a paper by Ziemann, Eren and El-Osta which found that gene-names were commonly [~20%] being over-interpreted as dates when foolish biologists stored a list of them in Excel. There are 23,000 protein coding genes in the human genome and, unless you know where to look, you're not going to scan through them all to see whether any of them have been converted to dates. And even if you got that covered, would you remember your Portuguese collaborators and remember to scan for gene-names like Ago1 as well as Aug1?

At least for Human Genes, at least for name/date confusion, the HGNC Human Gene Nomenclature Committeee have ceded the territory to Excel and renamed the ~30 genes most commonly mangled in this manner by Excel-the-All-Powerful [TheVerge] SEPT1 becomes SEPTIN1 while MARCH1 becomes MARCHF1. The rules [not only about Excel-woes] are paywalled at Nature Genetics. The penultimate rule is to stop another sort of confusion: CARS could be vehicles but CARS1 could only be cysteinyl-tRNA synthetase 1.
And we're being encouraged / compelled to be less cruel in naming defects.DOPEY is out DOP1A (DOP1 leucine zipper like protein A) is in. All those neurological deficits in drosophila: dunce, cabbage, turnip, rutabaga; or at least their human homologs, will presumably be cleaned up in their turn. I am glad [tee-hee: I R still 13] that the annotation for E.coli's Fucose-K
/note="fucK ORF (AA 1-482)"
is still hangin' in there. Despite one of the words being offensive to my dead grannie.

Finally, the discussion Mefi -- HackerNews -- about the outrageousness of Excel led me to this too-clever-be-'arf Venn Diagram:

No comments:

Post a Comment