Tuesday 20 February 2018

One letter code

Margaret Oakley [R after she became Margaret O Dayhoff through marriage] was born in Philadelphia in 1925. You cannot underestimate her importance to the development of the tools for making sense of biological sequences. For Dayhoff, the same claim can be made as for Dennis "C and Unix" Ritchie: without them it would all be different. Grace Hopper, inventor of COBOL, was another women in the right place, at the right time, with the right mindset and toolkit and she has a pretty high profile. Margaret Dayhoff otoh really doesn't get the same press but her contributions have had more impact; not least because she kicked off the area of bioinformatics and molecular sequence analysis which has supported me for almost all my working life. Developing a whole new field is chaotic - in the sense that it is sensitive to initial conditions.

I've riffed before on Pointless - the TV quiz game where success is when you can give a correct answer which nobody else has picked. If the question is "Name a female scientist who contributed to biomedical science in the late 20thC" then Margaret Dayhoff will be a winning Pointless answer. The answer to "Which pair of scientists made the first contribution to cracking the genetic code?" is not "Crick and Watson" - they 'just' gave us the physical structure of DNA. It is rather Nirenberg and Matthaei who in 1961 determined that UUU codes for Phenylalanine. That was the first codon assignment. The rest tumbled into place over the next 4 years, revealing that 20 amino acids are the basic inventory from which all proteins - all the enzymes, all the receptors, actin & myosin, haemoglobin, oxytocin, insulin - are constructed. The trouble is that the 20 amino acids were known and named years before the genetic code was AThing. The smallest, glycine, is from γλυκός glycos because it tastes sweet. I'm not sure about the connexion with soya Glycine max. Serine was first isolated from sericum the Latin for silk etc.

Dayhoff's first qualification was in mathematics which she subsequently started to apply to physical chemistry including the nature of chemical bonds. From there she moved into the structure of proteins and applied her mathematical and computing toolkit to the storage, retrieval and analysis of protein sequences - of which an increasing number were coming on stream. In 1960, she was appointed associate director of the National Biomedical Research Foundation in Maryland. Back then, protein sequencing was running in parallel and quite a way ahead of DNA/RNA sequencing. The first substantive piece of RNA sequencing saw RW Holley take a whole year 1965 to work out the 80ish bases of Alanine tRNA. That would now be knocked off in a μ-second. aNNyway, Dayhoff saw that the inventory of protein sequences was growing exponentially and, albeit from a small baseline, was going to get massive. Writing down each sequence on paper wasn't going to be the answer. Accordingly, she started to record sequences on punched cards [prev] and quickly grew dissatisfied with the convention that each amino acid was represented by a three-letter abbreviation based on its first three letters in English: Phe, Gly, Ser have been mentioned above. Dayhoff realised that with only 20 AAs in the inventory, each could be uniquely identified with one of the 26 letters in the Latin alphabet.

But whoops, here are those 20 amino acids: alanine - arginine - asparagine - aspartic acid - cysteine - glutamine - glutamic acid - glycine - histidine - isoleucine - leucine - lysine - methionine - phenylalanine - proline - serine - threonine - tryptophan - tyrosine - valine - and the first thing you note is that 20% of them begin with A!  So her first pass was to assign the easy [unique initial] ones:
  • C H I M S V 
  • it was also easy to assign F to phenylalanine at this stage which freed up 
  • P for proline
  • 8/20 done
the next decision was to give priority to the first in alphabetical order:
  • A = alanine; [G = glutamine]; L=leucine; T = threonine
  • that allowed K for lysine as the next unassigned letter in the alphabet.
  • 13/20 done
hmm, she thought, there are two cluttering overlaps because of the acid side-chains aspartate and glutamate and their amides asparagine and glutamine so:
  • let's reverse a bit to give G = glycine then
  • D = aspartate, the E = glutamate to fill in the early hole between C = Cys and F = Phe
  • N = asparagiNe and Q = glutamine [G looks a bit like Q] fills a similar later hole.
  • note that D precedes E because Aspartate precedes Glutamate
  • (18-1)/20 done
The rest are assigned by their second letter
  • R = aRginine; Y=tYrosine
  • and W the biggest letter is given to the largest amino acid tryptophan
  • and that's it!
  • 20/20 for Margaret Dayhoff
That, now universally agreed convention, was determined by the contingency that Dayhoff spoke English at home. If she's been born in Tampere, and followed the same algorithm then K would be assigned to cysteine [alaniini - arginiini - asparagiini - asparagiinihappo - kysteiini - glutamiini - glutamiinihappo - glysiini - histidiini - isoleusiini - leusiini - lysiini - metioniini - fenyylialaniini - proliini - seriini - treoniini - tryptofaani - tyrosiini - valiini] and all bets would have been off if she'd come from Kiev [аланін - аргінін - аспарагін - аспарагінова кислота - цистеїн - глутамін - глутамінова кислота - гліцин - гістидин - ізолейцин - лейцин - лізин - метіонін - фенілаланін - пролін - серин - треонін - триптофан - тирозин - валін].

Life has gotten more complex since those idyllic simple early days: we've discovered selenocysteine Sec U and pyrrolysine Pyl U. We finally give B to aspar* and Z to glutam* as ambiguity codes because a lot of the chemical protein sequencing protocols render the acids indistinguishable from their amides. Phew! with U and O we have a full set of vowels to play with.

Now the alphabet is almost full [J and X only unassigned] and we can use protein sequences to write names as a kind of geek-code. If you want to out-geek the geeks you can write your name as a peptide using Peptify a toy developed by Nuritas to stop their employees playing solitaire on their lunch-breaks. Nuritas is the spin-off of Nora Khaldi [bloboprev] an entrepreneurial woman in science. Here's PeptoBob me:


  1. I am Margaret Dayhoff's son in law Perhaps you wouls like to know that her Daughter (my wife) Ruth Dayhoff M.D. is a widely recognized Pioneer in medical computing and her Granddaughter Margaret Dayhoff Brannigan PhD is a molecular biologist with FDA.

  2. we can be reached at Firelaw@firelaw.us Vincent Brannigan