Wednesday, 14 May 2014

It's in code

Whoop whoop - nerd alert.
When I wrote about the Great Western Binfo meeting I attended last November, I prefaced the piece with a too-clever-by-'arf
thinking it would serve as un p'tit amuse-cerveau for binfoes before the meat of the article. I suspect that it served, to the nearest whole number, solely as un p'tit amuse-Bob.  Last week I had the honour to host the 2014 version of the meeting and I put
in the header of the programme by way of continuing the tradition.  It came up in conversation during the afternoon coffee break when I was chatting with two physicist-turned-binfoes.  Somebody who was earwigging turned round and said "Oh I assumed that was a typo".  Well, really!  I know that The Blob is sprinkled with errurs of spelinge and I often blush when I re-read my e-mails, but if I write something that sounds like one of two trolls 'wrestling' then I do so deliberately.  Because the p-t-bs are both fightin' sharp and quick on the uptake, it didn't take them long to crack the code.  That led nicely into a discussion of the IUPAC codes for DNA/RNA bases and the amino acids that go to make proteins.

Cypherists are quite frustrated by the fact that, while there are 26 letters in the English alphabet, there are only 20 amino acids that commonly appear in protein sequence
A AlaC CysD AspE GluF PheG GlyH HisI IleK LysL Leu
M MetN AsnP ProQ GlnR ArgS SerT ThrV ValW TrpY Tyr
BJOUXZ are all missing which is a drag because the list includes half the vowels. So you can find ELVIS in the protein database but you can't find BOB. The AAs are translated from RNA read in triplets, so the 12 letters at the top could be a genetic code to represent a four-letter word. Except that it has letters in addition to the regular ATCG (DNA) or AUCG (RNA).  What am dem and why dey needum?  They are usually referred to as ambiguity codes.  The "four" bases are of two chemical types: puRines (lArGe) and pYrimidines (CUTe or small).  Sometimes you don't know, or cannot specify, which purine is present so you write R.  Likewise the bases pair with each other using either 3 hydrogen bonds CG or 2 A=T.  The former is stronger than the latter so C or G is denoted S trong and A or T as W eak. N on the other hand means any base.  You rarely need the others but the finite number of ambiguous possibilities each has its one-letter IUPAC representation: Isoleucine - Ile - I is translated from any codon that starts with AU and doesn't end in G.  IUPAC-speak for this is H (not G).
Amino acids also have their ambiguity codes and for a good chemical reason. When you wish to characterise a protein one technique is to break it down into its component amino acids and count their relative abundance - this can be surprisingly informative about which protein you have.  But when you hydrolyse the peptide bond between each pair of AAs, you also hydrolyse the chemically very similar amide bond that differentiates the amino acids aspartic acid Asp from asparagine Asn and glutamic acid Glu from glutamine Gln. A complete hydrolysis with 6N HCl thus results in only 18 different components and the ambiguity is written thus Asp or Asn = B and Glu or Gln = Z.  Just as N represents any base in DNA, so X represents any AA in protein.

That's handy for geeky coders because it brings two more letters into the possible alphabet for writing English words in codons. In December I was well impressed by a couple of the Masters of Immunology knowing that there was a 21st amino acid selenocysteine which is incorporated into proteins in special circumstances in particular species by subverting UGA, one of the stop codons. Selenocysteine is interesting because it looks exactly like cysteine except that a selenium atom replaces cysteine's sulphur. And that's interesting because oxygen, sulphur and selenium are all in the same column of the periodic table of elements - ie. have similar chemical properties. A couple of days later one of the MScs e-mailed me to say "and let's not forget pyrrolysine".  That's another oddity that was discovered in 2002 in the methyltransferase gene of Methanosarcina barkeri.  It's since been found in methyltransferases from other species where it forms an essential part of the active site.  Like selenocysteine, it is coded by a subverted stop codon, in pyrrolysine's case UAG.  This is handy because IUPAC have elected to represent selenocysteine as Sec or U and pyrrolysine as Pyl or O, so we can write pretty much anything we want in codon-speak because the only letters uncodable are minority interest J and X.  But even those are sorted because X (any AA) could be represented by codon NNN and you could use I for J like the Romans.

Emboss backtranseq could help you if you want to send a secret message to your geeky mol.bol girlfriend like

and ExPaSy will help her translate the DNA back into English.  I hope you know what to do when you do get together - and no, there's more exciting things to do with her than "pull an all-nighter playing Dungeons & Dragons".

1 comment:

  1. Yes siree Bob - there is a way to preserve the code and use all letters....................BJOUZX BZEN BAMMMMMMMMMM!!!!!