Monday 21 October 2019

Similar proteins

Proteins are made of amino acids stuck together in long chains which come curling out of ribosomes in an order dictated ultimately by the DNA in the genes on the chromosomes in the nucleus of every cell which makes new proteins -- which is all cells that
are still alive. I've spent the last 30 years trying to make sense of those sequences: usually comparing them by alignment. Here's part of the sequence of insulin for pigs and humans:
                     *.* ******   *.** *****  ************************
[Each of the 20 amino acids found in proteins has a unique 1-letter code. And a more readable 3-letter code too]
{For the first half of the 20thC, diabetics were injected with insulin extracted from thousands of pig pancreases and it worked just fine.}

You can see that the sequences, and therefore by inference, their 3-D structures, are very similar but with several significant differences: the human version has two extra amino acids GA for starters.  If the final functioning protein is in 3 dimensions, then the linear sequence, as it appears from the ribosomes, or as represented on paper and on screen as above, can be thought of as a 1-D object. In between, biochemists talk about the secondary structure, which is comprised of the intermediate building blocks or structural elements: α-helices; β-sheets; turns; ω-loops.

How similar do protein sequences have to be for us to believe that a) they have a common ancestor and b) they therefore have a similar structure and function? If you line up any pair of sequences, some of the letters will match because there are only 20 amino acids to play with. Like when I align the opening words of the first two paragraphs above:
Proteins are made of aminoacids
You can see that the sequences
3/30 = 10% of the letters appear in the same position in these random unrelated sequences. Molecular evolutionists talk about the twilight zone when ~15-20% of the amino acids are identical between two sequences. More than 20% identity and you can be reasonably confident is assigning similar function and inferring the same 3-dimensional structure. In the twilight zone you really need some independent information to make that call. You might naively think that the random probability is 5% = 1/20 because there are 20 amino acids to play with. But because you're allowed to insert gaps in one sequence [to accommodate the extra GA in the insulin above, for example] and because some amino acids eg LAG and more common than others eg HWC:
the biologically meaningful / statistically significant cut-off (and the Twilight Zone) are pitched higher than 5% at just under 20%. To be continued . . . as gene location becomes the added value independent information to match barely detectable similarities between genes in two different species.

Years ago, in the late 90s, I wrote some code to deconstruct protein sequences as if they had been hydrolysed by 1M NaOH [that's caustic soda to cooks] into a soup of their component amino acids. This might be though of as a protein's 0-dimensional structure. My program also tallied up and counted the frequency of each of the 20 amino acids. Tricking about with a test dataset, I noticed that this string of 20 numbers had quite high predictive value: the AA frequencies of Bacillus subtilis recA could predictably fish out the recA in Escherichia coli etc. I got sufficiently excited about this that I presented it as my contribution to the weekly internal lunchtime research seminar. Ken Wolfe, not yet my boss, but usually the smartest chap in the room, said he'd noticed this several years earlier but hadn't though it worth publishing. My abiding problem as a scientist is that I lack the stamina to get results, even interesting results, down on paper and through the publication process. I am thus glad to be able to publish this executive summary in The Journal of Blob Studies - they will take anything.
  Part II.

No comments:

Post a Comment