Wednesday 14 February 2018

Measuring similarity

Sequence comparison and analysis: that's what I do. It's not my day job anymore; for the last five years I've worked in The Institute trying to make sense of science in a much more general sense with / for my students. But any credibility I have in the scientific community hangs upon my small-small contributions to revealing the pattern and process of evolution through the analysis of DNA and protein sequences. One of the key concepts is working out where genes, molecules, biochemical pathways and organisms came from . . . by comparing a bunch of related sequences.
If you can show that two sequences are more similar to each other than either is to a third one, then you have established a tree of relationships. In the simplest-possible-tree [L] A and C are closely related sister 'taxa' while B is only a cousin; and yes, B and D are sisters to each other also. Operational Taxonomic Units OTUs, here A B C and D, could be individuals or species or their genes. These assignments of similarity and relatedness are based on calculating how similar are the sequences when they are aligned together. The gross differences are easy to tally up. Here is a fragment of the protein sequence for beta-haemoglobin from four mammals; two from Order Primates, two from Order Rodentia:
Mouse  KDFTPAAQAAFQKVVAGVAT
Rat    KEFTPCAQAAFQKVVAGVAS
Human  KEFTPPVQAAYQKVVAGVAN
Baboon KEFTPQVQAAYQKVVAGVAN
       *:*** .***:********.
Note that for almost all the amino acids (AA the building blocks of all proteins here represented by 20 different letters) are identical in all four species. Yiu can check out the encoding here. The convention is that, when all the AAs at a given site are the same, then a * is put under the column. Next note that for the majority of the other columns, the two rodents have one variant and the two primates have another.  In one place, however, outlined in red, rats look more like primates than their fellow rodents; but that's just a random blip. The easiest way of getting a final answer on who is related to whom is to tally up the number of same AAs and divide by the total length of the sequence [here an arithmetically convenient N=20] to get a % identity and then report that in a matrix or table:
Species
Mus
Rat
Hum
Bab
Mus
100%
85%
75%
75%
Rat
85%
100%
80%
80%
Hum
75%
80%
100%
95%
Bab
75%
80%
95%
100%
This works out pretty good. If we choose a cut-off between 85% identical and 80% identical we can [correctly] sort the four species into 'same order' vs 'different order'.  For big differences [mouse and baboon had a last common ancestor 100 million years ago (mya) while {rat and mouse} or {baboon and human} are "only" separated by 30 mya] identity vs difference works nicely. For relationships that are closer - ¿are chimpanzees Pan troglodytes or gorilla Gorilla gorilla out nearest relative? - it might be useful to have some gradation of difference rather that the stark black&white; 1 vs 0; same/different. One way to do that is to consult the DNA which makes the protein: because of the redundancy of the genetic code DNA is intrinsically more variable than proteins. In the first column of the alignment everyone has a K = Lys = Lysine. But it might turn out that the rodents make lysine from the codon AAA, while the primates use AAG, in which case the column would lose its identity * and give us some distinguishing information. I'll tell you more about mechanisms for calculating similarity later.


No comments:

Post a Comment