Here [L, the chap with the longest hair] she is holding a hank of paper tape to show that she could handle data; and lots of it. Back in those days (1960s) input and output was largely through tree-based media: paper tape or Hollerith cards with holes punched out of them. Having dragooned the amino acids of the growing population of protein sequences into strings of letters, Dayhoff and others started to compare the strings to try to figure out a) where they came from and b) how they worked. Bizarrely, the first 3-D structures of proteins (which we think of now as being intrinsically more difficult to produce than DNA sequences) were worked out in 1958 [Kendrew, Myoglobin and Perutz], before working out the 1-D sequences really got going. Except in peculiar circumstances, nobody sequences proteins directly anymore: protein sequence is inferred from the DNA codons.
A key issue with devising a single 'distance' between two aligned sequences was what to do about the mismatches. Everyone agreed that same vs different was too simplistic a model to have much utility for 'difficult' cases. I looked at two better models to cope with wrong-wrong-almost-right. You can count the number of changes in the underlying DNA and score them from easy-to-change to hard-to-change. OR you can look at the instrinsic physico-chemical properties of amino acids - size, charge, hydrophobicity - and mix them up into a theoretical similarity score. Glycine with no side chain is well different from Phenylalanine which has a gurt big lumpy hydrophobic side-chain; lysine is positively charged and glutamate is negatively charged but they are both charged and they have exactly (+/- 1) the same molecular weight.
Margaret Dayhoff adopted a much more pragmatic wysiwyg approach. She gathered all the sequences in the nascent protein database and aligned them in pairs. For any pair of sequences that were at least 85% identical, she tallied up a) the number of places where the amino acid remained the same b) the nature and number of the differences.
These were gathered into a big 20 x 20 matrix, and after a bit of scaling this was published as a Point Accepted Mutation PAM matrix. From the initial PAM 1 [1% different] matrix a series of PAM 30 . . . PAM 120 . . . PAM 250 matrices were extrapolated to serve as models for more distantly [than 85% ID] related sequences. The top left corner of the PAM30 matrix appears [R]. It was very much a heuristic [good enough] solution to the problem and that fit really well with evolutionary biologists who observe nature's good enough solutions to survival on a daily basis.
It was also approved because is was based on actual observed changes and differences rather that what we-the-scientists thought was likely to occur. PAM matrices were hailed and adopted because they looked right and had considerable internal consistency. For example, all the cells on the main diagonal, coloured old rose in the diagram are positive: in the majority of cases (+85%) the same AA is present in both sequences in the alignment, so there get a positive score. All the off-diagonal cells are negative because when one AA is replaced by another it can go to any one of 19 other possibilities and so each one is rare. EXCEPT the cells coloured caramel which are 'conservative substitutions: two amino acids that look and behave like each other (same size, same charge, same hydrophobicity etc.): D aspartic acid and E glutamic acid is one such case. Such changes are tolerated by the mills of evolution and so become a point accepted mutation. PAM matrices were the de facto gold standard for sequence comparison for a generation. They have largely been replaced by BloSum matrices which were invented by Henikoff and Hennikoff in 1992.
Margaret Dayhoff? Definitely a nett contributor.
More women in science.


 
No comments:
Post a Comment