Science matters: Secondary structure

Part I.
When they started to sequence The Genome Era with baker's yeast Saccharomyces cerevisiae 25+ years ago, I noticed something peculiar about the structure of chr III, the first ever eukaryotic chromosome to be sequenced. It was one of my three [3] big ideas in a life-time of science; and turned out to be probably not true. But it was nevertheless a contribution to the debate which opened up a novel way of thinking about constraints on the sequence of genes and their equivalent proteins. As more chromosomal sequences came on stream, Ken Wolfe, in the office next door, but not yet my boss, also noticed something peculiar about the order of genes. In particular he found evidence of shadowy patterns of duplicated segments. They were "shadowy" because the component genes were a) often only distantly related b) frequently missing. Thus on one chromosome you'd have 12 genes
A - B - C - D - E - J - K - P - Q - R - S - T
and elsewhere you'd find
a - b - e - f - g - h - j - l - m - n - p - q - t

What normal people saw, if they even bothered to think in gene order terms, was that a pair of neighbouring genes A,B were distantly related to another adjacent pair a,b. somewhere else in the genome. And a little further along on the same two chromosomes P,Q were sort of similar to p,q. . . . probably a coincidence, nothing to see here. What Ken saw was [hypothesis to test!] that a chunk of the genome had been duplicated, and that most of the duplicated genes had been surplus to requirements and had mutated away to pseudogene, and then to nothing-to-see-here. He was able to see the present pattern through the spectacles of evolutionary time. If you aligned the genes themselves rather than their sequences, then the signal from the pairs was boosted by the similar intervening singletons Ee Jj and Tt:
ABCDE---JK---PQRST
ab--efghj-lmnpq--t
** * * ** *

Ken bounced the idea off Denis Shields, another absurdly smart friend of mine, and they published Molecular evidence for an ancient duplication of the entire yeast genome in Nature in 1997. That paper documented the existence of 55 duplicated regions incorporating 13% of the total gene complement. That paper included Fig.2. [part shown left] which could claim a prize for the prettiest, most data-rich and informative visual display of quantitative information [Tufte] ever to come out of Ireland. It showed commendable bottle to keep digging in the gene-duplication pit when 87% of the genome was saying something else entirely.

People like me would have taken the gong from Nature and moved on to the next project. Ken kept shaking the tree, looking for, and at, these patterns in the gene order of Saccharomyces cerevisiae. Some more focused [boring, pedestrian, ordinary] scientists started shaking their heads wondering if 'poor Ken' would soon be found crouched in a corner of the senior common room muttering to himself while blocking off imaginary fragments the yeast genome with extravagant hand-gestures. Partly on the back of his Nature paper, he landed one of the first mega +mi££ion grants when SFI Science Foundation Ireland was founded at the very end of the last century. One of the first tasks he set one of the Effectives of his dream-team was to create YGOB the Yeast Gene Order Browser. I remember him trying to explain his vision to Kevin Byrne, the recently recruited Quant who was assigned this task. Kevin could spell DNA but not desoxyribonucleicacid because he was trained as a mathematical astrophysicist . . . but ygob could that boy code! The result of that collaboration was like Fig2 above but with moving parts and incorporating not one but 2 dozen species [see the paper for more details] I always thought it looked like a mighty train marshalling yard for assembling genes into their most functional order:

Like the London Tube Map, the YGOB browser distills from the noise all the relevant data into a clear graphical roadmap with all the evidence (protein sequence, gene sequence, annotation] only a click away. Bri'nt!

With this infrastructural support, Ken then got to know pretty much all the 5,500 genes that kept Saccharomyces cerevisiae ticking: who neighboured whom; which genes contributed to biochemical pathway X; which protein was the receptor for which ligand. It was an extraordinary feat of memory and deep knowledge and a lot more useful than retaining the first 5,500 digits of π. As a teacher when you know everyone in the class you know who's missing. Ken started to notice a few cases of missing data: where every species in YGOB had genes ABCDE but Candida glabrata had only ABC-E [as L]. It turned out that in a number of such cases there was a gene in the expected position but it had evolved so fast that it retained no sequence similarity to any other gene ~~on the planet~~ in the databases. Nevertheless, the turned-up novel protein had the right secondary structure which, coupled with the compelling synteny / genomic location data, made him confident that he could ascribe a function in these peculiar cases. These functional annotations are vital to convert (ATCG)_∞ into making sense of what genes and organisms do. He wrote it up in a (rare nowadays) single author paper for Current Biology: Evolutionary Genomics: Yeasts Accelerate beyond BLAST. A few years later he set one of his students to make a comprehensive search of all the YGOB species looking for similar cases. They were able to annotate a number of other proteins which had fallen through the generic gene-finding sieve when each species was sequenced and annotated. Deep knowledge and a synteny Way of Seeing was able to tidy up a number of peculiaries in the biochemical capabilities of particular species.

I was started down this recent-history rabbit hole because I read a report in Chemistry World about the discovery of a 'missing gene' in Trypanosomes [prev] which are responsible for sleeping sickness one of the most scything neglected tropical diseases. Richard Rachubinski of U Alberta was looking for novel targets to kill the parasites before they could kill their human hosts and was surprisingly unable to find a homologue of PEX3 in the lists of trypanosome proteins. PEX is a family of proteins associated with the development and maintenance of peroxisomes - sub-cellular organelles which are vital for the processing of long chain fatty acids and reactive oxygen species. PEX3 remained elusive until Rachubinski went to a research presentation outside his own field and learned about a bioinformatic tool called HHPred which can infer function based on secondary structure [which is comprised of the intermediate building blocks or structural elements: α-helices; β-sheets; turns; ω-loops] even when the protein sequence is completely different from other members of a gene family. They are a long way from an effective therapeutic but the first step of identification and annotation has put them on the right path. More shadows firming up into a hard reality!

Science matters

Tuesday, 22 October 2019

Secondary structure

No comments:

Post a Comment