Wednesday 3 October 2018

Wallflower genes

Nope: nothing to do with the genetics of Cheiranthus cheiri [R] - today we're on about metaphorical wallflowers. I tell my students that without variation there would be no genetics. If he hadn't noted that his garden peas were round/wrinkled; yellow/green; tall/short, Gregor Mendel might have stuck to his monkish day job and not spent a lifetime reflecting on a different sort of [green x yellow] cross. Mendel knew nothing at all about the molecular mechanisms of heredity because the genetic code was 100 years in the future. He could only work with variants which were in-yer-face obvious to the eye - the only tool he had beside his hands [and a trowel]. He could do nothing with invisible but vital variants a) protecting against cankers, galls and aphids; b) responding to frost and desiccation; c) tolerating good bacteria d) metabolising amino acids e) holding the chromosomes together.

It's a fair approximation to say that all progress in medicine is now based on genes and proteins: how they differ in health and ill-health; how they can be encouraged, unplugged, interfered with, replaced and mimicked. Over the last 40 years we have developed a toolkit for clocking these variations. But it turns out that some genes are more equal than others - in terms of the attention they attract from scientists, funders and Big Pharma. I've a neat Most Sexy Protein exercise to do with students which shows how some proteins have had more than 10x more interest than other, related, probably equally worthy, potential targets. I've been here before tribbing the work of Aled Edwards in Toronto for showing that the human genome project delivered [to a close approximation] NO new gene targets. As the HGP  a) cost $3 billion of tax-payers money and b) was billed as helping understand our molecular/medical fundamentals; that's a bit of a bust. Scientists, for all their talk of pushing the far frontiers of science, are quite risk-averse: they tend to tweak and polish the same old same old system that they worked on for their PhD when they had all their hair and fewer children. One reason for this is that they follow the money, and funders are even more risk-averse than scientists. The chaps who make the funding decisions have a bean-counting monkey on their backs whispering "Remember, adjudicator, your decisions must bear fruit for the tax-payer".

The conservatism of science got hot 'n' fizzy in September because of a new massive bibliometric study of all the genes we know about in the human genome. Large-scale investigation of the reasons why potentially important genes are ignored was published in PLOS Biology by a small group from Northwestern U headed by Thomas Stoeger and Luis Amaral. I gather that it was Stoeger's original idea and he mobilised Amaral the Portuguese Megaquant (who has an office in each of The Institute on Complex Systems (NICO), Dept of Chemical and Biological Engineering, Dept of Molecular Bioscience, and the Dept of Physics and Astronomy at NWU) to help crunch the numbers, which quickly became formidable.

When I launched into the waves >!plooof!< of molecular evolutionary analysis in 1989, it was fairly calm. I was assigned to a) develop useful software for analysing some classes of DNA sequence b) apply that software to human genes. Only 1064 protein coding genes had been sequenced by Christmas 1990: I know; I collected them carefully. Those genes were cherry picked because they were interesting and analyzable: they had been tracked down because a mutation in their DNA was associated with a disease state. They were not, therefore, a random selection of the 20,000 genes which we now believe exist. The NWU team have flagged 15 basic bio-chemo- physico- attributes which are strongly associated with Interesting genes. That's 15 out of 430 (!) things that you can measure/record about a gene [see R for some of them graphically displayed]. They whittled the list to those key attributes with "gradient boosting regressions with out-of-sample Monte Carlo cross-validation" whatever that it. It seems that most of these measures are derived from street-light science: they could be measured with the techniques which were then available. They had to a) be expressed in bucket quantities b) across a wide range of tissues especially HeLa cells c) have signal peptides so that they were exportable from the cell for access d) tolerate non-fatal variants. If a gene scored strong in these features, it was likely to be discovered and characterised  in the last century. In horse-racing parlance, those genes were racing along on the back of Eclipse first the rest nowhere. With that early start, funding-fondling and inertia [continue in its existing state of rest or uniform motion in a straight line, unless that state is changed by an external force] would kick in and a disproportionate amount of care cash-and-attention would follow.

When the human genome came out in 2001, it should have levelled the playing field because the gene sequence and predicted protein were available for all 20,000 human genes, including the wall-flowers and shrinking violets. But it didn't. Obviously there are exceptions to this rule. Six johnny-come-lately genes are identified in Stoeger&Amaral Fig2a [L]. I know a lot about IFNL4 - interferon λ4 - one of those "Sexy Genes". Its existence wasn't even suspected until January 2013. One of my three useful ideas in science was to predict the existence of IFNL4 close to IFNL3 [also sexy see L] from a close analysis of that part of the genome. Our interest was piqued because a series of studies showed that variants/mutations in that region were associated with the clearance of Hepatitis C virus. The bioinformatic evidence, as I presented it at a lab meeting, was so compelling that the boss pulled a graduate student off her own project, where she was quietly minding her own business, and had her try to prove the existence and activity of the predicted gene in real cells in a test-tube. The quest consumed about 1.5 person years and came to nothing. I left that group in December 2012 and started work at The Institute in January 2013. The same week The Blob launched, a far better resourced research group scooped up the prize for discovering IFNL4. And while we're looking at Fig1a, check out C9orf72, it hasn't got a proper name yet [it's the 72nd Open reading frame on Chromosome 9] but it's very strongly associated with ALS [Lou Gehrig] and FTD frontotemporal dementia, and we have only a hazy idea of the how and the why. It was invisible in the 1980s and 1990s because it's not active in HeLa cells and doesn't have a signal peptide. Nevertheless it has huge potential for making money in the development of therapeutics.

Okay that was all good fun, but what to do about this ludicrous, wasteful, boring, use of science funding? Stoeger&Amaral have some [ain't gonna happen] suggestions:
  • "In order to counter the career forces currently pushing towards conformity, there would be a need for stable, long-term support for such innovators to focus on the unknown".
  • For gawd's sake keep up the basic research - in flies, frogs and nematodes - they have been a rich seam for identifying novel ways forward in human health.
  • Reductionist science - where you control for all variable but one - is only sporadically successful in making progress through the complexity which is us and our habitat. Fund multi-gene science with interaction terms in the equation.
  • Look carefully at the NWU data, it will help you identify the wall-flowers that could do with a dollop of dollars from NIH.


  1. I also know a lot about IFNL4, as we did scoop you and got the prize for IFNL4 discovery ;). It is interesting to learn about other, almost successful efforts happening at the same time...By the way, IFNL4 and IFNL3 gene names mentioned in the PLOS paper emerged the same time, in our paper, but IFNL3 was renamed from IL28B, not discovered at that time. So, it was not quite correct to put these genes in the same category of papers on gene names published before/after 2010. However, IFNL4 was not just a new gene name but a new gene, and now this 2013 discovery paper was cited by more than 600 papers...

    1. I knew that my big insight must have been shared by at least a dozen eager binfo graduate students across the world. You guys were able to carry it forward from wistful speculation to biological actuality. Hats off and a sweeping bow to you, but the hunt for IL860/IL28X was the best fun and I don't regret the time. Then again my ambition genes were shot off in the war.

    2. We actually didn't have any speculations when we started. We didn't predict the existence of this transcript but first saw something on RNA-seq and then followed it up. Anyway, it took 9 months after RNAseq to clone all these isoforms and of course the full one, that became IFNL4, was the last one because it was most GC-rich and very difficult to get by PCR. Functional annotation also took time as we had to develop reagents (antibodies, constructs) before we even knew it was a real thing. Until the publication, this transcript was called "Ghost", correctly representing its status. About being a "better funded lab" - i was a new tenure track investigator and most of my crew were at pre-doc level. I guess we were very very hungry and worked like there is no tomorrow. I hope i will get to experience this rush again in my career, i miss it ... Thanks for bringing these memories back :)!