Thursday, 15 February 2018

Eurogene - the map

I told y'all that you should go to Dublin on Darwinday to hear Dan Bradley talk about the Genetic Origins of the Irish. But I know that some things you can't delegate: you just have to do them yourself. Accordingly, I leapt into the Little Red Yaris at 1705hrs and drove to Dublin to hear the news from the frontiers of biogeography. But the news is always based on the olds and the most beautiful and informative picture of my 2018 [at top: far better copy] was published in 2008! I may well have been entranced by that map when it came out ten years ago, but I've since forgotten all about it. Heck, I've forgotten my car-keys and where I left my glasses as well.

That map is Fig 1 in a paper in Nature: Genes mirror geography within Europe which sampled the sequenced genomes of 3000 Europeans (and four Turks) and tallied up each person's state at 500,000 different variable sites in their DNA sequence. That's a shed-load of data and you can't make much headway by ticking off (3000 x 3000)/2 x 500,000 cases of Sean is different from Jean here but the same there, while Giovanni is different again. Well actually you can, and that's what John Novembre et al. did in 2008. They put the whole dataset into a hopper called Principle Components Analysis and gave it all a good shake and a jiggle. PCA reconciles all the internal inconsistencies, and calculates the position of each person in N-dimensional hyperspace. No, I too only have a hazy notion of what that really means but in practice it calculates how near or far each person is from each other person in the dataset. It will come as no surprise when it turned out that the quartet of Turks looked really similar to each other genetically and rather different from the Europeans . . . and the Irish too: like each other, quite similar to Brits and Scots and less like Poles and Greeks. A lot of that difference will smooth itself out over the next 100 years as our 200,000 resident Poles make babies with their Irish neighbours.

You can do these sort of studies because the cost of generating the primary data has collapsed over the last 30 years. The first ever chunk of genomic DNA, yeast Saccharomyces cerevisiae chromosome III, was contracted by the EEC (=EU) 30 years ago at 320,000 ecus (=€) or €1 /base. We carried out the first non-trivial added-value analysis of that data - one of my three big ideas in science. With that stepping stone achieved, planners looked to sequencing The Human Genome: it cost €300,000,000 (10c per base) and took ten years. Now you can sequence A human genome for €1,000; it will take a day; and there is enough server power to do many genomes in parallel. So 3,000 genomes is quite affordable in a big science sort of way.

What is most striking about the distribution of genomes across the most explanatory axes of the PCA landscape is how closely it maps onto the geography of Europe.  The pale blue of Greece and the Balkans is nearest to Turkey over on the right; the grey Italian peninsula runs parallel and a little more distant; and further away again is a purple peninsula of Iberian genomes. At the opposite end of the continent, the Irish intercalate with the Brits; the Scandinavians have both shared and separate identity etc. etc. If you look closely, you can see Paddy-No-Pals off on his own in the sea like as sort of Uber-Irish outlier. Maybe she is not Paddy at all but Caitlín Ní Uallacháin. Also note the five rogue ITs in the sea at bottom left of the diagram; they do indeed have Italian passports but they are actually Sardinians. There is no evidence here that the compatriots of Szilárd, Wigner, von Neumann and Teller come from Mars.

I was doing a similar analysis waaaay back in 1980s. I took me 2 years of tramping the streets of towns and cites in New England and the Canadian Maritimes, scoring genetic variation in domestic cats Felis catus to gather a sample of 10,000 cats in 35 different populations diagnosed for 7 genetic variants. (35 x 35)/2 x 7 is quite a bit smaller than (3000 x 3000)/2 x 500,000 !! But it was all my own work. One finding was that genetic distance was correlated (highly significant statistically) with geographic distance but that relationship only explained 16% of the variability in the sample. 84% of the variation was noise - some of which could be accounted for by the history of the patterns of French, English and Dutch colonisation in the 1600s. That was what my PhD thesis concluded aNNyway.

When you cough up your $100 to get your DNA sequenced, 23andMe will compare your DNA to a database like this one and place your genome on the map. Unless you are truly and incestuously descended from the Pharaohs, your genome will be a mess of fragments from the miscegenation of your ancestors. 23andMe will give you a summary sound-byte like "50% Irish; 25% English; 20% French; a toe from the Maghreb and a Neanderthal fingernail". You may take that assessment with a huge pinch of salt because the data will be inherently noisy.

No comments:

Post a Comment