Monday 26 February 2018

Anscombe's quartet

When I started working at The Institute January five years ago, I was given the most absurd workload. A couple of different classes required me to teach Excel - the spreadsheet programme. Knowing this since I was hired in early December, one of the very last things I did in Trinity before I left was to take a half day course in learning Excel. Some weeks, the only thing that kept my nose above the water was the belief that it was up to the students to figure things out and that being told things in detail by me would be counter-productive. Let us all give thanks for spread-sheet software - I had a data-analysis job before such things existed and it was hard work.

One of the things that I have really taken on board is that the first thing to do with any data is to graph it. The human eye is really good at picking out patterns and a graph will pick out any trends, grouping and outliers. After eyeballing the data, then you can carry out a statistical test to see if any of the patterns or comparisons are significant. Apart from anything else, a really good graph will have huge explanatory power for any report you make. If you do the statistics first, you may delude yourself into thinking that something is going on when in reality the data is so noisy that your 'significant' association has no explanatory power. Like my genetic and geographic distance plot. One of the proverbs of statistics is "Correlation is not causation" nevertheless "r" the correlation coefficient between two variables does give a quantified estimate of the strength of their association. r varies between -1 and +1, with values close to zero indicating mere noise. Doing that GenDist v GeogDist analysis from my PhD really brought home to me the value of r2 which is formally the % of the variability in the dataset explained by the supposed relationship between the two variables. A relationship which has r = 0.8 will have a positive trend
Three weeks ago, I picked up a scrap of paper at work from a previous class that had been considering Frank Anscombe's insistence that the first thing you should do with data is to graph it. The four graphs show four datasets with highly significant positive [and suspiciously similar] associations: for each one r = 0.82 and r2 = 0.67. These classic 'fudged' datasets are known as Anscombe's Quartet. They were designed by Anscombe in 1971 to show that a correlation coefficient - for all it's pretensions at objective truth - can be a piss-poor summary of what is actually going on.
  • Is there really a 'trend' the the data shown bottom right? It's the kind of picture that you'd get from plotting height against foot length for 10 mice and one cat. The outlier is driving the whole analysis.
  • The top two pictures really bring home that a correlation coefficient depends both on the slope of the relationship and on how tight it is.
  • The picture bottom left makes clear that a linear trend is a rather woolly approximation for the true nature of the relationship between the two variables.
Frank Anscombe was born in Hove, UK in 1918 but was head-hunted by Princeton and later Yale after WWII, when the US could pay salaries that seemed astronomical to benighted, rationed Brits. You can get the substance of his sermon on the value of graphs and analysis in the first page summary here. The rest of the gospel is behind a [modest] JSTOR paywall.

No comments:

Post a Comment