One of the things that I have really taken on board is that the first thing to do with any data is to graph it. The human eye is really good at picking out patterns and a graph will pick out any trends, grouping and outliers. After eyeballing the data, then you can carry out a statistical test to see if any of the patterns or comparisons are significant. Apart from anything else, a really good graph will have huge explanatory power for any report you make. If you do the statistics first, you may delude yourself into thinking that something is going on when in reality the data is so noisy that your 'significant' association has no explanatory power. Like my genetic and geographic distance plot. One of the proverbs of statistics is "Correlation is not causation" nevertheless "r" the correlation coefficient between two variables does give a quantified estimate of the strength of their association. r varies between -1 and +1, with values close to zero indicating mere noise. Doing that GenDist v GeogDist analysis from my PhD really brought home to me the value of r2 which is formally the % of the variability in the dataset explained by the supposed relationship between the two variables. A relationship which has r = 0.8 will have a positive trend
Three weeks ago, I picked up a scrap of paper at work from a previous class that had been considering Frank Anscombe's insistence that the first thing you should do with data is to graph it. The four graphs show four datasets with highly significant positive [and suspiciously similar] associations: for each one r = 0.82 and r2 = 0.67. These classic 'fudged' datasets are known as Anscombe's Quartet. They were designed by Anscombe in 1971 to show that a correlation coefficient - for all it's pretensions at objective truth - can be a piss-poor summary of what is actually going on.
- Is there really a 'trend' the the data shown bottom right? It's the kind of picture that you'd get from plotting height against foot length for 10 mice and one cat. The outlier is driving the whole analysis.
- The top two pictures really bring home that a correlation coefficient depends both on the slope of the relationship and on how tight it is.
- The picture bottom left makes clear that a linear trend is a rather woolly approximation for the true nature of the relationship between the two variables.
No comments:
Post a Comment