Archive for statistics

Throwing a Fit

Posted in Bad Statistics, The Universe and Stuff with tags , on February 18, 2009 by telescoper

I’ve just been to a very interesting and stimulating seminar by Subir Sarkar from Oxford, who spoke about Cosmology Beyond the Standard Model, a talk into which he packed a huge number of provocative comments and interesting arguments. His abstract is here:

Precision observations of the cosmic microwave backround and of the large-scale clustering of galaxies have supposedly confirmed the indication from the Hubble diagram of Type Ia supernovae that the universe is dominated by some form of dark energy which is causing the expansion rate to accelerate. Although hailed as having established a ‘standard model’ for cosmology, this raises a profound problem for fundamental physics. I will discuss whether the observations can be equally well explained in alternative inhomogeneous cosmological models that do not require dark energy and will be tested by forthcoming observations.

He made no attempt to be balanced and objective, but it was a thoroughly enjoyable polemic making the point that it is possible that the dark energy whose presence we infer from cosmological observations might just be an artifact of using an oversimplified model to interpret the data. I actually agreed with quite a lot of what he said, and certainly think the subject needs people willing to question the somewhat shaky foundations on which the standard concordance cosmology is built.

But near the end, Subir almost spoiled the whole thing by making a comment that made me decide to make  another entry in my Room 101 of statistical horrors.  He was talking about the  spectrum of fluctuations in the temperature of the Cosmic Microwave Background as measured by the Wilkinson Microwave Anisotropy Probe (WMAP):

 

 

I’ve mentioned the importance of this plot in previous posts. In his talk, Subir wanted to point out that the measured spectrum isn’t actually fit all that well by the concordance cosmology prediction shown by the solid line.

A simple way of measuring goodness-of-fit is to work out the value of chi-squared which relates to the sum of the squares of the residuals between the data and the fit. If you do this with the WMAP data you will find that the value of chi-squared is actually a bit high, so high indeed that there is only a 7 per cent chance of such a value arising in a concordance Universe.  The reason is probably to do with the behaviour at low harmonics (i.e. large scales) where there are some points that do appear to lie off the model curve. This means that the best fit concordance model  isn’t a really brilliant fit, but it is acceptable at the usual 5% significance level.

I won’t quibble with this number, although strictly speaking the data points aren’t entirely independent so the translation of chi-squared into a probability is not quite as easy as it may seem.  I’d also stress that I think it is valuable to show that the concordance model isn’t by any means perfect.  However, in Subir’s talk the chi-squared result morphed into a statement that the  probability of the concordance model being right is only 7 per cent.

No! The probability of chi-squared given the model is 7%, but that’s quite different to the probability of the model given the value of chi-squared…

This is a thinly disguised example of the prosecutor’s fallacy which came up in my post about Sir Roy Meadow and his testimony in the case against Sally Clark that resulted in a wrongful conviction for the murder of her two children.

Of course the consequences of this polemicist’s fallacy aren’t so drastic. The Universe won’t go to prison. And it didn’t really spoil what was a fascinating talk. But it did confirm in my mind that statistics is like alcohol. It makes clever people say very silly things.

Misplaced Confidence

Posted in Bad Statistics, The Universe and Stuff with tags , , , on December 10, 2008 by telescoper

From time to time I’ve been posting items about the improper use of statistics. My colleague Ant Whitworth just showed me an astronomical example drawn from his own field of star formation and found in a recent paper by Matthew Bate from the University of Exeter.

The paper is a lengthy and complicated one involving the use of extensive numerical calculations to figure out the effect of radiative feedback on the process of star formation. The theoretical side of this subject is fiendishly difficult, to the extent that it is difficult to make any progress with pencil-and-paper techinques, and Matthew is one of the leading experts in the use of computational methods to tackle problems in this area.

One of the main issues Matthew was investigating was whether radiative feedback had any effect on the initial mass function of the stars in his calculations. The key results are shown in the picture below (Figure 8 from the paper) in terms of cumulative distributions of the star masses in various different situations.

untitled

The question that arises from such data is whether these empirical distributions differ significantly from each other or whether they are consistent with the variations that would naturally arise in different samples drawn from the same distribution. The most interesting ones are the two distributions to the right of the plot that appear to lie almost on top of each other.

Because the samples are very small (only 13 and 15 objects respectively) one can’t reasonably test for goodness-of-fit using the standard chi-squared test because of discreteness effects and because not much is known about the error distribution. To do the statistics, therefore, Matthew uses a popular non-parametric method called the Kolmogorov-Smirnov test which uses the maximum deviation D between the two distributions as a figure of merit to decide whether they match. If D is very large then it is not probable that it can have arisen from the same distribution. If it is smaller then it might have. As for what happens if it is very small then you’ll have to wait a bit.

This is an example of a standard (frequentist) hypothesis test in which the null hypothesis is that the empirical distributions are calculated from independent samples drawn from the same underlying form. The probability of a value of D arising as large as the measured one can be calculated assuming the null is true and is then the significance level of the test. If there’s only a 1% chance of it being as large as the measured value then the significance level is 1%.

So far, so good.

But then, in describing the results of the K-S test the paper states

A Kolmogorov-Smirnov (K-S) test on the …. distributions gives a 99.97% probability that the two IMFs were drawn from the same underlying distribution (i.e. they are statistically indistinguishable).

Agh! No it doesn’t! What it gives is a probability of 99.97% that the chance deviation between the two distributions is expected to be larger than that actually measured. In other words, the two distributions are surprisingly close to each other. But the significance level merely specifies the probability that you would reject the null-hypothesis if it were correct. It says nothing at all about the probability that the null hypothesis is correct. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution of D based on it, and hence determine the statistical power of the test. Without specifying an alternative hypothesis all you can say is that you have failed to reject the null hypothesis.

Or better still, if you have an alternative hypothesis you can forget about power and significance and instead work out the relative probability of the two hypotheses using a proper Bayesian approach.

You might also reasonably ask why might D be so very small? If you find an improbably low value of chi-squared then it usually means either that somebody has cheated or that the data are not independent (which is assumed for the basis of the test). Qualitatively the same thing happens with a KS test.

In fact these two distributions can’t be thought of as independent samples anyway as they are computed from the same initial conditions but with various knobs turned on or off to include different physics. They are not “samples” drawn from the same population but slightly different versions of the same sample. The probability emerging from the KS machinery is therefore meaningless anyway in this context.

So a correct statement of the result would be that the deviation between the two computed distributions is much smaller than one would expect to arise from two independent samples of the same size drawn from the same population.

That’s a much less dramatic statement than is contained in the paper, but has the advantage of not being bollocks.

The Curious Case of the Inexpert Witness

Posted in Bad Statistics with tags , , , on September 17, 2008 by telescoper

Although I am a cosmologist by trade, I am also interested in the fields of statistics and probability theory. I guess this derives from the fact that a lot of research in cosmology depends on inferences drawn from large data sets. By its very nature this process is limited by the fact that the information obtained in such studies is never complete. The analysis of systems based on noisy or incomplete data is exactly what probability is about.

Of course, statistics has much wider applications than in pure science and there are times when it is at the heart of controversies that explode into the public domain, particularly when involved in medicine or jurisprudence. One of the reasons why I wrote my book From Cosmos to Chaos was a sense of exasperation at how poorly probability theory is understood even by people who really should know better. Although statistical reasoning is at the heart of a great deal of research in physics and astronomy, there are many prominent practioners who don’t really know what they are talking about when they discuss probability. As I soon discovered when I started thinking about writing the book, the situation is even worse in other fields. I thought it might be fun to post a few examples of bad statistics from time to time, so I’ll start with this, which is accompanied by a powerpoint file of a lunchtime talk I gave at Cardiff.

I don’t have time to relate the entire story of Sally Clark and the monstrous miscarriage of justice she endured after the deaths of her two children. The wikipedia article I have linked to is pretty accurate, so I’ll refer you there for the details. In a nutshell, in 1999 she was convicted of the murder of her two children on the basis of some dubious forensic evidence and the expert testimony of a prominent paediatrician, Sir Roy Meadow. After appeal her convinction was quashed in 2003, but she died in 2007 from alcohol poisoning having apparently taken to the bottle after three years of wrongful imprisonment.

Professor Meadow had a distinguished (if somewhat controversial) career, becoming famous for a paper on Munchausen’s Syndrome by Proxy which appeared in the Lancet in 1977. He subsequently appeared as an expert witness in many trials of parents accused of murdering their children. In the Sally Clark case he was called as a witness for the prosecution, where his testimony included an entirely bogus and now infamous argument about the probability of two sudden infant deaths happening accidentally in the same family.

The argument is basically the following. The observed ratio of live births to cot deaths in affluent non-smoking families (like Sally Clark’s) is about 8,500:1. This means that about 1 in 8,500 children born to such families die in such a way. He then argued that the probability that two such tragedies happen in the same family is this number squared, i.e. about 73,000,000:1. In the mind of the jury this became the odds against the death of Mrs Clark’s children being accidental and therefore presumably the odds against her being innocent. The jury found her guilty.

For reasons why this argument is completely bogus, and more technical details, look in the following powerpoint file (which involves a bit of maths):

the-inexpert-witness

It is difficult to assess how important Roy Meadow’s testimony was in the collective mind of the Jury, but it was certainly erroneous and misleading. The General Medical Council decided that he should be struck off the medical register in July 2005 on the grounds of “serious professional misconduct”. He appealed, and the decision was partly overturned in 2006, the latest judgement basically being about what level of professional misconduct should be termed “serious”.

My reaction to all this is a mixture of anger and frustration. First of all, the argument presented by Meadow is so clearly wrong that any competent statistician could have been called as a witness to rebut it. The defence were remiss in not doing so. Second, the disciplinary action taken by the GMC seemed to take no account of the consequences his testimony had for Sally Clark. He was never even at risk of prosecution or financial penalty. Sally Clark spent three years of her life in prison, on top of having lost her children, and now is herself dead. Finally, expert testimony is clearly important in many trials, but experts should testify only on those matters that they are experts about! Meadow even admitted later that he didn’t really understand statistics. So why did he include this argument in his testimony? I quote from a press release produced by the Royal Statistical Society in the aftermath of this case:

Although many scientists have some familiarity with statistical methods, statistics remains a specialised area. The Society urges the Courts to ensure that statistical evidence is presented only by appropriately qualified statistical experts, as would be the case for any other form of expert evidence.

As far as I know, the criminal justice system has yet to implement such safeguards.

How many more cases like this need to happen before the Courts recognise the dangers of bad statistics?