Archive for the Bad Statistics Category

False Positives

Posted in Bad Statistics with tags , , on February 27, 2009 by telescoper

There was an interesting article on the BBC website this week that, for once, contains an example of a reasonable discussion of statistics in the mass media. I’m indebted to my friend Anton for pointing it out to me. I’ve filed it along with examples of Bad Statistics because the issue is usually poorly explained. I don’t think the article itself is bad. In fact, it’s rather good.

The question is all about cancer screening, specifically for breast cancer, but the lesson could apply to a host of other situations. In the original context, the question goes as follows:

Say that routine screening is 90% accurate. Say you have a positive test. What’s the chance that your positive test is accurate and you really have cancer?

Presumably there will be many of you that think the answer is 90%. Hands up if you think this!

If you don’t think it’s 90% then what do you think it is?

The correct answer is that you have no idea. I haven’t given you enough information.

To see why, imagine that the prevalence of cancer in the population is such that 1% of a randomly selected sample will have it. Out of a thousand people one would expect that, on average, ten would have cancer. If the test is 90% accurate then 9 of these will show positive signs and only one won’t.

However, 990 people out of the original thousand don’t have cancer. If the test is only 90% accurate then 10%, i.e. 99 of these will show a false positive.

Thus the total number of positive tests is 108 and only 9 of the individuals concerned actually have cancer. The odds are therefore 9/108. That’s only about a 1-in-12 chance that you have cancer.

But that depends on my assumption about the overall rate in the population. If that number is different it changes the odds. Without this information, the problem is ill-posed.

The more general way of looking at this is in terms of conditional probabilities. What you are given is that P(positive test| cancer)=P(+|C)=0.9 and P(negative test|no cancer)=0.9, while P(negative test|cancer)= 0.1 and P(positive test|no cancer)=P(+|N)=0.1. What you want to know is P(cancer|positive test)=P(C|+). This can be obtained from Bayes’ Theorem but only if you know P(cancer)=P(C)=1-P(N), since people either have cancer or they don’t.

The answer is given by P(C|+)=P(C)P(+|C)/[P(C)P(+|C)+P(N)P(+|N)], which for the numbers I gave above= 0.01 x 0.9/[0.01 x 0.9 + 0.99 x 0.1]=0.009/[0.009+0.099], which gives the same answer as before.

So the moral is that you shouldn’t panic if you get a positive test from a screening test of this type. As long as the condition being tested is relatively rarer than the likelihood of an error in the test result then the chances are high that you’ve got nothing to worry about. But of course, you should take more detailed tests.

The Bayesian way is the easy way!

Throwing a Fit

Posted in Bad Statistics, The Universe and Stuff with tags , on February 18, 2009 by telescoper

I’ve just been to a very interesting and stimulating seminar by Subir Sarkar from Oxford, who spoke about Cosmology Beyond the Standard Model, a talk into which he packed a huge number of provocative comments and interesting arguments. His abstract is here:

Precision observations of the cosmic microwave backround and of the large-scale clustering of galaxies have supposedly confirmed the indication from the Hubble diagram of Type Ia supernovae that the universe is dominated by some form of dark energy which is causing the expansion rate to accelerate. Although hailed as having established a ‘standard model’ for cosmology, this raises a profound problem for fundamental physics. I will discuss whether the observations can be equally well explained in alternative inhomogeneous cosmological models that do not require dark energy and will be tested by forthcoming observations.

He made no attempt to be balanced and objective, but it was a thoroughly enjoyable polemic making the point that it is possible that the dark energy whose presence we infer from cosmological observations might just be an artifact of using an oversimplified model to interpret the data. I actually agreed with quite a lot of what he said, and certainly think the subject needs people willing to question the somewhat shaky foundations on which the standard concordance cosmology is built.

But near the end, Subir almost spoiled the whole thing by making a comment that made me decide to make  another entry in my Room 101 of statistical horrors.  He was talking about the  spectrum of fluctuations in the temperature of the Cosmic Microwave Background as measured by the Wilkinson Microwave Anisotropy Probe (WMAP):

 

 

I’ve mentioned the importance of this plot in previous posts. In his talk, Subir wanted to point out that the measured spectrum isn’t actually fit all that well by the concordance cosmology prediction shown by the solid line.

A simple way of measuring goodness-of-fit is to work out the value of chi-squared which relates to the sum of the squares of the residuals between the data and the fit. If you do this with the WMAP data you will find that the value of chi-squared is actually a bit high, so high indeed that there is only a 7 per cent chance of such a value arising in a concordance Universe.  The reason is probably to do with the behaviour at low harmonics (i.e. large scales) where there are some points that do appear to lie off the model curve. This means that the best fit concordance model  isn’t a really brilliant fit, but it is acceptable at the usual 5% significance level.

I won’t quibble with this number, although strictly speaking the data points aren’t entirely independent so the translation of chi-squared into a probability is not quite as easy as it may seem.  I’d also stress that I think it is valuable to show that the concordance model isn’t by any means perfect.  However, in Subir’s talk the chi-squared result morphed into a statement that the  probability of the concordance model being right is only 7 per cent.

No! The probability of chi-squared given the model is 7%, but that’s quite different to the probability of the model given the value of chi-squared…

This is a thinly disguised example of the prosecutor’s fallacy which came up in my post about Sir Roy Meadow and his testimony in the case against Sally Clark that resulted in a wrongful conviction for the murder of her two children.

Of course the consequences of this polemicist’s fallacy aren’t so drastic. The Universe won’t go to prison. And it didn’t really spoil what was a fascinating talk. But it did confirm in my mind that statistics is like alcohol. It makes clever people say very silly things.

Misplaced Confidence

Posted in Bad Statistics, The Universe and Stuff with tags , , , on December 10, 2008 by telescoper

From time to time I’ve been posting items about the improper use of statistics. My colleague Ant Whitworth just showed me an astronomical example drawn from his own field of star formation and found in a recent paper by Matthew Bate from the University of Exeter.

The paper is a lengthy and complicated one involving the use of extensive numerical calculations to figure out the effect of radiative feedback on the process of star formation. The theoretical side of this subject is fiendishly difficult, to the extent that it is difficult to make any progress with pencil-and-paper techinques, and Matthew is one of the leading experts in the use of computational methods to tackle problems in this area.

One of the main issues Matthew was investigating was whether radiative feedback had any effect on the initial mass function of the stars in his calculations. The key results are shown in the picture below (Figure 8 from the paper) in terms of cumulative distributions of the star masses in various different situations.

untitled

The question that arises from such data is whether these empirical distributions differ significantly from each other or whether they are consistent with the variations that would naturally arise in different samples drawn from the same distribution. The most interesting ones are the two distributions to the right of the plot that appear to lie almost on top of each other.

Because the samples are very small (only 13 and 15 objects respectively) one can’t reasonably test for goodness-of-fit using the standard chi-squared test because of discreteness effects and because not much is known about the error distribution. To do the statistics, therefore, Matthew uses a popular non-parametric method called the Kolmogorov-Smirnov test which uses the maximum deviation D between the two distributions as a figure of merit to decide whether they match. If D is very large then it is not probable that it can have arisen from the same distribution. If it is smaller then it might have. As for what happens if it is very small then you’ll have to wait a bit.

This is an example of a standard (frequentist) hypothesis test in which the null hypothesis is that the empirical distributions are calculated from independent samples drawn from the same underlying form. The probability of a value of D arising as large as the measured one can be calculated assuming the null is true and is then the significance level of the test. If there’s only a 1% chance of it being as large as the measured value then the significance level is 1%.

So far, so good.

But then, in describing the results of the K-S test the paper states

A Kolmogorov-Smirnov (K-S) test on the …. distributions gives a 99.97% probability that the two IMFs were drawn from the same underlying distribution (i.e. they are statistically indistinguishable).

Agh! No it doesn’t! What it gives is a probability of 99.97% that the chance deviation between the two distributions is expected to be larger than that actually measured. In other words, the two distributions are surprisingly close to each other. But the significance level merely specifies the probability that you would reject the null-hypothesis if it were correct. It says nothing at all about the probability that the null hypothesis is correct. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution of D based on it, and hence determine the statistical power of the test. Without specifying an alternative hypothesis all you can say is that you have failed to reject the null hypothesis.

Or better still, if you have an alternative hypothesis you can forget about power and significance and instead work out the relative probability of the two hypotheses using a proper Bayesian approach.

You might also reasonably ask why might D be so very small? If you find an improbably low value of chi-squared then it usually means either that somebody has cheated or that the data are not independent (which is assumed for the basis of the test). Qualitatively the same thing happens with a KS test.

In fact these two distributions can’t be thought of as independent samples anyway as they are computed from the same initial conditions but with various knobs turned on or off to include different physics. They are not “samples” drawn from the same population but slightly different versions of the same sample. The probability emerging from the KS machinery is therefore meaningless anyway in this context.

So a correct statement of the result would be that the deviation between the two computed distributions is much smaller than one would expect to arise from two independent samples of the same size drawn from the same population.

That’s a much less dramatic statement than is contained in the paper, but has the advantage of not being bollocks.

Cerebral Asymmetry: is it all in the Mind?

Posted in Bad Statistics, Science Politics with tags , , on November 12, 2008 by telescoper

After blogging a few days ago about the possibility that our entire Universe might be asymmetric, I found out today that a short comment of mine about a completely different form of asymmetry has been published in the Proceedings of the National Academy of Sciences of New York.

Earlier this summer a paper by Ivanka Savic & Per Lindstrom concerning gender and sexuality differences in brain structure received widespread press coverage and the odd blog comment. They had analysed a group of 90 volunteers divided into four classes based on gender and sexual orientation: male heterosexual, male homosexual, female heterosexual and female homosexual.

They studied the brain structure of these volunteers using Magnetic Resonance Imaging and used their data to look for differences between the different classes. In particular they measured the asymmetry between left and right hemispheres for their samples. The right side of the brain for heterosexual men was found to be typically about 2% larger than the left; homosexual women also had an asymmetry, but slightly smaller than this at about 1%. Gay men and heterosexual women showed no discernible cerebral asymmetry. These claims are obviously very interesting and potentially important if they turn out to be true. It is in the nature of the scientific method that such results should be subjected to rigorous scrutiny in order to check their credibility.

As someone who knows nothing about neurobiology but one or two things about statistics, I dug out the research paper by Savic & Lindstrom and looked at the analysis it presents. I very quickly began to suspect there might be a problem. For each volunteer, the authors obtain measurements of the left and right cerebral volumes (call these L and R respectively). Each pair of measurements is then combined to form an asymmetry index (AI) as (L-R)/(L+R). There is then a set of values for AI, one for each volunteer. The claim is that these are systematically different for the different gender and orientation groups, based on a battery of tests including Analysis of Variance (ANOVA) and t-tests based on sample means.

Of course, it would be better to do this using a consistent, Bayesian, approach because this would make explicit the dependence of the results on an underlying model of the data. Sadly, the statistical methodology available off-the-shelf is of inferior frequentist type and this is what researchers tend to do when they don’t really know what they’re doing. They also don’t bother to read the health warnings that state the assumptions behind the results.

The problem in this case is that the tests done by Savic & Lindstrom all depend on the quantity being analysed (AI) having a normal (Gaussian) distribution. This is very often a reasonable hypothesis for biometric data, but unfortunately in this case the construction of the asymmetry index is such that it is expected to have a very non-Gaussian shape as is commonly the case for distributions of variables formed as ratios. In fact, the ratio of two normal variates has a peculiar distribution with very long tails. Many statistical analyses appeal to the Central Limit Theorem to justify the assumption of normality, but distributions with very long tails (such as the Cauchy distribution) violate the conditions of this Theorem, namely that the distribution must have finite variance. The asymmetry index is probably therefore an inappropriate choice of variable for the tests that Savic & Lindstrom perform. In particular the significance levels (or p-values) quoted in their paper are very low (of order 0.0008, for example, in the ANOVA test) which is surprising for such small samples. These probabilities are obtained by assuming the observations have Gaussian statistics, and they would be much lower for a distribution with longer tails.

Being a friendly chap I emailed Dr Savic drawing this problem to her attention and asking if she knew about this problem and the possible implications it might have for the analysis she had presented. If not, I offered to do an independent (private) check on the data to see how reliable the claimed statistical results actually were. I never received a reply.

Worried that the world might be jumping to all kinds of far-reaching conclusions about gay genes based on these questionable statistics, I wrote instead to the editor of the Journal Proceedings of the National Academy of Sciences of New York, Randy Schekman, who suggested I submit a written comment to the Journal. I did, it was accepted by the editorial committee, and it came out in the 11th November Issue. What I didn’t realise was that Savic & Lindstrom had actually prepared a reply and that this was published alongside my comment. I find it strange that I wasn’t told about this before publication but that aside, it is in principle quite reasonable to let the authors respond to criticisms like mine. Their response reveals that they completely missed the point of the danger of long-tailed distributions I mentioned above. They state that “when the sample size n is big the sampling distribution of the mean becomes approximately normal regardless of the distribution of the original variable“. Not if the distribution of the original variable has such a long tail it doesn’t! In fact, if the observations have a Cauchy distribution then so does the sampling distribution of the mean, whatever the size of sample. You can find this caveat spelled out in many places, including here. Savic & Lindstrom seem oblivous to this pitfall, even after I specifically pointed it out to them.

They also claim that a group size of n=30 is sufficient to be confident that the central limit theorem holds. A pity, then, that none of their groups is of that size. The overall sample is 90, but it is broken down into two groups of 20 and two of 25.

cerebral-asymmetry

(c) 2008 Academy of Sciences of New York

They also say that the measured AI distribution is actually normal anyway and give a plot (above). This shows all the AI values binned into one histogram. Since they don’t give any quantitative measures of goodness of fit, it’s hard to tell whether this has a normal distribution or not. One can, however, easily identify a group of five or six individuals that seem to form a separate group with larger AI values (the small peak to the right of the large peak). Since they don’t give histograms broken down by group it is impossible to be sure, but I would hazard a guess that these few individuals might be responsible for the entire result; remember that the entire sample has n only of 90.

More alarmingly, Savic & Lindstrom state in their reply that “one outlier” is omitted from this graph. Really? On what basis was the outlier rejected? The existence of outliers could be evidence of exactly the sort of problem I am worried about! Unless there was a known mistake in the measurement, this outlier should never have been omitted. They claim that the “recalculation of the data excluding this outlier does not change the results”. It find it difficult to believe that the removal of an outlier from such a small sample could not change the p-values!

In my note I made a few constructive suggestions as to how the difficulty might be circumvented, by Savic & Bergstrom have not followed any of them. Instead they report (without details of the p-values) having done some alternative, non-parametric, tests. These are all very well, but they don’t add very much if their p-values also assume Gaussian statistics. A better way to do this sort of thing robustly would be using Monte Carlo simulations.

The bottom line is that after this exchange of comments we haven’t really got anywhere and I still don’t know if the result is significant. I don’t really think it’s useful to go backwards and forwards through the journal, so I’ve emailed Dr Savic again asking for access to the numbers so I can check the statistics privately. In astronomy it is quite normal for people to make their data sets publically available, but that doesn’t seem to be the case in neurobiology. I’m not hopeful that they will reply, especially since they branded my comments “harsh” and “inappropriate”. Scientists should know how to take constructive criticism.

Their conclusion may eventually turn out to be right, but the analysis done so far is certainly not robust and it needs further checking. In the meantime I don’t just have doubts about the claimed significance of this specific result, which merely serves to illustrate the extremely poor level of statistical understanding displayed by large numbers of professional researchers. This was one of the things I wrote about in my book From Cosmos to Chaos. I’m very confident that a large fraction of claimed results in biosciences are based on bogus analyses.

I’ve long thought that scientific journals that deal with subjects like this should employ panels of statisticians to do the analysis independently of the authors and also that publication of the paper should require publication of the raw data. Science advances when results are subject to open criticism and independent analysis. I sincerely hope that Savic & Lindstrom will release their data in order for their conclusions to be checked in this way.

It’s no wonder that there is so much public distrust of science, when such important claims are rushed into the public domain without proper scrutiny.

A Lop-sided Universe?

Posted in Bad Statistics, Cosmic Anomalies, The Universe and Stuff with tags , on November 9, 2008 by telescoper

Over on cosmic variance, I found an old post concerning the issue of whether there might be large-scale anomalies in the cosmic microwave background sky. I blogged about this some time ago, under the title of Is there an Elephant in the Room?, so it’s interesting to see a different take on it. Interest in this issue has been highlighted by a recent paper by Groeneboom & Eriksen that claims to have detected asymmetry in the distribution of fluctuations in the data from the Wilkinson Microwave Anisotropy Probe (WMAP) inconsistent with the predictions of the standard cosmological model. If this feature is truly of primordial origin then it is an extremely important discovery as it will (probably) require the introduction of new physics into our understanding of cosmology, and that will be exciting.

It is the job of theorists to invent new theories, and it is not at all a problem that these bits of evidence have generated a number of speculative ideas. Who knows? One of them may be right. I think it is the job of theoreticians to think as radically as possible about things like this. On the other hand, it is the observational evidence that counts in the end and we should be very conservative in how we treat that. This is what bothers me about this particular issue.

elongatedThe picture on the left shows a processed version of the WMAP fluctuation pattern designed to reveal the asymmetry, with the apparent preferred direction shown in red. This map shows the variation of the across the whole sky, and the claimed result is that the fluctuations are a bit larger around the red dots (which are 180 degrees apart) than in the regions at right angles to them.

It’s a slight effect, but everything in the picture is a slight effect as the CMB is extremely smooth to start with, the fluctuations in temperature being only about one part in a hundred thousand. The statistical analysis looks to me to be reasonably solid, so lets suppose that the claim is correct.scan

The picture on the right (courtesy of NASA/WMAP Science Team) shows the scan strategy followed by the WMAP satellite on the same projection of the sky. The experiment maps the whole sky by spinning its detectors in such a way that they point at all possible positions. The axis of this spin is chosen in a particular way so that it is aligned with the ecliptic poles (out of the plane of the solar system). It is in the nature of this procedure that it visits some places more than others (those at the ecliptic poles are scanned more often than those at the equator), hence the variation in signal-to-noise shown in the map. You can see that effect graphically in the picture: the regions near the North and South ecliptic poles have better signal to noise than the others.

The axis found by Groeneboom & Eriksen is not perfectly aligned with the ecliptic plane but it is pretty close. It seems a reasonable (if conservative) interpretation of this that the detected CMB anomaly could be due to an unknown systematic that has something to do either with the solar system (such as an unknown source of radiation, like cold dust) or the way the satellite scans. The WMAP team have worked immensely hard to isolate any such systematics so if this is such an effect then it must be very subtle to have escaped their powerful scrutiny. They’re all clever people and it’s a fabulous experiment, but that doesn’t mean that it is impossible that they have missed something.

Many of the comments that have been posted on cosmic variance relating to this question the statistical nature of the result. Of course we have only one sky available, so given the “randomness” of the fluctuations it is possible that freakish configurations occur by chance. This misses the essentially probabilistic nature of all science which I tried to describe in my book on probability From Cosmos to Chaos. We are always limited by noise and incompleteness but that doesn’t invalidate the scientific method. In cosmology these problems are writ large because of the nature of the subject, but there is no qualitative difference in the interplay between science and theory in cosmology compared with other sciences. It’s just less easy to get the evidence.

So the issue here, which is addressed only partially by Groeneboom % Eriksen, is whether a lop-sided universe is more probable than an isotropic one given the WMAP measurements. They use a properly consistent Bayesian argument to tackle this issue and form a reasonably strong conclusion that the answer is yes. As far as it goes, I think this is (probably) reasonable.

However, now imagine I don’t believe in anistropic cosmologies but instead have an idea that this is caused by an unknown systematic relating in some way to the ecliptic plane. Following the usual Bayesian logic I think it is clear that, although both can account for the data, my hypothesis must be even more probable than a lop-sided universe. There is no reason why a primordial effect should align so closely with the ecliptic plane, so there is one unexplained coincidence in the lop-sided-universe model, whereas my model neatly accounts for that fact without any freedom to adjust free parameters. Ockham’s razor is on my side.

So what can we do about this? The answer might be not very much. It is true that, soon, the Planck Surveyor will be launched and it will map the CMB sky againat higher resolution and sensitivity. On the other hand, it will not solve the problem that we only have one sky. The fact that it is a different experiment may yield clues to any residual systematics in the WMAP results, but if it has a similar scan strategy to WMAP, even Planck might not provide definitive answers.

I think this one may run and run!

The Curious Case of the Inexpert Witness

Posted in Bad Statistics with tags , , , on September 17, 2008 by telescoper

Although I am a cosmologist by trade, I am also interested in the fields of statistics and probability theory. I guess this derives from the fact that a lot of research in cosmology depends on inferences drawn from large data sets. By its very nature this process is limited by the fact that the information obtained in such studies is never complete. The analysis of systems based on noisy or incomplete data is exactly what probability is about.

Of course, statistics has much wider applications than in pure science and there are times when it is at the heart of controversies that explode into the public domain, particularly when involved in medicine or jurisprudence. One of the reasons why I wrote my book From Cosmos to Chaos was a sense of exasperation at how poorly probability theory is understood even by people who really should know better. Although statistical reasoning is at the heart of a great deal of research in physics and astronomy, there are many prominent practioners who don’t really know what they are talking about when they discuss probability. As I soon discovered when I started thinking about writing the book, the situation is even worse in other fields. I thought it might be fun to post a few examples of bad statistics from time to time, so I’ll start with this, which is accompanied by a powerpoint file of a lunchtime talk I gave at Cardiff.

I don’t have time to relate the entire story of Sally Clark and the monstrous miscarriage of justice she endured after the deaths of her two children. The wikipedia article I have linked to is pretty accurate, so I’ll refer you there for the details. In a nutshell, in 1999 she was convicted of the murder of her two children on the basis of some dubious forensic evidence and the expert testimony of a prominent paediatrician, Sir Roy Meadow. After appeal her convinction was quashed in 2003, but she died in 2007 from alcohol poisoning having apparently taken to the bottle after three years of wrongful imprisonment.

Professor Meadow had a distinguished (if somewhat controversial) career, becoming famous for a paper on Munchausen’s Syndrome by Proxy which appeared in the Lancet in 1977. He subsequently appeared as an expert witness in many trials of parents accused of murdering their children. In the Sally Clark case he was called as a witness for the prosecution, where his testimony included an entirely bogus and now infamous argument about the probability of two sudden infant deaths happening accidentally in the same family.

The argument is basically the following. The observed ratio of live births to cot deaths in affluent non-smoking families (like Sally Clark’s) is about 8,500:1. This means that about 1 in 8,500 children born to such families die in such a way. He then argued that the probability that two such tragedies happen in the same family is this number squared, i.e. about 73,000,000:1. In the mind of the jury this became the odds against the death of Mrs Clark’s children being accidental and therefore presumably the odds against her being innocent. The jury found her guilty.

For reasons why this argument is completely bogus, and more technical details, look in the following powerpoint file (which involves a bit of maths):

the-inexpert-witness

It is difficult to assess how important Roy Meadow’s testimony was in the collective mind of the Jury, but it was certainly erroneous and misleading. The General Medical Council decided that he should be struck off the medical register in July 2005 on the grounds of “serious professional misconduct”. He appealed, and the decision was partly overturned in 2006, the latest judgement basically being about what level of professional misconduct should be termed “serious”.

My reaction to all this is a mixture of anger and frustration. First of all, the argument presented by Meadow is so clearly wrong that any competent statistician could have been called as a witness to rebut it. The defence were remiss in not doing so. Second, the disciplinary action taken by the GMC seemed to take no account of the consequences his testimony had for Sally Clark. He was never even at risk of prosecution or financial penalty. Sally Clark spent three years of her life in prison, on top of having lost her children, and now is herself dead. Finally, expert testimony is clearly important in many trials, but experts should testify only on those matters that they are experts about! Meadow even admitted later that he didn’t really understand statistics. So why did he include this argument in his testimony? I quote from a press release produced by the Royal Statistical Society in the aftermath of this case:

Although many scientists have some familiarity with statistical methods, statistics remains a specialised area. The Society urges the Courts to ensure that statistical evidence is presented only by appropriately qualified statistical experts, as would be the case for any other form of expert evidence.

As far as I know, the criminal justice system has yet to implement such safeguards.

How many more cases like this need to happen before the Courts recognise the dangers of bad statistics?