Archive for the Bad Statistics Category

DNA Profiling and the Prosecutor’s Fallacy

Posted in Bad Statistics with tags , , , , , , on October 23, 2010 by telescoper

It’s been a while since I posed anything in the Bad Statistics file so I thought I’d return to the subject of one of my very first blog posts, although I’ll take a different tack this time and introduce it with different, though related, example.

The topic is forensic statistics, which has been involved in some high-profile cases and which demonstrates how careful probabilistic reasoning is needed to understand scientific evidence. A good example is the use of DNA profiling evidence. Typically, this involves the comparison of two samples: one from an unknown source (evidence, such as blood or semen, collected at the scene of a crime) and a known or reference sample, such as a blood or saliva sample from a suspect. If the DNA profiles obtained from the two samples are indistinguishable then they are said to “match” and this evidence can be used in court as indicating that the suspect was in fact the origin of the sample.

In courtroom dramas, DNA matches are usually presented as being very definitive. In fact, the strength of the evidence varies very widely depending on the circumstances. If the DNA profile of the suspect or evidence consists of a combination of traits that is very rare in the population at large then the evidence can be very strong that the suspect was the contributor. If the DNA profile is not so rare then it becomes more likely that both samples match simply by chance. This probabilistic aspect makes it very important to understand the logic of the argument very carefully.

So how does it all work? A DNA profile is not a complete map of the entire genetic code contained within the cells of an individual, which would be such an enormous amount of information that it would be impractical to use it in court. Instead, a profile consists of a few (perhaps half-a-dozen) pieces of this information called alleles. An allele is one of the possible codings of DNA of the same gene at a given position (or locus) on one of the chromosomes in a cell. A single gene may, for example, determine the colour of the blossom produced by a flower; more often genes act in concert with other genes to determine the physical properties of an organism. The overall physical appearance of an individual organism, i.e. any of its particular traits, is called the phenotype and it is controlled, at least to some extent, by the set of alleles that the individual possesses. In the simplest cases, however, a single gene controls a given attribute. The gene that controls the colour of a flower will have different versions: one might produce blue flowers, another red, and so on. These different versions of a given gene are called alleles.

Some organisms contain two copies of each gene; these are said to be diploid. These copies can either be both the same, in which case the organism is homozygous, or different in which case it is heterozygous; in the latter case it possesses two different alleles for the same gene. Phenotypes for a given allele may be either dominant or recessive (although not all are characterized in this way). For example, suppose the dominated and recessive alleles are called A and a, respectively. If a phenotype is dominant then the presence of one associated allele in the pair is sufficient for the associated trait to be displayed, i.e. AA, aA and Aa will both show the same phenotype. If it is recessive, both alleles must be of the type associated with that phenotype so only aa will lead to the corresponding traits being visible.

Now we get to the probabilistic aspect of this. Suppose we want to know what the frequency of an allele is in the population, which translates into the probability that it is selected when a random individual is extracted. The argument that is needed is essentially statistical. During reproduction, the offspring assemble their alleles from those of their parents. Suppose that the alleles for any given individual are chosen independently. If p is the frequency of the dominant gene and q is the frequency of the recessive one, then we can immediately write:

p+q =1

Using the product law for probabilities, and assuming independence, the probability of homozygous dominant pairing (i.e. AA) is p2, while that of the pairing aa is q2. The probability of the heterozygotic outcome is 2pq (the two possibilities, each of probability pq are Aa and aA). This leads to the result that

p^2 +2pq +q^2 =1

This called the Hardy-Weinberg law. It can easily be extended to cases where there are two or more alleles, but I won’t go through the details here.

Now what we have to do is examine the DNA of a particular individual and see how it compares with what is known about the population. Suppose we take one locus to start with, and the individual turns out to be homozygotic: the two alleles at that locus are the same. In the population at large the frequency of that allele might be, say, 0.6. The probability that this combination arises “by chance” is therefore 0.6 times 0.6, or 0.36. Now move to the next locus, where the individual profile has two different alleles. The frequency of one is 0.25 and that of the other is 0.75. so the probability of the combination is “2pq”, which is 0.375. The probability of a match at both these loci is therefore 0.36 times 0.375, or 13.5%. The addition of further loci gradually refines the profile, so the corresponding probability reduces.

This is a perfectly bona fide statistical argument, provided the assumptions made about population genetic are correct. Let us suppose that a profile of 7 loci – a typical number for the kind of profiling used in the courts – leads to a probability of one in ten thousand of a match for a “randomly selected” individual. Now suppose the profile of our suspect matches that of the sample left at the crime scene. This means that, either the suspect left the trace there, or an unlikely coincidence happened: that, by a 1:10,000 chance, our suspect just happened to match the evidence.

This kind of result is often quoted in the newspapers as meaning that there is only a 1 in 10,000 chance that someone other than the suspect contributed the sample or, in other words, that the odds against the suspect being innocent are ten thousand to one against. Such statements are gross misrepresentations of the logic, but they have become so commonplace that they have acquired their own name: the Prosecutor’s Fallacy.

To see why this is a fallacy, i.e. why it is wrong, imagine that whatever crime we are talking about took place in a big city with 1,000,000 inhabitants. How many people in this city would have DNA that matches the profile? Answer: about 1 in 10,000 of them ,which comes to 100. Our suspect is one. In the absence of any other information, the odds are therefore roughly 100:1 against him being guilty rather than 10,000:1 in favour. In realistic cases there will of course be additional evidence that excludes the other 99 potential suspects, so it is incorrect to claim that a DNA match actually provides evidence of innocence. This converse argument has been dubbed the Defence Fallacy, but nevertheless it shows that statements about probability need to be phrased very carefully if they are to be understood properly.

All this brings me to the tragedy that I blogged about in 2008. In 1999, Mrs Sally Clark was tried and convicted for the murder of her two sons Christopher, who died aged 10 weeks in 1996, and Harry who was only eight weeks old when he died in 1998. Sudden infant deaths are sadly not as uncommon as one might have hoped: about one in eight thousand families experience such a nightmare. But what was unusual in this case was that after the second death in Mrs Clark’s family, the distinguished paediatrician Sir Roy Meadows was asked by the police to investigate the circumstances surrounding both her losses. Based on his report, Sally Clark was put on trial for murder. Sir Roy was called as an expert witness. Largely because of his testimony, Mrs Clark was convicted and sentenced to prison.

After much campaigning, she was released by the Court of Appeal in 2003. She was innocent all along. On top of the loss of her sons, the courts had deprived her of her liberty for four years. Sally Clark died in 2007 from alcohol poisoning, after having apparently taken to the bottle after three years of wrongful imprisonment.The whole episode was a tragedy and a disgrace to the legal profession.

I am not going to imply that Sir Roy Meadows bears sole responsibility for this fiasco, because there were many difficulties in Mrs Clark’s trial. One of the main issues raised on Appeal was that the pathologist working with the prosecution had failed to disclose evidence that Harry was suffering from an infection at the time he died. Nevertheless, what Professor Meadows said on oath was so shockingly stupid that he fully deserves the vilification with which he was greeted after the trial. Two other women had also been imprisoned in similar circumstances, as a result of his intervention.

At the core of the prosecution’s case was a probabilistic argument that would have been torn to shreds had any competent statistician been called to the witness box. Sadly, the defence counsel seemed to believe it as much as the jury did, and it was never rebutted. Sir Roy stated, correctly, that the odds of a baby dying of sudden infant death syndrome (or “cot death”) in an affluent, non-smoking family like Sally Clarks, were about 8,543 to one against. He then presented the probability of this happening twice in a family as being this number squared, or 73 million to one against. In the minds of the jury this became the odds against Mrs Clark being innocent of a crime.

That this argument was not effectively challenged at the trial is truly staggering.

Remember that the product rule for combining probabilities

P(AB)=P(A)P(B|A)

only reduces to

P(AB)=P(A)P(B)

if the two events A and B are independent, i.e. that the occurrence of one event has no effect on the probability of the other. Nobody knows for sure what causes cot deaths, but there is every reason to believe that there might be inherited or environmental factors that might cause such deaths to be more frequent in some families than in others. In other words, sudden infant deaths might be correlated rather than independent. Furthermore, there is data about the frequency of multiple infant deaths in families. The conditional frequency of a second such event following an earlier one is not one in eight thousand or so, it’s just one in 77. This is hard evidence that should have been presented to the jury. It wasn’t.

Note that this testimony counts as doubly-bad statistics. It not only deploys the Prosecutor’s Fallacy, but applies it to what was an incorrect calculation in the first place!

Defending himself, Professor Meadows tried to explain that he hadn’t really understood the statistical argument he was presenting, but was merely repeating for the benefit of the court something he had read, which turned out to have been in a report that had not been even published at the time of the trial. He said

To me it was like I was quoting from a radiologist’s report or a piece of pathology. I was quoting the statistics, I wasn’t pretending to be a statistician.

I always thought that expert witnesses were suppose to testify about those things that they were experts about, rather than subjecting the jury second-hand flummery. Perhaps expert witnesses enjoy their status so much that they feel they can’t make mistakes about anything.

Subsequent to Mrs Clark’s release, Sir Roy Meadows was summoned to appear in front of a disciplinary tribunal at the General Medical Council. At the end of the hearing he was found guilty of serious professional misconduct, and struck off the medical register. Since he is retired anyway, this seems to me to be scant punishment. The judges and barristers who should have been alert to this miscarriage of justice have escaped censure altogether.

Although I am pleased that Professor Meadows has been disciplined in this fashion, I also hope that the General Medical Council does not think that hanging one individual out to dry will solve this problem. I addition, I think the politicians and legal system should look very hard at what went wrong in this case (and others of its type) to see how the probabilistic arguments that are essential in the days of forensic science can be properly incorporated in a rational system of justice. At the moment there is no agreed protocol for evaluating scientific evidence before it is presented to court. It is likely that such a body might have prevented the case of Mrs Clark from ever coming to trial. Scientists frequently seek the opinions of lawyers when they need to, but lawyers seem happy to handle scientific arguments themselves even when they don’t understand them at all.

I end with a quote from a press release produced by the Royal Statistical Society in the aftermath of this case:

Although many scientists have some familiarity with statistical methods, statistics remains a specialised area. The Society urges the Courts to ensure that statistical evidence is presented only by appropriately qualified statistical experts, as would be the case for any other form of expert evidence.

As far as I know, the criminal justice system has yet to implement such safeguards.


Share/Bookmark

Political Correlation

Posted in Bad Statistics, Politics with tags , , , , on August 28, 2010 by telescoper

I was just thinking that it’s been a while since I posted anything in my bad statistics category when a particularly egregious example jumped up out of this week’s Times Higher and slapped me in the face. This one goes wrong before it even gets to the statistical analysis, so I’ll only give it short shrift here, but it serves to remind us all how feeble is many academic’s grasp of the scientific method, and particularly the role of statistics within it. The perpetrator in this case is Paul Whiteley, who is Professor of Politics at the University of Essex. I’m tempted to suggest he should go and stand in the corner wearing a dunce’s cap.

Professor Whiteley argues that he has found evidence that refutes the case that increased provision of science, technology, engineering and maths (STEM) graduates are -in the words of Lord Mandelson – “crucial to in securing future prosperity”. His evidence is based on data relating to 30 OECD countries: on the one hand, their average economic growth for the period 2000-8 and, on the other, the percentage of graduates in STEM subjects for each country over the same period. He finds no statistically significant correlation between these variates. The data are plotted here:

This lack of correlation is asserted to be evidence that STEM graduates are not necessary for economic growth, but in an additional comment (for which no supporting numbers are given), it is stated that growth correlates with the total number of graduates in all subjects in each country. Hence the conclusion that higher education is good, whether or not it’s in STEM areas.

So what’s wrong with this analysis? A number of things, in fact, but I’ll start with what seems to me the most important conceptual one. In order to test a hypothesis, you have to look for a measurable effect that would be expected if the hypothesis were true, measure the effect, and then decide whether the effect is there or not. If it isn’t, you have falsified the hypothesis.

Now, would anyone really expect the % of students graduating in STEM subjects  to correlate with the growth rate in the economy over the same period? Does anyone really think that newly qualified STEM graduates have an immediate impact on economic growth? I’m sure even the most dedicated pro-science lobbyist would answer “no” to that question. Even the quote from Lord Mandelson included the crucial word “future”! Investment in these areas is expected to have a long-term benefit that would probably only show after many years. I would have been amazed had there been a correlation between measures relating to such a short period, so  absence of one says nothing whatsoever about the economic benefits of education in STEM areas.

And another thing. Why is the “percentage of graduates” chosen as a variate for this study? Surely a large % of STEM graduates is irrelevant if the total number is very small? I would have thought the fraction of the population with a STEM degree might be a better choice. Better still, since it is claimed that the overall number of graduates correlates with economic growth, why not show how this correlation with the total number of graduates breaks down by subject area?

I’m a bit suspicious about the reliability of the data too. Which country is it that produces less than 3% of its graduates in science subjects (the point at the bottom left of the plot). Surely different countries also have different types of economy wherein the role of science and technology varies considerably. It’s tempting, in fact, to see two parallel lines in the above graph – I’m not the only one to have noticed this – which may either be an artefact of small numbers chosen or might indicate that some other parameter is playing a role.

This poorly framed hypothesis test, dubious choice of variables, and highly questionable conclusions strongly suggest that Professor Whiteley had made his mind up what result he wanted and simply dressed it up in a bit of flimsy statistics. Unfortunately, such pseudoscientific flummery is all that’s needed to convince a great many out there in the big wide world, especially journalists. It’s a pity that this shoddy piece of statistical gibberish was given such prominence in the Times Higher, supported by a predictably vacuous editorial, especially when the same issue features an article about the declining standards of science journalism. Perhaps we need more STEM graduates to teach the others how to do statistical tests properly.

However, before everyone accuses me of being blind to the benefits of anything other than STEM subjects, I’ll just make it clear that, while I do think that science is very important for a large number of reasons, I do accept that higher education generally is a good thing in itself , regardless of whether it’s in physics or mediaeval latin, though I’m not sure about certain other subjects.  Universities should not be judged solely by the effect they may or may not have on short-term economic growth.

Which brings me to a final point about the difference between correlation and causation. People with more disposal income probably spend more money on, e.g., books than people with less money. Buying books doesn’t make you rich, at least not in the short-term, but it’s a good thing to do for its own sake. We shouldn’t think of higher education exclusively on the cost side of the economic equation, as politicians and bureaucrats seem increasingly to be doing,  it’s also one of the benefits.


Share/Bookmark

Cauchy Statistics

Posted in Bad Statistics, The Universe and Stuff with tags , , , , on June 7, 2010 by telescoper

I was attempting to restore some sort of order to my office today when I stumbled across some old jottings about the Cauchy distribution, which is perhaps more familiar to astronomers as the Lorentz distribution. I never used in the publication they related to so I thought I’d just quickly pop the main idea on here in the hope that some amongst you might find it interesting and/or amusing.

What sparked this off is that the simplest cosmological models (including the particular one we now call the standard model) assume that the primordial density fluctuations we see imprinted in the pattern of temperature fluctuations in the cosmic microwave background and which we think gave rise to the large-scale structure of the Universe through the action of gravitational instability, were distributed according to Gaussian statistics (as predicted by the simplest versions of the inflationary universe theory).  Departures from Gaussianity would therefore, if found, yield important clues about physics beyond the standard model.

Cosmology isn’t the only place where Gaussian (normal) statistics apply. In fact they arise  generically,  in circumstances where variation results from the linear superposition of independent influences, by virtue of the Central Limit Theorem. Noise in experimental detectors is often treated as following Gaussian statistics, for example.

The Gaussian distribution has some nice properties that make it possible to place meaningful bounds on the statistical accuracy of measurements made in the presence of Gaussian fluctuations. For example, we all know that the margin of error of the determination of the mean value of a quantity from a sample of size n independent Gaussian-dsitributed varies as 1/\sqrt{n}; the larger the sample, the more accurately the global mean can be known. In the cosmological context this is basically why mapping a larger volume of space can lead, for instance, to a more accurate determination of the overall mean density of matter in the Universe.

However, although the Gaussian assumption often applies it doesn’t always apply, so if we want to think about non-Gaussian effects we have to think also about how well we can do statistical inference if we don’t have Gaussianity to rely on.

That’s why I was playing around with the peculiarities of the Cauchy distribution. This comes up in a variety of real physics problems so it isn’t an artificially pathological case. Imagine you have two independent variables X and Y each of which has a Gaussian distribution with zero mean and unit variance. The ratio Z=X/Y has a probability density function of the form

p(z)=1/\pi(1+z^2),

which is a form of the Cauchy distribution. There’s nothing at all wrong with this as a distribution – it’s not singular anywhere and integrates to unity as a pdf should. However, it does have a peculiar property that none of its moments is finite, not even the mean value!

Following on from this property is the fact that Cauchy-distributed quantities violate the Central Limit Theorem. If we take n independent Gaussian variables then the distribution of sum X_1+X_2 + \ldots X_n has the normal form, but this is also true (for large enough n) for the sum of n independent variables having any distribution as long as it has finite variance.

The Cauchy distribution has infinite variance so the distribution of the sum of independent Cauchy-distributed quantities Z_1+Z_2 + \ldots Z_n doesn’t tend to a Gaussian. In fact the distribution of the sum of any number of  independent Cauchy variates is itself a Cauchy distribution. Moreover the distribution of the mean of a sample of size n does not depend on n for Cauchy variates. This means that making a larger sample doesn’t reduce the margin of error on the mean value!

This was essentially the point I made in a previous post about the dangers of using standard statistical techniques – which usually involve the Gaussian assumption – to distributions of quantities formed as ratios.

We cosmologists should be grateful that we don’t seem to live in a Universe whose fluctuations are governed by Cauchy, rather than (nearly) Gaussian, statistics. Measuring more of the Universe wouldn’t be any use in determining its global properties as we’d always be dominated by cosmic variance..

 

Clustering in the Deep

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , on May 27, 2010 by telescoper

I couldn’t resist a quick lunchtime post about the results that have come out concerning the clustering of galaxies found by the HerMES collaboration using the Herschel Telescope. There’s quite a lengthy press release accompanying the new results, and there’s not much point in repeating the details here, so I’ll just show a wonderful image showing thousands of galaxies and their far-infrared colours.

Image Credit: European Space Agency, SPIRE and HERMES consortia

According to the press release, this looks “like grains of sand”. I wonder if whoever wrote the text was deliberately referring to Genesis 22:17?

.. they shall multiply as the stars of the heaven, and as the grains of sand upon the sea shore.

However, let me take issue a little with the following excerpt from said press release:

While at a first glance the galaxies look to be scattered randomly over the image, in fact they are not. A closer look will reveals that there are regions which have more galaxies in, and regions that have fewer.

A while ago I posted an item asking what “scattered randomly” is meant to mean. It included this picture

This is what a randomly-scattered set of points actually looks like. You’ll see that it also has some regions with more galaxies in them than others. Coincidentally, I showed the same  picture again this morning in one of my postgraduate lectures on statistics and a majority of the class – as I’m sure do many of you seeing it for the first time –  thought it showed a clustered pattern. Whatever “randomness” means precisely, the word certainly implies some sort of variation whereas the press release implies the opposite. I think a little re-wording might be in order.

What galaxy clustering statistics reveal is that the variation in density from place-to-place is greater than that expected in a random distribution like that shown. This has been known since the 1960s, so it’s not  the result that these sources are clustered that’s so important. In fact, The preliminary clustering results from the HerMES surveys – described in a little more detail in a short paper available on the arXIv – are especially  interesting because they show that some of the galaxies seen in this deep field are extremely bright (in the far-infrared), extremely distant, high-redshift objects which exhibit strong spatial correlations. The statistical form of this clustering provides very useful input for theorists trying to model the processes of galaxy formation and evolution.In particular, the brightest objects at high redshift have a propensity to appear preferentially in dense concentrations, making them even more strongly clustered than rank-and-file galaxies. This fact probably contains important information about the environmental factors responsible for driving their enormous luminosities.

The results are still preliminary, but we’re starting to see concrete evidence of the impact Herschel is going to have on extragalactic astrophysics.

General Purpose Election Blog Post

Posted in Bad Statistics, Politics with tags , , on April 14, 2010 by telescoper

A dramatic new <insert name of polling organization, e.g. GALLUP> opinion poll has revealed that the <insert name of political party> lead over <insert name of political party> has WIDENED/SHRUNK/NOT CHANGED dramatically. This almost certainly means a <insert name of political party> victory or a hung parliament. This contrasts with a recent <insert name of polling organization, e.g. YOUGOV> poll which showed that the <insert name of political party> lead had WIDENED/SHRUNK/NOT CHANGED which almost certainly meant a <insert name of political party> victory or a hung parliament.

Political observers were quick to point out that we shouldn’t read too much into this poll, as tomorrow’s <insert name of polling organization> poll shows the <insert name of political party> lead over <insert name of political party> has WIDENED/SHRUNK/NOT CHANGED dramatically, almost certainly meaning a <insert name of political party> victory or a hung parliament.

(adapted, without permission, from Private Eye)

Science’s Dirtiest Secret?

Posted in Bad Statistics, The Universe and Stuff with tags , , , on March 19, 2010 by telescoper

My attention was drawn yesterday to an article, in a journal I never read called American Scientist, about the role of statistics in science. Since this is a theme I’ve blogged about before I had a quick look at the piece and quickly came to the conclusion that the article was excruciating drivel. However, looking at it again today, my opinion of it has changed. I still don’t think it’s very good, but it didn’t make me as cross second time around. I don’t know whether this is because I was in a particularly bad mood yesterday, or whether the piece has been edited. But although it didn’t make me want to scream, I still think it’s a poor article.

Let me start with the opening couple of paragraphs

For better or for worse, science has long been married to mathematics. Generally it has been for the better. Especially since the days of Galileo and Newton, math has nurtured science. Rigorous mathematical methods have secured science’s fidelity to fact and conferred a timeless reliability to its findings.

During the past century, though, a mutant form of math has deflected science’s heart from the modes of calculation that had long served so faithfully. Science was seduced by statistics, the math rooted in the same principles that guarantee profits for Las Vegas casinos. Supposedly, the proper use of statistics makes relying on scientific results a safe bet. But in practice, widespread misuse of statistical methods makes science more like a crapshoot.

In terms of historical accuracy, the author, Tom Siegfried, gets off to a very bad start. Science didn’t get “seduced” by statistics.  As I’ve already blogged about, scientists of the calibre of Gauss and Laplace – and even Galileo – were instrumental in inventing statistics.

And what were the “modes of calculation that had served it so faithfully” anyway? Scientists have long  recognized the need to understand the behaviour of experimental errors, and to incorporate the corresponding uncertainty in their analysis. Statistics isn’t a “mutant form of math”, it’s an integral part of the scientific method. It’s a perfectly sound discipline, provided you know what you’re doing…

And that’s where, despite the sloppiness of his argument,  I do have some sympathy with some of what  Siegfried says. What has happened, in my view, is that too many people use statistical methods “off the shelf” without thinking about what they’re doing. The result is that the bad use of statistics is widespread. This is particularly true in disciplines that don’t have a well developed mathematical culture, such as some elements of biosciences and medicine, although the physical sciences have their own share of horrors too.

I’ve had a run-in myself with the authors of a paper in neurobiology who based extravagant claims on an inappropriate statistical analysis.

What is wrong is therefore not the use of statistics per se, but the fact that too few people understand – or probably even think about – what they’re trying to do (other than publish papers).

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions. Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.

Quite, but what does this mean for “science’s dirtiest secret”? Not that it involves statistical reasoning, but that large numbers of scientists haven’t a clue what they’re doing when they do a statistical test. And if this is the case with practising scientists, how can we possibly expect the general public to make sense of what is being said by the experts? No wonder people distrust scientists when so many results confidently announced on the basis of totally spurious arguments, turn out to be be wrong.

The problem is that the “standard” statistical methods shouldn’t be “standard”. It’s true that there are many methods that work in a wide range of situations, but simply assuming they will work in any particular one without thinking about it very carefully is a very dangerous strategy. Siegfried discusses examples where the use of “p-values” leads to incorrect results. It doesn’t surprise me that such examples can be found, as the misinterpretation of p-values is rife even in numerate disciplines, and matters get worse for those practitioners who combine p-values from different studies using meta-analysis, a method which has no mathematical motivation whatsoever and which should be banned. So indeed should a whole host of other frequentist methods which offer limitless opportunities for to make a complete botch of the data arising from a research project.

Siegfried goes on

Nobody contends that all of science is wrong, or that it hasn’t compiled an impressive array of truths about the natural world. Still, any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical.

Any single scientific study done along is quite likely to be incorrect. Really? Well, yes, if it is done incorrectly. But the point is not that they are incorrect because they use statistics, but that they are incorrect because they are done incorrectly. Many scientists don’t even understand the statistics well enough to realise that what they’re doing is wrong.

If I had my way, scientific publications – especially in disciplines that impact directly on everyday life, such as medicine – should adopt a much more rigorous policy on statistical analysis and on the way statistical significance is reported. I favour the setting up of independent panels whose responsibility is to do the statistical data analysis on behalf of those scientists who can’t be trusted to do it correctly themselves.

Having started badly, and lost its way in the middle, the article ends disappointingly too. Having led us through a wilderness of failed frequentists analyses, he finally arrives at a discussion of the superior Bayesian methodology, in irritatingly half-hearted fashion.

But Bayesian methods introduce a confusion into the actual meaning of the mathematical concept of “probability” in the real world. Standard or “frequentist” statistics treat probabilities as objective realities; Bayesians treat probabilities as “degrees of belief” based in part on a personal assessment or subjective decision about what to include in the calculation. That’s a tough placebo to swallow for scientists wedded to the “objective” ideal of standard statistics….

Conflict between frequentists and Bayesians has been ongoing for two centuries. So science’s marriage to mathematics seems to entail some irreconcilable differences. Whether the future holds a fruitful reconciliation or an ugly separation may depend on forging a shared understanding of probability.

The difficulty with this piece as a whole is that it reads as an anti-science polemic: “Some science results are based on bad statistics, therefore statistics is bad and science that uses statistics is bogus.” I don’t know whether that’s what the author intended, or whether it was just badly written.

I’d say the true state of affairs is different. A lot of bad science is published, and a lot of that science is bad because it uses statistical reasoning badly. You wouldn’t however argue that a screwdriver is no use because some idiot tries to hammer a nail in with one.

Only a bad craftsman blames his tools.

The Seven Year Itch

Posted in Bad Statistics, Cosmic Anomalies, The Universe and Stuff with tags , , , on January 27, 2010 by telescoper

I was just thinking last night that it’s been a while since I posted anything in the file marked cosmic anomalies, and this morning I woke up to find a blizzard of papers on the arXiv from the Wilkinson Microwave Anisotropy Probe (WMAP) team. These relate to an analysis of the latest data accumulated now over seven years of operation; a full list of the papers is given here.

I haven’t had time to read all of them yet, but I thought it was worth drawing attention to the particular one that relates to the issue of cosmic anomalies. I’ve taken the liberty of including the abstract here:

A simple six-parameter LCDM model provides a successful fit to WMAP data, both when the data are analyzed alone and in combination with other cosmological data. Even so, it is appropriate to search for any hints of deviations from the now standard model of cosmology, which includes inflation, dark energy, dark matter, baryons, and neutrinos. The cosmological community has subjected the WMAP data to extensive and varied analyses. While there is widespread agreement as to the overall success of the six-parameter LCDM model, various “anomalies” have been reported relative to that model. In this paper we examine potential anomalies and present analyses and assessments of their significance. In most cases we find that claimed anomalies depend on posterior selection of some aspect or subset of the data. Compared with sky simulations based on the best fit model, one can select for low probability features of the WMAP data. Low probability features are expected, but it is not usually straightforward to determine whether any particular low probability feature is the result of the a posteriori selection or of non-standard cosmology. We examine in detail the properties of the power spectrum with respect to the LCDM model. We examine several potential or previously claimed anomalies in the sky maps and power spectra, including cold spots, low quadrupole power, quadropole-octupole alignment, hemispherical or dipole power asymmetry, and quadrupole power asymmetry. We conclude that there is no compelling evidence for deviations from the LCDM model, which is generally an acceptable statistical fit to WMAP and other cosmological data.

Since I’m one of those annoying people who have been sniffing around the WMAP data for signs of departures from the standard model, I thought I’d comment on this issue.

As the abstract says, the  LCDM model does indeed provide a good fit to the data, and the fact that it does so with only 6 free parameters is particularly impressive. On the other hand, this modelling process involves the compression of an enormous amount of data into just six numbers. If we always filter everything through the standard model analysis pipeline then it is possible that some vital information about departures from this framework might be lost. My point has always been that every now and again it is worth looking in the wastebasket to see if there’s any evidence that something interesting might have been discarded.

Various potential anomalies – mentioned in the above abstract – have been identified in this way, but usually there has turned out to be less to them than meets the eye. There are two reasons not to get too carried away.

The first reason is that no experiment – not even one as brilliant as WMAP – is entirely free from systematic artefacts. Before we get too excited and start abandoning our standard model for more exotic cosmologies, we need to be absolutely sure that we’re not just seeing residual foregrounds, instrument errors, beam asymmetries or some other effect that isn’t anything to do with cosmology. Because it has performed so well, WMAP has been able to do much more science than was originally envisaged, but every experiment is ultimately limited by its own systematics and WMAP is no different. There is some (circumstantial) evidence that some of the reported anomalies may be at least partly accounted for by  glitches of this sort.

The second point relates to basic statistical theory. Generally speaking, an anomaly A (some property of the data) is flagged as such because it is deemed to be improbable given a model M (in this case the LCDM). In other words the conditional probability P(A|M) is a small number. As I’ve repeatedly ranted about in my bad statistics posts, this does not necessarily mean that P(M|A)- the probability of the model being right – is small. If you look at 1000 different properties of the data, you have a good chance of finding something that happens with a probability of 1 in a thousand. This is what the abstract means by a posteriori reasoning: it’s not the same as talking out of your posterior, but is sometimes close to it.

In order to decide how seriously to take an anomaly, you need to work out P(M|A), the probability of the model given the anomaly, which requires that  you not only take into account all the other properties of the data that are explained by the model (i.e. those that aren’t anomalous), but also specify an alternative model that explains the anomaly better than the standard model. If you do this, without introducing too many free parameters, then this may be taken as compelling evidence for an alternative model. No such model exists -at least for the time being – so the message of the paper is rightly skeptical.

So, to summarize, I think what the WMAP team say is basically sensible, although I maintain that rummaging around in the trash is a good thing to do. Models are there to be tested and surely the best way to test them is to focus on things that look odd rather than simply congratulating oneself about the things that fit? It is extremely impressive that such intense scrutiny over the last seven years has revealed so few oddities, but that just means that we should look even harder..

Before too long, data from Planck will provide an even sterner test of the standard framework. We really do need an independent experiment to see whether there is something out there that WMAP might have missed. But we’ll have to wait a few years for that.

So far it’s WMAP 7 Planck 0, but there’s plenty of time for an upset. Unless they close us all down.

The League of Small Samples

Posted in Bad Statistics with tags , , , on January 14, 2010 by telescoper

This morning I was just thinking that it’s been a while since I’ve filed anything in the category marked bad statistics when I glanced at today’s copy of the Times Higher and found something that’s given me an excuse to rectify my lapse. Today saw the publication of said organ’s new Student Experience Survey which ranks  British Universities in order of the responses given by students to questions about various aspects of the teaching, social life and so  on. Here are the main results, sorted in decreasing order:

1 Loughborough University 84.9 128
2 University of Cambridge, The 82.6 259
3 University of Oxford, The 82.6 197
4 University of Sheffield, The 82.3 196
5 University of East Anglia, The 82.1 122
6 University of Wales, Aberystwyth 82.1 97
7 University of Leeds, The 81.9 185
8 University of Dundee, The 80.8 75
9 University of Southampton, The 80.6 164
10 University of Glasgow, The 80.6 136
11 University of Exeter, The 80.3 160
12 University of Durham 80.3 189
13 University of Leicester, The 79.9 151
14 University of St Andrews, The 79.9 104
15 University of Essex, The 79.5 65
16 University of Warwick, The 79.5 190
17 Cardiff University 79.4 180
18 University of Central Lancashire, The 79.3 88
19 University of Nottingham, The 79.2 233
20 University of Newcastle-upon-Tyne, The 78.9 145
21 University of Bath, The 78.7 142
22 University of Wales, Bangor 78.7 43
23 University of Edinburgh, The 78.1 190
24 University of Birmingham, The 78.0 179
25 University of Surrey, The 77.8 100
26 University of Sussex, The 77.6 49
27 University of Lancaster, The 77.6 123
28 University of Stirling, The 77.6 44
29 University of Wales, Swansea 77.5 61
30 University of Kent at Canterbury, The 77.3 116
30 University of Teesside, The 77.3 127
32 University of Hull, The 77.2 87
33 Robert Gordon University, The 77.2 57
34 University of Lincoln, The 77.0 121
35 Nottingham Trent University, The 76.9 192
36 University College Falmouth 76.8 40
37 University of Gloucestershire 76.8 74
38 University of Liverpool, The 76.7 89
39 University of Keele, The 76.5 57
40 University of Northumbria at Newcastle, The 76.4 149
41 University of Plymouth, The 76.3 190
41 University of Reading, The 76.3 117
43 Queen’s University of Belfast, The 76.0 149
44 University of Aberdeen, The 75.9 84
45 University of Strathclyde, The 75.7 72
46 Staffordshire University 75.6 85
47 University of York, The 75.6 121
48 St George’s Medical School 75.4 33
49 Southampton Solent University 75.2 34
50 University of Portsmouth, The 75.2 141
51 Queen Mary, University of London 75.2 104
52 University of Manchester 75.1 221
53 Aston University 75.0 66
54 University of Derby 75.0 33
55 University College London 74.8 114
56 Sheffield Hallam University 74.8 159
57 Glasgow Caledonian University 74.6 72
58 King’s College London 74.6 101
59 Brunel University 74.4 64
60 Heriot-Watt University 74.1 35
61 Imperial College of Science, Technology & Medicine 73.9 111
62 De Montfort University 73.6 83
63 Bath Spa University 73.4 64
64 Bournemouth University 73.3 128
65 University of the West of England, Bristol 73.3 207
66 Leeds Metropolitan University 73.1 143
67 University of Chester 72.5 61
68 University of Bristol, The 72.3 145
69 Royal Holloway, University of London 72.1 59
70 Canterbury Christ Church University 71.8 78
71 University of Huddersfield, The 71.8 97
72 York St John University College 71.8 31
72 University of Wales Institute, Cardiff 71.8 41
74 University of Glamorgan 71.6 84
75 University of Salford, The 71.2 58
76 Roehampton University 71.1 47
77 Manchester Metropolitan University, The 71.1 131
78 University of Northampton 70.8 42
79 University of Sunderland, The 70.8 61
80 Kingston University 70.7 121
81 University of Bradford, The 70.6 33
82 Oxford Brookes University 70.5 99
83 University of Ulster 70.3 61
84 Coventry University 69.9 82
85 University of Brighton, The 69.4 106
86 University of Hertfordshire 68.9 138
87 University of Bedfordshire 68.6 44
88 Queen Margaret University, Edinburgh 68.5 35
89 London School of Economics and Political Science 68.4 73
90 Royal Veterinary College, The 68.2 43
91 Anglia Ruskin University 68.1 71
92 Birmingham City University 67.7 109
93 University of Wolverhampton, The 67.5 72
94 Liverpool John Moores University 67.2 103
95 Goldsmiths College 66.9 42
96 Napier University 65.5 63
97 London South Bank University 64.9 44
98 City University 64.6 44
99 University of Greenwich, The 63.9 67
100 University of the Arts London 62.8 40
101 Middlesex University 61.4 51
102 University of Westminster, The 60.4 76
103 London Metropolitan University 55.2 37
104 University of East London, The 54.2 41
10465

The maximum overall score is 100 and the figure in the rightmost column is the number of students from that particular University that contributed to the survey. The total number of students involved is shown at the bottom, i.e. 10465.

My current employer, Cardiff University, comes out pretty well (17th) in this league table, but some do surprisingly poorly such as Imperial which is 61st. No doubt University spin doctors around the country will be working themselves into a frenzy trying how best to present their showing in the list, but before they get too carried away I want to dampen their enthusiasm.

Let’s take Cardiff as an example. The number of students whose responses produced the score of 79.4 was just 180. That’s by no means the smallest sample in the survey, either. Cardiff University has approximately 20,000 undergraduates. The score in this table is therefore obtained from less than 1% of the relevant student population. How representative can the results be, given that the sample is so incredibly small?

What is conspicuous by its absence from this table is any measure of the “margin-of-error” of the estimated score. What I mean by this is how much the sample score would change for Cardiff if a different set of 180 students were involved. Unless every Cardiff student gives Cardiff exactly 79.4 then the score will vary from sample to sample. The smaller the sample, the larger the resulting uncertainty.

Given a survey of this type it should be quite straightforward to calculate the spread of scores from student to student within a sample from a given University in terms of the standard deviation, σ, as well as the mean score. Unfortunately, this survey does not include this information. However, lets suppose for the sake of argument that the standard deviation for Cardiff is quite small, say 10% of the mean value, i.e. 7.94. I imagine that it’s much larger than that, in fact, but this is just meant to be by way of an illustration.

If you have a sample size of  N then the standard error of the mean is going to be roughly (σ⁄√N) which, for Cardiff, is about 0.6. Assuming everything has a normal distribution, this would mean that the “true” score for the full population of Cardiff students has a 95% chance of being within two standard errors of the mean, i.e. between 78.2 and 80.6. This means Cardiff could really be as high as 9th place or as low as 23rd, and that’s making very conservative assumptions about how much one student differs from another within each institution.

That example is just for illustration, and the figures may well be wrong, but my main gripe is that I don’t understand how these guys can get away with publishing results like this without listing the margin of error at all. Perhaps its because that would make it obvious how unreliable the rankings are? Whatever the reason we’d never get away with publishing results without errors in a serious scientific journal.

Still, at least there’s been one improvement since last year: the 2009 results gave every score to two decimal places! My A-level physics teacher would have torn strips off me if I’d done that!

Precision, you see, is not the same as accuracy….

Dark Squib

Posted in Bad Statistics, Science Politics, The Universe and Stuff with tags , on December 19, 2009 by telescoper

After today’s lengthy pre-Christmas traipse around Cardiff in the freezing cold, I don’t think I can summon up the energy for a lengthy post today. However, today’s cryogenic temperatures did manage to remind me that I hadn’t closed the book on a previous story about rumours of a laboratory detection of dark matter by the experiment known as CDMS. The main rumour – that there was going to be a paper in Nature reporting the definite detection of dark matter particles – turned out to be false, but there was a bit of truth after all, in that they did put out a paper yesterday (18th December, the date that the original rumour suggested their paper would come out).  There’s also an executive summary of the results here.

It turns out that the experiment has seen two events that might, just might, be the Weakly Interacting Massive Particles (WIMPs) that are most theorists favoured candidate for cold dark matter. However, they might also be due to background events generated by other stray particles getting into the works. It’s impossible to tell at this stage whether the signal is real or not. Based on the sort of naive  frequentist statistical treatment of the data that for some reason is what particle physicists seem to prefer, there’s a 23% chance of their signal being background rather than dark matter. In other words, it’s about a one-sigma detection. In fact, if you factor in the possibility of a systematic error in the background counts – these are very difficult things to calibrate precisely – then the significance of the result decreases even further. And if you do it all properly, in a Bayesian way with an appropriate prior then the most probable result is no detection. Andrew Jaffe gives some details on his blog.

There is no universally accepted criterion for what constitutes a definite detection, but I’ve been told recently by the editor of Nature himself that if it’s less than 3-sigma (a probability of about 1% of it arising) then they’re unlikely to publish it. If it’s 2-sigma (5%) then it’s interesting, but not conclusive, but at 1-sigma it’s not worth writing home about never mind writing a press release.

I should  add that none of their results has yet been subject to peer review either. I can only guess that CDMS must be undergoing a funding review pretty soon and wanted to use the media to show it was producing the goods. I can’t say I’m impressed with these antics, and I doubt if the reviewers will be either.

Unfortunately, the fact that this is all so inconclusive from a scientific point of view hasn’t stopped various organs getting hold of the wrong end of the stick and starting to beat about the bush with it. New Scientist‘s Twitter feed screamed

Clear signal of dark matter detected in Minnesota!

although the article itself was a bit better informed. The Guardian ran a particularly poor story,  impressive only in the way it crammed so many misconceptions into such a short piece.

This episode takes me back to a theme I’ve touched on many times on this blog, which is that scientific results are very rarely black-and-white and they have to be treated carefully in appropriate probabilistic terms. Unfortunately, the media and the public have a great deal of difficulty understanding the subtleties of this and what gets across in the public domain can be either garbled or downright misleading. Most often in science the correct answer isn’t “true” or “false” but somewhere in between.

Of course, with more measurements, better statistics and stronger control of systematics this CDMS result may well turn into a significant detection. If it does then it will be a great scientific breakthrough and they’ll have my congratulations straight away, tempered with a certain amount of sadness that there will be no UK competitors in the race owing to our recent savage funding cuts. But we’re not there yet. So far, it’s just a definite maybe.

The Monkey Complex

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , on November 15, 2009 by telescoper

There’s an old story that if you leave a set of monkeys hammering on typewriters for a sufficiently long time then they will eventually reproduce the entire text of Shakespeare’s play Hamlet. It comes up in a variety of contexts, but the particular generalisation of this parable in cosmology is to argue that if we live in an enormously big universe (or “multiverse“), in which the laws of nature (as specified by the relevant fundamental constants) vary “sort of randomly” from place to place, then there will be a domain in which they have the right properties for life to evolve. This is one way of explaining away the apparent fine-tuning of the laws of physics: they’re not finely tuned, but we just live in a place where they allowed us to evolve. Although it may seem an easy step from monkeys to the multiverse, it always seemed to me a very shaky one.

For a start, let’s go back to the monkeys. The supposition that given an infinite time the monkeys must produce everything that’s possible in a finite sequence, is not necessarily true even if one does allow an infinite time. It depends on how they type. If the monkeys were always to hit two adjoining keys at the same time then they would never produce a script for Hamlet, no matter how long they typed for, as the combinations QW or ZX do not appear anywhere in that play. To guarantee what we need the kind their typing has to be ergodic, a very specific requirement not possessed by all “random” sequences.

A more fundamental problem is what is meant by randomness in the first place. I’ve actually commented on this before, in a post that still seems to be collecting readers so I thought I’d develop one or two of the ideas a little.

 It is surprisingly easy to generate perfectly deterministic mathematical sequences that behave in the way we usually take to characterize indeterministic processes. As a very simple example, consider the following “iteration” scheme:

 X_{j+1}= 2 X_{j} \mod(1)

If you are not familiar with the notation, the term mod(1) just means “drop the integer part”.  To illustrate how this works, let us start with a (positive) number, say 0.37. To calculate the next value I double it (getting 0.74) and drop the integer part. Well, 0.74 does not have an integer part so that’s fine. This value (0.74) becomes my first iterate. The next one is obtained by putting 0.74 in the formula, i.e. doubling it (1.48) and dropping  the integer part: result 0.48. Next one is 0.96, and so on. You can carry on this process as long as you like, using each output number as the input state for the following step of the iteration.

Now to simplify things a little bit, notice that, because we drop the integer part each time, all iterates must lie in the range between 0 and 1. Suppose I divide this range into two bins, labelled “heads” for X less than ½ and “tails” for X greater than or equal to ½. In my example above the first value of X is 0.37 which is “heads”. Next is 0.74 (tails); then 0.48 (heads), 0.96(heads), and so on.

This sequence now mimics quite accurately the tossing of a fair coin. It produces a pattern of heads and tails with roughly 50% frequency in a long run. It is also difficult to predict the next term in the series given only the classification as “heads” or “tails”.

However, given the seed number which starts off the process, and of course the algorithm, one could reproduce the entire sequence. It is not random, but in some respects  looks like it is.

One can think of “heads” or “tails” in more general terms, as indicating the “0” or “1” states in the binary representation of a number. This method can therefore be used to generate the any sequence of digits. In fact algorithms like this one are used in computers for generating what are called pseudorandom numbers. They are not precisely random because computers can only do arithmetic to a finite number of decimal places. This means that only a finite number of possible sequences can be computed, so some repetition is inevitable, but these limitations are not always important in practice.

The ability to generate  random numbers accurately and rapidly in a computer has led to an entirely new way of doing science. Instead of doing real experiments with measuring equipment and the inevitable errors, one can now do numerical experiments with pseudorandom numbers in order to investigate how an experiment might work if we could do it. If we think we know what the result would be, and what kind of noise might arise, we can do a random simulation to discover the likelihood of success with a particular measurement strategy. This is called the “Monte Carlo” approach, and it is extraordinarily powerful. Observational astronomers and particle physicists use it a great deal in order to plan complex observing programmes and convince the powers that be that their proposal is sufficiently feasible to be allocated time on expensive facilities. In the end there is no substitute for real experiments, but in the meantime the Monte Carlo method can help avoid wasting time on flawed projects:

…in real life mistakes are likely to be irrevocable. Computer simulation, however, makes it economically practical to make mistakes on purpose.

(John McLeod and John Osborne, in Natural Automata and Useful Simulations).

So is there a way to tell whether a set of numbers is really random? Consider the following sequence:

1415926535897932384626433832795028841971

Is this a random string of numbers? There doesn’t seem to be a discernible pattern, and each possible digit seems to occur with roughly the same frequency. It doesn’t look like anyone’s phone number or bank account. Is that enough to make you think it is random?

Actually this is not at all random. If I had started it with a three and a decimal place you might have cottoned on straight away. “3.1415926..” is the first few digits in the decimal representation of p. The full representation goes on forever without repeating. This is a sequence that satisfies most naïve definitions of randomness. It does, however, provide something of a hint as to how we might construct an operational definition, i.e. one that we can apply in practice to a finite set of numbers.

The key idea originates from the Russian mathematician Andrei Kolmogorov, who wrote the first truly rigorous mathematical work on probability theory in 1933. Kolmogorov’s approach was considerably ahead of its time, because it used many concepts that belong to the era of computers. In essence, what he did was to provide a definition of the complexity of an N-digit sequence in terms of the smallest amount of computer memory it would take to store a program capable of generating the sequence. Obviously one can always store the sequence itself, which means that there is always a program that occupies about as many bytes of memory as the sequence itself, but some numbers can be generated by codes much shorter than the numbers themselves. For example the sequence

111111111111111111111111111111111111

can be generated by the instruction to “print 1 35 times”, which can be stored in much less memory than the original string of digits. Such a sequence is therefore said to be algorithmically compressible.

There are many ways of calculating the digits of π numerically also, so although it may look superficially like a random string it is most definitely not random. It is algorithmically compressible.

I’m not sure how compressible Hamlet is, but it’s certainly not entirely random. When I studied it at school I certainly wished it were a little shorter…

The complexity of a sequence can be defined to be the length of the shortest program capable of generating it. If no algorithm can be found that compresses the sequence into a program shorter than itself then it is maximally complex and can suitably be defined as random. This is a very elegant description, and has good intuitive appeal.  

I’m not sure how compressible Hamlet is, but it’s certainly not entirely random. At any rate, when I studied it at school, I certainly wished it were a little shorter…

However, this still does not provide us with a way of testing rigorously whether a given finite sequence has been produced “randomly” or not.

If an algorithmic compression can be found then that means we declare the given sequence not to be  random. However we can never be sure if the next term in the sequence would fit with what our algorithm would predict. We have to argue, inferentially, that if we have fit a long sequence with a simple algorithm then it is improbable that the sequence was generated randomly.

On the other hand, if we fail to find a suitable compression that doesn’t mean it is random either. It may just mean we didn’t look hard enough or weren’t clever enough.

Human brains are good at finding patterns. When we can’t see one we usually take the easy way out and declare that none exists. We often model a complicated system as a random process because it is  too difficult to predict its behaviour accurately even if we know the relevant laws and have  powerful computers at our disposal. That’s a very reasonable thing to do when there is no practical alternative. 

It’s quite another matter, however,  to embrace randomness as a first principle to avoid looking for an explanation in the first place. For one thing, it’s lazy, taking the easy way out like that. And for another it’s a bit arrogant. Just because we can’t find an explanation within the framework of our current theories doesn’t mean more intelligent creatures than us won’t do so. We’re only monkeys, after all.