Archive for the Bad Statistics Category

What Counts as Productivity?

Posted in Bad Statistics, Science Politics, The Universe and Stuff with tags , , , , on March 18, 2011 by telescoper

Apparently last year the United Kingdom Infra-Red Telescope (UKIRT) beat its own personal best for scientific productivity. In fact here’s a  graphic showing the number of publications resulting from UKIRT to make the point:

The plot also demonstrates that a large part of recent burst of productivity has been associated with UKIDSS (the UKIRT Infrared Deep Sky Survey) which a number of my colleagues are involved in. Excellent chaps. Great project. Lots of hard work done very well.  Take a bow, the UKIDSS team!

Now I hope I’ve made it clear that  I don’t in any way want to pour cold water on the achievements of UKIRT, and particularly not UKIDSS, but this does provide an example of how difficult it is to use bibliometric information in a meaningful way.

Take the UKIDSS papers used in the plot above. There are 226 of these listed by Steve Warren at Imperial College. But what is a “UKIDSS paper”? Steve states the criteria he adopted:

A paper is listed as a UKIDSS paper if it is already published in a journal (with one exception) and satisfies one of the following criteria:

1. It is one of the core papers describing the survey (e.g. calibration, archive, data releases). The DR2 paper is included, and is the only paper listed not published in a journal.
2. It includes science results that are derived in whole or in part from UKIDSS data directly accessed from the archive (analysis of data published in another paper does not count).
3. It contains science results from primary follow-up observations in a programme that is identifiable as a UKIDSS programme (e.g. The physical properties of four ~600K T dwarfs, presenting Spitzer spectra of cool brown dwarfs discovered with UKIDSS).
4. It includes a feasibility study of science that could be achieved using UKIDSS data (e.g. The possiblity of detection of ultracool dwarfs with the UKIRT Infrared Deep Sky Survey by Deacon and Hambly).

Papers are identified by a full-text search for the string ‘UKIDSS’, and then compared against the above criteria.

That all seems to me to by quite reasonable, and it’s certainly one way of defining what a UKIDSS paper is. According to that measure, UKIDSS scores 226.

The Warren measure does, however, include a number of papers that don’t directly use UKIDSS data, and many written by people who aren’t members of the UKIDSS consortium. Being picky you might say that such papers aren’t really original UKIDSS papers, but are more like second-generation spin-offs. So how could you count UKIDSS papers differently?

I just tried one alternative way, which is to use ADS to identify all refereed papers with “UKIDSS” in the title, assuming – possibly incorrectly – that all papers written by the UKIDSS consortium would have UKIDSS in the title. The number returned by this search was 38.

Now I’m not saying that this is more reasonable than the Warren measure. It’s just different, that’s all.  According to my criterion however UKIDSS measures 38 rather than 226. It sounds less impressive (if only because 38 is a smaller number than 226),  but what does it mean about UKIDSS productivity in absolute terms?

Not very much, I think is the answer.

Yet another way you might try to judge UKIDSS using bibliometric means is to look at its citation impact. After all, any fool can churn out dozens of papers that no-one ever reads. I know that for a fact. I am that fool.

But citation data also provide another way of doing what Steve Warren was trying to measure. Presumably the authors of any paper that uses UKIDSS data in any significant way would cite the main UKIDSS survey paper led by Andy Lawrence (Lawrence et al. 2007). According to ADS, the number of times this has been cited since publication is 359. That’s higher than the Warren measure (226), and much higher than the UKIDSS-in-the-title measure (38).

So there we are, three different measures, all in my opinion perfectly reasonable measures of, er,  something or other, but each giving a very different numerical value. I am not saying any  is misleading or that any is necessarily better than the others. My point is simply that it’s not easy to assign a numerical value to something that’s intrinsically difficult to define.

Unfortunately, it’s a point few people in government seem to be prepared to acknowledge.

Andy Lawrence is 57.


Share/Bookmark

A Census of the Ridiculous

Posted in Bad Statistics, History with tags , , , , on March 12, 2011 by telescoper

My form for the 2011 Census arrived yesterday. Apparently they were all posted out on Monday, so that’s 5 days in the post. Par for the course for the Royal Mail these days. I’m slightly surprised it arrived at all.

There’s a hefty £1000 fine for not completing the Census, so I suppose I’ll fill it in, despite my feeling that it’s both intrusive and unnecessary. What’s worse is that several of the questions are so badly designed that the information resulting will be useless.

For example, according to the census guide:

Very careful consideration is given to the questions included in the census. Questions must meet the needs of a substantial number of users in order that the census is acceptable to the public and yields good quality data. The questions are selected following several rounds of consultation with:

  • central and local government
  • academia
  • health authorities
  • the business community
  • groups representing ethnic minorities and others with special interests and concerns

Hang on. The “business community”? Why should they be consulted? What do they want with my personal information? I thought the census data was for planning public services!  On the other hand, when everything is privatised maybe all our personal data will be flogged off to the private sector anyway.

The 2011 Census is the first one to include a question on health. According to the saturation advertising about the census, this question will help plan new hospitals and distribute NHS funding. So what is the new question, the answer to which will provide such valuable data? Here it is, together with the possible responses:

13. How is your health in general?

  • Very good
  • Good
  • Fair
  • Bad
  • Very Bad

And that’s it for “health”. Does anyone actually believe such a vague question is  going to be of any use at all in planning NHS services? I certainly don’t.

And then there’s the famous question about religion.

20. What is your religion?

For a start I don’t think my religion or lack of it should be any concern to the government. To be fair, however, this question is marked as “voluntary” so respondents are allowed not to answer it without getting locked up in the Tower of London. But in any case it’s a leading question and should never have been included in the census in this form anyway. “Do you have a religion and, if so, what is it?” would have been much better.

I could go on, but I’ve got better things to do today.

I’ll just say this last thing about the Census. Most of it clearly has nothing whatever to do with planning public services. In fact the government already holds most of the information about your private circumstances the form demands. The Census is nothing more an opportunity for the government to cross-check tax, benefit or other records in the hope of finding inaccuracies. In other words, Big Brother is watching you.

And the cost of all this snooping? A whopping £500 million, more than double the cost of the 2001 Census, and all of it  at a time of huge cuts to public services. You have to laugh, don’t you?


Share/Bookmark

Bayes’ Razor

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , , , , on February 19, 2011 by telescoper

It’s been quite while since I posted a little piece about Bayesian probability. That one and the others that followed it (here and here) proved to be surprisingly popular so I’ve been planning to add a few more posts whenever I could find the time. Today I find myself in the office after spending the morning helping out with a very busy UCAS visit day, and it’s raining, so I thought I’d take the opportunity to write something before going home. I think I’ll do a short introduction to a topic I want to do a more technical treatment of in due course.

A particularly important feature of Bayesian reasoning is that it gives precise motivation to things that we are generally taught as rules of thumb. The most important of these is Ockham’s Razor. This famous principle of intellectual economy is variously presented in Latin as Pluralites non est ponenda sine necessitate or Entia non sunt multiplicanda praetor necessitatem. Either way, it means basically the same thing: the simplest theory which fits the data should be preferred.

William of Ockham, to whom this dictum is attributed, was an English Scholastic philosopher (probably) born at Ockham in Surrey in 1280. He joined the Franciscan order around 1300 and ended up studying theology in Oxford. He seems to have been an outspoken character, and was in fact summoned to Avignon in 1323 to account for his alleged heresies in front of the Pope, and was subsequently confined to a monastery from 1324 to 1328. He died in 1349.

In the framework of Bayesian inductive inference, it is possible to give precise reasons for adopting Ockham’s razor. To take a simple example, suppose we want to fit a curve to some data. In the presence of noise (or experimental error) which is inevitable, there is bound to be some sort of trade-off between goodness-of-fit and simplicity. If there is a lot of noise then a simple model is better: there is no point in trying to reproduce every bump and wiggle in the data with a new parameter or physical law because such features are likely to be features of the noise rather than the signal. On the other hand if there is very little noise, every feature in the data is real and your theory fails if it can’t explain it.

To go a bit further it is helpful to consider what happens when we generalize one theory by adding to it some extra parameters. Suppose we begin with a very simple theory, just involving one parameter p, but we fear it may not fit the data. We therefore add a couple more parameters, say q and r. These might be the coefficients of a polynomial fit, for example: the first model might be straight line (with fixed intercept), the second a cubic. We don’t know the appropriate numerical values for the parameters at the outset, so we must infer them by comparison with the available data.

Quantities such as p, q and r are usually called “floating” parameters; there are as many as a dozen of these in the standard Big Bang model, for example.

Obviously, having three degrees of freedom with which to describe the data should enable one to get a closer fit than is possible with just one. The greater flexibility within the general theory can be exploited to match the measurements more closely than the original. In other words, such a model can improve the likelihood, i.e. the probability  of the obtained data  arising (given the noise statistics – presumed known) if the signal is described by whatever model we have in mind.

But Bayes’ theorem tells us that there is a price to be paid for this flexibility, in that each new parameter has to have a prior probability assigned to it. This probability will generally be smeared out over a range of values where the experimental results (contained in the likelihood) subsequently show that the parameters don’t lie. Even if the extra parameters allow a better fit to the data, this dilution of the prior probability may result in the posterior probability being lower for the generalized theory than the simple one. The more parameters are involved, the bigger the space of prior possibilities for their values, and the harder it is for the improved likelihood to win out. Arbitrarily complicated theories are simply improbable. The best theory is the most probable one, i.e. the one for which the product of likelihood and prior is largest.

To give a more quantitative illustration of this consider a given model M which has a set of N floating parameters represented as a vector \underline\lambda = (\lambda_1,\ldots \lambda_N)=\lambda_i; in a sense each choice of parameters represents a different model or, more precisely, a member of the family of models labelled M.

Now assume we have some data D and can consequently form a likelihood function P(D|\underline{\lambda},M). In Bayesian reasoning we have to assign a prior probability P(\underline{\lambda}|M) to the parameters of the model which, if we’re being honest, we should do in advance of making any measurements!

The interesting thing to look at now is not the best-fitting choice of model parameters \underline{\lambda} but the extent to which the data support the model in general.  This is encoded in a sort of average of likelihood over the prior probability space:

P(D|M) = \int P(D|\underline{\lambda},M) P(\underline{\lambda}|M) d^{N}\underline{\lambda}.

This is just the normalizing constant K usually found in statements of Bayes’ theorem which, in this context, takes the form

P(\underline{\lambda}|DM) = K^{-1}P(\underline{\lambda}|M)P(D|\underline{\lambda},M).

In statistical mechanics things like K are usually called partition functions, but in this setting K is called the evidence, and it is used to form the so-called Bayes Factor, used in a technique known as Bayesian model selection of which more anon….

The  usefulness of the Bayesian evidence emerges when we ask the question whether our N  parameters are sufficient to get a reasonable fit to the data. Should we add another one to improve things a bit further? And why not another one after that? When should we stop?

The answer is that although adding an extra degree of freedom can increase the first term in the integral defining K (the likelihood), it also imposes a penalty in the second factor, the prior, because the more parameters the more smeared out the prior probability must be. If the improvement in fit is marginal and/or the data are noisy, then the second factor wins and the evidence for a model with N+1 parameters lower than that for the N-parameter version. Ockham’s razor has done its job.

This is a satisfying result that is in nice accord with common sense. But I think it goes much further than that. Many modern-day physicists are obsessed with the idea of a “Theory of Everything” (or TOE). Such a theory would entail the unification of all physical theories – all laws of Nature, if you like – into a single principle. An equally accurate description would then be available, in a single formula, of phenomena that are currently described by distinct theories with separate sets of parameters. Instead of textbooks on mechanics, quantum theory, gravity, electromagnetism, and so on, physics students would need just one book.

The physicist Stephen Hawking has described the quest for a TOE as like trying to read the Mind of God. I think that is silly. If a TOE is every constructed it will be the most economical available description of the Universe. Not the Mind of God.  Just the best way we have of saving paper.


Share/Bookmark

Deductivism and Irrationalism

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , , , , , , on December 11, 2010 by telescoper

Looking at my stats I find that my recent introductory post about Bayesian probability has proved surprisingly popular with readers, so I thought I’d follow it up with a brief discussion of some of the philosophical issues surrounding it.

It is ironic that the pioneers of probability theory, principally Laplace, unquestionably adopted a Bayesian rather than frequentist interpretation for his probabilities. Frequentism arose during the nineteenth century and held sway until recently. I recall giving a conference talk about Bayesian reasoning only to be heckled by the audience with comments about “new-fangled, trendy Bayesian methods”. Nothing could have been less apt. Probability theory pre-dates the rise of sampling theory and all the frequentist-inspired techniques that modern-day statisticians like to employ.

Most disturbing of all is the influence that frequentist and other non-Bayesian views of probability have had upon the development of a philosophy of science, which I believe has a strong element of inverse reasoning or inductivism in it. The argument about whether there is a role for this type of thought in science goes back at least as far as Roger Bacon who lived in the 13th Century. Much later the brilliant Scottish empiricist philosopher and enlightenment figure David Hume argued strongly against induction. Most modern anti-inductivists can be traced back to this source. Pierre Duhem has argued that theory and experiment never meet face-to-face because in reality there are hosts of auxiliary assumptions involved in making this comparison. This is nowadays called the Quine-Duhem thesis.

Actually, for a Bayesian this doesn’t pose a logical difficulty at all. All one has to do is set up prior probability distributions for the required parameters, calculate their posterior probabilities and then integrate over those that aren’t related to measurements. This is just an expanded version of the idea of marginalization, explained here.

Rudolf Carnap, a logical positivist, attempted to construct a complete theory of inductive reasoning which bears some relationship to Bayesian thought, but he failed to apply Bayes’ theorem in the correct way. Carnap distinguished between two types or probabilities – logical and factual. Bayesians don’t – and I don’t – think this is necessary. The Bayesian definition seems to me to be quite coherent on its own.

Other philosophers of science reject the notion that inductive reasoning has any epistemological value at all. This anti-inductivist stance, often somewhat misleadingly called deductivist (irrationalist would be a better description) is evident in the thinking of three of the most influential philosophers of science of the last century: Karl Popper, Thomas Kuhn and, most recently, Paul Feyerabend. Regardless of the ferocity of their arguments with each other, these have in common that at the core of their systems of thought likes the rejection of all forms of inductive reasoning. The line of thought that ended in this intellectual cul-de-sac began, as I stated above, with the work of the Scottish empiricist philosopher David Hume. For a thorough analysis of the anti-inductivists mentioned above and their obvious debt to Hume, see David Stove’s book Popper and After: Four Modern Irrationalists. I will just make a few inflammatory remarks here.

Karl Popper really began the modern era of science philosophy with his Logik der Forschung, which was published in 1934. There isn’t really much about (Bayesian) probability theory in this book, which is strange for a work which claims to be about the logic of science. Popper also managed to, on the one hand, accept probability theory (in its frequentist form), but on the other, to reject induction. I find it therefore very hard to make sense of his work at all. It is also clear that, at least outside Britain, Popper is not really taken seriously by many people as a philosopher. Inside Britain it is very different and I’m not at all sure I understand why. Nevertheless, in my experience, most working physicists seem to subscribe to some version of Popper’s basic philosophy.

Among the things Popper has claimed is that all observations are “theory-laden” and that “sense-data, untheoretical items of observation, simply do not exist”. I don’t think it is possible to defend this view, unless one asserts that numbers do not exist. Data are numbers. They can be incorporated in the form of propositions about parameters in any theoretical framework we like. It is of course true that the possibility space is theory-laden. It is a space of theories, after all. Theory does suggest what kinds of experiment should be done and what data is likely to be useful. But data can be used to update probabilities of anything.

Popper has also insisted that science is deductive rather than inductive. Part of this claim is just a semantic confusion. It is necessary at some point to deduce what the measurable consequences of a theory might be before one does any experiments, but that doesn’t mean the whole process of science is deductive. He does, however, reject the basic application of inductive reasoning in updating probabilities in the light of measured data; he asserts that no theory ever becomes more probable when evidence is found in its favour. Every scientific theory begins infinitely improbable, and is doomed to remain so.

Now there is a grain of truth in this, or can be if the space of possibilities is infinite. Standard methods for assigning priors often spread the unit total probability over an infinite space, leading to a prior probability which is formally zero. This is the problem of improper priors. But this is not a killer blow to Bayesianism. Even if the prior is not strictly normalizable, the posterior probability can be. In any case, given sufficient relevant data the cycle of experiment-measurement-update of probability assignment usually soon leaves the prior far behind. Data usually count in the end.

The idea by which Popper is best known is the dogma of falsification. According to this doctrine, a hypothesis is only said to be scientific if it is capable of being proved false. In real science certain “falsehood” and certain “truth” are almost never achieved. Theories are simply more probable or less probable than the alternatives on the market. The idea that experimental scientists struggle through their entire life simply to prove theorists wrong is a very strange one, although I definitely know some experimentalists who chase theories like lions chase gazelles. To a Bayesian, the right criterion is not falsifiability but testability, the ability of the theory to be rendered more or less probable using further data. Nevertheless, scientific theories generally do have untestable components. Any theory has its interpretation, which is the untestable baggage that we need to supply to make it comprehensible to us. But whatever can be tested can be scientific.

Popper’s work on the philosophical ideas that ultimately led to falsificationism began in Vienna, but the approach subsequently gained enormous popularity in western Europe. The American Thomas Kuhn later took up the anti-inductivist baton in his book The Structure of Scientific Revolutions. Kuhn is undoubtedly a first-rate historian of science and this book contains many perceptive analyses of episodes in the development of physics. His view of scientific progress is cyclic. It begins with a mass of confused observations and controversial theories, moves into a quiescent phase when one theory has triumphed over the others, and lapses into chaos again when the further testing exposes anomalies in the favoured theory. Kuhn adopted the word paradigm to describe the model that rules during the middle stage,

The history of science is littered with examples of this process, which is why so many scientists find Kuhn’s account in good accord with their experience. But there is a problem when attempts are made to fuse this historical observation into a philosophy based on anti-inductivism. Kuhn claims that we “have to relinquish the notion that changes of paradigm carry scientists ..closer and closer to the truth.” Einstein’s theory of relativity provides a closer fit to a wider range of observations than Newtonian mechanics, but in Kuhn’s view this success counts for nothing.

Paul Feyerabend has extended this anti-inductivist streak to its logical (though irrational) extreme. His approach has been dubbed “epistemological anarchism”, and it is clear that he believed that all theories are equally wrong. He is on record as stating that normal science is a fairytale, and that equal time and resources should be spent on “astrology, acupuncture and witchcraft”. He also categorised science alongside “religion, prostitution, and so on”. His thesis is basically that science is just one of many possible internally consistent views of the world, and that the choice between which of these views to adopt can only be made on socio-political grounds.

Feyerabend’s views could only have flourished in a society deeply disillusioned with science. Of course, many bad things have been done in science’s name, and many social institutions are deeply flawed. One can’t expect anything operated by people to run perfectly. It’s also quite reasonable to argue on ethical grounds which bits of science should be funded and which should not. But the bottom line is that science does have a firm methodological basis which distinguishes it from pseudo-science, the occult and new age silliness. Science is distinguished from other belief-systems by its rigorous application of inductive reasoning and its willingness to subject itself to experimental test. Not all science is done properly, of course, and bad science is as bad as anything.

The Bayesian interpretation of probability leads to a philosophy of science which is essentially epistemological rather than ontological. Probabilities are not “out there” in external reality, but in our minds, representing our imperfect knowledge and understanding. Scientific theories are not absolute truths. Our knowledge of reality is never certain, but we are able to reason consistently about which of our theories provides the best available description of what is known at any given time. If that description fails when more data are gathered, we move on, introducing new elements or abandoning the theory for an alternative. This process could go on forever. There may never be a final theory. But although the game might have no end, at least we know the rules….


Share/Bookmark

A Main Sequence for Galaxies?

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , on December 2, 2010 by telescoper

Not for the first time in my life I find myself a bit of a laughing stock, after blowing my top during a seminar at Cardiff yesterday by retired Professor Mike Disney. In fact I got so angry that, much to the amusement of my colleagues, I stormed out. I don’t often lose my temper, and am not proud of having done so, but I reached a point when the red mist descended. What caused it was bad science and, in particular, bad statistics. It was all a big pity because what could have been an interesting discussion of an interesting result was ruined by too many unjustified assertions and too little attention to the underlying basis of the science. I still believe that no matter how interesting the results are, it’s  the method that really matters.

The interesting result that Mike Disney talked about emerges from a Principal Components Analysis (PCA) of the data relating to a sample of about 200 galaxies; it was actually published in Nature a couple of years ago; the arXiv version is here. It was the misleading way this was discussed in the seminar that got me so agitated so I’ll give my take on it now that I’ve calmed down to explain what I think is going on.

In fact, Principal Component Analysis is a very simple technique and shouldn’t really be controversial at all. It is a way of simplifying the representation of multivariate data by looking for the correlations present within it. To illustrate how it works, consider the following two-dimensional (i.e. bivariate) example I took from a nice tutorial on the method.

In this example the measured variables are Pressure and Temperature. When you plot them against each other you find they are correlated, i.e. the pressure tends to increase with temperature (or vice-versa). When you do a PCA of this type of dataset you first construct the covariance matrix (or, more precisely, its normalized form the correlation matrix). Such matrices are always symmetric and square (i.e. N×N, where N is the number of measurements involved at each point; in this case N=2) . What the PCA does is to determine the eigenvalues and eigenvectors of the correlation matrix.

The eigenvectors for the example above are shown in the diagram – they are basically the major and minor axes of an ellipse drawn to fit the scatter plot; these two eigenvectors (and their associated eigenvalues) define the principal components as linear combinations of the original variables. Notice that along one principal direction (v1) there is much more variation than the other (v2). This means that most of the variance in the data set is along the direction indicated by the vector v1, and relatively little in the orthogonal direction v2; the eigenvalue for the first vector is consequently larger than that for the second.

The upshot of this is that the description of this (very simple) dataset can be compressed by using the first principal component rather than the original variables, i.e. by switching from the original two variables (pressure and temperature) to one variable (v1) we have compressed our description without losing much information (only the little bit that is involved in the scatter in the v2 direction.

In the more general case of N observables there will be N principal components, corresponding to vectors in an N-dimensional space, but nothing changes qualitatively. What the PCA does is to rank the eigenvectors according to their eigenvalue (i.e. the variance associated with the direction of the eigenvector). The first principal component is the one with the largest variance, and so on down the ordered list.

Where PCA is useful with large data sets is when the variance associated with the first (or first few) principal components is very much larger than the rest. In that case one can dispense with the N variables and just use one or two.

In the cases discussed by Professor Disney yesterday the data involved six measurable parameters of each galaxy: (1) a dynamical mass estimate; (2) the mass inferred from HI emission (21cm); (3) the total luminosity; (4) radius; (5) a measure of the central concentration of the galaxy; and (6) a measure of its colour. The PCA analysis of these data reveals that about 80% of the variance in the data set is associated with the first principal component, so there is clearly a significant correlation present in the data although, to be honest, I have seen many PCA analyses with much stronger concentrations of variance in the first eigenvector so it doesn’t strike me as being particularly strong.

However, thinking as a physicist rather than a statistician there is clearly something very interesting going on. From a theoretical point of view one would imagine that the properties of an individual galaxy might be controlled by as many as six independent parameters including mass, angular momentum, baryon fraction, age and size, as well as by the accidents of its recent haphazard merger history.

Disney et al. argue that for gaseous galaxies to appear as a one-parameter set, as observed here, the theory of galaxy formation and evolution must supply at least five independent constraint equations in order to collapse everything into a single parameter.

This is all vaguely reminiscent of the Hertzsprung-Russell diagram, or at least the main sequence thereof:

 

You can see here that there’s a correlation between temperature and luminosity which constrains this particular bivariate data set to lie along a (nearly) one-dimensional track in the diagram. In fact these properties correlate with each other because there is a single parameter model relating all properties of main sequence stars to their mass. In other words, once you fix the mass of a main sequence star, it has a fixed  luminosity, temperature, and radius (apart from variations caused by age, metallicity, etc). Of course the problem is that masses of stars are difficult to determine so this parameter is largely hidden from the observer. What is really happening is that luminosity and temperature correlate with each other, because they both depend on the  hidden parameter mass.

I don’t think that the PCA result disproves the current theory of hierarchical galaxy formation (which is what Disney claims) but it will definitely be a challenge for theorists to provide a satisfactory explanation of the result! My own guess for the physical parameter that accounts for most of the variation in this data set is the mass of the dark halo within which the galaxy is embedded. In other words, it might really be just like the Hertzsprung-Russell diagram…

But back to my argument with Mike Disney. I asked what is the first principal component of the galaxy data, i.e. what does the principal eigenvector look like? He refused to answer, saying that it was impossible to tell. Of course it isn’t, as the PCA method actually requires it to be determined. Further questioning seemed to reveal a basic misunderstanding of the whole idea of PCA which made the assertion that all of modern cosmology would need to be revised somewhat difficult to swallow.  At that point of deadlock, I got very angry and stormed out.

I realise that behind the confusion was a reasonable point. The first principal component is well-defined, i.e. v1 is completely well defined in the first figure. However, along the line defined by that vector, P and T are proportional to each other so in a sense only one of them is needed to specify a position along this line. But you can’t say on the basis of this analysis alone that the fundamental variable is either pressure or temperature; they might be correlated through a third quantity you don’t know about.

Anyway, as a postscript I’ll say I did go and apologize to Mike Disney afterwards for losing my rag. He was very forgiving, although I probably now have a reputation for being a grumpy old bastard. Which I suppose I am. He also said one other thing,  that he didn’t mind me getting angry because it showed I cared about the truth. Which I suppose I do.


Share/Bookmark

Doubts about the Evidence for Penrose’s Cyclic Universe

Posted in Bad Statistics, Cosmic Anomalies, The Universe and Stuff with tags , , , , , , on November 28, 2010 by telescoper

A strange paper by Gurzadyan and Penrose hit the Arxiv a week or so ago. It seems to have generated quite a lot of reaction in the blogosphere and has now made it onto the BBC News, so I think it merits a comment.

The authors claim to have found evidence that supports Roger Penrose‘s conformal cyclic cosmology in the form of a series of (concentric) rings of unexpectedly low variance in the pattern of fluctuations in the cosmic microwave background seen by the Wilkinson Microwave Anisotropy Probe (WMAP). There’s no doubt that a real discovery of such signals in the WMAP data would point towards something radically different from the standard Big Bang cosmology.

I haven’t tried to reproduce Gurzadyan & Penrose’s result in detail, as I haven’t had time to look at it, and I’m not going to rule it out without doing a careful analysis myself. However, what I will say here is that I think you should take the statistical part of their analysis with a huge pinch of salt.

Here’s why.

The authors report a hugely significant detection of their effect (they quote a “6-σ” result; in other words, the expected feature is expected to arise in the standard cosmological model with a probability of less than 10-7. The type of signal can be seen in their Figure 2, which I reproduce here:

Sorry they’re hard to read, but these show the variance measured on concentric rings (y-axis) of varying radius (x-axis) as seen in the WMAP W (94 Ghz) and V (54 Ghz) frequency channels (top two panels) compared with what is seen in a simulation with purely Gaussian fluctuations generated within the framework of the standard cosmological model (lower panel). The contrast looks superficially impressive, but there’s much less to it than meets the eye.

For a start, the separate WMAP W and V channels are not the same as the cosmic microwave background. There is a great deal of galactic foreground that has to be cleaned out of these maps before the pristine primordial radiation can be isolated. The fact similar patterns can be found in the BOOMERANG data by no means rules out a foreground contribution as a common explanation of anomalous variance. The authors have excluded the region at low galactic latitude (|b|<20°) in order to avoid the most heavily contaminated parts of the sky, but this is by no means guaranteed to eliminate foreground contributions entirely. Here is the all-sky WMAP W-band map for example:

Moreover, these maps also contain considerable systematic effects arising from the scanning strategy of the WMAP satellite. The most obvious of these is that the signal-to-noise varies across the sky, but there are others, such as the finite size of the beam of the WMAP telescope.

Neither galactic foregrounds nor correlated noise are present in the Gaussian simulation shown in the lower panel, and the authors do not say what kind of beam smoothing is used either. The comparison of WMAP single-channel data with simple Gaussian simulations is consequently deeply flawed and the significance level quoted for the result is certainly meaningless.

Having not looked looked at this in detail myself I’m not going to say that the authors’ conclusions are necessarily false, but I would be very surprised if an effect this large was real given the strenuous efforts so many people have made to probe the detailed statistics of the WMAP data; see, e.g., various items in my blog category on cosmic anomalies. Cosmologists have been wrong before, of course, but then so have even eminent physicists like Roger Penrose…

Another point that I’m not sure about at all is even if the rings of low variance are real – which I doubt – do they really provide evidence of a cyclic universe? It doesn’t seem obvious to me that the model Penrose advocates would actually produce a CMB sky that had such properties anyway.

Above all, I stress that this paper has not been subjected to proper peer review. If I were the referee I’d demand a much higher level of rigour in the analysis before I would allow it to be published in a scientific journal. Until the analysis is done satisfactorily, I suggest that serious students of cosmology shouldn’t get too excited by this result.

It occurs to me that other cosmologists out there might have looked at this result in more detail than I have had time to. If so, please feel free to add your comments in the box…

IMPORTANT UPDATE: 7th December. Two papers have now appeared on the arXiv (here and here) which refute the Gurzadyan-Penrose claim. Apparently, the data behave as Gurzadyan and Penrose claim, but so do proper simulations. In otherwords, it’s the bottom panel of the figure that’s wrong.

ANOTHER UPDATE: 8th December. Gurzadyan and Penrose have responded with a two-page paper which makes so little sense I had better not comment at all.


Share/Bookmark

Bayes and his Theorem

Posted in Bad Statistics with tags , , , , , , on November 23, 2010 by telescoper

My earlier post on Bayesian probability seems to have generated quite a lot of readers, so this lunchtime I thought I’d add a little bit of background. The previous discussion started from the result

P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)

where

K=P(A|C).

Although this is called Bayes’ theorem, the general form of it as stated here was actually first written down, not by Bayes but by Laplace. What Bayes’ did was derive the special case of this formula for “inverting” the binomial distribution. This distribution gives the probability of x successes in n independent “trials” each having the same probability of success, p; each “trial” has only two possible outcomes (“success” or “failure”). Trials like this are usually called Bernoulli trials, after Daniel Bernoulli. If we ask the question “what is the probability of exactly x successes from the possible n?”, the answer is given by the binomial distribution:

P_n(x|n,p)= C(n,x) p^x (1-p)^{n-x}

where

C(n,x)= n!/x!(n-x)!

is the number of distinct combinations of x objects that can be drawn from a pool of n.

You can probably see immediately how this arises. The probability of x consecutive successes is p multiplied by itself x times, or px. The probability of (n-x) successive failures is similarly (1-p)n-x. The last two terms basically therefore tell us the probability that we have exactly x successes (since there must be n-x failures). The combinatorial factor in front takes account of the fact that the ordering of successes and failures doesn’t matter.

The binomial distribution applies, for example, to repeated tosses of a coin, in which case p is taken to be 0.5 for a fair coin. A biased coin might have a different value of p, but as long as the tosses are independent the formula still applies. The binomial distribution also applies to problems involving drawing balls from urns: it works exactly if the balls are replaced in the urn after each draw, but it also applies approximately without replacement, as long as the number of draws is much smaller than the number of balls in the urn. I leave it as an exercise to calculate the expectation value of the binomial distribution, but the result is not surprising: E(X)=np. If you toss a fair coin ten times the expectation value for the number of heads is 10 times 0.5, which is five. No surprise there. After another bit of maths, the variance of the distribution can also be found. It is np(1-p).

So this gives us the probability of x given a fixed value of p. Bayes was interested in the inverse of this result, the probability of p given x. In other words, Bayes was interested in the answer to the question “If I perform n independent trials and get x successes, what is the probability distribution of p?”. This is a classic example of inverse reasoning. He got the correct answer, eventually, but by very convoluted reasoning. In my opinion it is quite difficult to justify the name Bayes’ theorem based on what he actually did, although Laplace did specifically acknowledge this contribution when he derived the general result later, which is no doubt why the theorem is always named in Bayes’ honour.

This is not the only example in science where the wrong person’s name is attached to a result or discovery. In fact, it is almost a law of Nature that any theorem that has a name has the wrong name. I propose that this observation should henceforth be known as Coles’ Law.

So who was the mysterious mathematician behind this result? Thomas Bayes was born in 1702, son of Joshua Bayes, who was a Fellow of the Royal Society (FRS) and one of the very first nonconformist ministers to be ordained in England. Thomas was himself ordained and for a while worked with his father in the Presbyterian Meeting House in Leather Lane, near Holborn in London. In 1720 he was a minister in Tunbridge Wells, in Kent. He retired from the church in 1752 and died in 1761. Thomas Bayes didn’t publish a single paper on mathematics in his own name during his lifetime but despite this was elected a Fellow of the Royal Society (FRS) in 1742. Presumably he had Friends of the Right Sort. He did however write a paper on fluxions in 1736, which was published anonymously. This was probably the grounds on which he was elected an FRS.

The paper containing the theorem that now bears his name was published posthumously in the Philosophical Transactions of the Royal Society of London in 1764.

P.S. I understand that the authenticity of the picture is open to question. Whoever it actually is, he looks  to me a bit like Laurence Olivier…


Share/Bookmark

A Little Bit of Bayes

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , on November 21, 2010 by telescoper

I thought I’d start a series of occasional posts about Bayesian probability. This is something I’ve touched on from time to time but its perhaps worth covering this relatively controversial topic in a slightly more systematic fashion especially with regard to how it works in cosmology.

I’ll start with Bayes’ theorem which for three logical propositions (such as statements about the values of parameters in theory) A, B and C can be written in the form

P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)

where

K=P(A|C).

This is (or should be!)  uncontroversial as it is simply a result of the sum and product rules for combining probabilities. Notice, however, that I’ve not restricted it to two propositions A and B as is often done, but carried throughout an extra one (C). This is to emphasize the fact that, to a Bayesian, all probabilities are conditional on something; usually, in the context of data analysis this is a background theory that furnishes the framework within which measurements are interpreted. If you say this makes everything model-dependent, then I’d agree. But every interpretation of data in terms of parameters of a model is dependent on the model. It has to be. If you think it can be otherwise then I think you’re misguided.

In the equation,  P(B|C) is the probability of B being true, given that C is true . The information C need not be definitely known, but perhaps assumed for the sake of argument. The left-hand side of Bayes’ theorem denotes the probability of B given both A and C, and so on. The presence of C has not changed anything, but is just there as a reminder that it all depends on what is being assumed in the background. The equation states  a theorem that can be proved to be mathematically correct so it is – or should be – uncontroversial.

Now comes the controversy. In the “frequentist” interpretation of probability, the entities A, B and C would be interpreted as “events” (e.g. the coin is heads) or “random variables” (e.g. the score on a dice, a number from 1 to 6) attached to which is their probability, indicating their propensity to occur in an imagined ensemble. These things are quite complicated mathematical objects: they don’t have specific numerical values, but are represented by a measure over the space of possibilities. They are sort of “blurred-out” in some way, the fuzziness representing the uncertainty in the precise value.

To a Bayesian, the entities A, B and C have a completely different character to what they represent for a frequentist. They are not “events” but  logical propositions which can only be either true or false. The entities themselves are not blurred out, but we may have insufficient information to decide which of the two possibilities is correct. In this interpretation, P(A|C) represents the degree of belief that it is consistent to hold in the truth of A given the information C. Probability is therefore a generalization of the “normal” deductive logic expressed by Boolean algebra: the value “0” is associated with a proposition which is false and “1” denotes one that is true. Probability theory extends  this logic to the intermediate case where there is insufficient information to be certain about the status of the proposition.

A common objection to Bayesian probability is that it is somehow arbitrary or ill-defined. “Subjective” is the word that is often bandied about. This is only fair to the extent that different individuals may have access to different information and therefore assign different probabilities. Given different information C and C′ the probabilities P(A|C) and P(A|C′) will be different. On the other hand, the same precise rules for assigning and manipulating probabilities apply as before. Identical results should therefore be obtained whether these are applied by any person, or even a robot, so that part isn’t subjective at all.

In fact I’d go further. I think one of the great strengths of the Bayesian interpretation is precisely that it does depend on what information is assumed. This means that such information has to be stated explicitly. The essential assumptions behind a result can be – and, regrettably, often are – hidden in frequentist analyses. Being a Bayesian forces you to put all your cards on the table.

To a Bayesian, probabilities are always conditional on other assumed truths. There is no such thing as an absolute probability, hence my alteration of the form of Bayes’s theorem to represent this. A probability such as P(A) has no meaning to a Bayesian: there is always conditioning information. For example, if  I blithely assign a probability of 1/6 to each face of a dice, that assignment is actually conditional on me having no information to discriminate between the appearance of the faces, and no knowledge of the rolling trajectory that would allow me to make a prediction of its eventual resting position.

In tbe Bayesian framework, probability theory  becomes not a branch of experimental science but a branch of logic. Like any branch of mathematics it cannot be tested by experiment but only by the requirement that it be internally self-consistent. This brings me to what I think is one of the most important results of twentieth century mathematics, but which is unfortunately almost unknown in the scientific community. In 1946, Richard Cox derived the unique generalization of Boolean algebra under the assumption that such a logic must involve associated a single number with any logical proposition. The result he got is beautiful and anyone with any interest in science should make a point of reading his elegant argument. It turns out that the only way to construct a consistent logic of uncertainty incorporating this principle is by using the standard laws of probability. There is no other way to reason consistently in the face of uncertainty than probability theory. Accordingly, probability theory always applies when there is insufficient knowledge for deductive certainty. Probability is inductive logic.

This is not just a nice mathematical property. This kind of probability lies at the foundations of a consistent methodological framework that not only encapsulates many common-sense notions about how science works, but also puts at least some aspects of scientific reasoning on a rigorous quantitative footing. This is an important weapon that should be used more often in the battle against the creeping irrationalism one finds in society at large.

I posted some time ago about an alternative way of deriving the laws of probability from consistency arguments.

To see how the Bayesian approach works, let us consider a simple example. Suppose we have a hypothesis H (some theoretical idea that we think might explain some experiment or observation). We also have access to some data D, and we also adopt some prior information I (which might be the results of other experiments or simply working assumptions). What we want to know is how strongly the data D supports the hypothesis H given my background assumptions I. To keep it easy, we assume that the choice is between whether H is true or H is false. In the latter case, “not-H” or H′ (for short) is true. If our experiment is at all useful we can construct P(D|HI), the probability that the experiment would produce the data set D if both our hypothesis and the conditional information are true.

The probability P(D|HI) is called the likelihood; to construct it we need to have   some knowledge of the statistical errors produced by our measurement. Using Bayes’ theorem we can “invert” this likelihood to give P(H|DI), the probability that our hypothesis is true given the data and our assumptions. The result looks just like we had in the first two equations:

P(H|DI) = K^{-1}P(H|I)P(D|HI) .

Now we can expand the “normalising constant” K because we know that either H or H′ must be true. Thus

K=P(D|I)=P(H|I)P(D|HI)+P(H^{\prime}|I) P(D|H^{\prime}I)

The P(H|DI) on the left-hand side of the first expression is called the posterior probability; the right-hand side involves P(H|I), which is called the prior probability and the likelihood P(D|HI). The principal controversy surrounding Bayesian inductive reasoning involves the prior and how to define it, which is something I’ll comment on in a future post.

The Bayesian recipe for testing a hypothesis assigns a large posterior probability to a hypothesis for which the product of the prior probability and the likelihood is large. It can be generalized to the case where we want to pick the best of a set of competing hypothesis, say H1 …. Hn. Note that this need not be the set of all possible hypotheses, just those that we have thought about. We can only choose from what is available. The hypothesis may be relatively simple, such as that some particular parameter takes the value x, or they may be composite involving many parameters and/or assumptions. For instance, the Big Bang model of our universe is a very complicated hypothesis, or in fact a combination of hypotheses joined together,  involving at least a dozen parameters which can’t be predicted a priori but which have to be estimated from observations.

The required result for multiple hypotheses is pretty straightforward: the sum of the two alternatives involved in K above simply becomes a sum over all possible hypotheses, so that

P(H_i|DI) = K^{-1}P(H_i|I)P(D|H_iI),

and

K=P(D|I)=\sum P(H_j|I)P(D|H_jI)

If the hypothesis concerns the value of a parameter – in cosmology this might be, e.g., the mean density of the Universe expressed by the density parameter Ω0 – then the allowed space of possibilities is continuous. The sum in the denominator should then be replaced by an integral, but conceptually nothing changes. Our “best” hypothesis is the one that has the greatest posterior probability.

From a frequentist stance the procedure is often instead to just maximize the likelihood. According to this approach the best theory is the one that makes the data most probable. This can be the same as the most probable theory, but only if the prior probability is constant, but the probability of a model given the data is generally not the same as the probability of the data given the model. I’m amazed how many practising scientists make this error on a regular basis.

The following figure might serve to illustrate the difference between the frequentist and Bayesian approaches. In the former case, everything is done in “data space” using likelihoods, and in the other we work throughout with probabilities of hypotheses, i.e. we think in hypothesis space. I find it interesting to note that most theorists that I know who work in cosmology are Bayesians and most observers are frequentists!


As I mentioned above, it is the presence of the prior probability in the general formula that is the most controversial aspect of the Bayesian approach. The attitude of frequentists is often that this prior information is completely arbitrary or at least “model-dependent”. Being empirically-minded people, by and large, they prefer to think that measurements can be made and interpreted without reference to theory at all.

Assuming we can assign the prior probabilities in an appropriate way what emerges from the Bayesian framework is a consistent methodology for scientific progress. The scheme starts with the hardest part – theory creation. This requires human intervention, since we have no automatic procedure for dreaming up hypothesis from thin air. Once we have a set of hypotheses, we need data against which theories can be compared using their relative probabilities. The experimental testing of a theory can happen in many stages: the posterior probability obtained after one experiment can be fed in, as prior, into the next. The order of experiments does not matter. This all happens in an endless loop, as models are tested and refined by confrontation with experimental discoveries, and are forced to compete with new theoretical ideas. Often one particular theory emerges as most probable for a while, such as in particle physics where a “standard model” has been in existence for many years. But this does not make it absolutely right; it is just the best bet amongst the alternatives. Likewise, the Big Bang model does not represent the absolute truth, but is just the best available model in the face of the manifold relevant observations we now have concerning the Universe’s origin and evolution. The crucial point about this methodology is that it is inherently inductive: all the reasoning is carried out in “hypothesis space” rather than “observation space”.  The primary form of logic involved is not deduction but induction. Science is all about inverse reasoning.

For comments on induction versus deduction in another context, see here.

So what are the main differences between the Bayesian and frequentist views?

First, I think it is fair to say that the Bayesian framework is enormously more general than is allowed by the frequentist notion that probabilities must be regarded as relative frequencies in some ensemble, whether that is real or imaginary. In the latter interpretation, a proposition is at once true in some elements of the ensemble and false in others. It seems to me to be a source of great confusion to substitute a logical AND for what is really a logical OR. The Bayesian stance is also free from problems associated with the failure to incorporate in the analysis any information that can’t be expressed as a frequency. Would you really trust a doctor who said that 75% of the people she saw with your symptoms required an operation, but who did not bother to look at your own medical files?

As I mentioned above, frequentists tend to talk about “random variables”. This takes us into another semantic minefield. What does “random” mean? To a Bayesian there are no random variables, only variables whose values we do not know. A random process is simply one about which we only have sufficient information to specify probability distributions rather than definite values.

More fundamentally, it is clear from the fact that the combination rules for probabilities were derived by Cox uniquely from the requirement of logical consistency, that any departure from these rules will generally speaking involve logical inconsistency. Many of the standard statistical data analysis techniques – including the simple “unbiased estimator” mentioned briefly above – used when the data consist of repeated samples of a variable having a definite but unknown value, are not equivalent to Bayesian reasoning. These methods can, of course, give good answers, but they can all be made to look completely silly by suitable choice of dataset.

By contrast, I am not aware of any example of a paradox or contradiction that has ever been found using the correct application of Bayesian methods, although method can be applied incorrectly. Furthermore, in order to deal with unique events like the weather, frequentists are forced to introduce the notion of an ensemble, a perhaps infinite collection of imaginary possibilities, to allow them to retain the notion that probability is a proportion. Provided the calculations are done correctly, the results of these calculations should agree with the Bayesian answers. On the other hand, frequentists often talk about the ensemble as if it were real, and I think that is very dangerous…


Share/Bookmark

DNA Profiling and the Prosecutor’s Fallacy

Posted in Bad Statistics with tags , , , , , , on October 23, 2010 by telescoper

It’s been a while since I posed anything in the Bad Statistics file so I thought I’d return to the subject of one of my very first blog posts, although I’ll take a different tack this time and introduce it with different, though related, example.

The topic is forensic statistics, which has been involved in some high-profile cases and which demonstrates how careful probabilistic reasoning is needed to understand scientific evidence. A good example is the use of DNA profiling evidence. Typically, this involves the comparison of two samples: one from an unknown source (evidence, such as blood or semen, collected at the scene of a crime) and a known or reference sample, such as a blood or saliva sample from a suspect. If the DNA profiles obtained from the two samples are indistinguishable then they are said to “match” and this evidence can be used in court as indicating that the suspect was in fact the origin of the sample.

In courtroom dramas, DNA matches are usually presented as being very definitive. In fact, the strength of the evidence varies very widely depending on the circumstances. If the DNA profile of the suspect or evidence consists of a combination of traits that is very rare in the population at large then the evidence can be very strong that the suspect was the contributor. If the DNA profile is not so rare then it becomes more likely that both samples match simply by chance. This probabilistic aspect makes it very important to understand the logic of the argument very carefully.

So how does it all work? A DNA profile is not a complete map of the entire genetic code contained within the cells of an individual, which would be such an enormous amount of information that it would be impractical to use it in court. Instead, a profile consists of a few (perhaps half-a-dozen) pieces of this information called alleles. An allele is one of the possible codings of DNA of the same gene at a given position (or locus) on one of the chromosomes in a cell. A single gene may, for example, determine the colour of the blossom produced by a flower; more often genes act in concert with other genes to determine the physical properties of an organism. The overall physical appearance of an individual organism, i.e. any of its particular traits, is called the phenotype and it is controlled, at least to some extent, by the set of alleles that the individual possesses. In the simplest cases, however, a single gene controls a given attribute. The gene that controls the colour of a flower will have different versions: one might produce blue flowers, another red, and so on. These different versions of a given gene are called alleles.

Some organisms contain two copies of each gene; these are said to be diploid. These copies can either be both the same, in which case the organism is homozygous, or different in which case it is heterozygous; in the latter case it possesses two different alleles for the same gene. Phenotypes for a given allele may be either dominant or recessive (although not all are characterized in this way). For example, suppose the dominated and recessive alleles are called A and a, respectively. If a phenotype is dominant then the presence of one associated allele in the pair is sufficient for the associated trait to be displayed, i.e. AA, aA and Aa will both show the same phenotype. If it is recessive, both alleles must be of the type associated with that phenotype so only aa will lead to the corresponding traits being visible.

Now we get to the probabilistic aspect of this. Suppose we want to know what the frequency of an allele is in the population, which translates into the probability that it is selected when a random individual is extracted. The argument that is needed is essentially statistical. During reproduction, the offspring assemble their alleles from those of their parents. Suppose that the alleles for any given individual are chosen independently. If p is the frequency of the dominant gene and q is the frequency of the recessive one, then we can immediately write:

p+q =1

Using the product law for probabilities, and assuming independence, the probability of homozygous dominant pairing (i.e. AA) is p2, while that of the pairing aa is q2. The probability of the heterozygotic outcome is 2pq (the two possibilities, each of probability pq are Aa and aA). This leads to the result that

p^2 +2pq +q^2 =1

This called the Hardy-Weinberg law. It can easily be extended to cases where there are two or more alleles, but I won’t go through the details here.

Now what we have to do is examine the DNA of a particular individual and see how it compares with what is known about the population. Suppose we take one locus to start with, and the individual turns out to be homozygotic: the two alleles at that locus are the same. In the population at large the frequency of that allele might be, say, 0.6. The probability that this combination arises “by chance” is therefore 0.6 times 0.6, or 0.36. Now move to the next locus, where the individual profile has two different alleles. The frequency of one is 0.25 and that of the other is 0.75. so the probability of the combination is “2pq”, which is 0.375. The probability of a match at both these loci is therefore 0.36 times 0.375, or 13.5%. The addition of further loci gradually refines the profile, so the corresponding probability reduces.

This is a perfectly bona fide statistical argument, provided the assumptions made about population genetic are correct. Let us suppose that a profile of 7 loci – a typical number for the kind of profiling used in the courts – leads to a probability of one in ten thousand of a match for a “randomly selected” individual. Now suppose the profile of our suspect matches that of the sample left at the crime scene. This means that, either the suspect left the trace there, or an unlikely coincidence happened: that, by a 1:10,000 chance, our suspect just happened to match the evidence.

This kind of result is often quoted in the newspapers as meaning that there is only a 1 in 10,000 chance that someone other than the suspect contributed the sample or, in other words, that the odds against the suspect being innocent are ten thousand to one against. Such statements are gross misrepresentations of the logic, but they have become so commonplace that they have acquired their own name: the Prosecutor’s Fallacy.

To see why this is a fallacy, i.e. why it is wrong, imagine that whatever crime we are talking about took place in a big city with 1,000,000 inhabitants. How many people in this city would have DNA that matches the profile? Answer: about 1 in 10,000 of them ,which comes to 100. Our suspect is one. In the absence of any other information, the odds are therefore roughly 100:1 against him being guilty rather than 10,000:1 in favour. In realistic cases there will of course be additional evidence that excludes the other 99 potential suspects, so it is incorrect to claim that a DNA match actually provides evidence of innocence. This converse argument has been dubbed the Defence Fallacy, but nevertheless it shows that statements about probability need to be phrased very carefully if they are to be understood properly.

All this brings me to the tragedy that I blogged about in 2008. In 1999, Mrs Sally Clark was tried and convicted for the murder of her two sons Christopher, who died aged 10 weeks in 1996, and Harry who was only eight weeks old when he died in 1998. Sudden infant deaths are sadly not as uncommon as one might have hoped: about one in eight thousand families experience such a nightmare. But what was unusual in this case was that after the second death in Mrs Clark’s family, the distinguished paediatrician Sir Roy Meadows was asked by the police to investigate the circumstances surrounding both her losses. Based on his report, Sally Clark was put on trial for murder. Sir Roy was called as an expert witness. Largely because of his testimony, Mrs Clark was convicted and sentenced to prison.

After much campaigning, she was released by the Court of Appeal in 2003. She was innocent all along. On top of the loss of her sons, the courts had deprived her of her liberty for four years. Sally Clark died in 2007 from alcohol poisoning, after having apparently taken to the bottle after three years of wrongful imprisonment.The whole episode was a tragedy and a disgrace to the legal profession.

I am not going to imply that Sir Roy Meadows bears sole responsibility for this fiasco, because there were many difficulties in Mrs Clark’s trial. One of the main issues raised on Appeal was that the pathologist working with the prosecution had failed to disclose evidence that Harry was suffering from an infection at the time he died. Nevertheless, what Professor Meadows said on oath was so shockingly stupid that he fully deserves the vilification with which he was greeted after the trial. Two other women had also been imprisoned in similar circumstances, as a result of his intervention.

At the core of the prosecution’s case was a probabilistic argument that would have been torn to shreds had any competent statistician been called to the witness box. Sadly, the defence counsel seemed to believe it as much as the jury did, and it was never rebutted. Sir Roy stated, correctly, that the odds of a baby dying of sudden infant death syndrome (or “cot death”) in an affluent, non-smoking family like Sally Clarks, were about 8,543 to one against. He then presented the probability of this happening twice in a family as being this number squared, or 73 million to one against. In the minds of the jury this became the odds against Mrs Clark being innocent of a crime.

That this argument was not effectively challenged at the trial is truly staggering.

Remember that the product rule for combining probabilities

P(AB)=P(A)P(B|A)

only reduces to

P(AB)=P(A)P(B)

if the two events A and B are independent, i.e. that the occurrence of one event has no effect on the probability of the other. Nobody knows for sure what causes cot deaths, but there is every reason to believe that there might be inherited or environmental factors that might cause such deaths to be more frequent in some families than in others. In other words, sudden infant deaths might be correlated rather than independent. Furthermore, there is data about the frequency of multiple infant deaths in families. The conditional frequency of a second such event following an earlier one is not one in eight thousand or so, it’s just one in 77. This is hard evidence that should have been presented to the jury. It wasn’t.

Note that this testimony counts as doubly-bad statistics. It not only deploys the Prosecutor’s Fallacy, but applies it to what was an incorrect calculation in the first place!

Defending himself, Professor Meadows tried to explain that he hadn’t really understood the statistical argument he was presenting, but was merely repeating for the benefit of the court something he had read, which turned out to have been in a report that had not been even published at the time of the trial. He said

To me it was like I was quoting from a radiologist’s report or a piece of pathology. I was quoting the statistics, I wasn’t pretending to be a statistician.

I always thought that expert witnesses were suppose to testify about those things that they were experts about, rather than subjecting the jury second-hand flummery. Perhaps expert witnesses enjoy their status so much that they feel they can’t make mistakes about anything.

Subsequent to Mrs Clark’s release, Sir Roy Meadows was summoned to appear in front of a disciplinary tribunal at the General Medical Council. At the end of the hearing he was found guilty of serious professional misconduct, and struck off the medical register. Since he is retired anyway, this seems to me to be scant punishment. The judges and barristers who should have been alert to this miscarriage of justice have escaped censure altogether.

Although I am pleased that Professor Meadows has been disciplined in this fashion, I also hope that the General Medical Council does not think that hanging one individual out to dry will solve this problem. I addition, I think the politicians and legal system should look very hard at what went wrong in this case (and others of its type) to see how the probabilistic arguments that are essential in the days of forensic science can be properly incorporated in a rational system of justice. At the moment there is no agreed protocol for evaluating scientific evidence before it is presented to court. It is likely that such a body might have prevented the case of Mrs Clark from ever coming to trial. Scientists frequently seek the opinions of lawyers when they need to, but lawyers seem happy to handle scientific arguments themselves even when they don’t understand them at all.

I end with a quote from a press release produced by the Royal Statistical Society in the aftermath of this case:

Although many scientists have some familiarity with statistical methods, statistics remains a specialised area. The Society urges the Courts to ensure that statistical evidence is presented only by appropriately qualified statistical experts, as would be the case for any other form of expert evidence.

As far as I know, the criminal justice system has yet to implement such safeguards.


Share/Bookmark

Political Correlation

Posted in Bad Statistics, Politics with tags , , , , on August 28, 2010 by telescoper

I was just thinking that it’s been a while since I posted anything in my bad statistics category when a particularly egregious example jumped up out of this week’s Times Higher and slapped me in the face. This one goes wrong before it even gets to the statistical analysis, so I’ll only give it short shrift here, but it serves to remind us all how feeble is many academic’s grasp of the scientific method, and particularly the role of statistics within it. The perpetrator in this case is Paul Whiteley, who is Professor of Politics at the University of Essex. I’m tempted to suggest he should go and stand in the corner wearing a dunce’s cap.

Professor Whiteley argues that he has found evidence that refutes the case that increased provision of science, technology, engineering and maths (STEM) graduates are -in the words of Lord Mandelson – “crucial to in securing future prosperity”. His evidence is based on data relating to 30 OECD countries: on the one hand, their average economic growth for the period 2000-8 and, on the other, the percentage of graduates in STEM subjects for each country over the same period. He finds no statistically significant correlation between these variates. The data are plotted here:

This lack of correlation is asserted to be evidence that STEM graduates are not necessary for economic growth, but in an additional comment (for which no supporting numbers are given), it is stated that growth correlates with the total number of graduates in all subjects in each country. Hence the conclusion that higher education is good, whether or not it’s in STEM areas.

So what’s wrong with this analysis? A number of things, in fact, but I’ll start with what seems to me the most important conceptual one. In order to test a hypothesis, you have to look for a measurable effect that would be expected if the hypothesis were true, measure the effect, and then decide whether the effect is there or not. If it isn’t, you have falsified the hypothesis.

Now, would anyone really expect the % of students graduating in STEM subjects  to correlate with the growth rate in the economy over the same period? Does anyone really think that newly qualified STEM graduates have an immediate impact on economic growth? I’m sure even the most dedicated pro-science lobbyist would answer “no” to that question. Even the quote from Lord Mandelson included the crucial word “future”! Investment in these areas is expected to have a long-term benefit that would probably only show after many years. I would have been amazed had there been a correlation between measures relating to such a short period, so  absence of one says nothing whatsoever about the economic benefits of education in STEM areas.

And another thing. Why is the “percentage of graduates” chosen as a variate for this study? Surely a large % of STEM graduates is irrelevant if the total number is very small? I would have thought the fraction of the population with a STEM degree might be a better choice. Better still, since it is claimed that the overall number of graduates correlates with economic growth, why not show how this correlation with the total number of graduates breaks down by subject area?

I’m a bit suspicious about the reliability of the data too. Which country is it that produces less than 3% of its graduates in science subjects (the point at the bottom left of the plot). Surely different countries also have different types of economy wherein the role of science and technology varies considerably. It’s tempting, in fact, to see two parallel lines in the above graph – I’m not the only one to have noticed this – which may either be an artefact of small numbers chosen or might indicate that some other parameter is playing a role.

This poorly framed hypothesis test, dubious choice of variables, and highly questionable conclusions strongly suggest that Professor Whiteley had made his mind up what result he wanted and simply dressed it up in a bit of flimsy statistics. Unfortunately, such pseudoscientific flummery is all that’s needed to convince a great many out there in the big wide world, especially journalists. It’s a pity that this shoddy piece of statistical gibberish was given such prominence in the Times Higher, supported by a predictably vacuous editorial, especially when the same issue features an article about the declining standards of science journalism. Perhaps we need more STEM graduates to teach the others how to do statistical tests properly.

However, before everyone accuses me of being blind to the benefits of anything other than STEM subjects, I’ll just make it clear that, while I do think that science is very important for a large number of reasons, I do accept that higher education generally is a good thing in itself , regardless of whether it’s in physics or mediaeval latin, though I’m not sure about certain other subjects.  Universities should not be judged solely by the effect they may or may not have on short-term economic growth.

Which brings me to a final point about the difference between correlation and causation. People with more disposal income probably spend more money on, e.g., books than people with less money. Buying books doesn’t make you rich, at least not in the short-term, but it’s a good thing to do for its own sake. We shouldn’t think of higher education exclusively on the cost side of the economic equation, as politicians and bureaucrats seem increasingly to be doing,  it’s also one of the benefits.


Share/Bookmark