Archive for probability

Bayes’ Theorem and the Search for Supersymmetry

Posted in The Universe and Stuff with tags , , , , on May 13, 2012 by telescoper

Interesting comments about Bayes’ theorem and the prospects for detecting supersymmetry at the Large Hadron Collider. This piece explains how a non-detection isn’t always “absence of evidence” but can indeed by “evidence of absence”. It’s also worth reading the comments if you’re wondering whether what people say about Lubos Motl is actually true…

Phi G's avatarviXra log

Here’s a puzzle. There are three cups upside down on a table. You friend tells you that a pea is hidden under one of them. Based on past experience you estimate that there is a 90% probability that this is true. You turn over two cups and don’t find the pea. What is the probability now that there is a pea underneath? You may want to think about this before reading on.

Naively you might think that two-thirds of the parameter space has been eliminated, so the probability has gone from 90% to 30%, but this is quite wrong. You can use Bayes Theorem to get the correct answer but let me give you a more intuitive frequentist answer. The situation can be models by imagining that there are thirty initial possibilities with equal probability. Nine of them have a pea under the first cup, nine more under the second and nine more under the third…

View original post 880 more words

Serious Brain Teaser

Posted in Cute Problems with tags , , , on October 28, 2011 by telescoper

This one has been doing the rounds this morning, so I couldn’t resist posting it here:

PS. If anyone knows where this originated please let me know and I’ll give proper credit!

PPS. Note that the Mike Disney option (300%) is missing…

Bayes and his Theorem

Posted in Bad Statistics with tags , , , , , , on November 23, 2010 by telescoper

My earlier post on Bayesian probability seems to have generated quite a lot of readers, so this lunchtime I thought I’d add a little bit of background. The previous discussion started from the result

P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)

where

K=P(A|C).

Although this is called Bayes’ theorem, the general form of it as stated here was actually first written down, not by Bayes but by Laplace. What Bayes’ did was derive the special case of this formula for “inverting” the binomial distribution. This distribution gives the probability of x successes in n independent “trials” each having the same probability of success, p; each “trial” has only two possible outcomes (“success” or “failure”). Trials like this are usually called Bernoulli trials, after Daniel Bernoulli. If we ask the question “what is the probability of exactly x successes from the possible n?”, the answer is given by the binomial distribution:

P_n(x|n,p)= C(n,x) p^x (1-p)^{n-x}

where

C(n,x)= n!/x!(n-x)!

is the number of distinct combinations of x objects that can be drawn from a pool of n.

You can probably see immediately how this arises. The probability of x consecutive successes is p multiplied by itself x times, or px. The probability of (n-x) successive failures is similarly (1-p)n-x. The last two terms basically therefore tell us the probability that we have exactly x successes (since there must be n-x failures). The combinatorial factor in front takes account of the fact that the ordering of successes and failures doesn’t matter.

The binomial distribution applies, for example, to repeated tosses of a coin, in which case p is taken to be 0.5 for a fair coin. A biased coin might have a different value of p, but as long as the tosses are independent the formula still applies. The binomial distribution also applies to problems involving drawing balls from urns: it works exactly if the balls are replaced in the urn after each draw, but it also applies approximately without replacement, as long as the number of draws is much smaller than the number of balls in the urn. I leave it as an exercise to calculate the expectation value of the binomial distribution, but the result is not surprising: E(X)=np. If you toss a fair coin ten times the expectation value for the number of heads is 10 times 0.5, which is five. No surprise there. After another bit of maths, the variance of the distribution can also be found. It is np(1-p).

So this gives us the probability of x given a fixed value of p. Bayes was interested in the inverse of this result, the probability of p given x. In other words, Bayes was interested in the answer to the question “If I perform n independent trials and get x successes, what is the probability distribution of p?”. This is a classic example of inverse reasoning. He got the correct answer, eventually, but by very convoluted reasoning. In my opinion it is quite difficult to justify the name Bayes’ theorem based on what he actually did, although Laplace did specifically acknowledge this contribution when he derived the general result later, which is no doubt why the theorem is always named in Bayes’ honour.

This is not the only example in science where the wrong person’s name is attached to a result or discovery. In fact, it is almost a law of Nature that any theorem that has a name has the wrong name. I propose that this observation should henceforth be known as Coles’ Law.

So who was the mysterious mathematician behind this result? Thomas Bayes was born in 1702, son of Joshua Bayes, who was a Fellow of the Royal Society (FRS) and one of the very first nonconformist ministers to be ordained in England. Thomas was himself ordained and for a while worked with his father in the Presbyterian Meeting House in Leather Lane, near Holborn in London. In 1720 he was a minister in Tunbridge Wells, in Kent. He retired from the church in 1752 and died in 1761. Thomas Bayes didn’t publish a single paper on mathematics in his own name during his lifetime but despite this was elected a Fellow of the Royal Society (FRS) in 1742. Presumably he had Friends of the Right Sort. He did however write a paper on fluxions in 1736, which was published anonymously. This was probably the grounds on which he was elected an FRS.

The paper containing the theorem that now bears his name was published posthumously in the Philosophical Transactions of the Royal Society of London in 1764.

P.S. I understand that the authenticity of the picture is open to question. Whoever it actually is, he looks  to me a bit like Laurence Olivier…


Share/Bookmark

DNA Profiling and the Prosecutor’s Fallacy

Posted in Bad Statistics with tags , , , , , , on October 23, 2010 by telescoper

It’s been a while since I posed anything in the Bad Statistics file so I thought I’d return to the subject of one of my very first blog posts, although I’ll take a different tack this time and introduce it with different, though related, example.

The topic is forensic statistics, which has been involved in some high-profile cases and which demonstrates how careful probabilistic reasoning is needed to understand scientific evidence. A good example is the use of DNA profiling evidence. Typically, this involves the comparison of two samples: one from an unknown source (evidence, such as blood or semen, collected at the scene of a crime) and a known or reference sample, such as a blood or saliva sample from a suspect. If the DNA profiles obtained from the two samples are indistinguishable then they are said to “match” and this evidence can be used in court as indicating that the suspect was in fact the origin of the sample.

In courtroom dramas, DNA matches are usually presented as being very definitive. In fact, the strength of the evidence varies very widely depending on the circumstances. If the DNA profile of the suspect or evidence consists of a combination of traits that is very rare in the population at large then the evidence can be very strong that the suspect was the contributor. If the DNA profile is not so rare then it becomes more likely that both samples match simply by chance. This probabilistic aspect makes it very important to understand the logic of the argument very carefully.

So how does it all work? A DNA profile is not a complete map of the entire genetic code contained within the cells of an individual, which would be such an enormous amount of information that it would be impractical to use it in court. Instead, a profile consists of a few (perhaps half-a-dozen) pieces of this information called alleles. An allele is one of the possible codings of DNA of the same gene at a given position (or locus) on one of the chromosomes in a cell. A single gene may, for example, determine the colour of the blossom produced by a flower; more often genes act in concert with other genes to determine the physical properties of an organism. The overall physical appearance of an individual organism, i.e. any of its particular traits, is called the phenotype and it is controlled, at least to some extent, by the set of alleles that the individual possesses. In the simplest cases, however, a single gene controls a given attribute. The gene that controls the colour of a flower will have different versions: one might produce blue flowers, another red, and so on. These different versions of a given gene are called alleles.

Some organisms contain two copies of each gene; these are said to be diploid. These copies can either be both the same, in which case the organism is homozygous, or different in which case it is heterozygous; in the latter case it possesses two different alleles for the same gene. Phenotypes for a given allele may be either dominant or recessive (although not all are characterized in this way). For example, suppose the dominated and recessive alleles are called A and a, respectively. If a phenotype is dominant then the presence of one associated allele in the pair is sufficient for the associated trait to be displayed, i.e. AA, aA and Aa will both show the same phenotype. If it is recessive, both alleles must be of the type associated with that phenotype so only aa will lead to the corresponding traits being visible.

Now we get to the probabilistic aspect of this. Suppose we want to know what the frequency of an allele is in the population, which translates into the probability that it is selected when a random individual is extracted. The argument that is needed is essentially statistical. During reproduction, the offspring assemble their alleles from those of their parents. Suppose that the alleles for any given individual are chosen independently. If p is the frequency of the dominant gene and q is the frequency of the recessive one, then we can immediately write:

p+q =1

Using the product law for probabilities, and assuming independence, the probability of homozygous dominant pairing (i.e. AA) is p2, while that of the pairing aa is q2. The probability of the heterozygotic outcome is 2pq (the two possibilities, each of probability pq are Aa and aA). This leads to the result that

p^2 +2pq +q^2 =1

This called the Hardy-Weinberg law. It can easily be extended to cases where there are two or more alleles, but I won’t go through the details here.

Now what we have to do is examine the DNA of a particular individual and see how it compares with what is known about the population. Suppose we take one locus to start with, and the individual turns out to be homozygotic: the two alleles at that locus are the same. In the population at large the frequency of that allele might be, say, 0.6. The probability that this combination arises “by chance” is therefore 0.6 times 0.6, or 0.36. Now move to the next locus, where the individual profile has two different alleles. The frequency of one is 0.25 and that of the other is 0.75. so the probability of the combination is “2pq”, which is 0.375. The probability of a match at both these loci is therefore 0.36 times 0.375, or 13.5%. The addition of further loci gradually refines the profile, so the corresponding probability reduces.

This is a perfectly bona fide statistical argument, provided the assumptions made about population genetic are correct. Let us suppose that a profile of 7 loci – a typical number for the kind of profiling used in the courts – leads to a probability of one in ten thousand of a match for a “randomly selected” individual. Now suppose the profile of our suspect matches that of the sample left at the crime scene. This means that, either the suspect left the trace there, or an unlikely coincidence happened: that, by a 1:10,000 chance, our suspect just happened to match the evidence.

This kind of result is often quoted in the newspapers as meaning that there is only a 1 in 10,000 chance that someone other than the suspect contributed the sample or, in other words, that the odds against the suspect being innocent are ten thousand to one against. Such statements are gross misrepresentations of the logic, but they have become so commonplace that they have acquired their own name: the Prosecutor’s Fallacy.

To see why this is a fallacy, i.e. why it is wrong, imagine that whatever crime we are talking about took place in a big city with 1,000,000 inhabitants. How many people in this city would have DNA that matches the profile? Answer: about 1 in 10,000 of them ,which comes to 100. Our suspect is one. In the absence of any other information, the odds are therefore roughly 100:1 against him being guilty rather than 10,000:1 in favour. In realistic cases there will of course be additional evidence that excludes the other 99 potential suspects, so it is incorrect to claim that a DNA match actually provides evidence of innocence. This converse argument has been dubbed the Defence Fallacy, but nevertheless it shows that statements about probability need to be phrased very carefully if they are to be understood properly.

All this brings me to the tragedy that I blogged about in 2008. In 1999, Mrs Sally Clark was tried and convicted for the murder of her two sons Christopher, who died aged 10 weeks in 1996, and Harry who was only eight weeks old when he died in 1998. Sudden infant deaths are sadly not as uncommon as one might have hoped: about one in eight thousand families experience such a nightmare. But what was unusual in this case was that after the second death in Mrs Clark’s family, the distinguished paediatrician Sir Roy Meadows was asked by the police to investigate the circumstances surrounding both her losses. Based on his report, Sally Clark was put on trial for murder. Sir Roy was called as an expert witness. Largely because of his testimony, Mrs Clark was convicted and sentenced to prison.

After much campaigning, she was released by the Court of Appeal in 2003. She was innocent all along. On top of the loss of her sons, the courts had deprived her of her liberty for four years. Sally Clark died in 2007 from alcohol poisoning, after having apparently taken to the bottle after three years of wrongful imprisonment.The whole episode was a tragedy and a disgrace to the legal profession.

I am not going to imply that Sir Roy Meadows bears sole responsibility for this fiasco, because there were many difficulties in Mrs Clark’s trial. One of the main issues raised on Appeal was that the pathologist working with the prosecution had failed to disclose evidence that Harry was suffering from an infection at the time he died. Nevertheless, what Professor Meadows said on oath was so shockingly stupid that he fully deserves the vilification with which he was greeted after the trial. Two other women had also been imprisoned in similar circumstances, as a result of his intervention.

At the core of the prosecution’s case was a probabilistic argument that would have been torn to shreds had any competent statistician been called to the witness box. Sadly, the defence counsel seemed to believe it as much as the jury did, and it was never rebutted. Sir Roy stated, correctly, that the odds of a baby dying of sudden infant death syndrome (or “cot death”) in an affluent, non-smoking family like Sally Clarks, were about 8,543 to one against. He then presented the probability of this happening twice in a family as being this number squared, or 73 million to one against. In the minds of the jury this became the odds against Mrs Clark being innocent of a crime.

That this argument was not effectively challenged at the trial is truly staggering.

Remember that the product rule for combining probabilities

P(AB)=P(A)P(B|A)

only reduces to

P(AB)=P(A)P(B)

if the two events A and B are independent, i.e. that the occurrence of one event has no effect on the probability of the other. Nobody knows for sure what causes cot deaths, but there is every reason to believe that there might be inherited or environmental factors that might cause such deaths to be more frequent in some families than in others. In other words, sudden infant deaths might be correlated rather than independent. Furthermore, there is data about the frequency of multiple infant deaths in families. The conditional frequency of a second such event following an earlier one is not one in eight thousand or so, it’s just one in 77. This is hard evidence that should have been presented to the jury. It wasn’t.

Note that this testimony counts as doubly-bad statistics. It not only deploys the Prosecutor’s Fallacy, but applies it to what was an incorrect calculation in the first place!

Defending himself, Professor Meadows tried to explain that he hadn’t really understood the statistical argument he was presenting, but was merely repeating for the benefit of the court something he had read, which turned out to have been in a report that had not been even published at the time of the trial. He said

To me it was like I was quoting from a radiologist’s report or a piece of pathology. I was quoting the statistics, I wasn’t pretending to be a statistician.

I always thought that expert witnesses were suppose to testify about those things that they were experts about, rather than subjecting the jury second-hand flummery. Perhaps expert witnesses enjoy their status so much that they feel they can’t make mistakes about anything.

Subsequent to Mrs Clark’s release, Sir Roy Meadows was summoned to appear in front of a disciplinary tribunal at the General Medical Council. At the end of the hearing he was found guilty of serious professional misconduct, and struck off the medical register. Since he is retired anyway, this seems to me to be scant punishment. The judges and barristers who should have been alert to this miscarriage of justice have escaped censure altogether.

Although I am pleased that Professor Meadows has been disciplined in this fashion, I also hope that the General Medical Council does not think that hanging one individual out to dry will solve this problem. I addition, I think the politicians and legal system should look very hard at what went wrong in this case (and others of its type) to see how the probabilistic arguments that are essential in the days of forensic science can be properly incorporated in a rational system of justice. At the moment there is no agreed protocol for evaluating scientific evidence before it is presented to court. It is likely that such a body might have prevented the case of Mrs Clark from ever coming to trial. Scientists frequently seek the opinions of lawyers when they need to, but lawyers seem happy to handle scientific arguments themselves even when they don’t understand them at all.

I end with a quote from a press release produced by the Royal Statistical Society in the aftermath of this case:

Although many scientists have some familiarity with statistical methods, statistics remains a specialised area. The Society urges the Courts to ensure that statistical evidence is presented only by appropriately qualified statistical experts, as would be the case for any other form of expert evidence.

As far as I know, the criminal justice system has yet to implement such safeguards.


Share/Bookmark

Spin, Entanglement and Quantum Weirdness

Posted in The Universe and Stuff with tags , , , , , , , on October 3, 2010 by telescoper

After writing a post about spinning cricket balls a while ago I thought it might be fun to post something about the role of spin in quantum mechanics.

Spin is a concept of fundamental importance in quantum mechanics, not least because it underlies our most basic theoretical understanding of matter. The standard model of particle physics divides elementary particles into two types, fermions and bosons, according to their spin.  One is tempted to think of  these elementary particles as little cricket balls that can be rotating clockwise or anti-clockwise as they approach an elementary batsman. But, as I hope to explain, quantum spin is not really like classical spin: batting would be even more difficult if quantum bowlers were allowed!

Take the electron,  for example. The amount of spin an electron carries is  quantized, so that it always has a magnitude which is ±1/2 (in units of Planck’s constant; all fermions have half-integer spin). In addition, according to quantum mechanics, the orientation of the spin is indeterminate until it is measured. Any particular measurement can only determine the component of spin in one direction. Let’s take as an example the case where the measuring device is sensitive to the z-component, i.e. spin in the vertical direction. The outcome of an experiment on a single electron will lead a definite outcome which might either be “up” or “down” relative to this axis.

However, until one makes a measurement the state of the system is not specified and the outcome is consequently not predictable with certainty; there will be a probability of 50% probability for each possible outcome. We could write the state of the system (expressed by the spin part of its wavefunction  ψ prior to measurement in the form

|ψ> = (|↑> + |↓>)/√2

This gives me an excuse to use  the rather beautiful “bra-ket” notation for the state of a quantum system, originally due to Paul Dirac. The two possibilities are “up” (↑­) and “down” (↓) and they are contained within a “ket” (written |>)which is really just a shorthand for a wavefunction describing that particular aspect of the system. A “bra” would be of the form <|; for the mathematicians this represents the Hermitian conjugate of a ket. The √2 is there to insure that the total probability of the spin being either up or down is 1, remembering that the probability is the square of the wavefunction. When we make a measurement we will get one of these two outcomes, with a 50% probability of each.

At the point of measurement the state changes: if we get “up” it becomes purely |↑>  and if the result is  “down” it becomes |↓>. Either way, the quantum state of the system has changed from a “superposition” state described by the equation above to an “eigenstate” which must be either up or down. This means that all subsequent measurements of the spin in this direction will give the same result: the wave-function has “collapsed” into one particular state. Incidentally, the general term for a two-state quantum system like this is a qubit, and it is the basis of the tentative steps that have been taken towards the construction of a quantum computer.

Notice that what is essential about this is the role of measurement. The collapse of  ψ seems to be an irreversible process, but the wavefunction itself evolves according to the Schrödinger equation, which describes reversible, Hamiltonian changes.  To understand what happens when the state of the wavefunction changes we need an extra level of interpretation beyond what the mathematics of quantum theory itself provides,  because we are generally unable to write down a wave-function that sensibly describes the system plus the measuring apparatus in a single form.

So far this all seems rather similar to the state of a fair coin: it has a 50-50 chance of being heads or tails, but the doubt is resolved when its state is actually observed. Thereafter we know for sure what it is. But this resemblance is only superficial. A coin only has heads or tails, but the spin of an electron doesn’t have to be just up or down. We could rotate our measuring apparatus by 90° and measure the spin to the left (←) or the right (→). In this case we still have to get a result which is a half-integer times Planck’s constant. It will have a 50-50 chance of being left or right that “becomes” one or the other when a measurement is made.

Now comes the real fun. Suppose we do a series of measurements on the same electron. First we start with an electron whose spin we know nothing about. In other words it is in a superposition state like that shown above. We then make a measurement in the vertical direction. Suppose we get the answer “up”. The electron is now in the eigenstate with spin “up”.

We then pass it through another measurement, but this time it measures the spin to the left or the right. The process of selecting the electron to be one with  spin in the “up” direction tells us nothing about whether the horizontal component of its spin is to the left or to the right. Theory thus predicts a 50-50 outcome of this measurement, as is observed experimentally.

Suppose we do such an experiment and establish that the electron’s spin vector is pointing to the left. Now our long-suffering electron passes into a third measurement which this time is again in the vertical direction. You might imagine that since we have already measured this component to be in the up direction, it would be in that direction again this time. In fact, this is not the case. The intervening measurement seems to “reset” the up-down component of the spin; the results of the third measurement are back at square one, with a 50-50 chance of getting up or down.

This is just one example of the kind of irreducible “randomness” that seems to be inherent in quantum theory. However, if you think this is what people mean when they say quantum mechanics is weird, you’re quite mistaken. It gets much weirder than this! So far I have focussed on what happens to the description of single particles when quantum measurements are made. Although there seem to be subtle things going on, it is not really obvious that anything happening is very different from systems in which we simply lack the microscopic information needed to make a prediction with absolute certainty.

At the simplest level, the difference is that quantum mechanics gives us a theory for the wave-function which somehow lies at a more fundamental level of description than the usual way we think of probabilities. Probabilities can be derived mathematically from the wave-function,  but there is more information in ψ than there is in |2; the wave-function is a complex entity whereas the square of its amplitude is entirely real. If one can construct a system of two particles, for example, the resulting wave-function is obtained by superimposing the wave-functions of the individual particles, and probabilities are then obtained by squaring this joint wave-function. This will not, in general, give the same probability distribution as one would get by adding the one-particle probabilities because, for complex entities A and B,

A2+B2 ≠(A+B)2

in general. To put this another way, one can write any complex number in the form a+ib (real part plus imaginary part) or, generally more usefully in physics , as Re, where R is the amplitude and θ  is called the phase. The square of the amplitude gives the probability associated with the wavefunction of a single particle, but in this case the phase information disappears; the truly unique character of quantum physics and how it impacts on probabilies of measurements only reveals itself when the phase information is retained. This generally requires two or more particles to be involved, as the absolute phase of a single-particle state is essentially impossible to measure.

Finding situations where the quantum phase of a wave-function is important is not easy. It seems to be quite easy to disturb quantum systems in such a way that the phase information becomes scrambled, so testing the fundamental aspects of quantum theory requires considerable experimental ingenuity. But it has been done, and the results are astonishing.

Let us think about a very simple example of a two-component system: a pair of electrons. All we care about for the purpose of this experiment is the spin of the electrons so let us write the state of this system in terms of states such as  which I take to mean that the first particle has spin up and the second one has spin down. Suppose we can create this pair of electrons in a state where we know the total spin is zero. The electrons are indistinguishable from each other so until we make a measurement we don’t know which one is spinning up and which one is spinning down. The state of the two-particle system might be this:

|ψ> = (|↑↓> – |↓↑>)/√2

squaring this up would give a 50% probability of “particle one” being up and “particle two” being down and 50% for the contrary arrangement. This doesn’t look too different from the example I discussed above, but this duplex state exhibits a bizarre phenomenon known as quantum entanglement.

Suppose we start the system out in this state and then separate the two electrons without disturbing their spin states. Before making a measurement we really can’t say what the spins of the individual particles are: they are in a mixed state that is neither up nor down but a combination of the two possibilities. When they’re up, they’re up. When they’re down, they’re down. But when they’re only half-way up they’re in an entangled state.

If one of them passes through a vertical spin-measuring device we will then know that particle is definitely spin-up or definitely spin-down. Since we know the total spin of the pair is zero, then we can immediately deduce that the other one must be spinning in the opposite direction because we’re not allowed to violate the law of conservation of angular momentum: if Particle 1 turns out to be spin-up, Particle 2  must be spin-down, and vice versa. It is known experimentally that passing two electrons through identical spin-measuring gadgets gives  results consistent with this reasoning. So far there’s nothing so very strange in this.

The problem with entanglement lies in understanding what happens in reality when a measurement is done. Suppose we have two observers, Dick and Harry, each equipped with a device that can measure the spin of an electron in any direction they choose. Particle 1 emerges from the source and travels towards Dick whereas particle 2 travels in Harry’s direction. Before any measurement, the system is in an entangled superposition state. Suppose Dick decides to measure the spin of electron 1 in the z-direction and finds it spinning up. Immediately, the wave-function for electron 2 collapses into the down direction. If Dick had instead decided to measure spin in the left-right direction and found it “left” similar collapse would have occurred for particle 2, but this time putting it in the “right” direction.

Whatever Dick does, the result of any corresponding measurement made by Harry has a definite outcome – the opposite to Dick’s result. So Dick’s decision whether to make a measurement up-down or left-right instantaneously transmits itself to Harry who will find a consistent answer, if he makes the same measurement as Dick.

If, on the other hand, Dick makes an up-down measurement but Harry measures left-right then Dick’s answer has no effect on Harry, who has a 50% chance of getting “left” and 50% chance of getting right. The point is that whatever Dick decides to do, it has an immediate effect on the wave-function at Harry’s position; the collapse of the wave-function induced by Dick immediately collapses the state measured by Harry. How can particle 1 and particle 2 communicate in this way?

This riddle is the core of a thought experiment by Einstein, Podolsky and Rosen in 1935 which has deep implications for the nature of the information that is supplied by quantum mechanics. The essence of the EPR paradox is that each of the two particles – even if they are separated by huge distances – seems to know exactly what the other one is doing. Einstein called this “spooky action at a distance” and went on to point out that this type of thing simply could not happen in the usual calculus of random variables. His argument was later tightened considerably by John Bell in a form now known as Bell’s theorem.

To see how Bell’s theorem works, consider the following roughly analagous situation. Suppose we have two suspects in prison, say Dick and Harry (Tom grassed them up and has been granted immunity from prosecution). The  two are taken apart to separate cells for individual questioning. We can allow them to use notes, electronic organizers, tablets of stone or anything to help them remember any agreed strategy they have concocted, but they are not allowed to communicate with each other once the interrogation has started. Each question they are asked has only two possible answers – “yes” or “no” – and there are only three possible questions. We can assume the questions are asked independently and in a random order to the two suspects.

When the questioning is over, the interrogators find that whenever they asked the same question, Dick and Harry always gave the same answer, but when the question was different they only gave the same answer 25% of the time. What can the interrogators conclude?

The answer is that Dick and Harry must be cheating. Either they have seen the question list ahead of time or are able to communicate with each other without the interrogator’s knowledge. If they always give the same answer when asked the same question, they must have agreed on answers to all three questions in advance. But when they are asked different questions then, because each question has only two possible responses, by following this strategy it must turn out that at least two of the three prepared answers – and possibly all of them – must be the same for both Dick and Harry. This puts a lower limit on the probability of them giving the same answer to different questions. I’ll leave it as an exercise to the reader to show that the probability of coincident answers to different questions in this case must be at least 1/3.

This a simple illustration of what in quantum mechanics is known as a Bell inequality. Dick and Harry can only keep the number of such false agreements down to the measured level of 25% by cheating.

This example is directly analogous to the behaviour of the entangled quantum state described above under repeated interrogations about its spin in three different directions. The result of each measurement can only be either “yes” or “no”. Each individual answer (for each particle) is equally probable in this case; the same question always produces the same answer for both particles, but the probability of agreement for two different questions is indeed ¼ and not larger as would be expected if the answers were random. For example one could ask particle 1 “are you spinning up” and particle 2 “are you spinning to the right”? The probability of both producing an answer “yes” is 25% according to quantum theory but would be higher if the particles weren’t cheating in some way.

Probably the most famous experiment of this type was done in the 1980s, by Alain Aspect and collaborators, involving entangled pairs of polarized photons (which are bosons), rather than electrons, primarily because these are easier to prepare.

The implications of quantum entanglement greatly troubled Einstein long before the EPR paradox. Indeed the interpretation of single-particle quantum measurement (which has no entanglement) was already troublesome. Just exactly how does the wave-function relate to the particle? What can one really say about the state of the particle before a measurement is made? What really happens when a wave-function collapses? These questions take us into philosophical territory that I have set foot in already; the difficult relationship between epistemological and ontological uses of probability theory.

Thanks largely to the influence of Niels Bohr, in the relatively early stages of quantum theory a standard approach to this question was adopted. In what became known as the  Copenhagen interpretation of quantum mechanics, the collapse of the wave-function as a result of measurement represents a real change in the physical state of the system. Before the measurement, an electron really is neither spinning up nor spinning down but in a kind of quantum purgatory. After a measurement it is released from limbo and becomes definitely something. What collapses the wave-function is something unspecified to do with the interaction of the particle with the measuring apparatus or, in some extreme versions of this doctrine, the intervention of human consciousness.

I find it amazing that such a view could have been held so seriously by so many highly intelligent people. Schrödinger hated this concept so much that he invented a thought-experiment of his own to poke fun at it. This is the famous “Schrödinger’s cat” paradox. I’ve sent Columbo out of the room while I describe this.

In a closed box there is a cat. Attached to the box is a device which releases poison into the box when triggered by a quantum-mechanical event, such as radiation produced by the decay of a radioactive substance. One can’t tell from the outside whether the poison has been released or not, so one doesn’t know whether the cat is alive or dead. When one opens the box, one learns the truth. Whether the cat has collapsed or not, the wave-function certainly does. At this point one is effectively making a quantum measurement so the wave-function of the cat is either “dead” or “alive” but before opening the box it must be in a superposition state. But do we really think the cat is neither dead nor alive? Isn’t it certainly one or the other, but that our lack of information prevents us from knowing which? And if this is true for a macroscopic object such as a cat, why can’t it be true for a microscopic system, such as that involving just a pair of electrons?

As I learned at a talk by the Nobel prize-winning physicist Tony Leggett – who has been collecting data on this recently – most physicists think Schrödinger’s cat is definitely alive or dead before the box is opened. However, most physicists don’t believe that an electron definitely spins either up or down before a measurement is made. But where does one draw the line between the microscopic and macroscopic descriptions of reality? If quantum mechanics works for 1 particle, does it work also for 10, 1000? Or, for that matter, 1023?

Most modern physicists eschew the Copenhagen interpretation in favour of one or other of two modern interpretations. One involves the concept of quantum decoherence, which is basically the idea that the phase information that is crucial to the underlying logic of quantum theory can be destroyed by the interaction of a microscopic system with one of larger size. In effect, this hides the quantum nature of macroscopic systems and allows us to use a more classical description for complicated objects. This certainly happens in practice, but this idea seems to me merely to defer the problem of interpretation rather than solve it. The fact that a large and complex system makes tends to hide its quantum nature from us does not in itself give us the right to have a different interpretations of the wave-function for big things and for small things.

Another trendy way to think about quantum theory is the so-called Many-Worlds interpretation. This asserts that our Universe comprises an ensemble – sometimes called a multiverse – and  probabilities are defined over this ensemble. In effect when an electron leaves its source it travels through infinitely many paths in this ensemble of possible worlds, interfering with itself on the way. We live in just one slice of the multiverse so at the end we perceive the electron winding up at just one point on our screen. Part of this is to some extent excusable, because many scientists still believe that one has to have an ensemble in order to have a well-defined probability theory. If one adopts a more sensible interpretation of probability then this is not actually necessary; probability does not have to be interpreted in terms of frequencies. But the many-worlds brigade goes even further than this. They assert that these parallel universes are real. What this means is not completely clear, as one can never visit parallel universes other than our own …

It seems to me that none of these interpretations is at all satisfactory and, in the gap left by the failure to find a sensible way to understand “quantum reality”, there has grown a pathological industry of pseudo-scientific gobbledegook. Claims that entanglement is consistent with telepathy, that parallel universes are scientific truths, that consciousness is a quantum phenomena abound in the New Age sections of bookshops but have no rational foundation. Physicists may complain about this, but they have only themselves to blame.

But there is one remaining possibility for an interpretation of that has been unfairly neglected by quantum theorists despite – or perhaps because of – the fact that is the closest of all to commonsense. This view that quantum mechanics is just an incomplete theory, and the reason it produces only a probabilistic description is that does not provide sufficient information to make definite predictions. This line of reasoning has a distinguished pedigree, but fell out of favour after the arrival of Bell’s theorem and related issues. Early ideas on this theme revolved around the idea that particles could carry “hidden variables” whose behaviour we could not predict because our fundamental description is inadequate. In other words two apparently identical electrons are not really identical; something we cannot directly measure marks them apart. If this works then we can simply use only probability theory to deal with inferences made on the basis of information that’s not sufficient for absolute certainty.

After Bell’s work, however, it became clear that these hidden variables must possess a very peculiar property if they are to describe out quantum world. The property of entanglement requires the hidden variables to be non-local. In other words, two electrons must be able to communicate their values faster than the speed of light. Putting this conclusion together with relativity leads one to deduce that the chain of cause and effect must break down: hidden variables are therefore acausal. This is such an unpalatable idea that it seems to many physicists to be even worse than the alternatives, but to me it seems entirely plausible that the causal structure of space-time must break down at some level. On the other hand, not all “incomplete” interpretations of quantum theory involve hidden variables.

One can think of this category of interpretation as involving an epistemological view of quantum mechanics. The probabilistic nature of the theory has, in some sense, a subjective origin. It represents deficiencies in our state of knowledge. The alternative Copenhagen and Many-Worlds views I discussed above differ greatly from each other, but each is characterized by the mistaken desire to put quantum mechanics – and, therefore, probability –  in the realm of ontology.

The idea that quantum mechanics might be incomplete  (or even just fundamentally “wrong”) does not seem to me to be all that radical. Although it has been very successful, there are sufficiently many problems of interpretation associated with it that perhaps it will eventually be replaced by something more fundamental, or at least different. Surprisingly, this is a somewhat heretical view among physicists: most, including several Nobel laureates, seem to think that quantum theory is unquestionably the most complete description of nature we will ever obtain. That may be true, of course. But if we never look any deeper we will certainly never know…

With the gradual re-emergence of Bayesian approaches in other branches of physics a number of important steps have been taken towards the construction of a truly inductive interpretation of quantum mechanics. This programme sets out to understand  probability in terms of the “degree of belief” that characterizes Bayesian probabilities. Recently, Christopher Fuchs, amongst others, has shown that, contrary to popular myth, the role of probability in quantum mechanics can indeed be understood in this way and, moreover, that a theory in which quantum states are states of knowledge rather than states of reality is complete and well-defined. I am not claiming that this argument is settled, but this approach seems to me by far the most compelling and it is a pity more people aren’t following it up…


Share/Bookmark

The Seven Year Itch

Posted in Bad Statistics, Cosmic Anomalies, The Universe and Stuff with tags , , , on January 27, 2010 by telescoper

I was just thinking last night that it’s been a while since I posted anything in the file marked cosmic anomalies, and this morning I woke up to find a blizzard of papers on the arXiv from the Wilkinson Microwave Anisotropy Probe (WMAP) team. These relate to an analysis of the latest data accumulated now over seven years of operation; a full list of the papers is given here.

I haven’t had time to read all of them yet, but I thought it was worth drawing attention to the particular one that relates to the issue of cosmic anomalies. I’ve taken the liberty of including the abstract here:

A simple six-parameter LCDM model provides a successful fit to WMAP data, both when the data are analyzed alone and in combination with other cosmological data. Even so, it is appropriate to search for any hints of deviations from the now standard model of cosmology, which includes inflation, dark energy, dark matter, baryons, and neutrinos. The cosmological community has subjected the WMAP data to extensive and varied analyses. While there is widespread agreement as to the overall success of the six-parameter LCDM model, various “anomalies” have been reported relative to that model. In this paper we examine potential anomalies and present analyses and assessments of their significance. In most cases we find that claimed anomalies depend on posterior selection of some aspect or subset of the data. Compared with sky simulations based on the best fit model, one can select for low probability features of the WMAP data. Low probability features are expected, but it is not usually straightforward to determine whether any particular low probability feature is the result of the a posteriori selection or of non-standard cosmology. We examine in detail the properties of the power spectrum with respect to the LCDM model. We examine several potential or previously claimed anomalies in the sky maps and power spectra, including cold spots, low quadrupole power, quadropole-octupole alignment, hemispherical or dipole power asymmetry, and quadrupole power asymmetry. We conclude that there is no compelling evidence for deviations from the LCDM model, which is generally an acceptable statistical fit to WMAP and other cosmological data.

Since I’m one of those annoying people who have been sniffing around the WMAP data for signs of departures from the standard model, I thought I’d comment on this issue.

As the abstract says, the  LCDM model does indeed provide a good fit to the data, and the fact that it does so with only 6 free parameters is particularly impressive. On the other hand, this modelling process involves the compression of an enormous amount of data into just six numbers. If we always filter everything through the standard model analysis pipeline then it is possible that some vital information about departures from this framework might be lost. My point has always been that every now and again it is worth looking in the wastebasket to see if there’s any evidence that something interesting might have been discarded.

Various potential anomalies – mentioned in the above abstract – have been identified in this way, but usually there has turned out to be less to them than meets the eye. There are two reasons not to get too carried away.

The first reason is that no experiment – not even one as brilliant as WMAP – is entirely free from systematic artefacts. Before we get too excited and start abandoning our standard model for more exotic cosmologies, we need to be absolutely sure that we’re not just seeing residual foregrounds, instrument errors, beam asymmetries or some other effect that isn’t anything to do with cosmology. Because it has performed so well, WMAP has been able to do much more science than was originally envisaged, but every experiment is ultimately limited by its own systematics and WMAP is no different. There is some (circumstantial) evidence that some of the reported anomalies may be at least partly accounted for by  glitches of this sort.

The second point relates to basic statistical theory. Generally speaking, an anomaly A (some property of the data) is flagged as such because it is deemed to be improbable given a model M (in this case the LCDM). In other words the conditional probability P(A|M) is a small number. As I’ve repeatedly ranted about in my bad statistics posts, this does not necessarily mean that P(M|A)- the probability of the model being right – is small. If you look at 1000 different properties of the data, you have a good chance of finding something that happens with a probability of 1 in a thousand. This is what the abstract means by a posteriori reasoning: it’s not the same as talking out of your posterior, but is sometimes close to it.

In order to decide how seriously to take an anomaly, you need to work out P(M|A), the probability of the model given the anomaly, which requires that  you not only take into account all the other properties of the data that are explained by the model (i.e. those that aren’t anomalous), but also specify an alternative model that explains the anomaly better than the standard model. If you do this, without introducing too many free parameters, then this may be taken as compelling evidence for an alternative model. No such model exists -at least for the time being – so the message of the paper is rightly skeptical.

So, to summarize, I think what the WMAP team say is basically sensible, although I maintain that rummaging around in the trash is a good thing to do. Models are there to be tested and surely the best way to test them is to focus on things that look odd rather than simply congratulating oneself about the things that fit? It is extremely impressive that such intense scrutiny over the last seven years has revealed so few oddities, but that just means that we should look even harder..

Before too long, data from Planck will provide an even sterner test of the standard framework. We really do need an independent experiment to see whether there is something out there that WMAP might have missed. But we’ll have to wait a few years for that.

So far it’s WMAP 7 Planck 0, but there’s plenty of time for an upset. Unless they close us all down.

A Dutch Book

Posted in Bad Statistics with tags , , , on October 28, 2009 by telescoper

When I was a research student at Sussex University I lived for a time in Hove, close to the local Greyhound track. I soon discovered that going to the dogs could be both enjoyable and instructive. The card for an evening would usually consist of ten races, each involving six dogs. It didn’t take long for me to realise that it was quite boring to watch the greyhounds unless you had a bet, so I got into the habit of making small investments on each race. In fact, my usual bet would involve trying to predict both first and second place, the kind of combination bet which has longer odds and therefore generally has a better return if you happen to get it right.

imageresizer

The simplest way to bet is through a totalising pool system (called “The Tote”) in which the return on a successful bet  is determined by how much money has been placed on that particular outcome; the higher the amount staked, the lower the return for an individual winner. The Tote accepts very small bets, which suited me because I was an impoverished student in those days. The odds at any particular time are shown on the giant Tote Board you can see in the picture above.

However, every now and again I would place bets with one of the independent trackside bookies who set their own odds. Here the usual bet is for one particular dog to win, rather than on 1st/2nd place combinations. Sometimes these odds were much more generous than those that were showing on the Tote Board so I gave them a go. When bookies offer long odds, however, it’s probably because they know something the punters don’t and I didn’t win very often.

I often watched the bookmakers in action, chalking the odds up, sometimes lengthening them to draw in new bets or sometimes shortening them to discourage bets if they feared heavy losses. It struck me that they have to be very sharp when they change odds in this way because it’s quite easy to make a mistake that might result in a combination bet guaranteeing a win for a customer.

With six possible winners it takes a while to work out if there is such a strategy but to explain what I mean consider  a  race with three competitors. The bookie assigns odds as follows : (1) even money; (2) 3/1 against; and (3)  4/1 against. The quoted odds imply probabilities to win of 50% (1 in 2), 25% (1 in 4) and 20% (1 in 5) respectively.

Now suppose you  place in three different bets:  £100 on (1) to win, £50 on (2) and £40 on (3).  Your total stake is then £190. If (1) succeeds you win £100 and also get your stake back; you lose the other stakes, but you have turned £190 into £200 so are up £10  overall. If (2) wins you also come out with £200: your £50 stake plus £150 for the bet. Likewise if (3) wins. You win whatever the outcome of the race. It’s not a question of being lucky, just that the odds have been designed inconsistently.

I stress that I never saw a bookie actually do this. If one did, he’d soon go out of business. An inconsistent set of odds like this is called a Dutch Book, and a bet which guarantees the better a positive return is often called a lock. It’s the also the principle behind many share-trading schemes based on the idea of arbitrage.

It was only much  later I realised that there is a nice way of turning the Dutch Book argument around to derive the laws of probability from the principle that the odds be consistent, i.e. so that they do not lead to situations where a Dutch Book arises.

To see this, I’ll just generalise the above discussion a bit. Imagine you are a gambler interested in betting on the outcome of some event. If the game is fair, you would have expect to pay a stake px to win an amount x if the probability of the winning outcome is p.

Now  imagine that there are several possible outcomes, each with different probabilities, and you are allowed to bet a different amount on each of them. Clearly, the bookmaker has to be careful that there is no combination of bets that guarantees that you (the punter) will win.

Now consider a specific example. Suppose there are three possible outcomes; call them A, B, and C. Your bookie will accept the following bets: a bet on A with a payoff xA, for which the stake is pAxA; a bet on B for which the return  is xB and the stake  pBxB; and a bet on C with stake  pCxC and payoff xC.

Think about what happens in the special case where the events A and B are mutually exclusive (which just means that they can’t both happen) and C is just given by  A “OR” B, i.e. the event that either A or B happens. There are then three possible outcomes.

First, if A happens but B does not happen the net return to the gambler is

R=x_A(1-P_A)-x_BP_B+x_c(1-P_C).

The first term represents the difference between the stake and the return for the successful bet on A, the second is the lost stake corresponding to the failed bet on the event B, and the third term arises from the successful bet on C. The bet on C succeeds because if A happens then A”OR”B must happen too.

Alternatively, if B happens but A does not happen, the net return is

R=-x_A P_A -x_B(1-P_B)+x_c(1-P_C),

in a similar way to the previous result except that the bet on A loses, while those on B and C succeed.

Finally there is the possibility that neither A nor B succeeds: in this case the gambler does not win at all, and the return (which is bound to be negative) is

R=-x_AP_A-x_BP_B -x_C P_C.

Notice that A and B can’t both happen because I have assumed that they are mutually exclusive. For the game to be consistent (in the sense I’ve discussed above) we need to have

\textrm{det} \left( \begin{array}{ccc} 1- P_A & -P_B & 1-P_C \\ -P_A & 1-P_B & 1-P_C\\ -P_A & -P_B & -P_C \end{array} \right)=P_A+P_B-P_C=0.

This means that

P_C=P_A+P_B

so, since C is the event A “OR” B, this means that the probabilityof two mutually exclusive events A and B is the sum of the separate probabilities of A and B. This is usually taught as one of the axioms from which the calculus of probabilities is derived, but what this discussion shows is that it can itself be derived in this way from the principle of consistency. It is the only way to combine probabilities  that is consistent from the point of view of betting behaviour. Similar logic leads to the other rules of probability, including those for events which are not mutually exclusive.

Notice that this kind of consistency has nothing to do with averages over a long series of repeated bets: if the rules are violated then the game itself is rigged.

A much more elegant and complete derivation of the laws of probability has been set out by Cox, but I find the Dutch Book argument a  nice practical way to illustrate the important difference between being unlucky and being irrational.

P.S. For legal reasons I should point out that, although I was a research student at the University of Sussex, I do not have a PhD. My doctorate is a DPhil.

Ergodic Means…

Posted in The Universe and Stuff with tags , , , , , , on October 19, 2009 by telescoper

The topic of this post is something I’ve been wondering about for quite a while. This afternoon I had half an hour spare after a quick lunch so I thought I’d look it up and see what I could find.

The word ergodic is one you will come across very frequently in the literature of statistical physics, and in cosmology it also appears in discussions of the analysis of the large-scale structure of the Universe. I’ve long been puzzled as to where it comes from and what it actually means. Turning to the excellent Oxford English Dictionary Online, I found the answer to the first of these questions. Well, sort of. Under etymology we have

ad. G. ergoden (L. Boltzmann 1887, in Jrnl. f. d. reine und angewandte Math. C. 208), f. Gr.

I say “sort of” because it does attribute the origin of the word to Ludwig Boltzmann, but the greek roots (εργον and οδοσ) appear to suggest it means “workway” or something like that. I don’t think I follow an ergodic path on my way to work so it remains a little mysterious.

The actual definitions of ergodic given by the OED are

Of a trajectory in a confined portion of space: having the property that in the limit all points of the space will be included in the trajectory with equal frequency. Of a stochastic process: having the property that the probability of any state can be estimated from a single sufficiently extensive realization, independently of initial conditions; statistically stationary.

As I had expected, it has two  meanings which are related, but which apply in different contexts. The first is to do with paths or orbits, although in physics this is usually taken to meantrajectories in phase space (including both positions and velocities) rather than just three-dimensional position space. However, I don’t think the OED has got it right in saying that the system visits all positions with equal frequency. I think an ergodic path is one that must visit all positions within a given volume of phase space rather than being confined to a lower-dimensional piece of that space. For example, the path of a planet under the inverse-square law of gravity around the Sun is confined to a one-dimensional ellipse. If the force law is modified by external perturbations then the path need not be as regular as this, in extreme cases wandering around in such a way that it never joins back on itself but eventually visits all accessible locations. As far as my understanding goes, however, it doesn’t have to visit them all with equal frequency. The ergodic property of orbits is  intimately associated with the presence of chaotic dynamical behaviour.

The other definition relates to stochastic processes, i.e processes involving some sort of random component. These could either consist of a discrete collection of random variables {X1…Xn} (which may or may not be correlated with each other) or a continuously fluctuating function of some parameter such as time t, i.e. X(t) or spatial position (or perhaps both).

Stochastic processes are quite complicated measure-valued mathematical entities because they are specified by probability distributions. What the ergodic hypothesis means in the second sense is that measurements extracted from a single realization of such a process have a definition relationship to analagous quantities defined by the probability distribution.

I always think of a stochastic process being like a kind of algorithm (whose workings we don’t know). Put it on a computer, press “go” and it spits out a sequence of numbers. The ergodic hypothesis means that by examining a sufficiently long run of the output we could learn something about the properties of the algorithm.

An alternative way of thinking about this for those of you of a frequentist disposition is that the probability average is taken over some sort of statistical ensemble of possible realizations produced by the algorithm, and this must match the appropriate long-term average taken over one realization.

This is actually quite a deep concept and it can apply (or not) in various degrees.  A simple example is to do with properties of the mean value. Given a single run of the program over some long time T we can compute the sample average

\bar{X}_T\equiv \frac{1}{T} \int_0^Tx(t) dt

the probability average is defined differently over the probability distribution, which we can call p(x)

\langle X \rangle \equiv \int x p(x) dx

If these two are equal for sufficiently long runs, i.e. as T goes to infinity, then the process is said to be ergodic in the mean. A process could, however, be ergodic in the mean but not ergodic with respect to some other property of the distribution, such as the variance. Strict ergodicity would require that the entire frequency distribution defined from a long run should match the probability distribution to some accuracy.

Now  we have a problem with the OED again. According to the defining quotation given above, ergodic can be taken to mean statistically stationary. Actually that’s not true. ..

In the one-parameter case, “statistically stationary” means that the probability distribution controlling the process is independent of time, i.e. that p(x,t)=p(x,t+Δt) . It’s fairly straightforward to see that the ergodic property requires that a process X(t) be stationary, but the converse is not the case. Not every stationary process is necessarily ergodic. Ned Wright gives an example here. For a higher-dimensional process, such as a spatially-fluctuating random field the analogous property is statistical homogeneity, rather than stationarity, but otherwise everything carries over.

Ergodic theorems are very tricky to prove in general, but there are well-known results that rigorously establish the ergodic properties of Gaussian processes (which is another reason why theorists like myself like them so much). However, it should be mentioned that even if the ergodic assumption applies its usefulness depends critically on the rate of convergence. In the time-dependent example I gave above, it’s no good if the averaging period required is much longer than the age of the Universe; in that case even ergodicity makes it difficult to make inferences from your sample. Likewise the ergodic hypothesis doesn’t help you analyse your galaxy redshift survey if the averaging scale needed is larger than the depth of the sample.

Moreover, it seems to me that many physicists resort to ergodicity when there isn’t any compelling mathematical grounds reason to think that it is true. In some versions of the multiverse scenario, it is hypothesized that the fundamental constants of nature describing our low-energy turn out “randomly” to take on different values in different domains owing to some sort of spontaneous symmetry breaking perhaps associated a phase transition generating  cosmic inflation. We happen to live in a patch within this structure where the constants are such as to make human life possible. There’s no need to assert that the laws of physics have been designed to make us possible if this is the case, as most of the multiverse doesn’t have the fine tuning that appears to be required to allow our existence.

As an application of the Weak Anthropic Principle, I have no objection to this argument. However, behind this idea lies the assertion that all possible vacuum configurations (and all related physical constants) do arise ergodically. I’ve never seen anything resembling a proof that this is the case. Moreover, there are many examples of physical phase transitions for which the ergodic hypothesis is known not to apply.  If there is a rigorous proof that this works out, I’d love to hear about it. In the meantime, I remain sceptical.

The Law of Unreason

Posted in Bad Statistics, The Universe and Stuff with tags , , , , on October 11, 2009 by telescoper

Not much time to post today, so I thought I’d just put up a couple of nice little quotes about the Central Limit Theorem. In case you don’t know it, this theorem explains why so many phenomena result in measurable things whose frequencies of occurrence can be described by the Normal (Gaussian) distribution, with its characteristic Bell-shaped curve. I’ve already mentioned the role that various astronomers played in the development of this bit of mathematics, so I won’t repeat the story in this post.

In fact I was asked to prove the theorem during my PhD viva, and struggled to remember how to do it, but it’s such an important thing that it was quite reasonable for my examiners  to ask the question and quite reasonable for them to have expected me to answer it! If you want to know how to do it, then I’ll give you a hint: it involves a Fourier transform!

Any of you who took a peep at Joan Magueijo’s lecture that I posted about yesterday will know that the title of his talk was Anarchy and Physical Laws. The main issue he addressed was whether the existence of laws of physics requires that the Universe must have been designed or whether mathematical regularities could somehow emerge from a state of lawlessness. Why the Universe is lawful is of course one of the greatest mysteries of all, and one that, for some at least, transcends science and crosses over into the realm of theology.

In my little address at the end of Joao’s talk I drew an analogy with the Central Limit Theorem which is an example of an emergent mathematical law that describes situations which are apparently extremely chaotic. I just wanted to make the point that there are well-known examples of such things, even if the audience were sceptical about applying such notions to the entire Universe.

The quotation I picked was this one from Sir Francis Galton:

I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along

However, it is worth remembering also that not everything has a normal distribution: the central limit theorem requires linear, additive behaviour of the variables involved. I posted about an example where this is not the case here. Theorists love to make the Gaussian assumption when dealing with phenomena that they want to model with stochastic processes because these make many calculations tractable that otherwise would be too difficult. In cosmology, for example, we usually assume that the primordial density perturbations that seeded the formation of large-scale structure obeyed Gaussian statistics. Observers and experimentalists frequently assume Gaussian measurement errors in order to apply off-the-shelf statistical methods to their results. Often nature is kind to us but every now and again we find anomalies that are inconsistent with the normal distribution. Those exceptions usually lead to clues that something interesting is going on that violates the terms of the Central Limit Theorem. There are inklings that this may be the case in cosmology.

So to balance Galton’s remarks, I add this quote by Gabriel Lippmann which I’ve taken the liberty of translating from the original French.

Everyone believes in the [normal] law of errors: the mathematicians, because they think it is an experimental fact; and the experimenters, because they suppose it is a theorem of mathematics

There are more things in heaven and earth than are described by the Gaussian distribution!

The Inductive Detective

Posted in Bad Statistics, Literature, The Universe and Stuff with tags , , , , , , , on September 4, 2009 by telescoper

I was watching an old episode of Sherlock Holmes last night – from the classic  Granada TV series featuring Jeremy Brett’s brilliant (and splendidly camp) portrayal of the eponymous detective. One of the  things that fascinates me about these and other detective stories is how often they use the word “deduction” to describe the logical methods involved in solving a crime.

As a matter of fact, what Holmes generally uses is not really deduction at all, but inference (a process which is predominantly inductive).

In deductive reasoning, one tries to tease out the logical consequences of a premise; the resulting conclusions are, generally speaking, more specific than the premise. “If these are the general rules, what are the consequences for this particular situation?” is the kind of question one can answer using deduction.

The kind of reasoning of reasoning Holmes employs, however, is essentially opposite to this. The  question being answered is of the form: “From a particular set of observations, what can we infer about the more general circumstances that relating to them?”. The following example from a Study in Scarlet is exactly of this type:

From a drop of water a logician could infer the possibility of an Atlantic or a Niagara without having seen or heard of one or the other.

The word “possibility” makes it clear that no certainty is attached to the actual existence of either the Atlantic or Niagara, but the implication is that observations of (and perhaps experiments on) a single water drop could allow one to infer sufficient of the general properties of water in order to use them to deduce the possible existence of other phenomena. The fundamental process is inductive rather than deductive, although deductions do play a role once general rules have been established.

In the example quoted there is  an inductive step between the water drop and the general physical and chemical properties of water and then a deductive step that shows that these laws could describe the Atlantic Ocean. Deduction involves going from theoretical axioms to observations whereas induction  is the reverse process.

I’m probably labouring this distinction, but the main point of doing so is that a great deal of science is fundamentally inferential and, as a consequence, it entails dealing with inferences (or guesses or conjectures) that are inherently uncertain as to their application to real facts. Dealing with these uncertain aspects requires a more general kind of logic than the  simple Boolean form employed in deductive reasoning. This side of the scientific method is sadly neglected in most approaches to science education.

In physics, the attitude is usually to establish the rules (“the laws of physics”) as axioms (though perhaps giving some experimental justification). Students are then taught to solve problems which generally involve working out particular consequences of these laws. This is all deductive. I’ve got nothing against this as it is what a great deal of theoretical research in physics is actually like, it forms an essential part of the training of an physicist.

However, one of the aims of physics – especially fundamental physics – is to try to establish what the laws of nature actually are from observations of particular outcomes. It would be simplistic to say that this was entirely inductive in character. Sometimes deduction plays an important role in scientific discoveries. For example,  Albert Einstein deduced his Special Theory of Relativity from a postulate that the speed of light was constant for all observers in uniform relative motion. However, the motivation for this entire chain of reasoning arose from previous studies of eletromagnetism which involved a complicated interplay between experiment and theory that eventually led to Maxwell’s equations. Deduction and induction are both involved at some level in a kind of dialectical relationship.

The synthesis of the two approaches requires an evaluation of the evidence the data provides concerning the different theories. This evidence is rarely conclusive, so  a wider range of logical possibilities than “true” or “false” needs to be accommodated. Fortunately, there is a quantitative and logically rigorous way of doing this. It is called Bayesian probability. In this way of reasoning,  the probability (a number between 0 and 1 attached to a hypothesis, model, or anything that can be described as a logical proposition of some sort) represents the extent to which a given set of data supports the given hypothesis.  The calculus of probabilities only reduces to Boolean algebra when the probabilities of all hypothesese involved are either unity (certainly true) or zero (certainly false). In between “true” and “false” there are varying degrees of “uncertain” represented by a number between 0 and 1, i.e. the probability.

Overlooking the importance of inductive reasoning has led to numerous pathological developments that have hindered the growth of science. One example is the widespread and remarkably naive devotion that many scientists have towards the philosophy of the anti-inductivist Karl Popper; his doctrine of falsifiability has led to an unhealthy neglect of  an essential fact of probabilistic reasoning, namely that data can make theories more probable. More generally, the rise of the empiricist philosophical tradition that stems from David Hume (another anti-inductivist) spawned the frequentist conception of probability, with its regrettable legacy of confusion and irrationality.

My own field of cosmology provides the largest-scale illustration of this process in action. Theorists make postulates about the contents of the Universe and the laws that describe it and try to calculate what measurable consequences their ideas might have. Observers make measurements as best they can, but these are inevitably restricted in number and accuracy by technical considerations. Over the years, theoretical cosmologists deductively explored the possible ways Einstein’s General Theory of Relativity could be applied to the cosmos at large. Eventually a family of theoretical models was constructed, each of which could, in principle, describe a universe with the same basic properties as ours. But determining which, if any, of these models applied to the real thing required more detailed data.  For example, observations of the properties of individual galaxies led to the inferred presence of cosmologically important quantities of  dark matter. Inference also played a key role in establishing the existence of dark energy as a major part of the overall energy budget of the Universe. The result is now that we have now arrived at a standard model of cosmology which accounts pretty well for most relevant data.

Nothing is certain, of course, and this model may well turn out to be flawed in important ways. All the best detective stories have twists in which the favoured theory turns out to be wrong. But although the puzzle isn’t exactly solved, we’ve got good reasons for thinking we’re nearer to at least some of the answers than we were 20 years ago.

I think Sherlock Holmes would have approved.