Archive for the Bad Statistics Category

Irrationalism and Deductivism in Science

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , , , , , , , , , on March 11, 2024 by telescoper

I thought I would use today’s post to share the above reading list which was posted on the wall at the meeting I was at this weekend; it was only two days long and has now finished. Seeing the first book on the list, however, it seems a good idea to follow this up with a brief discussion -largely inspired by David Stove’s book – of some of the philosophical issues raised at the workshop.

It is ironic that the pioneers of probability theory, principally Laplace, unquestionably adopted a Bayesian rather than frequentist interpretation for his probabilities. Frequentism arose during the nineteenth century and held sway until recently. I recall giving a conference talk about Bayesian reasoning only to be heckled by the audience with comments about “new-fangled, trendy Bayesian methods”. Nothing could have been less apt. Probability theory pre-dates the rise of sampling theory and all the frequentist-inspired techniques that modern-day statisticians like to employ.

Most disturbing of all is the influence that frequentist and other non-Bayesian views of probability have had upon the development of a philosophy of science, which I believe has a strong element of inverse reasoning or inductivism in it. The argument about whether there is a role for this type of thought in science goes back at least as far as Roger Bacon who lived in the 13th Century. Much later the brilliant Scottish empiricist philosopher and enlightenment figure David Hume argued strongly against induction. Most modern anti-inductivists can be traced back to this source. Pierre Duhem has argued that theory and experiment never meet face-to-face because in reality there are hosts of auxiliary assumptions involved in making this comparison. This is nowadays called the Quine-Duhem thesis.

Actually, for a Bayesian this doesn’t pose a logical difficulty at all. All one has to do is set up prior probability distributions for the required parameters, calculate their posterior probabilities and then integrate over those that aren’t related to measurements. This is just an expanded version of the idea of marginalization, explained here.

Rudolf Carnap, a logical positivist, attempted to construct a complete theory of inductive reasoning which bears some relationship to Bayesian thought, but he failed to apply Bayes’ theorem in the correct way. Carnap distinguished between two types or probabilities – logical and factual. Bayesians don’t – and I don’t – think this is necessary. The Bayesian definition seems to me to be quite coherent on its own.

Other philosophers of science reject the notion that inductive reasoning has any epistemological value at all. This anti-inductivist stance, often somewhat misleadingly called deductivist (irrationalist would be a better description) is evident in the thinking of three of the most influential philosophers of science of the last century: Karl PopperThomas Kuhn and, most recently, Paul Feyerabend. Regardless of the ferocity of their arguments with each other, these have in common that at the core of their systems of thought likes the rejection of all forms of inductive reasoning. The line of thought that ended in this intellectual cul-de-sac began, as I stated above, with the work of the Scottish empiricist philosopher David Hume. For a thorough analysis of the anti-inductivists mentioned above and their obvious debt to Hume, see David Stove’s book Popper and After: Four Modern Irrationalists. I will just make a few inflammatory remarks here.

Karl Popper really began the modern era of science philosophy with his Logik der Forschung, which was published in 1934. There isn’t really much about (Bayesian) probability theory in this book, which is strange for a work which claims to be about the logic of science. Popper also managed to, on the one hand, accept probability theory (in its frequentist form), but on the other, to reject induction. I find it therefore very hard to make sense of his work at all. It is also clear that, at least outside Britain, Popper is not really taken seriously by many people as a philosopher. Inside Britain it is very different,and I’m not at all sure I understand why. Nevertheless, in my experience, most working physicists seem to subscribe to some version of Popper’s basic philosophy.

Among the things Popper has claimed is that all observations are “theory-laden” and that “sense-data, untheoretical items of observation, simply do not exist”. I don’t think it is possible to defend this view, unless one asserts that numbers do not exist. Data are numbers. They can be incorporated in the form of propositions about parameters in any theoretical framework we like. It is of course true that the possibility space is theory-laden. It is a space of theories, after all. Theory does suggest what kinds of experiment should be done and what data is likely to be useful. But data can be used to update probabilities of anything.

Popper has also insisted that science is deductive rather than inductive. Part of this claim is just a semantic confusion. It is necessary at some point to deduce what the measurable consequences of a theory might be before one does any experiments, but that doesn’t mean the whole process of science is deductive. He does, however, reject the basic application of inductive reasoning in updating probabilities in the light of measured data; he asserts that no theory ever becomes more probable when evidence is found in its favour. Every scientific theory begins infinitely improbable, and is doomed to remain so.

Now there is a grain of truth in this, or can be if the space of possibilities is infinite. Standard methods for assigning priors often spread the unit total probability over an infinite space, leading to a prior probability which is formally zero. This is the problem of improper priors. But this is not a killer blow to Bayesianism. Even if the prior is not strictly normalizable, the posterior probability can be. In any case, given sufficient relevant data the cycle of experiment-measurement-update of probability assignment usually soon leaves the prior far behind. Data usually count in the end.

The idea by which Popper is best known is the dogma of falsification. According to this doctrine, a hypothesis is only said to be scientific if it is capable of being proved false. In real science certain “falsehood” and certain “truth” are almost never achieved. Theories are simply more probable or less probable than the alternatives on the market. The idea that experimental scientists struggle through their entire life simply to prove theorists wrong is a very strange one, although I definitely know some experimentalists who chase theories like lions chase gazelles. To a Bayesian, the right criterion is not falsifiability but testability, the ability of the theory to be rendered more or less probable using further data. Nevertheless, scientific theories generally do have untestable components. Any theory has its interpretation, which is the untestable baggage that we need to supply to make it comprehensible to us. But whatever can be tested can be scientific.

Popper’s work on the philosophical ideas that ultimately led to falsificationism began in Vienna, but the approach subsequently gained enormous popularity in western Europe. The American Thomas Kuhn later took up the anti-inductivist baton in his book The Structure of Scientific Revolutions. Kuhn is undoubtedly a first-rate historian of science and this book contains many perceptive analyses of episodes in the development of physics. His view of scientific progress is cyclic. It begins with a mass of confused observations and controversial theories, moves into a quiescent phase when one theory has triumphed over the others, and lapses into chaos again when the further testing exposes anomalies in the favoured theory. Kuhn adopted the word paradigm to describe the model that rules during the middle stage,

The history of science is littered with examples of this process, which is why so many scientists find Kuhn’s account in good accord with their experience. But there is a problem when attempts are made to fuse this historical observation into a philosophy based on anti-inductivism. Kuhn claims that we “have to relinquish the notion that changes of paradigm carry scientists ..closer and closer to the truth.” Einstein’s theory of relativity provides a closer fit to a wider range of observations than Newtonian mechanics, but in Kuhn’s view this success counts for nothing.

Paul Feyerabend has extended this anti-inductivist streak to its logical (though irrational) extreme. His approach has been dubbed “epistemological anarchism”, and it is clear that he believed that all theories are equally wrong. He is on record as stating that normal science is a fairytale, and that equal time and resources should be spent on “astrology, acupuncture and witchcraft”. He also categorised science alongside “religion, prostitution, and so on”. His thesis is basically that science is just one of many possible internally consistent views of the world, and that the choice between which of these views to adopt can only be made on socio-political grounds.

Feyerabend’s views could only have flourished in a society deeply disillusioned with science. Of course, many bad things have been done in science’s name, and many social institutions are deeply flawed. But one can’t expect anything operated by people to run perfectly. It’s also quite reasonable to argue on ethical grounds which bits of science should be funded and which should not. But the bottom line is that science does have a firm methodological basis which distinguishes it from pseudo-science, the occult and new age silliness. Science is distinguished from other belief-systems by its rigorous application of inductive reasoning and its willingness to subject itself to experimental test. Not all science is done properly, of course, and bad science is as bad as anything.

The Bayesian interpretation of probability leads to a philosophy of science which is essentially epistemological rather than ontological. Probabilities are not “out there” in external reality, but in our minds, representing our imperfect knowledge and understanding. Scientific theories are not absolute truths. Our knowledge of reality is never certain, but we are able to reason consistently about which of our theories provides the best available description of what is known at any given time. If that description fails when more data are gathered, we move on, introducing new elements or abandoning the theory for an alternative. This process could go on forever. There may never be a final theory. But although the game might have no end, at least we know the rules….

Broken Science Initiative

Posted in Bad Statistics with tags , , , , , , on March 10, 2024 by telescoper

This weekend I find myself at an invitation-only event in Phoenix, Arizona, organized by the Broken Science Initiative and called  The Broken Science Epistemology Camp. I flew here on Thursday and will be returning on Tuesday, so it’s a flying visit to the USA.  I thank the organizers Greg Glassman and Emily Kaplan for inviting me. I wasn’t sure what to expect when I accepted the invitation to come but I welcomed the chance to attend an event that’s a bit different from the usual academic conference. There are some suggestions here for background reading which you may find interesting.

Yesterday we had a series of wide-ranging talks about subjects such as probability and statistics, the philosophy of science, the problems besetting academic research, and so on. One of the speakers was eminent psychologist  Gerd Gigerenzer, the theme of whose talk was the use of p-values in statistic and the effects of bad statistical reasoning in reporting research results and wider issues generated by this. You can find a paper covering many of the points raised by Gigerenzer here (PDF).

I’ve written about this before on this blog – see here for example – and I thought it might be useful to re-iterate some of the points here.

The p-value is a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a “null hypothesis”. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05.

Whatever the null hypothesis happens to be, the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that big under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data or that some other hypothesis is needed. To make that sort of statement you would need to specify an alternative hypothesis, calculate the distribution based on it, and determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when the alternative hypothesis, rather than the null, is correct. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean. Gerd Gigerenzer gave plenty of examples of this in his talk.

A Nature piece published some time ago argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true.  For instance, a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are also typically rather small.

The suggestion that this issue can be resolved  by simply choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05, does not help because the p-value is an answer to a question about what the hypothesis says about the probability of the data, which is quite different from that which a scientist would really want to ask, namely what the data have to say about a given hypothesis. Frequentist hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach, which does focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis. If I had my way I’d ban p-values altogether.

The p-value is just one example of a statistical device that is too often applied mechanically without real understanding, as a black box, and which can be manipulated through data dredging (or “p-hacking”). Gerd Gigerenzer went on to bemoan the general use of “mindless statistics”, the prevalence of “statistical rituals” and referred to much statistical reasoning as “a meaningless ordeal of pedantic computations”. It

Bad statistics isn’t the only thing wrong with academic research, but it is a significant factor.

The Big Ring Circus

Posted in Astrohype, Bad Statistics, The Universe and Stuff with tags , , , , on January 15, 2024 by telescoper

At the annual AAS Meeting in New Orleans last week there was an announcement of a result that made headlines in the media (see, e.g., here and here). There is also a press release from the University of Central Lancashire.

Here is a video of the press conference:

I was busy last week so didn’t have time to read the details so refrained from commenting on this issue at the time of the announcement. Now that I am back in circulation, I have time to read the details, but unfortunately was unable to find even a preprint describing this “discovery”. The press conference doesn’t contain much detail either so it’s impossible to say anything much about the significance of the result, which is claimed (without explanation) to be 5.2σ (after “doing some statistics”). I see the “Big Ring” now has its own wikipedia page, the only references on which are to press reports, not peer-reviewed scientific papers or even preprints.

So is this structure “so big it challenges our understanding of the universe”?

Based on the available information it is impossible to say. The large-scale structure of the Universe comprises a complex network of walls and filaments known as the cosmic web which I have written about numerous times on this blog. This structure is so vast and complicated that it is very easy to find strange shapes in it but very hard to determine whether or not they indicate anything other than an over-active imagination.

To assess the significance of the Big Ring or other structures in a proper scientific fashion, one has to calculate how probable that structure is given a model. We have a standard model that can be used for this purpose, but to simulate very structures is not straightforward because it requires a lot of computing power even to simulate just the mass distribution. In this case one also has to understand how to embed Magnesium absorption too, something which may turn out to trace the mass in a very biased way. Moreover, one has to simulate the observational selection process too, so one is doing a fair comparison between observations and predictions.

I have seen no evidence that this has been done in this case. When it is, I’ll comment on the details. I’m not optimistic however, as the description given in the media accounts contains numerous falsehoods. For example, quoting the lead author:

The Cosmological Principle assumes that the part of the universe we can see is viewed as a ‘fair sample’ of what we expect the rest of the universe to be like. We expect matter to be evenly distributed everywhere in space when we view the universe on a large scale, so there should be no noticeable irregularities above a certain size.

https://www.uclan.ac.uk/news/big-ring-in-the-sky

This just isn’t correct. The standard cosmology has fluctuations on all scales. Although the fluctuation amplitude decreases with scale, there is no scale at which the Universe is completely smooth. See the discussion, for example, here. We can see correlations on very large angular scales in the cosmic microwave background which would be absent if the Universe were completely smooth on those scales. The observed structure is about 400 Mpc in size, which does not seem to be to be particularly impressive.

I suspect that the 5.2σ figure mentioned above comes from some sort of comparison between the observed structure and a completely uniform background, in which case it is meaningless.

My main comment on this episode is that I think it’s very poor practice to go hunting headlines when there isn’t even a preprint describing the results. That’s not the sort of thing PhD supervisors should be allowing their PhD students to do. As I have mentioned before on this blog, there is an increasing tendency for university press offices to see themselves entirely as marketing agencies instead of informing and/or educating the public. Press releases about scientific research nowadays rarely make any attempt at accuracy – they are just designed to get the institution concerned into the headlines. In other words, research is just a marketing tool.

In the long run, this kind of media circus, driven by hype rather than science, does nobody any good.

P.S. I was going to joke that ring-like structures can be easily explained by circular reasoning, but decided not to.

How not to do data visualisation…

Posted in Bad Statistics on January 9, 2024 by telescoper

How many things are wrong about this graphic?

An Open Letter to the Times Higher World University Rankers

Posted in Bad Statistics, Education with tags , , , , , , on September 20, 2023 by telescoper

Dear Rankers,

I note with interest that you have announced significant changes to the methodology deployed in the construction of this years forthcoming league tables. I would like to ask what steps you will take to make it clear to that any changes in institutional “performance” (whatever that is supposed to mean) could well be explained simply by changes in the metrics and how they are combined?,

I assume, as intelligent and responsible people, that you did the obvious test for this effect, i.e. to construct and publish a parallel set of league tables, with this year’s input data but last year’s methodology, which would make it easy to isolate changes in methodology from changes in the performance indicators.  This is a simple test that anyone with any scientific training would perform.

You have not done this on any of the previous occasions on which you have introduced changes in methodology. Perhaps this lamentable failure of process was the result of multiple oversights. Had you deliberately withheld evidence of the unreliability of your conclusions you would have left yourselves open to an accusation of gross dishonesty, which I am sure would be unfair.

Happily, however, there is a very easy way to allay the fears of the global university community that the world rankings are being manipulated. All you need to do is publish a set of league tables using the 2022 methodology and the 2023 data. Any difference between this table and the one you published would then simply be an artefact and the new ranking can be ignored.

I’m sure you are as anxious as anyone else to prove that the changes this year are not simply artificially-induced “churn”, and I look forward to seeing the results of this straightforward calculation published in the Times Higher as soon as possible, preferably next week when you announce this years league tables.

I look forward to seeing your response to the above through the comments box, or elsewhere. As long as you fail to provide a calibration of the sort I have described, this year’s league tables will be even more meaningless than usual. Still, at least the Times Higher provides you with a platform from which you can apologize to the global academic community for wasting their time and that of others.

Never mind the points, look at the line!

Posted in Bad Statistics, Open Access, The Universe and Stuff with tags , , , , , on June 14, 2023 by telescoper

I was just thinking this morning that it’s been a while since I posted anything in my Bad Statistics folder when suddenly I come across this gem from a paper in Nature Astronomy entitled Could quantum gravity slow down neutrinos?

The paper itself is behind a paywall (though a preprint version is on the arXiv here). The results in the paper were deemed so important that Nature Astronomy tweeted about them, including this remarkable graph:

Understandably there has been quite a lot of reaction from scientists on Twitter to this plot, questioning how the blue line is obtained from the dots (as only one point to the right appears to be responsible for the trend), remarking on the complete absence of any error bars on either axis for any of the points, and above all wondering how this managed to get past a referee, never mind one for a “prestigious” journal such as Nature Astronomy. It wouldn’t have passed muster as an undergraduate exercise.

Of course this is how a proper astronomer would do it:

Joking aside, if you look at the paper (or the preprint if you can’t afford it) you will see another graph, which shows two other points at higher energy (red triangles):

The extra two points don’t have any error-bars either, and according to the preprint these appear to be unconfirmed candidate GRB events.

The abstract of the paper is:

In addition to its implications for astrophysics, the hunt for neutrinos originating from gamma-ray bursts could also be significant in quantum-gravity research, as they are excellent probes of the microscopic fabric of spacetime. Some previous studies based on neutrinos observed by the IceCube observatory found intriguing preliminary evidence that some of them might be gamma-ray burst neutrinos whose travel times are affected by quantum properties of spacetime that would slow down some of the neutrinos while speeding up others. The IceCube collaboration recently significantly revised the estimates of the direction of observation of their neutrinos, and we here investigate how the corrected directional information affects the results of the previous quantum-spacetime-inspired analyses. We find that there is now little evidence for neutrinos being sped up by quantum spacetime properties, whereas the evidence for neutrinos being slowed down by quantum spacetime is even stronger than previously determined. Our most conservative estimates find a false-alarm probability of less than 1% for these ‘slow neutrinos’, providing motivation for future studies on larger data samples.

I agree with the last sentence where it says larger data samples are needed in future, but also I’d suggest higher standards of data analysis are also called for. Not to mention refereeing. After all, it’s the quality of the reviewing that you pay for, isn’t it?

P.S. For those of you wondering, this paper would not have been published by the Open Journal of Astrophysics even if passed review, as it is not on the astro-ph section of arXiv (it’s on gr-qc).

Eurovision Scores and Ranks

Posted in Bad Statistics, Television with tags on May 14, 2023 by telescoper

After last night’s Eurovision 2023 extravaganza I thought I’d work off my hangover by summarizing the voting. The vote is split into 50% jury votes and 50% televotes from audiences sitting at home, drunk. It’s perhaps worth mentioning that the juries do their scores based on the dress rehearsals on Friday so they are not based on the performances the viewers see.

Each country/jury has 58 points to award, shared among 10 countries: 1-8, 10 and 12 for the top score. Countries that didn’t make it to the final (e.g. Ireland) also get to vote. For the televotes only there is also a “rest-of-the-world” vote for non-Eurovision countries.

This system can deliver very harsh results because only 10 songs can get points from a given source. It’s possible to be judged the 11th best across the board and score nil!

Here are the final scores in a table:

RankCountryOverallTelevotesJuryDiffRank Diff
1Sweden 583243340+97+1
2Finland526376150-226-1
3Israel362185177-8+3
4Italy350174176+2+3
5Norway26821652-168-14
6 Ukraine24318954-145-11
7Belgium18255127+72+5
8. Estonia16822146+124+14
9.Australia15121130+109+14
10. Czechia1293594+59+7
11.Lithuania1274681+35+4
12.Cyprus1265868+10-2
13.Croatia12311211-101-18
14. Armenia1225369+16+1
15.Austria12016104+88+13
16.France1045054+4-2
17. Spain100595+90+17
18.Moldova967620-56-11
19. Poland938112-69-16
20.Switzerland923161+30+4
21.Slovenia784533-12-3
22.Albania765917-42-11
23.Portugal591643+27+4
24.Serbia301614+20
25.United Kingdom24922+130
26. Germany18153-12-2
Final Scores by country in Eurovision 2023 showing the breakdown into televotes and jury votes, together with the difference in numerical scores awarded and difference in ranking based on jury votes rather than televotes, e.g. Albania scored 42 fewer points on the jury votes and would have been 11 places higher based just on televotes than just on jury votes.

Going into the last allocation of televotes, Finland were in in the lead thanks to their own huge televote, but Sweden managed to win despite a lower televote allocation because of their huge score on the jury votes. Had the scores been based on the jury votes alone, Sweden would have won by a mile, and if only on the televotes Finland would have won. Anyway, rules is rules…

There are some interestingly odd features in the above dataset. For example, Switzerland ranked 20th overall, but were ranked 18th and 14th by televotes and jury votes respectively. There are also cases in which a higher score in one set of votes leads to a lower rank, and vice-versa. Croatia were hammered by the jury votes, ranking 25th out of 26 on that basis but would have been 7th based on televotes alone; hence their -18 in the last column. A similar fate befell Norway. By contrast, Spain were last (26th) on the televotes but placed 9th in the pecking order by the juries; they ended up in 17th place.

Anyway, you can see that there are considerable differences between the scores and ranks based on the public vote and the jury votes. I have therefore deployed my vast knowledge of statistics to calculate the Spearman Rank Correlation Coefficient between the ranks based on televotes only and based on jury votes only. The result is 0.26. Using my trusty statistical tables, noting that n=26, and wearing a frequentist hat for simplicity, I find that there is no significant evidence for correlation between the two sets of ranks. I can’t say I’m surprised.

The apparent randomness of the scoring process introduces a considerable amount of churn into the system, as demonstrated by Mel Giedroyc in this, the iconic image of last night’s events.

At least I think that’s what she’s doing…

Anyway, for the record, I should say that my favourite three songs were Albania (22nd), Portugal (23rd) and Austria (15th). Maybe one day I’ll pick a song that makes it onto the left-hand half of the screen!

P.S. Eurovision 2024 will be in Sweden, which is nice because it will be the 50th anniversary of ABBA winning with Waterloo. I’ll never tire of boring people with the fact that a mere 15 years after ABBA won, I walked across the very same stage at the Brighton Centre to collect my doctorate from Sussex University…

Unknown Unknowns

Posted in Bad Statistics, History on May 2, 2023 by telescoper

I was surprised today that some students I was talking to couldn’t identify the leading American philosopher and social scientist responsible for this pithy summation of the limits of human knowledge:

Obviously it’s from before their time. How about you? Without using Google, can you identify the origin of this clear and insightful description?

Cosmological Dipole Controversy

Posted in Astrohype, Bad Statistics, The Universe and Stuff with tags , , on October 11, 2022 by telescoper

I’ve just finished reading an interesting paper by Secrest et al. which has attracted some attention recently. It’s published in the Astrophysical Journal Letters but is also available on the arXiv here. I blogged about earlier work by some of these authors here.

The abstract of the current paper is:

We present the first joint analysis of catalogs of radio galaxies and quasars to determine if their sky distribution is consistent with the standard ΛCDM model of cosmology. This model is based on the cosmological principle, which asserts that the universe is statistically isotropic and homogeneous on large scales, so the observed dipole anisotropy in the cosmic microwave background (CMB) must be attributed to our local peculiar motion. We test the null hypothesis that there is a dipole anisotropy in the sky distribution of radio galaxies and quasars consistent with the motion inferred from the CMB, as is expected for cosmologically distant sources. Our two samples, constructed respectively from the NRAO VLA Sky Survey and the Wide-field Infrared Survey Explorer, are systematically independent and have no shared objects. Using a completely general statistic that accounts for correlation between the found dipole amplitude and its directional offset from the CMB dipole, the null hypothesis is independently rejected by the radio galaxy and quasar samples with p-value of 8.9×10−3 and 1.2×10−5, respectively, corresponding to 2.6σ and 4.4σ significance. The joint significance, using sample size-weighted Z-scores, is 5.1σ. We show that the radio galaxy and quasar dipoles are consistent with each other and find no evidence for any frequency dependence of the amplitude. The consistency of the two dipoles improves if we boost to the CMB frame assuming its dipole to be fully kinematic, suggesting that cosmologically distant radio galaxies and quasars may have an intrinsic anisotropy in this frame.

I can summarize the paper in the form of this well-worn meme:

My main reaction to the paper – apart from finding it interesting – is that if I were doing this I wouldn’t take the frequentist approach used by the authors as this doesn’t address the real question of whether the data prefer some alternative model over the standard cosmological model.

As was the case with a Nature piece I blogged about some time ago, this article focuses on the p-value, a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a particular null hypothesis. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05. This is usually called a ‘2σ’ result because for Gaussian statistics a variable has a probability of 95% of lying within 2σ of the mean value.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that large under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is incorrect. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. In fact I feel so strongly about this that if I had my way I’d ban p-values altogether…

This is not an objection to the value of the p-value chosen, and whether this is 0.005 rather than 0.05 or, , a 5σ standard (which translates to about 0.000001!  While it is true that this would throw out a lot of flaky ‘two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would actually want to ask, which is what the data have to say about the probability of a specific hypothesis being true or sometimes whether the data imply one hypothesis more strongly than another. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis.

Not that it’s always easy to implement the (better) Bayesian approach. It’s especially difficult when the data are affected by complicated noise statistics and selection effects, and/or when it is difficult to formulate a hypothesis test rigorously because one does not have a clear alternative hypothesis in mind. That’s probably why many scientists prefer to accept the limitations of the frequentist approach than tackle the admittedly very challenging problems of going Bayesian.

But having indulged in that methodological rant, I certainly have an open mind about departures from isotropy on large scales. The correct scientific approach is now to reanalyze the data used in this paper to see if the result presented stands up, which it very well might.

GAA Clustering

Posted in Bad Statistics, GAA, The Universe and Stuff with tags , , , , , , on July 25, 2022 by telescoper
The distribution of GAA pitches in Ireland

The above picture was doing the rounds on Twitter yesterday ahead of this year’s All-Ireland Football Final at Croke Park (won by favourites Kerry despite a valiant effort from Galway, who led for much of the game and didn’t play at all like underdogs).

The picture above shows the distribution of Gaelic Athletics Association (GAA) grounds around Ireland. In case you didn’t know, Hurling and Gaelic Football are played on the same pitch with the same goals and markings on the field. First thing you notice is that the grounds are plentiful! Obviously the distribution is clustered around major population centres – Dublin, Cork, Limerick and Galway are particularly clear – but other than that the distribution is quite uniform, though in less populated areas the grounds tend to be less densely packed.

The eye is also drawn to filamentary features, probably related to major arterial roads. People need to be able to get to the grounds, after all. Or am I reading too much into these apparent structures? The eye is notoriously keen to see patterns where none really exist, a point I’ve made repeatedly on this blog in the context of galaxy clustering.

The statistical description of clustered point patterns is a fascinating subject, because it makes contact with the way in which our eyes and brain perceive pattern. I’ve spent a large part of my research career trying to figure out efficient ways of quantifying pattern in an objective way and I can tell you it’s not easy, especially when the data are prone to systematic errors and glitches. I can only touch on the subject here, but to see what I am talking about look at the two patterns below:

You will have to take my word for it that one of these is a realization of a two-dimensional Poisson point process and the other contains correlations between the points. One therefore has a real pattern to it, and one is a realization of a completely unstructured random process.

random or non-random?

I show this example in popular talks and get the audience to vote on which one is the random one. The vast majority usually think that the one on the right that  is random and the one on the left is the one with structure to it. It is not hard to see why. The right-hand pattern is very smooth (what one would naively expect for a constant probability of finding a point at any position in the two-dimensional space) , whereas the left-hand one seems to offer a profusion of linear, filamentary features and densely concentrated clusters.

In fact, it’s the picture on the left that was generated by a Poisson process using a  Monte Carlo random number generator. All the structure that is visually apparent is imposed by our own sensory apparatus, which has evolved to be so good at discerning patterns that it finds them when they’re not even there!

The right-hand process is also generated by a Monte Carlo technique, but the algorithm is more complicated. In this case the presence of a point at some location suppresses the probability of having other points in the vicinity. Each event has a zone of avoidance around it; the points are therefore anticorrelated. The result of this is that the pattern is much smoother than a truly random process should be. In fact, this simulation has nothing to do with galaxy clustering really. The algorithm used to generate it was meant to mimic the behaviour of glow-worms which tend to eat each other if they get  too close. That’s why they spread themselves out in space more uniformly than in the random pattern.

Incidentally, I got both pictures from Stephen Jay Gould’s collection of essays Bully for Brontosaurus and used them, with appropriate credit and copyright permission, in my own book From Cosmos to Chaos.

The tendency to find things that are not there is quite well known to astronomers. The constellations which we all recognize so easily are not physical associations of stars, but are just chance alignments on the sky of things at vastly different distances in space. That is not to say that they are random, but the pattern they form is not caused by direct correlations between the stars. Galaxies form real three-dimensional physical associations through their direct gravitational effect on one another.

People are actually pretty hopeless at understanding what “really” random processes look like, probably because the word random is used so often in very imprecise ways and they don’t know what it means in a specific context like this.  The point about random processes, even simpler ones like repeated tossing of a coin, is that coincidences happen much more frequently than one might suppose.

I suppose there is an evolutionary reason why our brains like to impose order on things in a general way. More specifically scientists often use perceived patterns in order to construct hypotheses. However these hypotheses must be tested objectively and often the initial impressions turn out to be figments of the imagination, like the canals on Mars.