Archive for psychology

Broken Science Initiative

Posted in Bad Statistics with tags , , , , , , on March 10, 2024 by telescoper

This weekend I find myself at an invitation-only event in Phoenix, Arizona, organized by the Broken Science Initiative and called  The Broken Science Epistemology Camp. I flew here on Thursday and will be returning on Tuesday, so it’s a flying visit to the USA.  I thank the organizers Greg Glassman and Emily Kaplan for inviting me. I wasn’t sure what to expect when I accepted the invitation to come but I welcomed the chance to attend an event that’s a bit different from the usual academic conference. There are some suggestions here for background reading which you may find interesting.

Yesterday we had a series of wide-ranging talks about subjects such as probability and statistics, the philosophy of science, the problems besetting academic research, and so on. One of the speakers was eminent psychologist  Gerd Gigerenzer, the theme of whose talk was the use of p-values in statistic and the effects of bad statistical reasoning in reporting research results and wider issues generated by this. You can find a paper covering many of the points raised by Gigerenzer here (PDF).

I’ve written about this before on this blog – see here for example – and I thought it might be useful to re-iterate some of the points here.

The p-value is a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a “null hypothesis”. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05.

Whatever the null hypothesis happens to be, the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that big under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data or that some other hypothesis is needed. To make that sort of statement you would need to specify an alternative hypothesis, calculate the distribution based on it, and determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when the alternative hypothesis, rather than the null, is correct. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean. Gerd Gigerenzer gave plenty of examples of this in his talk.

A Nature piece published some time ago argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true.  For instance, a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are also typically rather small.

The suggestion that this issue can be resolved  by simply choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05, does not help because the p-value is an answer to a question about what the hypothesis says about the probability of the data, which is quite different from that which a scientist would really want to ask, namely what the data have to say about a given hypothesis. Frequentist hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach, which does focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis. If I had my way I’d ban p-values altogether.

The p-value is just one example of a statistical device that is too often applied mechanically without real understanding, as a black box, and which can be manipulated through data dredging (or “p-hacking”). Gerd Gigerenzer went on to bemoan the general use of “mindless statistics”, the prevalence of “statistical rituals” and referred to much statistical reasoning as “a meaningless ordeal of pedantic computations”. It

Bad statistics isn’t the only thing wrong with academic research, but it is a significant factor.

Digit Ratio Survey

Posted in Bad Statistics, Biographical with tags , , , on February 9, 2015 by telescoper

I was intrigued by an article I found at the weekend which reports on a (no doubt rigorous) scientific study that claims a connection between the relative lengths of index and ring fingers and the propensity to be promiscuous. The assertion is that people whose ring finger is longer than their index finger like to play around, while those whose index finger is longer than their ring finger are inclined to fidelity. Obviously, since the study involves the University of Oxford’s Department of Experimental Psychology, there can be do doubt whatsoever about its reliablity or scientific credibility, just like the dozens of other things supposed to be correlated with digit ratio. Ahem.

I do remember a similar study some time ago that claimed that men with with a longer index finger (2D) than ring finger (4D) (i.e. with a 2D:4D digit ratio greater than one) were much more likely to be gay than those with a digit ratio lower than one. Taken with this new finding it proves what we all knew all along: that heterosexuals are far more likely to be promiscuous than homosexuals.

For the record, here is a photograph of my left hand (which, on reflection, is similar to my right, and which clearly shows a 2D:4D ratio greater than unity):

wpid-wp-1423482539378.jpeg

Inspired by the stunning application of the scientific method described in the report, I have decided to carry out a rigorous study of my own. I have heard that, at least among males, it is much more common to have digit ratio less than one than greater than one but I can’t say I’ve noticed it myself. Furthermore previously unanswered question in the literature is whether there is a connection between digit ratio and the propensity to read blogs. I will know subject this to rigorous scientific scrutiny by inviting readers of this blog to complete the following simply survey. I look forward to publishing my findings in due course in the Journal of Irreproducible Results.

PS. The actual paper on which the report was based is by Rafael Wlodarski, John Manning, and R. I. M. Dunbar,