**Amy Perfors**

In a previous post about the Gerber & Malhotra paper about publication bias in political science, I rather optimistically opined that the findings -- that there were more significant results than would be predicted by chance, and that many of those were suspiciously close to 0.05 -- were probably not deeply worrisome, at least for those fields in which experimenters could vary the number of subjects run based on the significance level achieved thus far.

Well, I now disagree with myself.

This change of mind comes as a result of reading about the Jeffreys-Lindley paradox (Lindley, 1957), a Bayes-inspired critique of significance testing in classical statistics. It says, roughly, that with large enough sample size, a p-value can be arbitrarily close to zero even though the null hypothesis is highly probable (i.e., very close to one). In other words, a classical statistical test might reject the null hypothesis at an arbitrarily low p-value, even though the evidence that it should be accepted is overwhelming. [A discussion of the paradox can be found here].

When I learned about this result a few years ago, it astonished me, and I still haven't fully figured out how to deal with all of the implications. (This is obvious, since I forgot about it when writing the previous post!). As I understand the paradox, the intuitive idea is that, with larger sample size, you will naturally get some data that appears unlikely (and, the more data you collect, the more likely you are to see some really unlikely data). If you forget to compare the probability of that data under the null hypothesis with the probability of the data under the alternative hypotheses, then you might get an arbitrarily low p-value (indicating that the data are unlikely under the null hypothesis) even if the data is even *more* unlikely under any of the alternatives. Thus, if you *just* look at the p-value, without taking effect size, sample size, or the comparative posterior probability of each hypothesis under consideration, you are likely to wrongly reject the null hypothesis on the basis of the p-value, even if it is the most likely of all possibilities.

The tie-in with my post before, of course, is that it implies that it *isn't* necessarily "okay" practice to keep increasing sample size until you achieve statistical significance. Of course, in practice, sample sizes rarely get larger than 24 or 32 -- at the absolute outside, 50 to 100 -- which is much smaller than infinity. Does this practical consideration, then, mean that the practice is okay? As far as I can tell, it is fairly standard (but then, so is the reliance on p-values to the exclusion of effect sizes, confidence intervals, etc., so "common" doesn't mean "okay"). Is this practice a bad idea only if your sample gets extremely large?

Lindley, D.V. (1957) A statistical paradox. *Biometrika*, 44. 187-192

Posted by Amy Perfors at October 31, 2006 10:00 AM