Here is a question for you: Imagine you are asked to conduct an observational study to estimate the effect of wearing a helmet on the risk of death in motorcycle crashes. You have to choose one of two different data-sets for this study: Either a large, rather heterogeneous sample of crashes (these happened on different roads, at different speeds, etc.) or a smaller, more homogeneous sample of crashes (let's say they all occurred on the same road). Your goal is to unearth a trustworthy estimate of the treatment effect that is as close as possible to the `truth', i.e. the effect estimate obtained from an (unethical) experimental study on the same subject. Which sample do you prefer?

Naturally, most people tend to choose the large sample. Larger sample, smaller standard error, less uncertainty, better inference…we’ve heard it all before. Interestingly, in a recent paper entitled "Heterogeneity and Causality: Unit Heterogeneity and Design Sensitivity in Observational Studies" Paul Rosenbaum comes to the opposite conclusion. He demonstrates that heterogeneity, and not sample size matters for the sensitivity of your inference to hidden bias (a topic we blogged about previously here and here). He concludes that:

“In observational studies, reducing heterogeneity reduces both sampling variability and sensitivity to unobserved bias—with less heterogeneity, larger biases would need to be present to explain away the same effect. In contrast, increasing the sample size reduces sampling variability, which is, of course useful, but it does little to reduce concerns about unobserved bias.”

This basic insight about the role of unit heterogeneity in causal inference goes back to John Stuart Mill’s 1864 System of Logic. In this regard, Rosenbaum’s paper is a nice comparison to Jas’s view on Mill’s methods. Of course, Sir Fisher dismissed Mill for his plea for unit homogeneity because in experiments, when you have randomization working for you, hidden bias is not a real concern so you may as well go for the larger sample.

Now you may say: well it all depends on the estimand, no? Do I care about the effect of helmets in the US as a whole or only on a single road? This point is well taken, but keep in mind that for causal inference from observational data we often care about internal validity first and not necessarily generalizability (most experiments are also done on highly selective groups). In any case, Rosenbaum’s basic intuition remains and has real implications for the way we gather data and judge inferences. Next time you complain about a small sample size, you may want to think about heterogeneity first.

So finally back to the helmet example. Rosenbaum cites an observational study that deals with the heterogeneity issue in a clever way: “Different crashes occur on different motorcycles, at different speeds, with different forces, on highways or country roads, in dense or light traffic, encountering deer or Hummers. One would like to compare two people, one with a helmet, the other without, on the same type of motorcycle, riding at the same speed, on the same road, in the same traffic, crashing into the same object. Is this possible? It is when two people ride the same motorcycle, a driver and a passenger, one helmeted, the other not. Using data from the Fatality Analysis Reporting System, Norvell and Cummings (2002) performed such a matched pair analysis using a conditional model with numerous pair parameters, estimating approximately a 40% reduction in risk associated with helmet use.”

Posted by Jens Hainmueller at 8:30 AM