I've spent quite a bit of time in the last few weeks - probably too much - thinking about the term 'regression' and its use in statistics, and why I find it so dislikeable. I sincerely doubt any campaign I try to start will have any real effect, so let me lay down the reasons why I feel we as scientists should refer to linear modelling as just such, and not as 'regression'.

One reason is that the word only has a tenuous connection to the actual algorithm - the other is that it far too often implies a causal relationship where none exists.

As the story goes, Francis Galton took a group of tall men and measured the height of their sons, and found that on average, the sons as a group were shorter than their fathers. Drawing on similar work he had done with pea plants, he described this phenomenon as "regression to the mean," recognizing that the sample of fathers was nonrandom. A "regression coefficient" then described the estimated parameter which, when multiplied by the predictor, would produce the mean value.

I can only surmise that "determining regression coefficients through minimizing the least squares difference" was too verbose for Galton and his buddies, and "regression analysis" stuck. Now we have lawyerese terms like "multiple regression analysis," which really should read "multiple parameter regression analysis" since we're only running one algorithm, but we appear stuck with it.

So what's the big deal? Nomenclature isn't an easy business, and two extra syllables in "linear model" might slow things down. But aside from my gripe with using "regress" as a transitive verb (the Latin student in me cringing), even the most generous interpretation of the word's root, and the experiments that revealed it, yield to trouble.

"Regression" literally means "the act of going back." If we accept this definition in this context, we have to have something to which we can return. Clearly, this implies discovering the mean - but chronologically, it can only mean discovering the cause, that which came before.

Linear modelling makes no explicit assumptions about cause and effect, a major source of headache in our discipline, but the word itself, consciously or otherwise, binds us to this fact.

The remedy to this is not simple; after all, I'm talking about trying to break the correlation-is-causation fallacy through words, which is both a difficult task and the sort of behaviour that will keep people from sitting with you at lunch. But we can improve things slowly and subtly in this fashion:

1) If you are confident that your analysis will unveil a causal relationship, say so. Call it "regression-to-cause", or "causal linear model", or something like that.

2) If you're not so sure, call it a (generalized) linear model, or a lin-mod, or a least-squares, or another term that does not necessarily imply causation. Resist the temptation to fall back to the word "regression" until a long time has passed.

This doesn't have to be a completely nerve-wracking exercise; just use a strike-through when necessary, to show that the term ~~regression~~'linear model' is better suited to describe what we're trying to build here.

Posted by Andrew C. Thomas at August 7, 2006 11:30 PM