# The genetics of zip codes

In political science, as in many other branches of social science, more attention is being paid to the genetic bases of political behavior (I won't say effects, because that opens a whole other barrel of worms). As I was looking around for an overview of some of the statistical issues involved, I came across a couple of blog posts by Cosma Shalizi at Carnegie Mellon that were both informative and amusing. An excerpt:

When we take our favorite population of organisms (e.g., last year's residents of the Morewood Gardens dorm at CMU), and measure the value of our favorite quantitative trait for each organism (e.g., their present zip code), we get a certain distribution of this trait:

Genotype ZIP
AATGAAATAAAAAAAAACGAAAATAAAAAA... 15232
AAGGCCATTAAAGTTAAAATAATGAAAGGA... 15213
AAGGCCATTAAAGTTAAAATAATGAAAGGA... 48104
CAATGATTAGGACAATAACATACAAGTTAT... 15212
GGGGTTAATTAATGGTTAGGATGGGTTTTT... 87501
CCTTCAAAGTTAATGAAAAGTTAAAATTTA... 15217
CCTTCAAAGTTAATGAAAAGTTAAAATTTA... 15217
TAAGTATTTGAAGCACAGCAACAACTAGGT... 02474

(Note to our institutional review board: No undergraduates had their DNA sequenced in the writing of this essay.)
If we are limited to the tools of early 20th century statistics (in particular, if we are the great R. A. Fisher, and so simultaneously forging those tools while helping to found evolutionary genetics), we summarize the distribution with a mean and a variance. We can inquire as to where the variance in the population comes from. In particular, assuming the organisms are not all clones, it is reasonable to suppose that some of the variation goes along with differences in genes. The fraction of variance which does so is, roughly speaking, the "heritability" of the trait.

The most basic sort of analysis of variance (see also: Fisher) would make this conceptually simple, though practically unsuccessful. Simply take all the organisms in the population, and group them by their genotypes. For each group of genetically identical organisms, compute the average value of the trait. Compare the variance of these within-genotype averages (that is, the across-genotype variance) to the total population variance; this is the fraction of variation associated with genotypes. In most mammalian populations, where clones (identical twins, triplets, ...) are rare and every organism otherwise has a unique genotype, this would tell you that almost all of the variance of any trait is associated with genetic differences. On such an analysis, almost all of the variance in zip codes in my example would be "due to" genetic differences, and the same would be true of telephone numbers, social security numbers, etc.

To see why, look at my table again. With one exception (the twins who live in 15213 and 48104), in this population changing zip code means changing your genotype. The vast majority (81%) of the variance in zip codes is between genotypes, not within them. With real human data, a quarter of the people wouldn't be twins living apart, and the proportion of variance in zip codes "due to" genotype would be even higher.

Naively, then, on this analysis we would say that the "heritability" of zip code, the fraction of its variance which goes along with genetic variations, is 81%. It is crucial to be clear on what this means, which is merely and exactly this: in this population, if we take a random group of genetically identical people, the variance within that group should be 19% (=100-81) of the total variance in the population.

Posted by Mike Kellermann at November 28, 2007 4:20 PM