Following Gov 2000 or the equivalent course in linear regression, Gov2001 is the second in the methods sequence for Government Department graduate and undergraduate students. While not required, most Government graduate students doing empirical work take the course. Graduate students in other departments and schools at Harvard (and in the area) also take the course. Undergraduates are especially welcome to take Gov 1002, which is taught along with this class. Non-Harvard students and others may also take this course by registering through the Harvard extension school, for course credit or as an auditor (see course number Stat E-200).

If there are seats in the room you're welcome to attend even if you're not formally registered, but if possible we would appreciate if you would sign up formally (as our teaching fellows get paid more!). If you are not a Harvard student, you can easily do this via Harvard extension school course Stat E-200 (and there is financial aid if you need it too).

If you need cross-registration papers signed, please bring them to the first class. You don't need permission from us to take the course. We observe that students who take the course for a grade participate more and get far more out of the experience (even among many of those who think or say it will be otherwise), but pass/fail and formal auditing are okay with us too.

For students who've taken a course in linear regression (such as Harvard's Gov2000), this course gives you the tools to learn new statistical methods or build them yourself. We focus on methods practically useful in real social science research. We aim to give you two types of skills

First, we show how to develop new approaches to research methods, data analysis, and statistical theory. More advanced statistical theory is not required when data and variables follow standard assumptions. Since this is not usually the case in most of the social sciences, we often cannot use ready-made statistical procedures developed elsewhere and for other purposes. We teach the underlying theory of inference (which, at its most fundamental is merely using facts you know to learn about facts you don't know); once understood, we can easily “reinvent” known statistical solutions to accommodate social science data, learn new techniques as they are developed, or even invent original approaches when required. Students will learn how to read an original scholarly article describing a new statistical technique, implement it in computer code, estimate the model with relevant data, understand and interpret the results, and present and explain the results to someone unfamiliar with statistics.

Second, students will learn how to make novel substantive contributions to the scholarly literature. In the past, some students who completed the course published a revised version of their class paper in a scholarly journal. For most of these students, this was their first professional publication. For papers from previous years, see the Gov 2001 Dataverse.

Gov 2000, a course in linear regression (with matrices), or the equivalent. If you know what

*b=(X'X) ^{-1}X'y*

means, you're probably ok.

Most in-class experience will be lecture-based, but some parts are designed as a *collective* experience. This means that other students will be counting on you (and you on them), and so please come to class prepared. If you don't understand something, that's perfectly fine; we'll figure it out together and make sure no one is left behind. But if you don't put in the effort, it will hurt what everyone gets out of the class. If you have a question about one of the readings, post a question in Perusall. If you think you may know an answer to a query another student posted, or have a suggestion, please try to answer it. In fact, if you merely have an interesting idea, please contribute that as well.

The best way, and often the only way, to learn new statistical procedures is by doing. We will therefore make extensive use of a flexible (open-source and free) statistical software program called R and a companion package called Zelig (which we designed for you and those in your position). R is among the most widely used statistical software, and Zelig is a widely used packages in R. You will learn how to program in this class, if you do not know already.

For hardware, you are welcome to use your own computers. To install R and Zelig on your computer, see zeligproject.org. You are also welcome to use the HMDC computer labs (in the concourse and 3rd floor of CGIS North-Knafel, 1737 Cambridge Street), which have computers with R already installed on them. Harvard affiliates also have the option of registering for a Research Computing Environment (RCE) account through http://hmdc.harvard.edu. Having an RCE account allows you access to HMDC's cluster of servers, which are fast and well-equipped to handle large data sets or time-intensive procedures. In addition, these servers supply a persistent (linux) desktop environment that is accessible from any computer with an Internet connection.

Most of the probability and statistical theory in this class will be taught in the context of "Monte Carlo simulation'' (which we do not expect you to know prior to the course). We will write computer programs to verify, or substitute for, more difficult formal mathematical proofs. This intuitive technique will make it much easier to understand and to implement new statistical methods.

Each week, you will have reading and problem sets.

Reading assignments will be acquired and done in Perusall.com. You will also collaboratively annotate the readings, asking questions you may have, answering other students' questions, and generally engaging with the material and each other. To get started in Perusall, see Getting Started. We will explain more in Section as well.

We strongly encourage you to work together in groups on the problem sets, so long as you write up your work on your own, by yourself, without having anyone check your work before you hand it in.

Problem sets must be submitted each week by the beginning of section (Wednesday 6PM). The full solution key will be posted so you can review your answers. Because we will be posting answer/solution keys immediately after deadlines, we can't give credit for late work. You can still turn in late work for feedback and help learning the material. The problem sets - including looking at the solution keys - is an extremely important part of the learning process, so please keep up and let us know if you have any questions.

The main assignment is to write a research paper that replicates an existing piece of scholarship. The goal of the paper is to apply some advanced method to, or develop one for, a substantive problem in your field of study. You should aim to produce a publishable article, and, in fact, most students do publish their final paper in a scholarly journal. (I know it sounds absurdly hard, but that's only because you haven't learned some of the material we go over in class!) More information about the paper can be found at http://gking.harvard.edu/papers/.

You must find a co-author and a paper to replicate by **Wednesday, February 24, at 5pm**, by which point you should upload on Canvas a PDF copy of the paper along with a brief paragraph explaining your choice. You are also required to have one of your classmates who is not your coauthor sign off on your article choice after checking that your article meets all the criteria listed in "Publication, Publication".

On **Wednesday, March 23**, you must turn in a draft of the paper with little text but with figures and tables, and a proposed table of contents for your paper, in a relatively polished form. You should also arrange to hand over all of the data and information necessary to replicate the results of your analysis and reproduce your tables and figures. (We will coordinate the exchange of files and code through Canvas and via announcements in class.) On that day, you will hand over your paper and materials to another student we assign to you, and, in exchange, you will receive a different student's paper. Your task for the following week is to replicate the other student's analysis and write a memo to this student (with a copy to us), pointing out ways to make the paper and the analysis better. You will be evaluated based on how helpful, not how destructive, you are.

The final version of the paper is due the day before Reading Period, **Wednesday April 27, at 5pm**. You must turn in a hard copy of the paper and all data and code (on Canvas). You must also follow standard academic practice and create a permanent replication archive by uploading all your data and code to the Gov2001 Dataverse (http://dvn.iq.harvard.edu/dvn/dv/gov2001).

If you need an extension with the replication paper, you do not need to ask permission: We will accept papers until **Thursday, May 5, at 5pm**, but since you will have had more time, papers turned in after the original deadline will be graded according to proportionately higher standards. The number of incompletes we plan to give is governed by a Poisson distribution with λ=0.01, so please plan accordingly.

Once all papers are turned in, you aren't quite finished. We will turn over your replication paper to another student and assign you a small set of replication papers to evaluate. Your last assignment for the class will be to read and comment on a fellow group's work and to grade this paper according to certain guidelines we will provide. Your main objective is to give feedback on what changes and improvements need to happen in order for the paper to be published. As always, you will be evaluated based on how helpful, not how destructive, you are. Your comments on your fellow student's paper are due **Friday May 13, at 5pm**.

One of the best ways that people learn is by teaching and collaborating with others. We facilitate collaboration in several different ways:

- In lecture we'll occasionally use Learning Catalytics to help us understand difficult questions related to the content of the lecture. Learning Catalytics will automatically assign you to small groups (chosen via algorithm) to discuss your answers to these questions. Since "teaching teaches the teacher," everyone in this setting will get some of the benefits of being a teacher.
- We encourage you to help each other out on problem sets. While the final product that you turn in must be your own individual work that you have written up in isolation from your partners, you can still seek the help of your peers if you get stuck on a particular part of the problem set. Learning from and teaching your peers is a great way to master the content of the class and foster relationships with your colleagues. Note that this does not apply to the assessment questions, which must be completed independently.
- On the replication paper you will choose your co-collaborators. This will give you the chance to write a journal-quality research paper with the help of your peers.
- Please participate in the Canvas discussion forum: ask questions if you have them, post ideas if they occur to you, and answer the queries from others.

This course is being offered as part of the Harvard Extension School's Distance Education Program. The recorded class meetings that you will view are from the Harvard FAS course, Government 2001, and this meets once per week throughout the term. Even though your participation will take place online, you are responsible for homework, readings, quizzes, and all other work. There will also be weekly on-campus section meetings and office hours for students who are able to attend, or watch the videotape of the section. Please see the Harvard Extension School distance education web site for more information.

Students taking the class through the extension school will complete a final exam instead of the replication paper. They will, however, participate in the replication assignment by replicating others' work.

All students will need to have access to the course webpage, which is operated by FAS. If you do not already have a Harvard ID, please make arrangements to get one or to set up an XID.

If you're in town, we'd love to have you in class physically, as long as there is room (and there usually is).

All reading assignments must be acquired and read at the web site Perusall (which can also be accessed through Canvas, Harvard's learning management system). Perusall will enable you to obtain answers to questions instantly and to work collaboratively with other members of class.

All readings are freely accessible to members of this class, except for the main text, [Gary King, Unifying Political Methodology: The Likelihood Theory of Statistical Inference. University of Michigan Press, 1998], which must be purchased at Perusall. In addition to the required text, we will assign a wide variety of scholarly papers.

Reading assignments will be announced at the end of every class.

Help is available when you need it. If you have any questions about the homework, your paper, or anything else related to the course, please use the class discussion forum on the Canvas site. Since all three of us and all students will be reachable via this platform, it's a very efficient way to get answers to questions that do not fit as comments on the video annotation tool sites. * Please also respond to inquiries if you happen to know the answer.* (You can control how often the platform emails you a digest of the latest Q&A.)

We will also use Canvas to post announcements regarding course logistics, including readings, video assignments, and problem sets.

Final grades will be a weighted average of the replication paper (or final exam), weekly problem sets, annotation grades in Perusall, and *participation*. (There will be no final exam.)

"Participation'' includes preparing for, joining in the discussion in class, coming to class and section, making a serious effort to contribute to the discussion queries in Perusall, and finding other ways of helping your classmates learn more. Finally, since everyone learns more when more connections exist among students, finding ways to help build class camaraderie will also count as part of participation and be very much appreciated by us!

The timeline below gives the outline of the weekly schedule. Students are expected to:

- Lecture preparation (before Monday). Do assigned readings and discuss on Perusall
- Attend class (Mon. 2-4PM)
- Complete the weekly problem set (Wed. 6PM)
- Attend section (Wed. 6-7PM or 7-8PM)

Keeping up with the weekly schedule is extremely important not only for your learning but for the rest of the class as well.

After the foundational material is presented (roughly the first third of the class), I will introduce a large variety of statistical models and methods. I will choose these based on what makes sense from a pedagogical perspective at first, but as the semester goes on I will choose more and more material based on students interest and class projects.

For more information on the content of the class, see the detailed lecture notes online, which gives a general outline. Here's another version of some of the material:

- What is statistics?
- What is political methodology?
- Models and a language of inference
- The role of simulation
- To solve probability problems
- to evaluate estimators
- to compute features of probability distributions
- to transform statistical results into quantities of interest
- Stochastic components (normal, log-normal, Bernoulli, Poisson, etc)
- The relationship between stochastic and systematic components and data generation processes
- Systematic components (linear, logit, etc.)
- Uncertainty and Inference
- Probability as a model of uncertainty
- Probability distributions, theory, discrete, continuous, examples
- Inference
- Inverse probability problems
- The likelihood theory of inference
- The Bayesian theory of inference
- Detailed example: Forecasting presidential elections
- Properties of maximum likelihood estimation (finite sample, asymptotic, etc.)
- Precision of likelihood estimates

We will not get to all these topics, and the list of topics we do cover will likely include others than those listed here, depending on student interest.

- Discrete regression models
- Binary variables
- Interpreting functional forms
- Ordinal variables
- Grouped uncorrelated binary variables
- Event count models --- Correlated and uncorrelated events; over and under dispersion.
- Basic time series models
- Basic multiple equation models, including identification
- Multinomial choice models
- Models for selection bias, censoring, and truncation
- Models for duration
- Hurdle models
- Case-control designs
- Model dependence
- Matching as nonparametric preprocessing
- Rare events
- Neural network models
- An overview of MCMC methods
- Compositional data
- Missing data (item and unit nonresponse) problems
- Ecological inference (avoiding aggregation bias)
- Models for reciprocal causation and endogenity
- Empirical and hierarchical Bayesian analysis
- Time series cross-sectional data
- Models for interpersonal incomparability in surveys
- Text Analysis

I've written up a version of the theory of teaching behind this class in the article "How Social Science Research Can Improve Teaching". You can also watch the accompanying video at the same link. I try to develop new or improved teaching and learning tools every year, and so you'll likely see differences from this description in class.

King, Gary. 1998. Unifying Political Methodology: The Likelihood Theory of Statistical Inference Ann Arbor: University of Michigan Press.

A variety of papers will be assigned as well.

It is also helpful to have access to a book on R/S programming such as

- Fox, John. 2002.
*An R and S-Plus Companion to Applied Regression*. Sage Publications. - Imai, Kosuke, Gary King, and Olivia Lau. 2016. Zelig: Everyone's Statistical Software, Manuscript.

- Pawitan, Yudi. 2001.
*In All Likelihood: Statistical Modelling and Inference Using Likelihood*. Oxford University Press - Barnett, Vic. 1982.
*Comparative Statistical Inference*. 2nd edition. Wiley. - Chiang, Alpha. 1984.
*Fundamental Methods of Mathematical Economics*. McGraw-Hill. - DeGroot, Morris H. 1986.
*Probability and Statistics*. Addison-Wesley. or Mendenhall, William and Robert J. Beaver.\ 1994.*Mathematical Statistics with Applications*. Duxbury. - Edwards, A.W.F. 1984.
*Likelihood*. Cambridge University Press. - Gelman, Andrew et al. 2004.
*Bayesian Data Analysis*. Chapman and Hall. - Gill, Jeff. 2008.
*Bayesian Methods: A Social and Behavioral Sciences Approach*, 2nd ed, Chapman and Hall. - Harvey, Andrew C. 1990.
*The Econometric Analysis of Time Series*. MIT Press. - Joreskog, Karl G. and Dag Sorbom, edited by Jay Magidson. 1979.
*Advances in Factor Analysis and Structural Equation Models*. University Press of America. - King, Gary. 1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton: Princeton University Press.
- Kleppner, Daniel and Norman Ramsey. Quick Calculus. Wiley.
- Lee J. Bain and Max Engelhardt. 1987.
*Introduction to Probability and Mathematical Statistics*. Duxbury. - McCullagh, Peter and J. A. Nelder. 1993.
*Generalized Linear Models*Chapman-Hall. - Mills, Terence C. 1990.
*Time Series Techniques for Economists*. New York: Cambridge University Press. - Norman J. Johnson and Samuel Kotz.
*Distributions in Statistics*, four volumes. John Wiley and Sons. - Rice, John A. 1995.
*Mathematical Statistics and Data Analysis*, 2nd Ed. Belmont, CA: Duxbury Press. - Rubinsten, Reuven Y. 1981.
*Simulation and the Monte Carlo Method*, New York: John Wiley. - Schafer, Joseph L. 1997.
*Analysis of Incomplete Multivariate Data*. New York: Chapman-Hall. - Tanner, Martin A. 1996.
*Tools for statistical inference: observed data and data augmentation methods*, 3rd edition. New York: Springer.