Code it!

Ok, so you have lots of cool data. Or maybe you've realized you need to collect your own data, like through an online survey. What do you need to do now? 

Code it! collects a set of resources for manipulating, creating, and analyzing data. 

We have broken resources out by particular software programs that you might already know a bit of, as well as a set of particular topics that deal with unique types of data.

Once you Code it!, you might need to Write it!.

                                                                                                                                                                                

Know It     Code It     Write It     Present It     Manage It

Stata

Stata is a popular statistical package for the social sciences. The current release of Stata is version 12.

Getting Started

Get Stata at Harvard: Link to the FAS downloads page for getting Stata on your computer

Introduction to Stata: An overview of Stata by Christopher Baum.

UCLA Stat Consulting Group - STATA Resource: The most extensive Stata resource on the web: includes descriptions of coding and statistical analysis techniques as well as an extensive FAQ section.

Stata Tutorial: Tutorial by German Rodriguez at Princeton, with great section on programming within Stata.

HKS Tutorial: Collection of training materials from HKS, with orientation towards public policy.

Basic Quick Start: Basic commands to get started with, from Vassar

Applied Econometrics in Stata: A great guide by Econ Department PhD student Ricardo Perez-Truglia.

Quick References

Stata 11 Quick Reference Card: Collection of common commands on an easy reference card.

Collection of Common Commands: Put together by Andres Zahler, HKS

Update Your Skills

General Questions

Top 10 Questions from Beginners: From the UCLA ATS group

Workflow/Coding Principles

What is a macro, how do I use it Stata?: Macros help limit the amount of code you need to use as well as deal with temporary files.

What is a loop, how do I use it, why is it good to use loops for programming in Stata, what are different types of loops? Link 1, Link 2
How do I merge data in Stata? Link 1Link 2
Loops, Locals, Regressions: How can I combine loops and locals to estimate lots of separate regression models without having to write everything out?

Collection of programming tricks: loops, macros, globals, etc.. Nice Collection of examples.

Can I source a do file within another do file? Link 1Link 2

Dealing with local directories when sharing do-files: What if I'm sharing do files with other people who use different file paths, what are some easy ways to change the project folders?

Additional Coding Tricks


How do I create fake data sets in Stata?

How do I change directories in Stata? Windows,Mac

String variable manipulation: I want to manipulate string variables. What are some ways to do this?

Regression Post-Estimation: After I estimate a regression model, can I access stored estimates somehow? How about getting the variance-covariance matrix?

File paths: What if I'm sharing do files with other people who use different file paths, what are some easy ways to have things automatically?

Use the kountry package to standardize World Bank, Correlates of War, IMF country codes across datasets... don't do it by hand!

Simulation: How to simulate in R and Stata.

Favorite Tricks

General Questions/Top 10's

Top 10 Questions from Beginners: From the UCLA ATS group

Importing Data

Import Data (I): Importing data in txt or csv format into Stata

Import Data (II): Importing data in xml format into Stata

Common Types of Data Manipulation

Merging datasets is a common task in Stata.  This page from the UCLA guide describes how to do so, as well as what the important "_merge" variables mean.

Reshaping datasets from wide to long is another common task.  This page from the UCLA guide describes how to do so.  This page describes how to do so with World Bank data, and is generally applicable to datasets with many variables.

Outputting Results for Papers

Discussion of different options for results output: resource 1resource 2, resource 3, resource 4 (estout is one of the most commonly used). Helpful article on "rapid article writing". Outreg is another common Stata module to arrange regression output into publication-ready tables

Other Common Questions

One-tailed test: Stata reference on how to calculate a one-tailed test after regression estimation

Graphics

Overview

Stata has a broad array of graphics abilities. A number of pages collect great examples with code. Stata has collected a nice set of thumbnails that you can search through and see the code for.

Princeton Stata Graphics collection, collection of user written extension packages, more packages, more packages,

Individual graphics tricks

Stata code for bar graph with numbers on top


R

R is an open-source language and environment for statistical computing and graphics.  Check out these R resources to see how to get started!   

Getting Started

Installation resources 

CRAN: Basic Installation of R from CRAN.

R Studio: An integrated R platform that you might want to use to make running R easier.

ESS: Connecting to R using emacs -- not necessary, but if you like using emacs, this is the way to connect it with R.

R Deducer: Connecting to R using this user-friendly GUI with drop-down menus.  Makes very pretty graphs. See this video.

Tutorials to Get You Started

A quick introduction to R from CRAN.

UCLA Stat Consulting Group Guide to R: Includes examples, installation tips, FAQs, and downloadable books on R.

Collection of R tutorials: A collection of tutorials, mostly for beginners, from Rocio Titiunik.



Quick References

General R References

R Reference Card:  4 pages that include 90% of the R commands you will ever need.

ESS Reference Card for S and R: Reference card for Emacs Speaks Statistics - a plugin that allows R to run in the Emacs text editor (available on the RCE).

RSeek:  A search engine just for R.

Google's R Style Guide: Basic rules of coding in R

Quick-R: Great introduction to R, as well as reference for advanced R users. Covers data input, data management, basic and advanced statistics, and graphics in R.

References to Useful R Packages

Zelig:  A useful package for running all types of statistical tests.

Update Your Skills

New Media Tutorials: Blogs, Videos, and Podcasts

R Revolutions blog:  get R updates and lots of R tips.

R Statistics blog: for more advanced users of R

Stat Methods blog: a resource for tips and R trainings around the country.

One R tip a Day:  Follow this twitter feed!

R Graph of the Week blog

R Video resources nyhackR, here (kudos to Drew Conway).

R + Applied Econometrics Blog: This blog, authored by applied econ grad student Kevin Goulding, is an excellent resource. It is a very clear overview of how to implement standard methods (e.g. hypothesis testing, fixed effects) and make nice graphs.

R-Bloggers: R news and tutorials from several hundred R bloggers.

Ports from Other Statistical Packages

Excel to R: Resources on converting to R from Excel

Matlab, Python and R:  Basic commands in all three languages

R for SPSS and SAS Users: An intro to R seen through the lens of SPSS and SAS users.

Rcpp Intro: Helpful slides on integrating R and C++ using Rcpp

Additional Coding Tricks

Simulation: How to simulate in R and Stata.

Favorite Tricks

Top Ten Questions from New R Users

1.  How do I use probability distributions in R?  Answer from RWiki.

2.  How can I speed up loops in R?  An answer from R Revolutions.

3.  How do I subset data in R?  An answer from the UCLA introductory website.  Also check out these answers about sorting data and merging data on Quick-R.

4.  What is a good package to use to manipulate strings in R?  My favorite package is stringr for working with strings in R.

5.  What are the different types of data structures in R?  Find a great answer here.

6.  How do I customize graphical parameters in R?  Quick-R has a great description of fonts and colors, legends and labeling, and combining plots in R.

7.  How do I generate random numbers in R?  An answer from R Revolutions.

8.  How do I run regressions in R?  Check out this answer from Quick-R.

9.  How do I deal with dates in R?  An answer from the UCLA tutorial on R.

10.  How do I output a table from R to Latex?   The library to do this is xtable.  If you want to get fancy integrating R and Latex, check out Sweave. Or check out a new R package stargazer.

Fun things to do after you learn the basics:

Basic web-scraping in R:  Useful because sometimes you want to get data directly from a table on a website, but there is no way to download the current data into a readable file.  Also check out webscraping using R Curl and  readLines.

Tips for plotting maps in R.  And more on this.

TwitteR: scrape twitter using R!

Two ways to do genetic matching in R: This code snippet shows two ways of implementing genetic matching in R. The first method uses the GenMatch() package, and the second uses MatchIt(). Essentially, this is a template to show how you can set your covariates and extract the treated or control data. Note that with the MatchIt() package, you could implement various other matching methods by specifying e.g. "nearest" rather than "genetic."

Script for Reinstalling R Packages: Rich Nielsen's script to reinstall R packages after upgrade of R version. Also see this script for installing multiple R packages simultaneously.

Kountry: A R and Stata package for dealing with annoying country codes

Automatically create model formulas in R

R Benchmark: Package to help benchmark code in R.

Graphics

Basic R graphs: Quick-R, simple graphs with R

R Graph of the Week blog

R Graph Gallery: lots of cool graphs with code!

Even prettier graphics:  brought to you by ggplot2.  Check out the ggplot2 website, ggplot2 examples from the Cookbook for R, as well as this blog which has multiple examples with code.

Getting rid of axis value labels: Don't want things labeled?

Maps/GIS

Resources and links for using geographic information systems (GIS).

Matlab

Matlab-R Dictionary: A list of syntax conversions between Matlab and R.

Starter Tutorials: A collection of basic tutorials on using Matlab.  Here's another good one (with a Belorussian translation just in case).

Python

The best way to learn Python is to work through examples and exercises!

The Python Tutorial: This is the canonical tutorial for Python; however, it will be most useful to those with programming background.

Learn Python the Hard Way: One of best free textbooks on learning Python.

Thinking like a Computer Scientist: Learning with Python: Fantastic introduction to Python and programming in general.  Exercises help you grasp the fundamentals of object-oriented programming.

Online Python Tutor: Input your code, and this will provide you with a visual representation of how your Python code is working. Great for learning and debugging.

CodeAcademy Python Tutorial:  Courses in the Python basics.

Python Challenge: A programming riddle that’s a fun way to learn Python.

PyVideo: Video clips of talks on a variety of Python functionality, mostly from PyCon.

Python Snippets: Snippets of code for doing everything from scraping all links from a webpage to stripping html tags to getting IP addresses with Python.

DiffNow: Code comparison tool. DiffNow allows for comparison of snippets of code side-by-side to identify differences. Great for debugging.

Natural Language Toolkit: This online book facilitates learning the basic aspects of Python in a natural language processing context.

Enthought Python distribution: includes a recent stable release of Python along with common packages for scientific computing

Running Python and R Together: How to access R from Python using RPy2.

Rich Nielsen's Web Scraping Example

Presentation on Web Scraping

FAS IT Slides, Basic Scraping: Excellent basic tutorial to scraping in Python. 

FAS IT Slides, Advanced Scraping: Excellent more advanced tutorial to scraping in Python.

FAS IT Slides, Parallelism and Performance:  Introduction to implementing very large jobs.

Justin Grimmer lecture slides and example code: Excellent introduction to the methods behind textual analysis, as well as useful code examples.  Working through all the lecture slides is recommended.

Andy Hall scraping in Python code: Excellent tutorial and code.

Stackoverflow: Essentially all of your Python and scraping questions have already been answered.  Google your question + stackoverflow as a first resource.

oDesk.  Python is complicated.  Some political scientists outsource their coding needs.

Sorting Large Number of Files in Python: Occasionally you have a very large number of files and want to sort them into various folders based on words or phrases in the file. This code snippet shows how to do so with Python. 

Installing Python Modules on RCE: If you're running Python on the RCE, you probably have two problems: the latest version of Python (2.7) is not available, and you don't have administrator access to install modules. Never fear, you can get around these problems by installing Python to your shared home directory and running it from there! This code snippet can be used to install and run any program from your shared home directory.

Have the RCE Send you a Text Message When Your Python Job is Done: When running large jobs that take a few days, it is nice to know when your job is done.  Add this code to the end of your Python script and the RCE will text your cellphone when the job is complete.

Java

Introduction to Java:  The Art and Science of Java by Eric Roberts

SPSS

Specific Topics

Links and resources for specific statistical or data processing topics.

Network Data analysis

Guide to social network analysis in R: from EconometricSense.

Another guide from RDataMining.

A book available online on the theory behind social network analysis.

Panel Data analysis

Physiological Data

Working with physiological data, like output from Mindware? 

Spatial Data Analysis

Spatial econometrics, etc..

Resources

Harvard PhD student Yuri Zhukov's yuri.pdf  on spatial methods and mini-course with examples and code.

RgoogleMaps: An R package to use Google Maps data.

Text analysis

In the past 15 years social scientists have increasingly analyzed text--written words--using systematic computer aided technologies. At Harvard, Gary King, Professor of Government, has made a number of innovations. Other faculty doing great work with text data include Arthur Spirling

IQSS also supports the Program on Text Analysis, which holds conferences on text analysis and purchases text based databases.


Gathering text

You don't want to copy and paste!

Resources for gather text in a form that text analysis can then be run on it.

Python: Flexible software programming language for doing lots of things, including parsing text. We've collected a whole page on Python.

Processing text

Once you have text ready for analysis, what can you use and do?

Overviews

Natural Language Toolkit: A basic introduction to natural language processing.

Speech and Language Processing, 2nd Editionby Daniel Jurafsky and James H. Martin also provides a comprehensive introduction.

Tools for Text: Materials from an excellent University of Washington text analysis conference in 2010, with videos of speakers and links to important papers.

Brandon Stewart and Justin Grimmer's paper on the promise and pitfalls of automated content analysis for political texts.

Specific Processing Packages and Techniques

ReadMe: Automated content analysis in R.

Wordfish: An R package to extract political positions from text documents and place documents onto a single dimension.

tm: A useful text mining R package.

Opinion Mining and Sentiment Analysis: How does what people write reflect their feelings?

Topic Modelling: Computer scientist David Blei's website, with links to topic modeling software.

lda: An R package to implement latent dirichlet allocation topic models in R; takes data in the Blei LDA-C format.

JFreq: Easy-to-use standalone software from Will Lowe that batch uploads a folder of text files and creates LDA-C sparse term document matrices or non-sparse csv term document matrices for analysis in lda() or elsewhere.  Java-based, so creates TDMs much faster than R packages.  Also can preprocess and stem documents and provide basic content analysis if Yoshikoder dictionaries are provided.

Yoshikoder: Text analysis with the lowest learning curve.  Standalone program with easy-to-use user interface.  Works on English and non-English languages.  Provides word frequency tables, dictionary based content analysis, and concordance tables.  However, does not preprocess texts by stemming and removing punctuation, numbers, stop words, and the most and least frequent words, which is now considered standard.

Computational Resources

The road towards Python and Java ends in computation.  Here are some more advanced resources.

Data Science Toolkit: Computer scientist Pete Warden provides open tools for data.  APIs available for download include methods to translate files to text, street addresses to coordinates, coordinates to political areas, and texts to people, among others.  He also makes it possible to grab the entire site as a self-contained, ready-to-run virtual machine.

Amazon Cloud Computing: The Amazon Elastic Compute Cloud (EC2) is a web service that allows you to rent a server for a given amount of time.  For Harvard affiliates, free alternatives for similar tasks include the Research Computing Environment and the Odyssey cluster.

Machine Learning: Materials from Stanford intro-level course.

Research Computing Environment (RCE)

For Mac

The first step is requesting an RCE account that will allow you access to the RCE server.   Account requests can be sent through the contact us form on the HMDC RCE page. They will send you an email requesting specific details about your project such as faculty adviser and size of data set.


Next, download NX client which you will use to access the RCE.  On the NX website, scroll down and click on NX client for Mac OSX.

For PC

The first step is requesting an RCE account that will allow you access to the RCE server.   Account requests can be sent through the contact us form on the HMDC RCE page. They will send you an email requesting specific details about your project such as faculty adviser and size of data set.

Next, download NX client which you will use to access the RCE.  On the NX website, scroll down and click on NX client for Windows or NX client for Linux as appropriate.

Editing Files Remotely

Use TRAMP to edit files on a server using local emacs: how to edit a file on a server using emacs without copying it to your hard drive.

Workflow

Copy and paste is your enemy!

Developing a good work flow for research is key. These resources will help you save time and generate reproducible research.

Some other collections of thoughts about workflow

  • Kieran Healy's collection of workflow suggestions, mostly involving LaTeX and R 

Database management

Are you copy and pasting spreadsheets together?

When you try to combine data sets, do you lose observations? Do you even know if you have all of your observations?

Google Refine: Power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases.

Loops are your friend

Using scripts instead of point and click

Web Development

Basic web development:  Lots of tutorials!

Blogroll

Here are some blogs we like with coding tips, book reviews, cool graphs, and methods trends.  Some are even funny.

Social Science Statistics Blog: IQSS blog on social science methods, covering methodological trends, questions and comments, paper and conference announcements, applied problems, and summaries of papers from the Applied Statistics Workshop.

Statistical Modeling, Causal Inference, and Social Science: Andrew Gelman's blog.

The Statistics Forum: Blog of the American Statistical Association and CHANCE magazine.

The Endeavour: John D. Cook's blog.

Simon Jackman's Blog: On politics, statistics, and computing.

Complexity and Social Networks Blog: IQSS site on network analysis and complex systems theory.

Polls and Votes: Charles Franklin's blog on analyzing polls and political data.  He ran the blog under the name Political Arithmetikfrom 2005-2008, the archives of which are still interesting to check out.

Political Science Methods: A forum for issues facing the methods subfield.

Econometric Sense: A heuristic guide to econometrics, statistics, applied analytics, biometrics, data mining, machine learning, experimental design, and bioinformatics.

RDataMining: Resources on data mining in R.

Hunch: Machine learning theory.

Modern Toolmaking: Practical tools for predictive modeling, data science, machine learning, and web scraping.

'R' You Ready?: Mark Heckmann's blog on R and nice graphics.

AI and Social Science: Brendan O'Connor's blog on artificial intelligence, computation, and statistics.

Statistical Computing Matters: Suggestions and comments about obscure and useful software.

Chris Blattman's Blog: Generally about international development, but also features posts about experiments and methods papers.

Qualitative Methods

Numbers are never the only story. How can you make use of more qualitative types of information?

Q Foundation: Collection of materials on qualitative research from the Harvard Graduate School of Education.