Summary

The statistical analysis of text is a requirement in almost all academic disciplines: medical re- searchers need smart search algorithms to find reported links between genes; political scientists use parliamentary speeches to assess the ideological positions of representatives; historians wish to know who wrote the Federalist Papers that are of disputed authorship; linguists seek automatic translation algorithms for non-English texts. Though text analysis is not new, the advent of the internet represents an almost incomprehensible quantity of information available for research: ap- proximately seven million new web pages go online every day. Methods that deal with written data thus represent the future of academic research.

With this in mind, we propose the Harvard University Program on Text Research (PTR). This will be a scientific interdisciplinary endeavor under IQSS’s purview that fosters research, cooper- ation and education in statistical text analysis for scholars within the University and the broader academic community. The PTR mission is to become a world leader in the creation, preservation and dissemination of text analysis knowledge. Specific plans include

  • Teaching and Consulting: PTR will house a range of text method expertise from several disciplines and will offer classes and courses for undergraduate, graduates and faculty researchers. The PTR preceptor will be a designated source for support of our teaching efforts. Topics for courses will include the efficient gathering of textual data for projects, the use of content analysis software and the statistical underpinnings of work in this area.
  • Conference: our fundamental model is the two-day IQSS text conference in 2009 that brought together speakers and researchers from a multitude of disciplines and departments in both this university and others. This year’s conference takes place on May 21–22.
  • Website: PTR will be served with a comprehensive website that reports information about the program and its personnel along with ‘selling’ the statistical analysis of text to the outside world. It will provide significant resources to scholars:
    • Repository of tools: PTR will design and implement new software tools for the analysis of text, which will be integrated into current operations in environments such as python and R. This software will be free and flexible: there will be no charge to the research community for its download and use, and all source code will be available
    • Repository of data: ‘plain text’ versions of documents are typically the inputs to text methods. In cooperation with Dataverse, PTR will make thousands of texts available to researchers—and thus for statistical analysis—at the click of a button: Plato’s Republic, the Bible, the entire works of Shakespeare, the Adams-Jefferson letters, the Gettysburg Address, the Brazilian constitution, the world’s trade treaties, are simply a handful of examples that will be placed in one convenient location.
    • Repository of developments in text: the PTR site and mailing list will allow scholars to post messages, papers and announcements of interest to fellow researchers, thus ensuring that PTR remains at the center of ‘state-of-the-art’ developments in this statistical field.
  • Synergy: PTR seeks to inform and influence, and be informed and influenced by, scholars working with text who are essentially unaware—or perhaps wary—of the use of statistical approaches. The focus here is on qualitative work in fields such as history, English and modern languages, anthropology, theology, sociology and cultural studies. The goal is not imperialism, but the contrary: a genuine desire to bring experts in text together, allowing us to design new approaches and methods and to share exciting developments that have the potential to unite many disparate disciplines that have more in common than is perhaps currently apparent. Part of our efforts here will be to actively seek out those working on text in the University and meet with them to investigate their current practice and needs. We will demonstrate the potential of more systematic/automatic efforts by personally inviting these colleagues to introductory presentations and feedback sessions at IQSS.