Questions about Free Software

Jim Greiner

This past spring at Harvard, a group of students from a variety of academic disciplines agitated for a course in C, C++, and R focusing on implementating iterative statistical algorithms such as EM, Gibbs sampling, and Metropolis-Hastings. The result was an informal summer class sponsored by IQSS and taught by recent Department of Statistics graduate Gopi Goswami. Professor Goswami created (from scratch) class notes, problem sets, and sample programs as well as compiling lists of web links and other useful materials. Course participants came from, among other places, Statistics, Biostatistics, Government, Japanese Studies, the Medical School, the Kennedy School, and Health Policy. For those interested in the lecture slides and other materials Professor Goswami compiled, the link is here. Principal among the subjects taught in the course was how to marry R's data-processing and display capabilities to an iterative inferential engine (try saying that phrase quickly three times) such as an EM or a Gibbs, with the latter written in C or C++ so as to increase (vastly) the speed of runs. In other words, we learned how to have R do the front end (data manipulation, data formatting) and back end (analysis of results, graphics) of an analysis while letting a faster language do the hard work in the middle.

The course both demonstrates and facilitates a growing trend in the quantitative social sciences toward making open-source software stemming from scholarly publications freely available to the academic community. Two examples from the ever-expanding field of ecological inference are Gary King's EI program, based on a truncated bivariate normal model and implemented in GAUSS, and Kosuke Imai and Ying Lu's implementation of a Dirichlet-process-based model), implemented with an R-C interface.

The trend toward freely available, model-specific software has obvious potential upsides. Previously written code can save the time of a user interested in applying the model. Moreover, if the code is used often enough and potential bugs are reported and fixed, the software may become better than what a potential user could write on his or her own. After all, few of us interested in answers to real-world issues want to spend the rest of our lives coding in C.

Nevertheless, I confess to a certain amount of apprehension. For me at least, freely available, model-specific software provides a temptation to use models I do not fully understand. Relatedly, I often think that I do understand a model fully, that I grasp all of its strengths and weakness, only to discover otherwise when I sit down to program it. Finally, oversight, hubris, or a desire to make accompanying documentation readable may cause the author of the software not to describe fully details of implementation or compromises made therein. Thus, while I am excited by the possibilities freely available social science software holds, I worry about the potential for misuse as well.

Posted by James Greiner at December 2, 2005 6:00 AM