Wednesday, September 23, 2020
Text data have a long history in social science and education research. However, these data are notoriously high-dimensional and characterized by many nuances of language that lack plausible statistical models. As a result, analysis of text data typically involves intensive human coding tasks where particular constructs or features of the text are first defined, and then a collection of documents are inspected and coded for the presence or absence of these constructs. While this process may be feasible in studies with smaller sample sizes, the time and resources required to train and employ multiple human coders frequently poses a challenge for large-scale efforts. In this talk, I will consider how to reliably and efficiently extract meaningful constructs from text documents in a manner that preserves human judgment, primarily for the purposes of supporting causal inferences in randomized where some outcomes of interest are features of text generated by the trial’s participants. To illustrate how text data might be leveraged in various inferential settings both in and out of the causal realm, I will present results from three recent studies in education, medicine, and public health.