Abstract: Big data generated from the internet have great potential in tracking and predicting massive social activities, in particular infectious diseases, whose accurate real-time prediction could help public health officials make timely decisions to save lives. We introduce a model ARGO (AutoRegression with GOogle search data / AutoRegression with General Online data) that has successfully utilized publicly available Google search data, with/without cloud-based electronic health records, to estimate current and near-future influenza-like illness activity level and/or dengue fever activity level for United States and five other countries around the globe. Our regularized multivariate regression model dynamically selects the most appropriate variables for prediction every week, and significantly outperforms all previous internet-based tracking models, including Google Flu Trends and Google Dengue Trends. We further extend the model to multiple geographical resolution, tracking infectious disease not only at national level but also at regional level, with spatial-temporal information pooling, making it flexible, self-correcting, robust and scalable.
Shihao Yang is a PhD candidate in the Statistics Department at Harvard University.