International conference on the cyberinfrastructure for historical China studies

March 14-16, 2018, Harvard China Center, Shanghai
Organizers: Peter Bol (China Biographical Database) and Donald Sturgeon (ctext.org)

The 2005 ACLS report “Our Cultural Commonwealth”  called for creating a cyberinfrastructure for the social sciences and humanities, similar to that which has been successfully developed for the sciences.  A cyberinfrastructure is the system of connections between the layer of base technologies (computation, storage, communication) and the layer of software, services, instruments, information and social practices applicable to specific projects and disciplines. One might think of the cyberinfrastructure as the network of software, data collections, personnel, best practices and standards independent of specific projects and disciplines, which facilitates the implementation of specific projects on general purpose base technologies. 

The humanities and the less quantitative social sciences differ from the sciences in that they are necessarily embedded in language, and this creates challenges specific to cyberinfrastructure for China studies.  Consider for example the increasing popularity of structured topic modeling, used to create topical surveys of large text corpora. Common methodologies assume that texts contain words and that words are divided by white spaces (as in this sentence). Word division is a major challenge in mining a language such as Chinese which does not separate words with white spaces and, in the case of historical texts, where phrasemes are a more appropriate category than words. A cyberinfrastructure for China studies must take into account the language in which texts were written. It must also deal with two further impediments to communication. First, digital resources such as text databases are dispersed among many institutions and companies. Second, utilities such as dictionaries to facilitate the online analysis of digital materials are often unique to or embedded in a particular resource. 

Digital China studies have developed through the creation of independent utilities, of which the most prevalent form is the searchable text database. Here there has been tremendous progress over the last twenty years. Beginning with the ever-growing Scripta Sinica from Academia Sinica (currently over 600 million characters in diverse collections) the number of searchable text databases from public and private vendors has steadily increased (see below). The largest is Donald Sturgeons’s ctext.org (Chinese Text Project) with a corpus of over 4 billion characters and 20-25,000 unique daily visitors. At the same time the interest in “digital humanities” has resulted in an increasing interest in the computational utilities for analyzing data derived from texts such as software for social network analysis; geospatial analysis and online cartography; textual markup, mining and topic modeling; and relational and object-oriented databases. Software that ten years ago was regarded as too difficult for the untrained student to use is now becoming commonplace. 

The goal of creating a cyberinfrastructure for historical China studies cannot be accomplished by combining all searchable text corpora into a single giant repository because the majority of databases are proprietary and access is subscription based. There has been some progress in metadata searching, so that the catalog of a library with many subscriptions can report on accessible digitized texts, but to date this does not include searching content across collections. Attempts have been made to implement federated search for digitized Chinese-language materials,  but their utility has so far been hampered by a lack of metadata standardization.

However, with the greater use of Application Programming Interfaces (APIs) it has become possible to create links between online databases and online text programs so that the functionalities of databases devoted to particular topics (places, people, government offices, religious sites) can be brought to bear on searchable text programs. An early example of this was the API created by the China Historical GIS project. A text database can be programmed to use the CHGIS API, so that on encountering a place name users can automatically call up CHGIS data on that place and see its location on a map. A more elaborate example is the China Biographical Database (CBDB) API which allows users to call up numerous categories of information about a person-name in a text. The MARKUS system is to date the most sophisticated example: it draws on a numerous online databases to facilitate the marking up of Chinese texts and facilitates the extraction of tagged data for further research. The fact that MARKUS can ingest textual data from ctext.org in real time provides an example for what other systems can accomplish. In our view making it possible for public and proprietary text databases to use such APIs to annotate their contents will greatly enhance their usefulness to many different research communities. The same functionality is now being developed for image collections – including art, maps, and scans of texts – that adopt the IIIF standard. Mirador, developed at Harvard and Stanford, is a utility that allows the user to create individual collections from disparate sources. Ctext.org is a model for how online text databases can make full use of APIs. This allows the creation of a cyberinfrastructure while recognizing the institutionally dispersed and disparate nature of the digital resources today.

We bring together research centers, libraries and public/private text database creators together with scholars and programmers who are creating online utilities and APIs to explore this first level of a cyberinfrastructure for China studies. 

The challenge of this endeavor is to show proprietary database providers how the value of their text databases, often containing thousands of titles, will be increased by participation in a cyberinfrastructure that exposes their metadata to others and facilitates communication between projects. We are now in the process of inviting participating foundations, centers, projects, databases, and libraries.

Funded by
The Harvard China Fund and the Chiang Ching-kuo Foundation
Chaoxing Group