Plan for a Conference on Cyberinfrastructure

 

The 2005 ACLS report “Our Cultural Commonwealth”[1] called for creating a cyberinfrastructure for the social sciences and humanities, similar to that which has been successfully developed for the sciences.[2] A cyberinfrastructure is the system of connections between the layer of base technologies (computation, storage, communication) and the layer of software, services, instruments, information and social practices applicable to specific projects and disciplines. One might think of the cyberinfrastructure as the network of software, data collections, personnel, best practices and standards independent of specific projects and disciplines, which facilitates the implementation of specific projects on general purpose base technologies.

 

The humanities and the less quantitative social sciences differ from the sciences in that they are necessarily embedded in language, and this creates challenges specific to cyberinfrastructure for China studies.  Consider for example the increasing popularity of structured topic modeling, used to create topical surveys of large text corpora. Common methodologies assume that texts contain words and that words are divided by white spaces (as in this sentence). Word division is a major challenge in mining a language such as Chinese which does not separate words with white spaces and, in the case of historical texts, where phrasemes are a more appropriate category than words. A cyberinfrastructure for China studies must take into account the language in which texts were written. It must also deal with two further impediments to communication. First, digital resources such as text databases are dispersed among many institutions and companies. Second, utilities such as dictionaries to facilitate the online analysis of digital materials are often unique to or embedded in a particular resource.

 

Digital China studies have developed through the creation of independent utilities, of which the most prevalent form is the searchable text database. Here there has been tremendous progress over the last twenty years. Beginning with the ever-growing Scripta Sinica from Academia Sinica (currently over 600 million characters in diverse collections) the number of searchable text databases from public and private vendors has steadily increased (see below). The largest is Donald Sturgeons’s ctext.org (Chinese Text Project) with a corpus of over 4 billion characters and 20-25,000 unique daily visitors. At the same time the interest in “digital humanities” has resulted in an increasing interest in the computational utilities for analyzing data derived from texts such as software for social network analysis; geospatial analysis and online cartography; textual markup, mining and topic modeling; and relational and object-oriented databases. Software that ten years ago was regarded as too difficult for the untrained student to use is now becoming commonplace.

 

The goal of creating a cyberinfrastructure for historical China studies cannot be accomplished by combining all searchable text corpora into a single giant repository because the majority of databases are proprietary and access is subscription based. There has been some progress in metadata searching, so that the catalog of a library with many subscriptions can report on accessible digitized texts, but to date this does not include searching content across collections. Attempts have been made to implement federated search for digitized Chinese-language materials,[3] but their utility has so far been hampered by a lack of metadata standardization.

 

However, with the greater use of Application Programming Interfaces (APIs) it has become possible to create links between online databases and online text programs so that the functionalities of databases devoted to particular topics (places, people, government offices, religious sites) can be brought to bear on searchable text programs. An early example of this was the API created by the China Historical GIS project. A text database can be programmed to use the CHGIS API, so that on encountering a place name users can automatically call up CHGIS data on that place and see its location on a map. A more elaborate example is the China Biographical Database (CBDB) API which allows users to call up numerous categories of information about a person-name in a text. The MARKUS system is to date the most sophisticated example: it draws on a numerous online databases to facilitate the marking up of Chinese texts and facilitates the extraction of tagged data for further research. The fact that MARKUS can ingest textual data from ctext.org in real time provides an example for what other systems can accomplish. In our view making it possible for public and proprietary text databases to use such APIs to annotate their contents will greatly enhance their usefulness to many different research communities. The same functionality is now being developed for image collections – including art, maps, and scans of texts – that adopt the IIIF standard. Mirador, developed at Harvard and Stanford, is a utility that allows the user to create individual collections from disparate sources. Ctext.org is a model for how online text databases can make full use of APIs. This allows the creation of a cyberinfrastructure while recognizing the institutionally dispersed and disparate nature of the digital resources today.

 

We thus propose to bring research centers, libraries and public/private text database creators together with scholars and programmers who are creating online utilities and APIs to explore this first level of a cyberinfrastructure for China studies.

 

The challenge of this endeavor is to show proprietary database providers how the value of their text databases, often containing thousands of premodern titles, will be increased by participation in a cyberinfrastructure that exposes their metadata to others and facilitates communication between projects. In support of this the following organizations have agreed to be participate. More will be invited later. 

 

Funded by

The Harvard China Fund and the Chiang Ching-kuo Foundation

 

Participating Foundations, Centers, Projects and Libraries

Canada

McGill University, Ming Qing Women’s Writings Project

China

National Library of China, 国家图书馆出版

China

Peking University, Institute of Humanities and Social Sciences

China

Fudan University Library

China

Nanjing University Library

China

Peking University History of Ancient China Research Center

China

Peking University Library

China

Shanghai Library

China

Sun Yat-sen University Library

China

Zhejiang University Library

China

Chaoxing.com&Duxiu

China

Gujilianhe.com.cn 中華經典古籍

China

Souyun

China

Shuge Library

China

Yuelu Academy 岳麓书院

Europe

Leiden University, Centre for Digital Humanities

Europe

Max Planck Institute for the History of Science

Europe

British Library

Europe

National Library Berlin

foundation-observer

American Council of Learned Societies

foundation-observer

Henry Luce Foundation

foundation-funder

CCK Foundation

Hong Kong

Chinese University of Hong Kong, Institute of Chinese Studies

Hong Kong

Chinese University of Hong Kong, University Library

Hong Kong

Hong Kong University of Science and Technology

Japan

Kyoto University Center for Informatics in East Asian Studies at the Institute for Research in Humanities

Japan

CJKV-English Dictionary

Japan

*Kanseki Repository漢リポ

Japan

Center for Open Data in the Humanities in National Institute of Informatics

Japan

Kansai University

Japan

University of Tokyo

Korea

Yonsei Institute for Sinology (Yonsei University)

Taiwan

Academia Sinica, Institute of History and Philology

Taiwan

National Taiwan University, Center for Computer Science and Digital Humanities

Taiwan

Academia Sinica Center for Digital Cultures

Taiwan

Academai Sinica, Institute of Taiwan History

Taiwan

Center for GIS, RCHSS, Academia Sinica

Taiwan

National Chengchi University, MOST(科技部)Digital Humanities Project

Taiwan

National Normal University, MOST(科技部) Taiwan Biographical Database Project

Taiwan

Central Library

Taiwan

Dharma Drum Buddhist College, Taibei/CBETA

US

Fairbank Center for Chinese Studies, Harvard University

US

Stanford University, digital humanities asia@stanford

US

Yale University, Ten Thousand Rooms Project

US

Council on East Asian Libraries

US

Harvard-Yenching Library

US

Princeton University Library

US

The University of Chicago Library

US

East Asia Library at Stanford University

US

Library of Congress

US

Utah Geneaological Society, FamilySearch

US

Area Studies Collections at Penn

US

www.pleco.com

US

www.wenlin.com

US

Temple University

US

EastView.com

China

China National Knowledge Infrastructure CNKI 中國知

China

Chinese Local Historical Sources Database 中國地方歷史文獻數據

China

Unihan Company 書同文, Beijing

China

South-Centeral University for Nationalities, School of Arts and Commnunication, School of Arts and Commnunication

China

Guoxue dashi 國學大

 

We are still waiting for responses from several vendors.

 

We will hold this conference at the Harvard Shanghai Center in March, 2018. A paper detailing the proposed infrastructure was presented at the 7th International Conference of Digital Archives and Digital Humanities in Taibei in December. An outline is appended.

APPENDIX 

 

“A Cyberinfrastructure for Historical China Studies”

Hongsu Wang, Lik Hang Tsui, Peter K. Bol

 

for the 7th International Congress of Digital Archives and Digital Humanities, Taiwan

 

Abstract

The proliferation of databases for the study of Chinese history and the increasing numbers of researchers taking part in their development calls for a cyberinfrastructure. A cyberinfrastructure can be conceived as a network of discipline-specific software applications and data collections and also of the personnel and the set of best practices, standards, and collaborative methods they establish. This paper discusses how participants in such a cyberinfrastructure for historical China studies can share their resources and how their communication can be facilitated by various technologies and mechanisms.

 

Keywords: Cyberinfrastructure, Chinese history, digital humanities

 

 

  1. Introduction: Building a Cyberinfrastructure
  2. Ways of Sharing Resources
    1. Sharing with APIs
  • Exporting data; Sharing tools; Creating links between databases
    1. The Sharing of Files
  • Sustainability; Version control; Description of files
  1. Authorization to Share
    1. Types of Authorization
    2. Types of Data Owners
  • Libraries; Open access, non-commercial databases; Commercial databases; Closed databases
  1. Digital Tools for Inter-project Collaboration
    1. Cross-catalogue Search of Premodern Chinese Titles
    2. OCR Technologies and the Sharing of Chinese Textual Resources
  • Practices of CTEXT and CBDB
    1. Textual Markup and Visualization
  • MARKUS
    1. Code Tables
    2. API and Data Sharing Tools
  1. Communication between Members of Cyberinfrastructure
    1. Internet Communication Channels
    2. Facilitating Regular Interactions
  • Initial conference; Regular meetings; Topical discussions; Interviews; Specific Collaborations

 

 

關於中國歷史研究的網絡基礎設施:召開相關會議的倡議

 

倡議人:包弼德(中國歷代人物傳記資料庫)、德龍(中國哲學書電子化計劃)

 

 

美國學術團體協會(ACLS)在2005年發佈的研究報告《我們的文化共同體》(Our Cultural Commonwealth)提出,人文和社會科學有自己的網絡基礎設施(cyberinfrastructure),就像自然科學研究那樣。[4] 網絡基礎設施的層次介於基礎科技和具體用於某研究項目、某學科和實踐的特定科技之間。[5]它可以起的獨特作用,在於連接對一個學科有用的電腦軟件、數據集、人才、實務做法、標準和合作模式等,有利不同項目利用一些共通的科技。

 

和自然科學相比,人文學科和部分社會科學學科(尤其是其中量化方法特點不明顯的學術領域)深深浸淫在語言之中,很受語言的特點影響。就以主題模型(topic modeling)為例,當這方法用於中國文史研究時會面對頗多挑戰。一般使用這種研究方法時,認定每個詞之間的空格就代表分詞的區隔,但這種標準不能用於中文文本。對於古代漢語而言,片語」(phrasemes)可能比「詞」更能貼切描述這種內容。所以,建立網絡基礎設施時,必須考慮它所處理文本的語言之特點。而且,這種構建的工作也要面對兩種挑戰。首先,全文數據庫散落在各種機構和公司之下,相當分散,溝通頗為困難。其次,對於在線資料的分析工具往往是按照某種材料而開發,甚至屬於某個系統裡,要放到更廣的應用去,實屬不易。

 

在中國數位研究中,各種獨立的工具林立,尤其是可檢索的全文數據庫。過去二十年,這方面有了極大發展。從資料不斷擴充的中央研究院「漢籍電子文獻資料庫」(目前資料規模已經超過6億字)開始,各種公開和私人數據庫的數量大幅增加。「中國哲學書電子化計劃」的文本已經包含超過50億字的內容,網站每天大約有2-2.5萬訪客。同時,人們對數位人文的興趣也增加了,使得相關數位分析工具更加成熟,例如是用於社會網絡分析的軟件;地理分析工具和在線地圖等工具;文本標記、挖掘和主題模型分析等工具;關係型數據庫和物件導向數據庫等。人們十年前認為過於昂貴的軟件,今天已經變成非常普遍的工具。

 

要構建網絡基礎設施,做法不能是直接把所有文本資料合併。原因在於很多數據庫都由商業公司運作,用戶必須通過訂閱才能獲得數據。最近,一些數據庫的元數據檢索功能有了改進,使得用戶可以從多個圖書館的館藏目錄獲得有哪些文本有電子版這一類信息。然而,至今還沒有人開發出同時檢索多個數據庫內容的工具。有些同行致力於開發中文電子資料的聯合檢索[6],但這種工作的主要障礙在於元數據格式上的不一致。

 

不過,應用程序接口(APIs)的流行,使得在線數據庫和在線文本工具的連接變得更容易。關於一些特定專題(例如某地方、人群、職官、宗教場所)的數據庫可以得到更好的利用。一個相關的例子是「中國歷史地理信息系統」的API。任何文本工具都可以利用它的API,每當遇到地名時自動從「中國歷史地理信息系統」調出相關數據,並在地圖上標示。一個更複雜的例子是「中國歷代人物傳記資料庫」的API。它允許用戶調出關於一個人的各種類型的信息,例如籍貫、官職、親屬等。「瑪庫斯」(MARKUS)代表API的最成熟用法,它允許從不同在線數據庫調出資料,幫助用戶對中文文本進行標記,並允許他們從文本挖掘出經過標記的資料,以供研究。「瑪庫斯」甚至可以直接從「中國哲學電子書計劃」提取文本。因此,在我們看來,如果各大公共和私人的全文數據庫都允許API的使用,作為文本本身的補充,將大大提高各數據庫的功用,對所有用戶群體都是極大的幫助。實際上,坊間已經有一些工具能用於整合符合IIIF國際標準的圖像資料,包括畫作、地圖、書籍的掃描圖像等。例如由哈佛、斯坦福開發的Mirador就是讓用戶從各種來源的數據建立個人收藏的一個工具。「中國哲學電子書計劃」則向我們展現一個主要資料為文本的數據庫可以如何得益於API。當然,我們不得不承認當下的數位資源是多種多樣的,開放程度也各不一樣,但構建一個網絡基礎設施仍然是有必要並可行的。

 

因此,我們希望召集各大研究中心、圖書館與公開和私人的全文數據庫擁有者參與這種探討,並把各種在線工具和API的開發者和相關學者聚集在一起,開始討論論網絡基礎設施的話題。

 

這方面工作的主要挑戰在於如何說服商業數據庫,讓他們看到為何它們的全文數據庫可以從網絡基礎設施所帶來的元數據分享以及不同項目之間的互動得益。已經有不少機構同意參與這項建議,也有更多項目將被邀請。請見下。

 

 

資助方

哈佛中國基金會

蔣經國國際學術交流基金會

 

已應允參與機構(基金會、研究中心、項目、圖書館等)

中央研究院歷史語言研究所

美國學術團體協會

香港中文大學中國文化研究所

哈佛大學費正清中國研究中心

香港科技大學

京都大學東亞人文情報學研究中心

萊頓大學數位人文研究中心

馬克斯-普朗克科學史研究所

麥吉爾大學「明清婦女著作」數字文獻數據庫

臺灣大學數位人文研究中心

北京大學人文社會科學研究院

斯坦福大學digital humanities asia@stanford

耶魯大學「廣廈千萬間項目」

美國東亞圖書館理事會

復旦大學圖書館

哈佛燕京圖書館

北京大學圖書館

普林斯頓大學葛思德東亞圖書館

上海圖書館

浙江大學圖書館

中山大學圖書館

 

將被邀請的全文數據庫項目(標星號代表開源)

以下是提供中國古代文本的主要數據庫,大部分來自中國。其中有些除了文本資料以外,還提供文獻的掃描影像。其中的商業數據庫一般依靠訂閱的商業模式。儘管許多數據庫的文獻量一直有增長,但數據庫的界面和功能往往沒有跟上科技發展,例如API和關聯數據等。建立網絡基礎設施,會在這方面的推動起到很大作用。

 

*中華電子佛典協會

*法鼓文理學院

*漢リポ 

*書格數字圖書館

*猶他家譜學會 FamilySearch

北京方正阿帕比技术有限公司

中國地方歷史文獻數據庫

中國知網

國家圖書館

CJKV-English Dictionary

鼎秀古籍全文检索平台

讀秀中文學術搜索

北京愛如生數位化技術研究中心

國學寶典資料庫

瀚堂典藏資料庫系統

北京書同文數字化技術有限公司

Pleco

文林研究所

中華書局中華經典古籍庫

 

我們計劃於2017-18年間在哈佛上海中心舉辦這場會議。關於這種網絡基礎設施的相關構想請參見我等在臺灣的第七屆數位典藏與數位人文國際研討會提交的論文。此處附上論文大綱。

 

 

附錄:

 

服務於中國歷史研究的網絡基礎設施

王宏甦 、徐力恆 、包弼德

 

第七屆數位典藏與數位人文國際研討會論文,201612

 

摘要

 

 

數據庫、研究項目數量和參與中國文史數位研究的人員大幅增加,使得為中國歷史研究建立相應的網路基礎設施變得必要。網絡基礎設施可以起的作用在於連接對一個學科有用的電腦軟件、數據集、人才、實務做法、標準和合作模式,促進研究的進步。本文將具體論述為何要營建中國歷史研究的網絡基礎設施,以及如何從資源的共享和成員的交流兩方面實現這個目標。

 

關鍵詞:網絡基礎設施、中國歷史、數位人文

 

 

一、引言:中文數位人文網絡基礎設施的實現

 

二、資源的共享方式

a. API 分享

  • 輸出分享數據、在線工具的功能分享、數據之間的關聯

b. 文檔分享:文檔的持久性、文檔的版本管理、文檔的描述

 

三、分享權限

a. 分享授權的形式

b. 分享授權的數據持有者:

  • 圖書館類、公開的非盈利數據庫、商業數據庫、非公開的數據庫

 

四、跨項目電子化工具

a. 跨庫書目檢索系統

b. OCR技術與中文文本資源的開放

  • 例如CBDB和CTEXT的作業方式

c. 標記與可視化工具

  • 例如MARKUS

d. 代碼表

e. API和數據分享工具

 

五、成員交流

a. 基於網路的信息溝通方式

b. 成員的常規溝通

  • 大會、定期會晤、主題討論、訪談、項目合作
 

[2] 2003 NSF Report “Revolutionizing Science and Engineering through Cyberinfrastructure”, https://www.nsf.gov/cise/sci/reports/atkins.pdf

[3] See, for example, the German initiative: CrossAsia search http://crossasia.org/en.html. CBDB has developed a prototype for a database that can include the items in all online text databases.

[5] 參閱2003年的美國國家科學基金會報告“Revolutionizing Science and Engineering through Cyberinfrastructure”: https://www.nsf.gov/cise/sci/reports/atkins.pdf

[6] 例如德國的項目CrossAsia:http://crossasia.org/en.html。「中國歷代人物傳記資料庫」項目開發了一個跨數據庫檢索中國古籍書目資料的試驗版。