Structure of CBDB

An Account of the Structure of the China Biographical Database (CBDB)
Michael A. Fuller
May 26, 2005


A. An Overview of the Entities in the Database

Database design uses tables to give concrete form to more abstract objects which we simply call “entities.” Since the goal of a database is to capture the relational information about entities, it remains useful to keep the abstract objects separate from the tables that represent their relation. That way, one can more easily ask the question of how the tables need to change to better stand in for the entities they represent.

The central entity that defines biography in the database is, of course:

  1. People

But since a relational database track the ways in which people form relations with other people, with their society (i.e., social and economic institutions), and with the physical world, we also need entities with which people interact. First, relationships with people:

  1. Kinship and Lineage
  2. Non-kinship Associations
    Within Non-Kinship Associations, there are two additional sub-entities: Mourning Events, and Gifts, each of which has attributes not found in the simplest Non-kinship associations.

Next, with social and economic institutions:

  1. Status (the types of socio-economic roles a person can play in society)
  2. Modes of Entry into Government
  3. Offices and Postings to office (the bureaucratic organization of rule through time)
  4. Events reflecting the social, cultural, or political role of a person

Then, with geography:

  1. Administrative Hierarchy (defined in political terms as administrative units)
  2. Physical Places (fixed locations in space required for historical comparisons)

Finally, there are texts.

  1. Texts (including primary texts, secondary texts, and paleographic data)

B. Details of Entities:

NOTE: The database allows one to record the Source of information and to add additionalNotes as seems appropriate. Every item in the database that records information on an individual has the attributes of Source, Pages, and Notes. Therefore I will not note these in the discussions below.

1. People

Biographical information for people in China begins with obvious categories: name, maleor female, date of birth, and date of death.

Precise dates of birth and death often are not available, and all we have is a period ofyears of activity. Sometimes, not even that is available: we simply know the reign period (nianhao) or dynasty. In order to capture the level of precision in the data, the database allows the use of reign period information for all dates. One can give a specific year within the reign period, but one also can simply indicate “beginning,” “middle”, “end”, or “unspecified.” For analytic purposes, the database will algorithmically produce Western dates from the reign period information for birth, death, years of activity, and any other date given in the traditional Chinese nianhao designation, but it will preserve the vagueness in the nianhao coding.

a. Names

Names can prove complex. A person has a surname 姓and a personal name 名given in infancy. A person may change his or her personal name. Men also have capping names 字 given when they come to adulthood. People may take on style names 號or Buddhist names. They may be granted a title during their lifetimes or given a posthumous title 諡 after death. The database can record all of these designations and additional types as needed.

b. Entry into Office

The simplest interaction is with modes of Entry into Office. Hartwell reasoned that this event happens only once, if it happens at all. However, there remain important variations that could compel a rethinking. It may not prove useful to think of Entry as a singular event: some people are granted eligibility for office through yin privilege but then go on to pass the jinshi examination. Some begin through yin but are awarded the status of having passed the jinshi. Some people enter and leave service through a variety of circumstances throughout their lives. Hartwell attempts to capture these variations through his codes, but it may prove useful—depending on the data—to allow a person more than one Entry event.

c. Status

Hartwell treated status (which he called “employment”), like entry, as a broad descriptive category for a person, and a person had just one “Employment” attribute. The current version of the database has a separate table to track status to allow a more complete account of a person’s various socio-economic roles.

d. Place Associations

Hartwell had several place codes as part of the basic biographical entry for each person. This approach, however, proved insufficiently flexible because there are many types of place association that Hartwell had not considered. For example, Beverly Bossler suggests a “owned property at” category, while many have requested a “buried at” association. The list of types of Place Associations can expand to accommodate whatever the research reveals as useful.

e. Postings

Hartwell initially was interested in central offices, so his list of offices was not long. He also conflated the type of office with its location: both the Zhexi tiju and Huainanxi tijuhad office codes. The current version has the office table entry of tiju, and Posting entity for an individual has an office attribute, the beginning and ending years, and locations. Since it is possible that a posting involves combined jurisdictions, the Posting entity allows multiple locations to be associated with a single assignment.

f. Kinship and Lineage

The database tracks both agnatic (one’s own clan) and affinal (those of one’s spouse) kinship information. Since it would redundant to include all information about a lineage to reappear in the record of each member of that lineage, the database stores whatever information is known about a particular individual and builds its large Kinship networks dynamically.

Hartwell assigned three other attributes to people that are related to kinship: status, clan, and zu 族. His documentation suggests that the latter two attributes are rather speculative and of limited use, but CBDB preserves the categories in case later research can give them more rigorous meanings. Hartwell’s use of the first category approximates the idea of a choronym (a place identifier for major aristocratic lineages) with a few stray categories (farmer, merchant) mixed in. CBDB has removed the extraneous elements and renamed the attribute as Choronym


g. Non-Kinship Associations

The role of kinship relations in Chinese history and society has been well studied, even if many questions remain, to which the CBDB will be able to contribute. The importance and systematic features of other types of hierarchical relations (teacher-student, patron-client, etc.) and—even more crucially—horizontal relations (fellow students, friends, colleagues, etc.) have been difficult to examine. CDBD records these non-kinship relations for a person through the Associations entity.

h. Writings

This entity may be too narrowly conceived. Written texts are important, but if we are to represent information about people, it may prove advisable to generalize the entity to all forms of scholarly and creative productivity. For an architect, if we know what buildings he designed, this might be the place to record that. Certainly the database should have a way to record what is extant for painters, although the list may get very extensive (and accordingly less useful) for Ming and Qing dynasty artists because it merely duplicates information available elsewhere without necessarily bringing the particular power of a relational database to bear on this creative productivity. This is, however, an empirical issue that is best determined as the database grows.

i. Activities

This is a relic of Hartwell’s original design. Hartwell used it to record a variety of heterogeneous information primarily related to a person’s official career. This information now is in the Notes associated with the main biographical entry.

j. Events

While Postings records the appointments to office, it does not record what the posting signifies: was it a promotion or a demotion? Events, which is not in Hartwell’s original database, is designed to capture important events in a person’s biography. While some are directly tied to office, others are more cultural or social. The creation of a shrine in a person’s honor, for example, was an important activity in the Daoxue community. We should have a way to capture that activity. Since this is a new entity, only when the biographical information is input will we see whether it proves to be a useful entity, or whether it should be removed.

2. Kinship

An instance of the Kinship relationship for an individual has three components (plus the source information):

person
kinship relation
kin

The building-block relations for Kinship should be (but are not yet) the 9 basic categories: e (ego), F (father), M (mother), B (brother), Z (sister), S (son), D (daughter), H (husband) and W (wife). Adoptive kinship can perhaps be represented by an asterisk (*), and it may prove useful to preserve Hartwell’s coding of older (+) and younger (-) siblings. Kinship here is strictly descriptive: there is no concern for the Chinese organization of the kinship system or for the terminology. That said, Hartwell focused primarily on reconstructing the mourning circle (5 degrees of consanguinity). However, the database probably needs to be systematized, and new routines need to be written for a more general approach to building kinship networks. Hartwell’s codes for relationship are ad hoc and should be regularized. The kinship network algorithm probably should be a finite state machine (e.g., F+F > FF, F+S > either B or e, B+B > B or e, W+H > e or WH (i.e., either a earlier marriage, or a later remarriage), etc.).

Agnatic kinship distance can computed as a pair of values: first is generational distance, with +1 for SD, -1 FM and so on. The second is a “collateral branch route distance,” where the key entry is +1 for BZ. Once the branch is formed, the distance is a measure of how far down the branch one must travel to reach the person. Thus FB = -1,1 (one generation up, one branch over), and FBS = 0,2 (same generation but 2 links along the branch). Affinal relationship distances are measured through joining pairs of agnatic distances, with HW = (0,0)(0,0) for a distance of 1 (i.e. this is e husband + e wife, but the adding of a pair in itself adds 1 to the distance). The distance to FBWF (the father-in-law of a paternal uncle) would be (-1,1)(-1,0), for a total generational distance of -2 and a total distance of 4, while the distance to FBWZH (the husband of a paternal uncle’s wife’s sister) would be (-1,1)(0,1)(0,0), with a generational distance of -1 and a total distance of 6. (It should be clear that affines of affines grow distant relatively quickly, but this would seem to be an appropriate result.) Thus one can specify a stopping condition for the search for kinship relations. There must be better ways to compute the distance, but my major concern is be to make sure that some metric tells the system when to stop looking for more kinship relations.

There are complications that the system will need to accommodate. For example, how does one handle a nephew adopted as the heir for a childless couple? It may suffice to simply use two entries: F (for the person’s actual father) and F* for the adoptive father, and let all the other dynamically generated entries grow from these. For the adopted person’s children, this would generate both FF and FF*, and so on. Do we want this? My inclination is to say yes. If the adoption was between brothers’ families, the grandfather is of course the same, but other variations at greater kinship distances are possible. It would seem that, despite the above claim to descriptive neutrality, the database probably needs to handle such a situation—and particularly kinship distance—in a manner consistent with historical practice. The project manager in consultation with social historians will need to decide the best approach.

3. Non-kinship Associations

a. Simple Non-kinship Associations

These have a three-part structure: person + association + associate. The major challenge in recording the non-kinship Associations that individuals formed over their lives is to control the proliferation of categories. The current database has introduced superordinate categories and sub-categories to group associations into families. Hartwell had created codes for studying each of the classics under a teacher: these should be grouped into classical scholarship, which in turn is part of the more general teacher-student relation. Since an association, however, can have several aspects to its meaning, the database allows any particular association to be categorized as participating in three distinct category/subcategory pairs.

Because associations are between pairs of people, there must be symmetrical types of associations. That is, if {A “is the student of” B} is in the database, then {B “is the teacher of” A} also should be. In fact, the current version of the program automatically generates this second entry. Thus Associations as an entity has an internal structure:

Association type
Paired Association type
Association Categories/subcategories (3 at present)

When one creates a new Association, one must also create its converse. Mutual associations, of course, are their own converse. To record that {A “traveled with” B} also means that {B “traveled with” A}. Some associations, however, are not dyadic because the relation is not to a person but to a more abstract or general object. The most important type of association of this type is the faction. Thus we have {A “is member of Yuanyou group” Æ} (Æ here is the Null element.)

b. Mourning Associations

In China, a person can choose to participate in mourning for a teacher or a person connected through some other form of non-kinship association. Mourning has five aspects that are represented in the database:
Mourner
Mourned
Length of Mourning
Color of Mourning Robe
Date

It may prove that this category of association is most important in early China, but it should be expected that as Chinese culture transformed over time, sources for biography stressed different forms of activities and relationships. The database needs to be able to accommodate these historical shifts.

c. Gift Giving

Another type of association is created through the giving of gifts. This practice also has five aspects represented in the database:

Giver
Recipient
Gift
Value (or quantity) of gift
Date


4. Status

Hartwell defined this entity as a singular attribute for a person, a quick way to assess who he or she was. However, it became clear that if the database is to represent information on people who did not serve in office or who held other stations in life outside of government service, the idea of Status would need to be clarified and refined. The current database has a separate table to trace status history. Since the dating often is uncertain, however, the table has a field to record sequence. Moreover, since some forms of status may combine roles (a lay Buddhist scholar, or a literatus who runs a printing firm), there may be a use for types that fall under two different categories. Or it may be better to disaggregate to roles. This is largely an empirical question of how often such merged roles appear and whether they seem to have been viewed as a single status rather than two. Thus refining this approach—in the list of the types of status and their categorization— is a task for the project manager in coordination with the coding team in Beijing. The structure of a Status datum for a person is:

Person
Status code
Status sequence
Date
Source information and notes
Status itself (as opposed to any particular person’s “status event”) is a simple entity:

Status code
Status description
Status category and subcategory 1
Status category and subcategory 2


5. Modes of Entry into Government

Hartwell lists 265 ways to enter government service. He has such types of entry as “yin followed by granted equivalent to jinshi:” the very phrasing of this type suggests thatEntry is not the single event that Hartwell proposed. The current version of the database has a separate table that reorganizes Hartwell’s data to reflect its actual complexity.

Entry itself is a simple entity, just a name, a type, and a subtype. However, because different routes of entry entail different types of information, the Entry Event for an individual is more complex. If one enters government through the examination system, for example, one needs to know what type of examination and when, if available. (Colleagues have requested that the database track failed examinations as well.) If, in contrast, one enters government through the merit of someone else, the person, and the relationship to the person should also be recorded, if known. Thus if Zhang Weisan entered office through yin deriving from his uncle Zhang Jingyi, the entry would be:

Person: [ID of] Zhang Weisan
Entry type: [code for] yin
Entry relation type: [code for] Uncle
Entry relation: [ID of] Zhang Jingyi

Since it is also possible that one can enter office through the yin privilege of a non-kin associate, the “entry event” will need to have a way to record the non-kinship relation. In the end, then, the entry event has many attributes, only some of which are relevant to any particular instance:

Person ID
Entry type code
Entry relation type code (for kin)
Entry associate type code (for non-kin)
Entry associate ID (used for both kin and non-kin)
Entry test date (both Western and nianhao + year (if known))
Entry test ranking

6. Offices

One of the design issues that need to be considered again is how much of the complexity of the Chinese imperial bureaucratic system should be captured in the database. Since Hartwell was primarily interested in civil, central government posts, and financial offices in particular, his structures for representing office-holding information were fairly limited. These have been revised in the FoxPro prototype and have been changed again in the transition to the MySQL version.

In the Chinese system from the Han through the Qing, the duties of a position may change even though the title of the office remains constant, or the duties may remain constant although the title changes. Scholars have complained that Charles Hucker, in hisDictionary of Official Titles, tried to force a continuity of function onto office names when it would have been more useful to simply acknowledge the drifts. Thus in CBDB Office Name is one entity, while Office Function is another. Most of the actual duties of an office at any particular time are not relevant to the CBDB because these details contribute little to the analytic power of the database; the attributes of function that do matter are (1) is the office used as an indication of salary/rank, or is it functional, (2) to what other office does it report, and (3) the type of the office (i.e. central military, prefectural civil, etc.) Thus the attribute of the office name entity are few:

Office code
Office name
Year created (both Western and nianhao + year)
Year abolished (both Western and nianhao + year)

The office function entity carries most of the information:

Office code
First year for this function
Last year for this function
Office category (titular, honorary, functional, etc.)
The Office code of the office to which the office reports
Office type (civil central, military circuit, etc.)
Office subtype (personnel, finance, etc.)

Postings are entities at the intersection of people, the bureaucracy, and—since most will be away from the capital—places. A person serves in an office at a given rank in particular place at a specified time. The attributes of the postings entity for a person capture this event:

Person ID
Office code
Salary Rank Office code
Address ID
Sequence (since often only the order of office is known with no further information about the years for any of the postings)
Year (both Western and nianhao + year)
Source, and Notes

I see no compelling reason why Buddhist and Daoist bureaucratic positions should not be added to the Office Name/Office Function//Postings entities. They have not simply because Hartwell had no data for biographies of priests.

7. Places

Hartwell’s inclusion of geographic codes in his biographical data was a major innovation. Since he was well ahead of his time, however, his approach to coding was a good first-try, but its limitations have required that it be replaced. Based on a series of maps for the Tang through Qing dynasties, Hartwell assigned unique codes to all regions down to the level of count that reflect their hierarchical reporting status in each dynasty. Thus S000400000321 designates a Song dynasty county (xian) that is part of a prefecture S000400000300, that in turn is part of a circuit S000400000000. If the reporting structure for a county changes, it must get a new code to reflect that change. The current version of the database uses a strategy for coding places that derives from the CHGIS project and relies on two types of spatial entities: Addresses and Places.

To begin with, there are Addresses: these are specifically historical “instances” of place designation that refer to an administrative jurisdiction bounded in space with a particular name. If either the boundaries or the name changes, a new address must be created. These historical instances, however, are part of administrative hierarchies of the type that Hartwell represented in his coding: this information is preserved in a “belongs-to” table that serves the same function as the “part-of” table in CHGIS. Thus there are two tables:

Addresses
Address code
Address name
Administrative type (this probably should be in the belongs-to table)
Address first year
Address last year

Belongs to
Address code
Belongs-to Address code

Since a major goal of the CBDB is to allow the examination of trends across dynastic boundaries, the database needs a way to examine what happens in a particular location over long periods of time. For this, CBDB relies on the Place entity: place designates a physical location, an x-y coordinate on the map. Since Hartwell had recorded the correspondence between his dynastic administrative codes and the 1990 county boundaries, CBDB at present simply uses those 1990 counties as its place units. The current database has a table of places that includes those 1990 counties and additional units to cover any historical county address (or any address at the end of the hierarchical structure, i.e. has no units the “belong to” it) for which Hartwell did not have a corresponding 1990 code. The most significant attribute of a place unit is its x-y coordinates. Thus we have the Place table:

Place code
Place name
x-coordinate
y-coordinate

CBDB has a final table to hold the Address and Place information together that needs only two entries:

Address code
Place code

It should be stressed that when the CHGIS project is completed, we hope to be able to replace the CBDB internal geo-coding system with the CHGIS tables. However, there is one important aspect of the CBDB approach that is not basic to CHGIS and will need to be confronted. That is, CHGIS does not concern itself with long-time trends and does not have tables that correspond to Place and Address_Place. It has a table that records the change from one address code to another, but this approach does not meet the needs of CBDB. Consider, for example, if the central authorities decided to combine three depopulated counties A, B, and C into one large county D and thus save administrative expenses. Suppose that thirty years later the government decided that this was a mistake and redivides D along the original boundaries into A¢, B¢, and C¢. A “transitions” table would record A®D and D®A¢, but would have no way to reconstruct the fact that A®A¢. CBDB needs to track the correspondence of administrative to physical location over time in order to provide longitudinal data.
Since experience with CHGIS has taught us that relying on boundaries defined by polygons has many problems, in the end it may be best to simply create a grid of points to replace the Places table, and then determine which points were inside what county at any given time. In doing historical analysis, the granularity of the grid (which need not be evenly spaced across the whole of China) will limit the accuracy of the results, but this limitation in an important way represents the very real limits encountered in the data itself. When a funerary inscription states that someone was from a particular county, this is not necessarily an absolutely precise designation. Scholars and especially students should be reminded that place information is only fairly accurate. The Place grid would be a reminder of this limitation.


8. Writings
There are three major types of writings of concern to the database: inscriptional and other paleographic material, printed primary texts, and secondary scholarship. Since a work like Huang Zongxi’s Song Yuan xue’an is both a scholarly compendium of earlier writings and a work in its own right, and since the paleographic materials also had authors who are of interest to the database, these distinctions for pre-modern texts of any sort are neither clear nor useful. CBDB accordingly treats all three types as writings. Writings have the attributes one can expect:

author(s)
title
category of writing (inscription or manuscript/printed)
genre
original publication date
original publisher
original publication location
current publication date
current publisher
current publication location

Inscriptional materials have a few additional attributes (recorded in separate tables):

alternate names
donor(s)
recipient(s)
place where discovered
date of discovery
current location
source of information

Since the texts can serve as sources for biographical information, CBDB records the publication information for the modern edition used, since source information for entries includes page numbers. However, CBDB does not aspire to serve as a standard reference for bibliographic information. It (at least at present and in the near future) will not list all the extant editions of texts for authors nor adjudicate which are the most reliable among those extant editions. Part of the future plans for CBDB when it is available on the web is to develop links between the database and other web resources: bibliographic sites certainly will be among such links.

A finite state machine is computerese for a system of a limited number of possible states where the transition from one state to another has strictly defined behaviour for all combinations of initial and final states.