Our paper “Nominative Linkage of Records of Officials in the China Government Employee Dataset-Qing (CGED-Q)” which shares our experience with nominative linkage in the CGED-Q has been published at Historical Life Course Studies. We hope it will be useful to others who are engaged in large-scale, automated nominative linkage (disambiguation) of individuals in historical Chinese-language sources.
While the approach that we arrived at after many iterations may be specific to the CGED-Q and its contents, we think that our summary of the challenges that we encountered will be of broader interest, and our methods should at least be a roadmap for others with related projects.
Major issues we document and then address include the use of variant orthographies for the same character in different editions or sources, replacement of characters with ones that look similar but are actually completely different, replacement of characters with homophones, inconsistencies in the writing of the names of counties, and changes in boundaries that led the same county to be associated with different provinces in different sources or editions.
Here is the abstract:
We introduce our approach to the nominative linkage of records of Qing officials who were included in the China Government Employee Datasets-Qing (CGED-Q) Jinshenlu (JSL) and Examination Records (ER). We constructed these datasets by transcription of quarterly rosters of civil and military officials produced by the government and by commercial presses, and records of examination degree holders. We assess each of the primary attributes available in the original sources in terms of their usefulness for disambiguation, focusing on their diversity and potential for inconsistent recording. For officials who were not affiliated with the Eight Banners, these primary attributes include surname, given name, and province and county of origin. For the small subset of officials who were affiliated with the Bannermen, we assess the available data separately. We also assess secondary attributes available in the data that may be useful for adjudicating candidate matches. We then describe the approach that we developed that addresses the issues we identified with the primary and secondary attributes. The issues we have identified and the approach that we have developed will be of interest to researchers engaged in similar efforts to construct and link datasets based on elite males in historical China.
Here is the paper at the HLCS website.
We have also made available the complete tabulations that are the basis of the tables in the paper. These include the frequencies of surnames and given names in the CGED-Q JSL, and the frequencies of discordance across record of the same individual in the recording of surnames, characters in given names, and place of origin. The tabulations can be downloaded at the HKUST and Harvard Dataspaces:
https://dataspace.hkust.edu.hk/dataset.xhtml?persistentId=doi:10.14711/dataset/M8HQEA
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/4OSP8V
Those not specifically interested in linkage may still be interested in the tabulations of surnames and characters in given names.