Cameron D. Campbell 康文林

Family, Social Mobility, and Inequality in China and in Comparative Perspective

Menu
  • Research
    • Abridged CV
    • Full CV (PDF)
    • 2 page CV (PDF)
    • Google Scholar
    • CNKI
    • 百度学术
    • ORCID
    • HKUST Repository
  • News
  • Data
    • China Government Employee Database – Qing (CGED-Q) 中国历史官员量化数据库(清代)
      • Download Data
      • Search by Name
      • CGED-Q Jinshenlu Public Release – Resources for Users
    • China Multigenerational Panel Databases 中國多代人口数据庫
      • Download Data
  • Lee-Campbell Group
    • People
    • Projects
    • Publications
  • Photography
    • Photo site 摄影网站
    • Map view
    • Updates
  • Contact
Menu

Improved pipeline for nominative linkage of historical records written with Chinese characters

Posted on October 1, 2025October 1, 2025 by camecamp

Yue (Bruce) YU has a lead-authored working paper introducing a new pipeline for nominative linkage of historical records written with Chinese characters. He also has a codebase at Github with a tutorial on implementing it using our public data.

At present most approaches for large-scale historical record linkage are for sources written with phonetic alphabets. These are not suitable for materials written with Chinese characters. Homophones are extremely common in Chinese, and meanwhile slight modifications to a character, for example the addition or removal of a few strokes, can turn it into a different character with a different meaning and a different pronunciation. Yu Yue’s approach turns Chinese characters into sequences of strokes that are then turned into embeddings and has a variety of other novel features. He demonstrates it by application to the CGED-Q JSL, and compares its performance with the ad hoc approach described in my 2022 paper with Chen Bijia.

Here is the abstract:

We introduce a generic machine learning-based pipeline for nominative linkage of records within and across large-scale Chinese historical datasets. The pipeline addresses key challenges, including character variations, incomplete data, and scalability issues specific to historical datasets in which names and other attributes are recorded with Chinese characters, not just for China, but potentially for Korea, Japan and Vietnam. Techniques developed for attributes recorded in phonetic alphabets are of limited usefulness for Chinese characters not only because homonyms are common, but characters that are similar enough in appearance to be frequently mistaken for each other may sound completely different. Our approach integrates stroke-based character embeddings for efficient blocking, supervised classification with active learning for record matching, and graph-based clustering for final linkage. We demonstrate the effectiveness of this pipeline using the career records of officials in the China Government Employee Database-Qing Jinshenlu (CGED-Q JSL) as a test case. We achieve improved linkage quality compared to standard probabilistic methods, with substantially longer linked sequences of career records and fewer aberrant transitions. To validate the generalizability, we also successfully apply the pipeline to another database and a cross-database linkage task. By minimizing the need for manual tuning, our pipeline offers a more accessible and effective solution for Chinese historical data linkage.

  • Instagram
  • Photography website
  • Bluesky
  • LinkedIn

Recent Posts

  • Working paper on kin networks of local officials in the late Qing

    January 22, 2026
  • New edited volume Quantitative History of China: State Capacity, Institutions, and Development

    November 12, 2025
  • Chinese translation of our original record linkage paper

    November 5, 2025
  • New manuscript about Kin Networks of Exam Degree Holders

    November 4, 2025
  • Special issue “Inequality, economic stress, and demographic response” in Explorations in Economic History

    August 14, 2025
  • Tutorial for using R to Analyze the Publicly Released CGED-Q JSL

    July 7, 2025

Recent Photography

  • Dihua Street in Taipei, at night 臺北迪化街夜景

    February 28, 2026
  • Taiping Elementary School at night, Keelung

    February 28, 2026
  • Downtown Keelung seen from Huzishan, at night

    February 28, 2026
  • Zhongshan No. 1 Road, Alley 113, Keelung, Taiwan at Night 台灣中山一路113巷夜景

    February 28, 2026
  • Keelung Downtown at night 基隆市中心的夜景

    February 28, 2026
  • Chongqing by Day

    February 20, 2026
  • Chongqing waterfront promenade, near Chaotianmen

    February 20, 2026

©2026 Cameron D. Campbell 康文林 | Theme by SuperbThemes