Consolidating Chinese character variants in preparation for nominative linkage

While carrying out nominative linkage of records of individuals in different editions of a database we are constructing, we ran into three types of problems:

Different variants of the same character were used when entering the same individual’s surname or given named in different editions, preventing a link from being made. This was sometimes the result of the coder’s keying, and in other cases, it seemed to reflect differences in the character in the original.
Keying mistakes by coders led to the occasional entry of a simplified character, when a traditional character should have been entered.
A character in an individual’s given name was replaced with a homonym, preventing a link from between made.

To deal with this, we wrote code in STATA to

Consolidate variants for the purpose of nominative linkage by replacing them to the most common variant that appears in our dataset. This isn’t necessarily the ‘right’ variant and we are only doing the replacement temporarily, when we do linkage. We preserve the original entry in the dataset. Since we are linking on surname, given name, and for anyone who wasn’t in the Banners, province and county of origin, we’re not too worried about the prospect of false positives created by linking someone to someone else who has the exact same province and county of origin, and identical other characters in their surname and given name, but whose name actually included the variant.
Translate simplified characters to traditional characters. Again, this is only for linkage. We’re not worried about false positives because for a link to be made correctly, the other person would need to be identical on all the other characters in their province, county and surname and given name, and differ only in that one character was written with a traditional character, rather than the traditional character that is now the simplified version of that traditional character.
Translate strings of Unicode characters into pinyin, with tones, and then allow for matches to be made in situations where the combination of province, county, surname and given name were identical except for one character that was nevertheless a homophone.

In the hope that our programs and the tables used for mapping might be useful to someone, we provide them below.

Please keep in mind that these were developed for a very specific purpose, nominative linkage, where false positives were unlikely, because we were matching on additional information, and we considered it very, very unlikely that there could be two people from the same province and county, whose combination of name and surname differed by only one character, and the different characters were both variants of a same character. For our purposes, even a few false positive would not be much of a problem. However, for other purposes, the programs below might be inadequate, and some details we have swept under the web might be crucial. If you are dealing with a situation where more precision is important, you are of course welcome to take what we have done as a starting point and come up with something better. We do hope that you will refer to this blog entry.

The other caveat is that we just don’t have time to provide any help or answer questions. For the time being, these programs are sufficient for our purposes, and we are not inclined to do much more work on them. I can’t incorporate suggestions, and if you have questions about the code or the tables, I may or may not be able to answer. The material we have made available here will be most useful to someone who knows STATA, or some programming language.

Our programs rely on mappings constructed from files we downloaded from the Unicode website: http://unicode.org/charts/unihan.html, in particular Unihan.zip from the folder http://www.unicode.org/Public/UCD/latest/ucd/. Within that zip file, Unihan_Variants.txt was the basis for our mapping of character variants and simplified/traditional variants. Unihan_Readings.txt was the basis for our mappings to pinyin. We imported these files into STATA, and then produced three different files, one for the variant mappings, one for simplified/traditional mapping, and one for pinyin. We did additional processing to narrow down the variant mappings so that they would all converge on the version of a character that was most common in our dataset.

We have a STATA .do file that defines programs to carry out these tasks. It needs three .dta files mentioned above that you will also have to download and place in a folder: Character variants recodes.dta, Simplified to traditional.dta, and Unicode pinyin.dta.

Before executing any of the programs, you will need to define a global macro named conversion that specifies the path to the folder in which the .dta files that they need have been placed:

global conversion “pathname“

Replace pathname with the path to the folder, for example, C:\Users\camcam\Documents\StataFiles

Consolidating variants

The program to consolidate variants, consolidate_variants, expects three arguments:

consolidate_variants varname1 newvarname1 newvarname2

The first, varname1, is a string variable already in memory that will be the subject of the processing. The second, newvarname1, is a string variable to be created that will be where the output is placed. The third newvarname2 will be a string variable to be created which will include a list of the transformations that were made.

The program dices the Unicode string into its constituent characters, converts each character to a Hex representation of its Unicode value, merges with Character variants recodes.dta to obtain a new value if available, and then reassembles everything. Looking at it, there are things I would do to speed it up and simplify it, but I don’t have time.

In the file Character variants recodes.dta, unicode is the ‘before’ version, unicode_cv is the ‘after’ version, and for reference, the Chinese characters corresponding to the before and after Unicode values are included as original_cv and outcome_cv, respectively. Outcome_cv_total is the count of the number of times the outcome character appeared in our data, and was the basis for the decision about which character was ‘before’ and which one was ‘after’.

For reference, here is the Excel file with the variant mappings.

Converting simplified to traditional characters

The program to transform simplified to traditional, to_traditional, is invoked the same way:

to_traditional varname1 newvarname1 newvarname2

newvarname1 will contain the traditional version of varname1, and newvarname2 will contain a list of the changes that were made.

Here is the Excel file with our simplified to traditional mappings.

Converting characters to pinyin

Finally, to_pinyin creates a string of pinyin with tones. It is invoked with two arguments, the first being the name of an existing string, and the second the name of a string to be created.

to_pinyin varname1 newvarname1

Here is the Excel file with our mappings to pinyin.

Stripping out tone marks in the pinyin

We also created a program to strip tone marks out of the pinyin strings created by the code above.

pinyin_no_tone varname1 newvarname1

This is hopefully self-explanatory.