Skip to content Skip to navigation

Linking Individuals Across Historical Sources: a Fully Automated Approach

Feb 2018
Stanford King Center on Global Development Working Paper
By  Ran Abramitzky, Roy Mill, Santiago Perez
Linking individuals across historical datasets relies on information such as name and age
that is both non-unique and prone to enumeration and transcription errors. These errors make
it impossible to nd the correct match with certainty. We suggest a fully automated method for
linking historical datasets that enables researchers to create samples that minimize type I (false
positives) and type II (false negatives) errors. The rst step of the method uses the Expectation-
Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that
each two observations correspond to the same individual. The second step uses these estimated
probabilities to determine which records to use in the analysis. We provide codes to implement
this method.