A Case Study in Record Linkage_PVER Conf_May2011


Published on

PowerPoint Presentation from May 2011 Personal Validation and Entity Resolution Conference. Presenters: Marianne Winglee, Richard Valliant, Fritz Scheuren.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • For example, define outcome indicators to show if
  • A Case Study in Record Linkage_PVER Conf_May2011

    1. 1. A CASE STUDY IN RECORD LINKAGE Marianne Winglee Westat Richard Valliant Survey Research Center at U Michigan and the JPSM at U Maryland Fritz Scheuren NORC at University of Chicago
    2. 2. Today’s Talk <ul><li>Share a case study that used model-based approaches to make linkage decisions and estimate error rates, and </li></ul><ul><li>Our thoughts on applications of these approaches in other settings </li></ul>
    3. 3. The Case Study <ul><li>The Medical Expenditure Panel Survey –both providers and household respondents reported medical events </li></ul><ul><li>The task was to combine data for estimation </li></ul>
    4. 4. Provider & HH Data <ul><li>Some fields used for matching records included: </li></ul><ul><ul><li>Date of event </li></ul></ul><ul><ul><li>Medical procedure codes </li></ul></ul><ul><ul><li>Condition codes and </li></ul></ul><ul><ul><li>Length of hospital stay for inpatient events </li></ul></ul>
    5. 5. Data May Disagree <ul><li>Both sources were reporting events for the same persons over the same period </li></ul><ul><li>But their data may disagree on any or all of the fields </li></ul>
    6. 6. Probabilistic <ul><li>The matching method followed the Fellegi and Sunter framework </li></ul><ul><li>The match weights were related to likelihood of two records being a match </li></ul>
    7. 7. Decision Rule <ul><li>The optimal decision rule is to use two selection thresholds: </li></ul><ul><ul><li>a upper and a lower threshold </li></ul></ul><ul><ul><li>A clerical review zone to resolve pairs with indeterminate status </li></ul></ul>
    8. 8. In practice <ul><li>We often use a single threshold. Data managers are looking for a method to guide decisions and measure success </li></ul><ul><li>The question is “What value to use to declare a record pair a match or a nonmatch?” </li></ul>
    9. 9. A Guide to Action <ul><li>How best to estimate linkage errors at each cutoff value is a difficult question to answer given a limited budget and time schedule </li></ul>
    10. 10. Error Estimation <ul><li>Accurate estimation of the linkage errors should depend on at least two factors </li></ul><ul><ul><li>The power of the match fields to unambiguously identify events that are true matches, </li></ul></ul><ul><ul><li>The linkage method used </li></ul></ul>
    11. 11. Methods <ul><li>Taken together, it is then possible, in a given setting, to </li></ul><ul><ul><li>Specify the linkage categories </li></ul></ul><ul><ul><li>Estimate agreement probabilities </li></ul></ul><ul><ul><li>Determine match weights </li></ul></ul>
    12. 12. Match Weight CDFs <ul><li>An idea in the literature is to use the cumulative distribution functions (CDFs ) of the match weights to determine cutoffs </li></ul><ul><li>Newcombe and Kennedy, 1962 and Jaro, 1989 </li></ul>
    13. 13. Basic Steps <ul><li>The basic step is to first compute the match weight for all plausible pairs for a person </li></ul><ul><li>A plausible record pair may be one where the provider and HH agreed on the date of event, but not all the medical condition codes. </li></ul>
    14. 14. Weight thresholds <ul><li>Plot the CDFs of the weights for </li></ul><ul><ul><li>true matched pairs M </li></ul></ul><ul><ul><li>true unmatched pairs U </li></ul></ul><ul><li>Use the weight distribution to determine thresholds to attain the desired level of error rates </li></ul>
    15. 15. Our Approaches <ul><li>The rest of this talk is about how we applied these ideas? </li></ul><ul><ul><li>Using training sets of pairs </li></ul></ul><ul><ul><li>Use training sets to estimate multinomial parameters and do simulation with those parameters </li></ul></ul>
    16. 16. Training Pairs M <ul><li>Selected a sample of 500 people and prepared the matching files: </li></ul><ul><ul><li>2,507 events reported by household respondents, and </li></ul></ul><ul><ul><li>2,804 events reported by medical providers </li></ul></ul>
    17. 17. Training M Curve <ul><li>Manual reviewers matched the events and generated 1,501 pairs </li></ul><ul><li>we considered these the M pairs, assigned the match weights from the linkage system and generated the CDF of these training M pairs </li></ul>
    18. 18. Training U Pairs <ul><li>Selected a sample of 500 events from each of the matching files and used many-to-many match to generate all 250,000 possible pairs </li></ul><ul><li>For these randomly selected sets of events matched across people, the chance of there being any correctly matched pairs was negligible . </li></ul>
    19. 19. Training U Curve <ul><li>We took the entire set and considered them the U pairs </li></ul><ul><li>Assigned the match weights from the linkage system and generated the CDF of these training U pairs </li></ul>
    20. 20. Weight Curves -10 0 10 20 100 50 0 Cumulative Percentage Assigned Match Weight
    21. 21. Error Rates <ul><li>False negative error rate </li></ul><ul><ul><li>Of the set of true matched pairs, what proportion would not be found? ( false nonmatch ) </li></ul></ul><ul><li>False positive error rate </li></ul><ul><ul><li>Of the set of true unmatched pairs, what proportion would be mixed in with the found pairs? ( false match ) </li></ul></ul>
    22. 22. A Simulation Approach <ul><li>Training sets of true pairs are not always available </li></ul><ul><li>Another method is to generate the M pairs and the U pairs through simulation </li></ul>
    23. 23. SimRate <ul><li>Simulates the distribution of the weights using parameters from the matched pairs and the unmatched pairs </li></ul><ul><li>We covered a situation where weights are assigned to candidate pairs using the Log2 formula </li></ul>
    24. 24. Match Weights
    25. 25. Simulation Model <ul><li>This application used multinomial distribution models with the parameters m to generate the simulated M pairs </li></ul><ul><li>Used another multinomial distribution with the parameters u to generate the simulated U pairs </li></ul>
    26. 26. SIM-M Curve <ul><li>Estimate m parameters from training set of real matched pairs </li></ul><ul><li>Generate many realizations of multinomial random variables </li></ul><ul><li>Run these pairs through matching software </li></ul>
    27. 27. SIM-M Curve II <ul><li>Software computes match weight for each record pair </li></ul><ul><li>Plot cumulative distribution of weights for these real matches </li></ul><ul><li>Proportion of cases left of threshold is false negative rate </li></ul>
    28. 28. SIM-U Curve <ul><li>Estimate  u parameters from training set of real non-matched pairs </li></ul><ul><li>Generate many realizations of multinomial random variables </li></ul><ul><li>Run these pairs through matching software </li></ul>
    29. 29. SIM-U Curve II <ul><li>Software computes match weight for each record pair </li></ul><ul><li>Plot reverse cumulative distribution of weights for these real non-matches </li></ul><ul><li>Proportion of cases right of threshold is false positive rate </li></ul>
    30. 30. Results <ul><li>The results were encouraging. </li></ul><ul><li>The simulated weight curves followed the shape of the training weight curves in both sets. </li></ul>
    31. 31. Heart of SimRate <ul><li>It provides error rate estimates that would be obtained from repeated applications of the matching algorithm to a large number of candidate record pairs </li></ul>
    32. 32. Flexible Tool <ul><li>As long as we can generate simulated record pairs that realistically follow the observed distribution in the data, SimRate should provide suitable error rate estimates </li></ul>
    33. 33. Concluding Remarks <ul><li>What the paper does is offer a useful heuristic that is complete in itself and serviceable (modifiable even) in other settings. </li></ul>
    34. 34. Remark II <ul><li>SimRate is just a way of estimating error rates. We are simulating the way in which matching would actually be implemented. It is not a method of matching. </li></ul>
    35. 35. Remark III <ul><li>The simulation method of estimating error rates could apply to any method that assigns some score to a potential pair. </li></ul>
    36. 36. Remark IV <ul><li>What we offered was a strategy. What we did is far from a final solution to estimating error rates. </li></ul><ul><li>We look forward to seeing applications in other settings </li></ul>
    37. 37. Contact Us <ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul><ul><li>Fritz-scheuren @NORC.UChicago.Edu </li></ul>
    38. 38. Notations
    39. 39. Assumptions
    40. 40. Simulation Model
    41. 41. Agreement Probabilities
    42. 42. Joint Probability M
    43. 43. Joint Probability U