• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A Case Study in Record Linkage_PVER Conf_May2011
 

A Case Study in Record Linkage_PVER Conf_May2011

on

  • 401 views

PowerPoint Presentation from May 2011 Personal Validation and Entity Resolution Conference. Presenters: Marianne Winglee, Richard Valliant, Fritz Scheuren.

PowerPoint Presentation from May 2011 Personal Validation and Entity Resolution Conference. Presenters: Marianne Winglee, Richard Valliant, Fritz Scheuren.

Statistics

Views

Total Views
401
Views on SlideShare
401
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • For example, define outcome indicators to show if

A Case Study in Record Linkage_PVER Conf_May2011 A Case Study in Record Linkage_PVER Conf_May2011 Presentation Transcript

  • A CASE STUDY IN RECORD LINKAGE Marianne Winglee Westat Richard Valliant Survey Research Center at U Michigan and the JPSM at U Maryland Fritz Scheuren NORC at University of Chicago
  • Today’s Talk
    • Share a case study that used model-based approaches to make linkage decisions and estimate error rates, and
    • Our thoughts on applications of these approaches in other settings
  • The Case Study
    • The Medical Expenditure Panel Survey –both providers and household respondents reported medical events
    • The task was to combine data for estimation
  • Provider & HH Data
    • Some fields used for matching records included:
      • Date of event
      • Medical procedure codes
      • Condition codes and
      • Length of hospital stay for inpatient events
  • Data May Disagree
    • Both sources were reporting events for the same persons over the same period
    • But their data may disagree on any or all of the fields
  • Probabilistic
    • The matching method followed the Fellegi and Sunter framework
    • The match weights were related to likelihood of two records being a match
  • Decision Rule
    • The optimal decision rule is to use two selection thresholds:
      • a upper and a lower threshold
      • A clerical review zone to resolve pairs with indeterminate status
  • In practice
    • We often use a single threshold. Data managers are looking for a method to guide decisions and measure success
    • The question is “What value to use to declare a record pair a match or a nonmatch?”
  • A Guide to Action
    • How best to estimate linkage errors at each cutoff value is a difficult question to answer given a limited budget and time schedule
  • Error Estimation
    • Accurate estimation of the linkage errors should depend on at least two factors
      • The power of the match fields to unambiguously identify events that are true matches,
      • The linkage method used
  • Methods
    • Taken together, it is then possible, in a given setting, to
      • Specify the linkage categories
      • Estimate agreement probabilities
      • Determine match weights
  • Match Weight CDFs
    • An idea in the literature is to use the cumulative distribution functions (CDFs ) of the match weights to determine cutoffs
    • Newcombe and Kennedy, 1962 and Jaro, 1989
  • Basic Steps
    • The basic step is to first compute the match weight for all plausible pairs for a person
    • A plausible record pair may be one where the provider and HH agreed on the date of event, but not all the medical condition codes.
  • Weight thresholds
    • Plot the CDFs of the weights for
      • true matched pairs M
      • true unmatched pairs U
    • Use the weight distribution to determine thresholds to attain the desired level of error rates
  • Our Approaches
    • The rest of this talk is about how we applied these ideas?
      • Using training sets of pairs
      • Use training sets to estimate multinomial parameters and do simulation with those parameters
  • Training Pairs M
    • Selected a sample of 500 people and prepared the matching files:
      • 2,507 events reported by household respondents, and
      • 2,804 events reported by medical providers
  • Training M Curve
    • Manual reviewers matched the events and generated 1,501 pairs
    • we considered these the M pairs, assigned the match weights from the linkage system and generated the CDF of these training M pairs
  • Training U Pairs
    • Selected a sample of 500 events from each of the matching files and used many-to-many match to generate all 250,000 possible pairs
    • For these randomly selected sets of events matched across people, the chance of there being any correctly matched pairs was negligible .
  • Training U Curve
    • We took the entire set and considered them the U pairs
    • Assigned the match weights from the linkage system and generated the CDF of these training U pairs
  • Weight Curves -10 0 10 20 100 50 0 Cumulative Percentage Assigned Match Weight
  • Error Rates
    • False negative error rate
      • Of the set of true matched pairs, what proportion would not be found? ( false nonmatch )
    • False positive error rate
      • Of the set of true unmatched pairs, what proportion would be mixed in with the found pairs? ( false match )
  • A Simulation Approach
    • Training sets of true pairs are not always available
    • Another method is to generate the M pairs and the U pairs through simulation
  • SimRate
    • Simulates the distribution of the weights using parameters from the matched pairs and the unmatched pairs
    • We covered a situation where weights are assigned to candidate pairs using the Log2 formula
  • Match Weights
  • Simulation Model
    • This application used multinomial distribution models with the parameters m to generate the simulated M pairs
    • Used another multinomial distribution with the parameters u to generate the simulated U pairs
  • SIM-M Curve
    • Estimate m parameters from training set of real matched pairs
    • Generate many realizations of multinomial random variables
    • Run these pairs through matching software
  • SIM-M Curve II
    • Software computes match weight for each record pair
    • Plot cumulative distribution of weights for these real matches
    • Proportion of cases left of threshold is false negative rate
  • SIM-U Curve
    • Estimate  u parameters from training set of real non-matched pairs
    • Generate many realizations of multinomial random variables
    • Run these pairs through matching software
  • SIM-U Curve II
    • Software computes match weight for each record pair
    • Plot reverse cumulative distribution of weights for these real non-matches
    • Proportion of cases right of threshold is false positive rate
  • Results
    • The results were encouraging.
    • The simulated weight curves followed the shape of the training weight curves in both sets.
  • Heart of SimRate
    • It provides error rate estimates that would be obtained from repeated applications of the matching algorithm to a large number of candidate record pairs
  • Flexible Tool
    • As long as we can generate simulated record pairs that realistically follow the observed distribution in the data, SimRate should provide suitable error rate estimates
  • Concluding Remarks
    • What the paper does is offer a useful heuristic that is complete in itself and serviceable (modifiable even) in other settings.
  • Remark II
    • SimRate is just a way of estimating error rates. We are simulating the way in which matching would actually be implemented. It is not a method of matching.
  • Remark III
    • The simulation method of estimating error rates could apply to any method that assigns some score to a potential pair.
  • Remark IV
    • What we offered was a strategy. What we did is far from a final solution to estimating error rates.
    • We look forward to seeing applications in other settings
  • Contact Us
    • [email_address]
    • [email_address]
    • Fritz-scheuren @NORC.UChicago.Edu
  • Notations
  • Assumptions
  • Simulation Model
  • Agreement Probabilities
  • Joint Probability M
  • Joint Probability U