Creating a Large Database Test Bed with Typographical Errors for Record Linkage Evaluation

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Creating a Large Database Test Bed with Typographical Errors for Record Linkage Evaluation - Presentation Transcript

    1. Creating a Large Database Test Bed with Typographical Errors for Record Linkage Evaluation Nawanan Theera‐Ampornpunt, MD, Boonchai Kijsanayotin, MD, PhD, Stuart M. Speedie, PhD Health Informatics, University of Minnesota, Minneapolis, Minnesota INTRODUCTION Health information exchange across multiple organizations requires a method or algorithm to METHODS (Continued) optimally link records of the same individuals First Name Last Name Date of Birth Gender Zip Code System ID Generate male and female records in equal numbers, using demographic data. and produce a system identifier unique for each Selecting the best record linkage algorithm 88,799  record to allow algorithm evaluation 4,275 Female MN Sequence  requires an evaluation to determine its sensitivity Last Names Age Distribution  1,219 Male M/F Zip Codes Number Database Test Bed Creation and specificity. (1990 Census) of MN Population (1990 Census) 1‐N This evaluation is facilitated by a large database Split records into 2 databases with both common Probabilistic Probabilistic Probabilistic Uniform Uniform records and distinct records test bed that closely reflects a real world population and takes into account the potential Error Introduction data entry errors that unfortunately occur in real- world databases. Randomly introduce errors in each applicable variable based on the frequency of each type of errors This study investigated the synthesis of such a Large Master Database of Demographic Data specified by the user database. (N = 950,000 records) Next Steps: Record Linkage Algorithm Evaluation (Not Part of Study) OBJECTIVE Split records into 2 databases with common and distinct  Employ record linkage algorithms of interest to To enhance the existing methods in creating a records to allow algorithm evaluation produce anonymous identifiers for evaluation database test bed for record linkage evaluation by developing a PHP program to: Database A Database B Common records across the 2 databases would be Create a sufficiently large database of used to check if an algorithm produces the same demographic data that allows more robust and anonymous identifiers as it is supposed to. reliable evaluation of record linkage algorithms Introduce errors with Introduce errors with Distinct records across the 2 databases would be Model the real world distribution of key variables user specified frequencies user specified frequencies used to check if an algorithm produces different anonymous identifiers as it is supposed to. Allow users to introduce typographical errors that occur in real world due to imperfect data Evaluation of Record Linkage  Errors introduced into each database can be used to entry, with frequencies of error occurrence Database A Database B  assess robustness of the algorithm compared to the with Errors Algorithms (Not Part of Project) with Errors specified by the user ideal databases with no errors. METHODS SUMMARY OF CONCLUSIONS Master Database Creation 5 Types of Data Entry Errors A large database test bed is achieved, allowing Randomly select a combination of first names and evaluation of record linkage algorithms last names for each gender based on lists of Character Insertion Richard Ricthard names from 1990 U.S. Census publicly available Character Omission Sullivan Sulivan Acknowledgment The demographic data generated reflect real world and their frequencies of occurrence distribution Character Substitution Robert Rodert This  project  was  funded  in  part  under  grant  number  Generate date of birth based on the available age UC1 HS16155 from  the  Agency  for  Healthcare  Data entry errors were introduced to allow algorithm Character Transposition 55414 55441 Research and Quality, U.S. Department of Health and  evaluation of imperfect dataset distribution of the Minnesota population and randomly select a zip code from MN zip codes Gender Misclassification M F Human Services. using a uniform distribution

    + Nawanan Theera-AmpornpuntNawanan Theera-Ampornpunt, 2 years ago

    custom

    580 views, 0 favs, 0 embeds more stats

    A poster presented at the American Medical Informat more

    More info about this document

    CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

    Go to text version

    • Total Views 580
      • 580 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 11
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories