Your SlideShare is downloading. ×
  • Like
Facilitating Analytics while Protecting Privacy
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Facilitating Analytics while Protecting Privacy

  • 254 views
Published

Presentation at the Strata Rx 2013

Presentation at the Strata Rx 2013

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
254
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
10
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Facilitating Analytics while Protecting Individual Privacy Using Data De-identification Khaled El Emam
  • 2. Talk Outline Present two case studies where we conducted an analysis of the privacy implications associated with sharing health data. Overview of methodology and risk measurement basics State of Louisiana Department of Health and Hospitals and Cajun Code Fest 2013 Mount Sinai School of Medicine Department of Preventative Medicine – World Trade Center Disaster Registry
  • 3. Data Anonymization Resources Book Signing: September 26, 2013 at 10:35am Khaled El Emam Luk Arbuckle
  • 4. Basic Methodology
  • 5. Direct and In-Direct/Quasi-Identifiers Examples of direct identifiers: Name, address, telephone number, fax number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number Examples of quasi identifiers: sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, total years of schooling, marital status, criminal history, total income, visible minority status, profession, event dates
  • 6. Terminology
  • 7. Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 2. ZIP Codes (except first three) 3. All elements of dates (except year) 4. Telephone numbers 5. Fax numbers 6. Electronic mail addresses 7. Social security numbers 8. Medical record numbers 9. Health plan beneficiary numbers 10.Account numbers 11.Certificate/license numbers 12.Vehicle identifiers and serial numbers, including license plate numbers 13.Device identifiers and serial numbers 14.Web Universal Resource Locators (URLs) 15.Internet Protocol (IP) address numbers 16.Biometric identifiers, including finger and voice prints 17.Full face photographic images and any comparable images; 18. Any other unique identifying number, characteristic, or code Actual Knowledge
  • 8. Statistical Method  A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: I. Applying such principles and methods, determines that the risk is “very small” that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify and individual who is a subject of the information, and II. Documents the methods and results of the analysis that justify such determination
  • 9. Equivalence Classes - I  An equivalence class is the set of records in a table that has the same values for all quasi-identifiers.
  • 10. Equivalence Classes - II Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 11. Equivalence Classes - III Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 12. Equivalence Classes - IV Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 13. Equivalence Classes - V Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 14. Equivalence Classes - VI Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 15. Equivalence Classes - VII Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 16. Maximum Risk In the example data set we had 5 equivalence classes The largest equivalence class had a size of 3, and the smallest equivalence class had a size of 2 The probability of correctly re-identifying a record is 1 divided by the size of the equivalence class The maximum probability in this table is 50% (0.5 probability)
  • 17. Average Risk There were: - Four equivalence classes of size 2 - One equivalence class of size 3 The average risk is: [(8 x 0.5) + (3 x 0.33)]/11 = 5/11  This gives us an average risk of 5/11, or 45%  This turns out to be the number of equivalence classes divided by the number of records
  • 18. Case Study: State of Louisiana – Cajun Code Fest
  • 19. State of Louisiana  Demonstrate how the State of Louisiana used a novel approach to improve the health of its citizens by working with the Center for Business & Information Technologies (CBIT) at the University of Louisiana to provide data for Cajun Code Fest  Discuss how providing realistic looking and behaving de- identified Medicaid claims and immunization data, competitors were able to generate applications to help Louisiana’s “Own your Own Health” initiative – an initiative that encourages patients to make knowledgeable and informed decisions about their healthcare
  • 20. Cajun Code Fest 2.0  April 24-26, 2013  27 Hours of coding put on by the Center for Business & Information Technology at the University of Louisiana Lafayette  Teams converged to work their innovative magic to analyze the de-identified data set to create new healthcare solutions that will allow patients to become engaged in their own health
  • 21. Why De-identified Data? The core data that served as the basis for Cajun Code Fest had to be de-identified before it could be released to the entrants in the challenge. It would not have been possible to have the coding challenge without properly de-identified data.
  • 22. Data by the Numbers 200,000 unique individuals 6,683,337 Medicaid claims 6,410,969 Medicaid prescriptions 4,085,977 Immunization records 29,951 Providers
  • 23. Data Model
  • 24. Claims Summary
  • 25. Long Tails & Truncation
  • 26. Date Shifting – Simple Noise
  • 27. Date Shifting – Fixed Shift
  • 28. Date Shifting – Randomized Generalization I
  • 29. Date Shifting - Randomized Generalization II
  • 30. Geoproxy Attacks Patients tend to visit providers and obtain prescriptions from pharmacies that are close to where they live Can we use the provider and pharmacy location information to predict where the patient lives ? This is called a geoproxy attack We can measure the probability of a correct geoproxy attack and incorporate that into our overall risk measurement framework
  • 31. Geoproxy Risk on Claims Data
  • 32. Case Study: Mount Sinai School of Medicine World Trade Center Disaster Registry
  • 33.  Over 50,000 people are estimated to have helped with the rescue and recovery efforts after 9/11, and over 27,000 of those are captured in the WTC disaster registry created by the Clinical Center of Excellence at Mount Sinai.  The Mount Sinai did a lot of publicity and outreach, working with a variety of organizations, to recruit 9/11 workers and volunteers. Those who participated have gone through comprehensive examinations including: - Medical questionnaires - Mental-health questionnaires - Exposure-assessment questionnaires - Standardised physical examinations - Optional follow-up assessments every 12 to 18 months Background
  • 34. Public Information
  • 35. Series of Events
  • 36. The visit date was used for questions that were specific to the date at which the visit occurred (e.g., “do you currently smoke?” would create an event for smoking at the time of visit.) Some questions included dates that could be used directly with the quasi-identifier, and were more informative than the visit date. (e.g., the answer “when were you diagnosed with this disease?” was used to provide a date to the disease event). Series of Events
  • 37. Demographics
  • 38. Examples of Events
  • 39. Multiple Levels Sometimes it is reasonable to assume that the adversary will not have a lot of details about an event For example, the adversary may know that an event has occurred but not know the exact date that the event occurred at In such a case we change the data to match the adversary background knowledge, but we release more detailed data This makes sense given the assumption – the more detailed information that is released does not give the adversary additional useful information
  • 40.  Ten years after the fact, however, it seems unlikely that an adversary will know the dates of a patient’s events before 9/11. Often patients gave different years of diagnosis on follow-up visits because they themselves didn’t remember what medical conditions they had! So instead of the date of event, we used “pre-9/11” as a value.  We made a distinction between childhood (under 18) and adulthood (18 and over) diagnoses, these seemed like something you could reasonably know.  These generalizations were done only for measuring risk, and weren’t applied to the de-identified registry data. Time of Events
  • 41. Covering Designs What are the quasi-identifiers when the series of events is long? Will an adversary know all of the details in that sequence ? It is reasonable to assume that an adversary will only know p events – this is the power of the adversary But which p out of m events does the adversary know ? If we look at all combinations of p from m we may end up with quite a large number of combinations of quasi-identifiers to measure the risk
  • 42. Combinations of 3
  • 43. Covering Design
  • 44. Reduction in Computation
  • 45. Contact Khaled El Emam: kelemam@privacyanalytics.ca 613.369.4313 ext 111 @PrivacyAnalytic