Your SlideShare is downloading. ×
Sharing Health Research Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Sharing Health Research Data

284
views

Published on

Slides from a presentation at Johns Hopkins on de-identification and data sharing

Slides from a presentation at Johns Hopkins on de-identification and data sharing

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
284
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SHARING HEALTH RESEARCH DATADe-identificationMETHODS & EXPERIENCES Dr. Khaled El Emam Electronic Health Information Laboratory
  • 2. Motivations for De-identification • Obtaining patient consent/authorization – not practical for large databases and introduces bias • Compliance to regulations / legislation • Contractual obligations • Maintain public / consumer / client trust • Costs of breach notificationElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 3. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 4. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 5. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 6. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 7. A BalanceElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 8. Definition of De-identified Data Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information.Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 9. Re-identification Attacks • Just to clear this issue out at the beginning • There are some claims that health data is easy to re- identify • Often examples are used to support that argument • The evidence does not support these claims – When data are de-identified properly the probability of a successful re-identification attack is very small • Let‟s consider a few highly publicized examplesElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 10. AOL • AOL releases search queries replacing usernames with pseudonyms • New York Times reporters re- identify one user 4417749 • Her search terms: “tea for good health”, “numb fingers”, “hand tremors”, “dry mouth”, “60 single men”, “dog that urinates on everything”, “landscapers in Lilburn, Ga”, “homes sold in shadow lake subdivision gwinnett county georgia” • Thelma Arnold, widow living in Lilburn Ga ; she has three dogsElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 11. AOL ? • It is well known that a large percentage of individuals run „vanity‟ searches that include their names – Thelma Arnold did • It is also known that location information can be determined from an individual‟s search queries • Search queries, even if the username is replaced with a pseudonym, cannot be considered de- identifiedElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 12. Weld • Governor Weld of Massachusetts was unwell during a public appearance – the story was covered in the media • Semi-publicly available insurance claims data matched with voter registration lists • It was possible to determine which claims records belonged to the GovernorElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 13. Weld ? • This re-identification attack was done before HIPAA came into effect – the insurance claims data would not pass any of the HIPAA de-identification standards • A recent analysis indicated that Weld was likely re-identified because he was a famous person and there was already a lot of information about him in the media (his admission date, his diagnosis, his discharge date) – the voter registration list was arguably not necessary • The success rate for such an attack would be lower for general members of the public because the voter registration list is incompleteElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 14. Netflix • Netflix publicly released movie ratings data in the context of a competition to develop a recommendation algorithm • Researchers re-identified a couple of records by matching with a publicly available and identifiable movie ratings database (IMDB) • Results in cancellation of a second competition and litigation started against Netflix for exposing personal informationElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 15. Netflix ? • The re-identifications were not actually verified by Netflix • Authors of attack admit that the Netflix data was not de-identified (replaced usernames with pseudonyms) • The false positive rate of the matching was not evaluated (how many people in the IMDB database were actually in the Netflix database ?)Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 16. http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 17. Attribute vs Identity Disclosure • Attribute disclosure: discover something new about an individual in the database without knowing which record belongs to that individual • Identity disclosure: determine which record in the database belongs to a particular individual (for example, determine that record number 7 belongs to Bob Smith – that is identity disclosure) • HIPAA only cares about identity disclosureElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 18. Attribute vs Identity Disclosure NOT HPV Vaccinated HPV Vaccinated Religion A 5 40 Religion B 40 5  Statistically significant relationship (chi- square, p<0.05)  High risk of attribute disclosureElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 19. Attribute vs Identity Disclosure NOT HPV Vaccinated HPV Vaccinated Religion A 5 40 Religion B 40 5  Statistically significant relationship (chi- square, p<0.05)  High risk of attribute disclosureElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 20. Attribute vs Identity Disclosure NOT HPV Vaccinated HPV Vaccinated Religion A 5 6 Religion B 6 5  After suppression  Not statistically significant relationship (chi-square)  Low risk of attribute disclosureElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 21. Stigmatizing AnalyticsElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 22. Definition of De-identified Data Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individualElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 23. Direct Identifiers • Fields that would uniquely identify individuals in a database • Name, address, telephone number, fax number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device numberElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 24. Dealing with Direct Identifiers • Defensible approaches: – Remove those fields – Convert them to one-time or persistent pseudonyms – Randomize the values • These approaches will ensure, if done properly, that the probability of recovering the original value is very smallElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 25. Quasi-Identifiers • sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth pluralityElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 26. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 27. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 28. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 29. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 30. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 31. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 32. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 33. Re-identification Risk Measurement • Risk measurement will depend on: – Granularity of quasi-identifiers – Region of the country we are talking about – Risk metric used (eg, uniqueness or groups of 5) – Threshold for what is acceptable riskElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 34. De-identification Standards • The HIPAA Privacy Rule specifies two de- identification standards (45 CFR 164.514): – Safe Harbor – Statistical method (also known as the expert statistician method)Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 35. HIPAA Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers1. Names 12. Vehicle identifiers 18. Any other unique2. ZIP Codes (except and serial numbers, identifying number, first three) including license characteristic, or3. All elements of dates plate numbers code (except year) 13. Device identifiers4. Telephone numbers and serial numbers5. Fax numbers 14. Web Universal6. Electronic mail Resource Locators addresses (URLs)7. Social security 15. Internet Protocol (IP) numbers address numbers8. Medical record 16. Biometric identifiers, numbers including finger and9. Health plan voice prints beneficiary numbers 17. Full face10. Account numbers photographic images11. Certificate/license and any comparable numbers images; Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 36. HIPAA Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers1. Names 13. Device identifiers2. ZIP Codes (except and serial numbers first three) 14. Web Universal3. All elements of dates Resource Locators (except year) (URLs)4. Telephone numbers 15. Internet Protocol (IP)5. Fax numbers address numbers6. Electronic mail 16. Biometric identifiers, addresses including finger and7. Social security voice prints numbers 17. Full face8. Medical record photographic images numbers and any comparable9. Health plan 12. Vehicle identifiers images; beneficiary numbers and serial numbers, 18. Any other unique10. Account numbers including license identifying number,11. Certificate/license plate numbers characteristic, or numbers code Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 37. Two Problems with Safe Harbor • May be removing too much information on the ZIP Code and date fields – these fields are useful for many analytical purposes • Does not provide adequate protection – it is easy to have a Safe Harbor compliant data set with a high risk of re-identificationElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 38. High Risk Safe Harbor Data - I • If the adversary knows that Bob, 55 year old male, is in the database Gender Age ZIP Lab Test M 55 112 Albumin, Serum Alkaline F 53 114 Phosphatase M 24 134 Creatine KinaseElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 39. High Risk Safe Harbor Data - II • 2.24m visits, 1.6m patients, NY discharge data for 2007 • Compliant with Safe Harbor Fields % of patients unique age, gender, ZIP3 2.54% age, gender, ZIP3, LOS 21.49%Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 40. Statistical Method Conditions • A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: I. Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and II. Documents the methods and results of the analysis that justify such determinationElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 41. Re-identification Risk SpectrumElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 42. Overall RiskElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 43. Overall RiskElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 44. Overall RiskElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 45. Overall RiskElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 46. Overall RiskElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 47. Overall RiskElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 48. Managing Re-identification RiskElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 49. Different Types of Data Releases • The same data set can be disclosed with different thresholds: – Public data set – Release with conditions for known data recipients, including the requirement to sign a data sharing agreement, a prohibition on re- identification, and a requirement to pass these conditions to all sub-contractors – The more conditions the higher quality the data setElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 50. Example – CA Hospital Discharges • Context: data release to a researcher who will sign a data use agreement, good practices for managing sensitive health information • There were ~2.1m patients who had ~3m visits • Risk threshold = 0.2; use average risk across all patients • Variables: – Year of birth – Gender – Year of admission – Days since last visit – Length of stayElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 51. Risk LevelElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 52. HierarchyElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 53. De-identified DataElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 54. Key Practical Considerations • Data warehouses: de-identification of data extracts instead of whole data warehouses results in higher quality de-identified data • Beware of correlated data: data in multiple medical domains are correlated, so one has to be cognizant of inference attacks on data • Automation: automation can detect outliers and perform selective suppression, which results in higher quality de-identified data • Transparency: important to ensure that methods have received peer and regulator scrutinyElectronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 55. Contact kelemam@ehealthinformation.ca @kelemam www.ehealthinformation.ca Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca