Big Data Meets Privacy:
De-identification Maturity Model for Benchmarking and
Improving De-identification Practices
Nathal...
Workshop Outline
 Big Data: Opportunities and Risks in Healthcare
 De-identification Myths: Fact or Fiction
 Overview o...
OPPORTUNITIES AND RISKS WITH BIG DATA
How to Successfully Leverage Data
While Protecting Individual Privacy
Big Data Tidal Wave is Creating Unforeseen
Opportunities and Risks
Organizations with the Right Tools
And a Skilled Team will
Come Out on Top
Big Data Opportunities and Risks
 A lot of useful data contains personal information about patients, study
participants, ...
Healthcare Breaches
 Best evidence suggests at least 27% of healthcare practices have a
breach every year
 The costs for...
De-identification is one piece of an enterprise privacy
program that can make privacy work
“Privacy by Design” provides he...
De-Identification Facts or Fiction #1
 True or False:
- It’s possible to re-identify most, if not all, data.
 False:
- U...
De-Identification Facts or Fiction #2
 True or False:
- Privacy regulations say that there must be zero
chance of re-iden...
De-Identification Facts or Fiction #3
 True or False:
- Only covered entities should consider HIPAA as
a standard for de-...
OVERVIEW OF ANONYMIZATION
How to Successfully Leverage Data
While Protecting Individual Privacy
PRIVACYANALYTICS.CA
© 2012-2013, Privacy Analytics. All Rights Reserved13 of 76
Balancing Data Privacy Requires Evaluation...
Balancing Data Privacy
Direct and In-Direct/Quasi-Identifiers
Examples of direct identifiers: Name, address, telephone
number, fax number, MRN, h...
Terminology
A process that removes the association
between the identifying data and the data
subject. (Source ISO/TS 25237:2008)
Reducing the risk of identifying a data
subject to a very small level through the
application of a set of data transformat...
Removal of fields
from a data set
A particular type of anonymization that both
removes the association with a data
subject and adds an association between a...
Replacing a value in
the data with a random
value from a large
database of possible
values
Data Masking
Data Masking =
No analytics on those
fields
Reducing the risk of identifying a data subject to
a very small level through the application of a set
of data transformat...
Reducing the
precision of a value
to a more general
one
The removal of
records or values
(cells) in the data
Randomly selecting a subset of records or
patients from a data set
The motives and
capacity of the
data recipient to
re-identify the data
The security and
privacy practices
that the data
recipient has in
place to manage
the data received.
Statistical De-identification
De-identification =
High analytical value
RE-IDENTIFICATION RISKS
Risks from Basic Demographics
DE-IDENTIFICATION MATURITY MODEL
How to Successfully Leverage Data
While Protecting Individual Privacy
De-identification Maturity Model (DMM)
 Formal framework to evaluate maturity of de-identification services
within an org...
Three Dimensions of the DMM
A
CB
Practice Dimension
 DMM has five maturity levels for the de-identification practices
that an organization has in place
 ...
Case Study 1 – Safe Harbor
 Organization A is a disease registry
 They have lots of databases that they connect to and t...
Safe Harbor
Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names
2. ZIP Codes (except first
three)
3. All element...
Case Study 1 – Safe Harbor
 Automation dimension (is it automated)
- They use a home grown scripts for implementing SH
- ...
Case Study 1 – Safe Harbor
- They have interpreted the SH regulation for dates such that they
have only dealt with dates o...
Case Study 2 – Masking
 Company B is a claims processor
 They have a need for realistic data for software testing
 Prac...
Case Study 2 – Masking
 Automation dimension (is it automated)
- They use a commercial product for masking
- This product...
Case Study 3 – Governance
 Company C is an EMR vendor
 They have a need to provide reports to their clients on trends an...
Case Study 3 – Governance
- They have on-going training of staff on how to do the
anonymization
- They are able to quickly...
Case Study 3 – Governance
 Automation Dimension (is it automated)
- They use commercial software to do masking and de-
id...
Benefits of DMM
 Determine whether an organization can defensibly ensure risk of re-
identification is “very small”
 Pro...
PRIVACYANALYTICS.CA
© 2012-2013, Privacy Analytics. All Rights Reserved51 of 92
Key Learnings
Data Anonymization Resources
Book Signing:
Sept 26,10:35 am Booth # 107
Khaled El Emam & Luk Arbuckle
Other Conference Activities
 Session: Facilitating Analytics While Protecting Individual Privacy Using
Data De-identifica...
Contact
Nathalie Holmes:
nholmes@privacyanalytics.ca
613.369.4313 ext 122
Khaled El Emam:
kelemam@ehealthinformation.ca
61...
Review Quiz
 What does anonymization mean?
 What is the difference between data masking and de-identification?
 Why is ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and Improving De-identification Practices
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and Improving De-identification Practices
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and Improving De-identification Practices
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and Improving De-identification Practices
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and Improving De-identification Practices
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and Improving De-identification Practices
Upcoming SlideShare
Loading in …5
×

Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and Improving De-identification Practices

1,399 views
1,230 views

Published on

Presentation at the Strata Rx 2013 Conference on Big Data and Privacy

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,399
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
52
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and Improving De-identification Practices

  1. 1. Big Data Meets Privacy: De-identification Maturity Model for Benchmarking and Improving De-identification Practices Nathalie Holmes Khaled El Emam
  2. 2. Workshop Outline  Big Data: Opportunities and Risks in Healthcare  De-identification Myths: Fact or Fiction  Overview of Terms Used in Anonymization  De-identification Maturity Model (DMM) Case Studies  DMM Uses and Benefits
  3. 3. OPPORTUNITIES AND RISKS WITH BIG DATA How to Successfully Leverage Data While Protecting Individual Privacy
  4. 4. Big Data Tidal Wave is Creating Unforeseen Opportunities and Risks
  5. 5. Organizations with the Right Tools And a Skilled Team will Come Out on Top
  6. 6. Big Data Opportunities and Risks  A lot of useful data contains personal information about patients, study participants, or consumers  The challenge is getting access to the data – addressing the privacy requirements: - Do you have authority ? - Is it mandatory or discretionary ? - Do you patient / participant consent ? - Can you anonymize the data  These are the only ways that you get access to the data
  7. 7. Healthcare Breaches  Best evidence suggests at least 27% of healthcare practices have a breach every year  The costs for healthcare are $200 per individual for breach notification (Ponemon)  This applies whether you have obtained consent or authority
  8. 8. De-identification is one piece of an enterprise privacy program that can make privacy work “Privacy by Design” provides helpful best practices Proactive, Preventative, Embedded and Continuous
  9. 9. De-Identification Facts or Fiction #1  True or False: - It’s possible to re-identify most, if not all, data.  False: - Using robust methods, evidence suggests risk can be very small.
  10. 10. De-Identification Facts or Fiction #2  True or False: - Privacy regulations say that there must be zero chance of re-identification in order for a data set to be used for secondary purposes.  False: - HIPAA states that the risk of re-identification must be “very small”. The FTC and other regulations use a “reasonableness” standard. All of these standards take context into account
  11. 11. De-Identification Facts or Fiction #3  True or False: - Only covered entities should consider HIPAA as a standard for de-identification.  False: - HIPAA is a good standard to use regardless of the applicable regulations.
  12. 12. OVERVIEW OF ANONYMIZATION How to Successfully Leverage Data While Protecting Individual Privacy
  13. 13. PRIVACYANALYTICS.CA © 2012-2013, Privacy Analytics. All Rights Reserved13 of 76 Balancing Data Privacy Requires Evaluation of Privacy Protection and Data Utility
  14. 14. Balancing Data Privacy
  15. 15. Direct and In-Direct/Quasi-Identifiers Examples of direct identifiers: Name, address, telephone number, fax number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number Examples of quasi identifiers: sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, total years of schooling, marital status, criminal history, total income, visible minority status, profession, event dates
  16. 16. Terminology
  17. 17. A process that removes the association between the identifying data and the data subject. (Source ISO/TS 25237:2008)
  18. 18. Reducing the risk of identifying a data subject to a very small level through the application of a set of data transformation techniques without any concern for the analytics utility of the data.
  19. 19. Removal of fields from a data set
  20. 20. A particular type of anonymization that both removes the association with a data subject and adds an association between a particular set of characteristics to the data subject and one or more pseudonyms (Source: ISO/TS 25237:2008)
  21. 21. Replacing a value in the data with a random value from a large database of possible values
  22. 22. Data Masking Data Masking = No analytics on those fields
  23. 23. Reducing the risk of identifying a data subject to a very small level through the application of a set of data transformation techniques such that the resulting data retains a very high analytics value.
  24. 24. Reducing the precision of a value to a more general one
  25. 25. The removal of records or values (cells) in the data
  26. 26. Randomly selecting a subset of records or patients from a data set
  27. 27. The motives and capacity of the data recipient to re-identify the data
  28. 28. The security and privacy practices that the data recipient has in place to manage the data received.
  29. 29. Statistical De-identification De-identification = High analytical value
  30. 30. RE-IDENTIFICATION RISKS Risks from Basic Demographics
  31. 31. DE-IDENTIFICATION MATURITY MODEL How to Successfully Leverage Data While Protecting Individual Privacy
  32. 32. De-identification Maturity Model (DMM)  Formal framework to evaluate maturity of de-identification services within an organization  Gauges level of an organization’s readiness and experience in relation to people, processes, technologies and consistent measurement practices  “DMM” used as a measurement tool; enables the enterprise to implement a grounded strategy based on facts  Improves compliance, facilitates access, and scales support services
  33. 33. Three Dimensions of the DMM A CB
  34. 34. Practice Dimension  DMM has five maturity levels for the de-identification practices that an organization has in place  Level 1 is lowest level of maturity and level 5 is the highest level of maturity Adhoc Masking Heuristic Risk Based Governance 1 2 3 4 5 A
  35. 35. Case Study 1 – Safe Harbor  Organization A is a disease registry  They have lots of databases that they connect to and they do a lot of data releases to internal and external data analysts  Practice Dimension (what you do): - Their primary way of anonymizing data is through following the Safe Harbor de-identification standard (L3)  Implementation Dimension (how well you do it): - There is a clear process and well defined roles for following SH, which is well documented - Because its documented, it’s repeatable (L3)
  36. 36. Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 2. ZIP Codes (except first three) 3. All elements of dates (except year) 4. Telephone numbers 5. Fax numbers 6. Electronic mail addresses 7. Social security numbers 8. Medical record numbers 9. Health plan beneficiary numbers 10.Account numbers 11.Certificate/license numbers 12.Vehicle identifiers and serial numbers, including license plate numbers 13.Device identifiers and serial numbers 14.Web Universal Resource Locators (URLs) 15.Internet Protocol (IP) address numbers 16.Biometric identifiers, including finger and voice prints 17.Full face photographic images and any comparable images; 18. Any other unique identifying number, characteristic, or code Actual Knowledge
  37. 37. Case Study 1 – Safe Harbor  Automation dimension (is it automated) - They use a home grown scripts for implementing SH - The scripts do not have any external validation that they work or are sufficient (L1)  Challenges - Despite these efforts, they have missed some key items - There have been pressures by analysts to provide more granular data
  38. 38. Case Study 1 – Safe Harbor - They have interpreted the SH regulation for dates such that they have only dealt with dates of birth rather than all dates - They have not brought all zip down to 3, and for regions where there are fewer than 20K people replace with 000 per SH - Some identifiers were missed (such as clinical trial participant numbers) - Did not consider the Actual Knowledge requirement in SH
  39. 39. Case Study 2 – Masking  Company B is a claims processor  They have a need for realistic data for software testing  Practice Dimension (what you do): - Their primary way of anonymizing is through data masking - This means they deal only with the direct identifiers (L2)  Implementation Dimension (how well you do it): - There is a clear process for doing masking and how they implement heuristics, which is well documented - Because its documented, it’s repeatable (L3)
  40. 40. Case Study 2 – Masking  Automation dimension (is it automated) - They use a commercial product for masking - This product produces consistent results (L2)  Challenges - Despite these efforts, they have missed some key items – the quasi- identifiers - Some dates and ZIP codes were not addressed - There is no evidence that the risk of re-identification was “very small” - The tool vendor architect provided assurance that this was OK
  41. 41. Case Study 3 – Governance  Company C is an EMR vendor  They have a need to provide reports to their clients on trends and benchmarks to help clients to improve their businesses  Practice Dimension (what you do): - They have a risk-based approach which includes anonymizing both direct identifiers (masking) and in-direct identifiers (de-identification)  Implementation Dimension (how well you do it): - There is a clear process for anonymizing the data which is well documented - Because its documented, it’s repeatable
  42. 42. Case Study 3 – Governance - They have on-going training of staff on how to do the anonymization - They are able to quickly produce reports and metrics documenting what they did to the data before they released it - They have automated data sharing agreements which specifies the controls that need to be in place by data users - They have a full audit trail to demonstrate that the risk of re- identification is “very small” per HIPAA - They track when there is overlap between the various data sets - Audits are conducted on data users to confirm compliance with conditions
  43. 43. Case Study 3 – Governance  Automation Dimension (is it automated) - They use commercial software to do masking and de- identification - The product produces consistent results - They are able to get defensible anonymization more quickly than by doing it manually - The product has been scrutinized by other users & peers and is upgraded on a regular basis - They are able to release more data sets, more quickly
  44. 44. Benefits of DMM  Determine whether an organization can defensibly ensure risk of re- identification is “very small”  Provides a road map to meet regulatory and legal requirements  Automation and governance allow organizations to share more data for secondary purposes with fewer resources  A higher the level of maturity results in higher quality data and greater consistency in de-identification  Significant improvement in ability to estimate resources and time required to de-identify data sets
  45. 45. PRIVACYANALYTICS.CA © 2012-2013, Privacy Analytics. All Rights Reserved51 of 92 Key Learnings
  46. 46. Data Anonymization Resources Book Signing: Sept 26,10:35 am Booth # 107 Khaled El Emam & Luk Arbuckle
  47. 47. Other Conference Activities  Session: Facilitating Analytics While Protecting Individual Privacy Using Data De-identification - Khaled El Emam - Thursday , September 26 @ 4:00pm, Salon F  Office hours in the Sponsor Pavilion: - Nathalie Holmes - Thursday, September 26 @ 3:10pm, Table D - Khaled El Emam - Thursday, September 26 @ 6:30pm, Table D
  48. 48. Contact Nathalie Holmes: nholmes@privacyanalytics.ca 613.369.4313 ext 122 Khaled El Emam: kelemam@ehealthinformation.ca 613.738.4181 @PrivacyAnalytic 2012 Start-Up Showcase Winner
  49. 49. Review Quiz  What does anonymization mean?  What is the difference between data masking and de-identification?  Why is it important to strive for balance between privacy and data utility?  How many levels of maturity (Practice Dimension) are there in the DMM?  Is it possible to be at Practice Dimension 1 (Ad hoc) and score well in the Implementation Dimension? Ex. Have a repeatable, defined and measurable process?  What are some advantages of having Standard Automation (software)?  What is the main difference between Practice Dimension 4 (Risk Based) and Dimension 5 (Governance)?

×