Facilitating Analytics while Protecting Privacy

Facilitating Analytics
while Protecting
Individual Privacy Using
Data De-identification
Khaled El Emam

Talk Outline
Present two case studies where we conducted an analysis of
the privacy implications associated with sharing health data.
Overview of methodology and risk measurement basics
State of Louisiana Department of Health and Hospitals and
Cajun Code Fest 2013
Mount Sinai School of Medicine Department of Preventative
Medicine – World Trade Center Disaster Registry

Data Anonymization Resources
Book Signing:
September 26, 2013 at 10:35am
Khaled El Emam
Luk Arbuckle

Direct and In-Direct/Quasi-Identifiers
Examples of direct identifiers: Name, address, telephone
number, fax number, MRN, health card number, health plan
beneficiary number, license plate number, email address,
photograph, biometrics, SSN, SIN, implanted device number
Examples of quasi identifiers: sex, date of birth or age,
geographic locations (such as postal codes, census
geography, information about proximity to known or unique
landmarks), language spoken at home, ethnic origin, total
years of schooling, marital status, criminal history, total income,
visible minority status, profession, event dates

Safe Harbor
Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names
2. ZIP Codes (except first
three)
3. All elements of dates
(except year)
4. Telephone numbers
5. Fax numbers
6. Electronic mail
addresses
7. Social security
numbers
8. Medical record
numbers
9. Health plan beneficiary
numbers
10.Account numbers
11.Certificate/license
numbers
12.Vehicle identifiers and
serial numbers,
including license plate
numbers
13.Device identifiers and
serial numbers
14.Web Universal
Resource Locators
(URLs)
15.Internet Protocol (IP)
address numbers
16.Biometric identifiers,
including finger and
voice prints
17.Full face photographic
images and any
comparable images;
18. Any other unique
identifying number,
characteristic, or code
Actual Knowledge

Statistical Method
 A person with appropriate knowledge of and experience with
generally accepted statistical and scientific principles and methods for
rendering information not individually identifiable:
I. Applying such principles and methods, determines that the risk is
“very small” that the information could be used, alone or in
combination with other reasonably available information, by an
anticipated recipient to identify and individual who is a subject of
the information, and
II. Documents the methods and results of the analysis that justify
such determination

Equivalence Classes - I
 An equivalence class is the set of records in a table that has the
same values for all quasi-identifiers.

Equivalence Classes - II
Gender Year of Birth (10 years) DIN
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765

Equivalence Classes - III
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765

Equivalence Classes - IV
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765

Equivalence Classes - V
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765

Equivalence Classes - VI
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765

Equivalence Classes - VII
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765

Maximum Risk
In the example data set we had 5 equivalence classes
The largest equivalence class had a size of 3, and the smallest
equivalence class had a size of 2
The probability of correctly re-identifying a record is 1 divided
by the size of the equivalence class
The maximum probability in this table is 50% (0.5 probability)

Average Risk
There were:
- Four equivalence classes of size 2
- One equivalence class of size 3
The average risk is:
[(8 x 0.5) + (3 x 0.33)]/11
= 5/11
 This gives us an average risk of 5/11, or 45%
 This turns out to be the number of equivalence classes divided by the
number of records

Case Study: State of Louisiana – Cajun Code Fest

State of Louisiana
 Demonstrate how the State of Louisiana used a novel approach
to improve the health of its citizens by working with the Center
for Business & Information Technologies (CBIT) at the
University of Louisiana to provide data for Cajun Code Fest
 Discuss how providing realistic looking and behaving de-
identified Medicaid claims and immunization data, competitors
were able to generate applications to help Louisiana’s “Own your
Own Health” initiative – an initiative that encourages patients to
make knowledgeable and informed decisions about their
healthcare

Cajun Code Fest 2.0
 April 24-26, 2013
 27 Hours of coding put on by the Center for Business & Information Technology at
the University of Louisiana Lafayette
 Teams converged to work their innovative magic to analyze the de-identified data set
to create new healthcare solutions that will allow patients to become engaged in their
own health

Why De-identified Data?
The core data that served as the basis for Cajun Code Fest
had to be de-identified before it could be released to the
entrants in the challenge.
It would not have been possible to have the coding challenge
without properly de-identified data.

Data by the Numbers
200,000 unique individuals
6,683,337 Medicaid claims
6,410,969 Medicaid prescriptions
4,085,977 Immunization records
29,951 Providers

Date Shifting – Simple Noise

Date Shifting – Randomized Generalization I

Date Shifting - Randomized Generalization II

Geoproxy Attacks
Patients tend to visit providers and obtain prescriptions from
pharmacies that are close to where they live
Can we use the provider and pharmacy location information to
predict where the patient lives ?
This is called a geoproxy attack
We can measure the probability of a correct geoproxy attack
and incorporate that into our overall risk measurement
framework

Case Study: Mount Sinai School of Medicine
World Trade Center Disaster Registry

 Over 50,000 people are estimated to have helped with the rescue and
recovery efforts after 9/11, and over 27,000 of those are captured in the WTC
disaster registry created by the Clinical Center of Excellence at Mount Sinai.
 The Mount Sinai did a lot of publicity and outreach, working with a variety of
organizations, to recruit 9/11 workers and volunteers. Those who participated
have gone through comprehensive examinations including:
- Medical questionnaires
- Mental-health questionnaires
- Exposure-assessment questionnaires
- Standardised physical examinations
- Optional follow-up assessments every 12 to 18 months
Background

The visit date was used for questions that were specific to the
date at which the visit occurred (e.g., “do you currently
smoke?” would create an event for smoking at the time of
visit.)
Some questions included dates that could be used directly
with the quasi-identifier, and were more informative than the
visit date. (e.g., the answer “when were you diagnosed with
this disease?” was used to provide a date to the disease
event).
Series of Events

Multiple Levels
Sometimes it is reasonable to assume that the adversary will
not have a lot of details about an event
For example, the adversary may know that an event has
occurred but not know the exact date that the event occurred
at
In such a case we change the data to match the adversary
background knowledge, but we release more detailed data
This makes sense given the assumption – the more detailed
information that is released does not give the adversary
additional useful information

 Ten years after the fact, however, it seems unlikely that an adversary
will know the dates of a patient’s events before 9/11. Often patients
gave different years of diagnosis on follow-up visits because they
themselves didn’t remember what medical conditions they had! So
instead of the date of event, we used “pre-9/11” as a value.
 We made a distinction between childhood (under 18) and adulthood
(18 and over) diagnoses, these seemed like something you could
reasonably know.
 These generalizations were done only for measuring risk, and weren’t
applied to the de-identified registry data.
Time of Events

Covering Designs
What are the quasi-identifiers when the series of events is
long?
Will an adversary know all of the details in that sequence ?
It is reasonable to assume that an adversary will only know p
events – this is the power of the adversary
But which p out of m events does the adversary know ?
If we look at all combinations of p from m we may end up with
quite a large number of combinations of quasi-identifiers to
measure the risk

Contact
Khaled El Emam:
kelemam@privacyanalytics.ca
613.369.4313 ext 111
@PrivacyAnalytic

Facilitating Analytics while Protecting Privacy

More Related Content

Similar to Facilitating Analytics while Protecting Privacy

More from Khaled El Emam

Recently uploaded

Facilitating Analytics while Protecting Privacy