SlideShare a Scribd company logo
1 of 55
Data explosion
UNIT-2
INDEX
• Statistics and Lack of barriers in Collection
and Distribution of Person-specific
information
• Mathematical model for characterizing and
comparing real-world data sharing practices
and policies and for computing privacy and
risk measurements
• Demographics and Uniqueness
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB
a year
640K ought to be
enough for anybody.
Big Data EveryWhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
Background
• Large amount of person-specific data has been
collected in recent years
– Both by governments and by private entities
• Data and knowledge extracted by data mining
techniques represent a key asset to the society
– Analyzing trends and patterns.
– Formulating public policies
• Laws and regulations require that some
collected data must be made public
– For example, Census data
• Health-care datasets
– Clinical studies, hospital discharge databases …
• Genetic datasets
– $1000 genome, HapMap, deCode …
• Demographic datasets
– U.S. Census Bureau, sociology studies …
• Search logs, recommender systems, social
networks, blogs …
– AOL search data, social networks of blogging
sites, Netflix movie ratings, Amazon …
Public Data Conundrum
• First thought: anonymize the data
• How?
• Remove “personally identifying information”
(PII)
– Name, Social Security number, phone number,
email, address… what else?
– Anything that identifies the person directly
• Is this enough?
What About Privacy?
Re-identification by Linking
Name Zipcode Age Sex
Alice 47677 29 F
Bob 47983 65 M
Carol 47677 22 F
Dan 47532 23 M
Ellen 46789 43 F
Voter registration data
QID SA
Zipcode Age Sex Disease
47677 29 F Ovarian Cancer
47602 22 F Ovarian Cancer
47678 27 M Prostate Cancer
47905 43 M Flu
47909 52 F Heart Disease
47906 47 M Heart Disease
ID
Name
Alice
Betty
Charles
David
Emily
Fred
Microdata
Latanya Sweeney’s Attack (1997)
Massachusetts hospital discharge dataset
Public voter dataset
Quasi-Identifiers
• Key attributes
– Name, address, phone number - uniquely
identifying!
– Always removed before release
• Quasi-identifiers
– (5-digit ZIP code, birth date, gender) uniquely
identify 87% of the population in the U.S.
– Can be used for linking anonymized dataset with
other datasets
Classification of Attributes
• Sensitive attributes
– Medical records, salaries, etc.
– These attributes is what the researchers need, so they are
always released directly
Name DOB Gender Zipcode Disease
Andre 1/21/76 Male 53715 Heart Disease
Beth 4/13/86 Female 53715 Hepatitis
Carol 2/28/76 Male 53703 Brochitis
Dan 1/21/76 Male 53703 Broken Arm
Ellen 4/13/86 Female 53706 Flu
Eric 2/28/76 Female 53706 Hang Nail
Key Attribute Quasi-identifier Sensitive attribute
K-Anonymity: Intuition
• The information for each person contained in the
released table cannot be distinguished from at
least k-1 individuals whose information also
appears in the release
– Example: you try to identify a man in the released
table, but the only information you have is his birth
date and gender. There are k men in the table with the
same birth date and gender.
• Any quasi-identifier present in the released table
must appear in at least k records
K-Anonymity Protection Model
• Private table
• Released table: RT
• Attributes: A1, A2, …, An
• Quasi-identifier subset: Ai, …, Aj
Male Female
*
476**
47677 4767847602
2*
29 2722
ZIP code Age Sex
Generalization
• Goal of k-Anonymity
– Each record is indistinguishable from at least k-1
other records
– These k records form an equivalence class
• Generalization: replace quasi-identifiers with
less specific, but semantically consistent
values
Achieving k-Anonymity
• Generalization
– Replace specific quasi-identifiers with less specific
values until get k identical values
– Partition ordered-value domains into intervals
• Suppression
– When generalization causes too much information loss
• This is common with “outliers”
• Lots of algorithms in the literature
– Aim to produce “useful” anonymizations
… usually without any clear notion of utility
Generalization in Action
Example of a k-Anonymous Table
Name Birth Gender ZIP Race
Andre 1964 m 02135 White
Beth 1964 f 55410 Black
Carol 1964 f 90210 White
Dan 1967 m 02174 White
Ellen 1968 f 02237 White
Released table External data Source
By linking these 2 tables, you still don’t learn Andre’s problem
Example of Generalization (1)
QID SA
Zipcode Age Sex Disease
47677 29 F Ovarian Cancer
47602 22 F Ovarian Cancer
47678 27 M Prostate Cancer
47905 43 M Flu
47909 52 F Heart Disease
47906 47 M Heart Disease
QID SA
Zipcode Age Sex Disease
476**
476**
476**
2*
2*
2*
*
*
*
Ovarian Cancer
Ovarian Cancer
Prostate Cancer
4790*
4790*
4790*
[43,52]
[43,52]
[43,52]
*
*
*
Flu
Heart Disease
Heart Disease
Microdata Generalized table
Example of Generalization (2)
• Released table is 3-anonymous
• If the adversary knows Alice’s quasi-identifier
(47677, 29, F), he still does not know which of
the first 3 records corresponds to Alice’s record
!!
Curse of Dimensionality
• Generalization fundamentally relies
on spatial locality
– Each record must have k close neighbors
• Real-world datasets are very sparse
– Many attributes (dimensions)
– “Nearest neighbor” is very far
• Projection to low dimensions loses all info 
k-anonymized datasets are useless
[Aggarwal VLDB ‘05]
HIPAA Privacy Rule
“The identifiers that must be removed include direct identifiers, such as
name, street address, social security number, as well as other identifiers, such
as birth date, admission and discharge dates, and five-digit zip code. The safe
harbor requires removal of geographic subdivisions smaller than a State,
except for the initial three digits of a zip code if the geographic unit formed by
combining all zip codes with the same initial three digits contains more than
20,000 people. In addition, age, if less than 90, gender, ethnicity, and other
demographic information not listed may remain in the information. The safe
harbor is intended to provide covered entities with a simple, definitive
method that does not require much judgment by the covered entity to
determine if the information is adequately de-identified."
"Under the safe harbor method, covered entities must remove all of a list of
18 enumerated identifiers and have no actual knowledge that the information
remaining could be used, alone or in combination, to identify a subject of the
information."
Two (and a Half) Interpretations
• Membership disclosure: Attacker cannot tell
that a given person in the dataset
• Sensitive attribute disclosure: Attacker cannot
tell that a given person has a certain sensitive
attribute
• Identity disclosure: Attacker cannot tell which
record corresponds to a given person
Unsorted Matching Attack
• Problem: records appear in the same order in
the released table as in the original table
• Solution: randomize order before releasing
Complementary Release Attack
• Different releases of the same private table can be
linked together to compromise k-anonymity
Linking Independent Releases
Zipcode Age Disease
476** 2* Heart Disease
476** 2* Heart Disease
476** 2* Heart Disease
4790* ≥40 Flu
4790* ≥40 Heart Disease
4790* ≥40 Cancer
476** 3* Heart Disease
476** 3* Cancer
476** 3* Cancer
A 3-anonymous patient table
Bob
Zipcode Age
47678 27
Carl
Zipcode Age
47673 36
Homogeneity attack
Background knowledge attack
Attacks on k-Anonymity
• k-Anonymity does not provide privacy if
– Sensitive values in an equivalence class lack diversity
– The attacker has background knowledge
Caucas 787XX Flu
Caucas 787XX Shingles
Caucas 787XX Acne
Caucas 787XX Flu
Caucas 787XX Acne
Caucas 787XX Flu
Asian/AfrAm 78XXX Flu
Asian/AfrAm 78XXX Flu
Asian/AfrAm 78XXX Acne
Asian/AfrAm 78XXX Shingles
Asian/AfrAm 78XXX Acne
Asian/AfrAm 78XXX Flu
Sensitive attributes must be
“diverse” within each
quasi-identifier equivalence class
[Machanavajjhala et al. ICDE ‘06]
l-Diversity
Distinct l-Diversity
• Each equivalence class has at least l well-
represented sensitive values
• Doesn’t prevent probabilistic inference attacks
Disease
...
HIV
HIV
HIV
pneumonia
...
...
bronchitis
...
10 records
8 records have HIV
2 records have other values
Other Versions of l-Diversity
• Probabilistic l-diversity
– The frequency of the most frequent value in an
equivalence class is bounded by 1/l
• Entropy l-diversity
– The entropy of the distribution of sensitive values
in each equivalence class is at least log(l)
• Recursive (c,l)-diversity
– r1<c(rl+rl+1+…+rm) where ri is the frequency of the
ith most frequent value
– Intuition: the most frequent value does not appear
too frequently
… Cancer
… Cancer
… Cancer
… Flu
… Cancer
… Cancer
… Cancer
… Cancer
… Cancer
… Cancer
… Flu
… Flu
Original dataset
Q1 Flu
Q1 Cancer
Q1 Cancer
Q1 Cancer
Q1 Cancer
Q1 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Flu
Q2 Flu
Anonymization B
Q1 Flu
Q1 Flu
Q1 Cancer
Q1 Flu
Q1 Cancer
Q1 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Anonymization A
99% have cancer
50% cancer  quasi-identifier group is “diverse”
This leaks a ton of information
99% cancer  quasi-identifier group is not “diverse”
…yet anonymized database does not leak anything
Neither Necessary, Nor Sufficient
Limitations of l-Diversity
• Example: sensitive attribute is HIV+ (1%) or
HIV- (99%)
– Very different degrees of sensitivity!
• l-diversity is unnecessary
– 2-diversity is unnecessary for an equivalence class that
contains only HIV- records
• l-diversity is difficult to achieve
– Suppose there are 10000 records in total
– To have distinct 2-diversity, there can be at most
10000*1%=100 equivalence classes
Skewness Attack
• Example: sensitive attribute is HIV+ (1%) or
HIV- (99%)
• Consider an equivalence class that contains an
equal number of HIV+ and HIV- records
– Diverse, but potentially violates privacy!
• l-diversity does not differentiate:
– Equivalence class 1: 49 HIV+ and 1 HIV-
– Equivalence class 2: 1 HIV+ and 49 HIV-l-diversity does not consider overall distribution of sensitive values!
Bob
Zip Age
47678 27
Zipcode Age Salary Disease
476** 2* 20K Gastric Ulcer
476** 2* 30K Gastritis
476** 2* 40K Stomach Cancer
4790* ≥40 50K Gastritis
4790* ≥40 100K Flu
4790* ≥40 70K Bronchitis
476** 3* 60K Bronchitis
476** 3* 80K Pneumonia
476** 3* 90K Stomach Cancer
A 3-diverse patient table
Conclusion
1. Bob’s salary is in [20k,40k],
which is relatively low
2. Bob has some stomach-related
disease
l-diversity does not consider semantics of sensitive values!
Similarity attack
Sensitive Attribute Disclosure
Caucas 787XX Flu
Caucas 787XX Shingles
Caucas 787XX Acne
Caucas 787XX Flu
Caucas 787XX Acne
Caucas 787XX Flu
Asian/AfrAm 78XXX Flu
Asian/AfrAm 78XXX Flu
Asian/AfrAm 78XXX Acne
Asian/AfrAm 78XXX Shingles
Asian/AfrAm 78XXX Acne
Asian/AfrAm 78XXX Flu
[Li et al. ICDE ‘07]
Distribution of sensitive
attributes within each
quasi-identifier group should
be “close” to their distribution
in the entire original database
Trick question: Why publish
quasi-identifiers at all??
t-Closeness
Caucas 787XX HIV+ Flu
Asian/AfrAm 787XX HIV- Flu
Asian/AfrAm 787XX HIV+ Shingles
Caucas 787XX HIV- Acne
Caucas 787XX HIV- Shingles
Caucas 787XX HIV- Acne
This is k-anonymous,
l-diverse and t-close…
…so secure, right?
Anonymous, “t-Close” Dataset
Caucas 787XX HIV+ Flu
Asian/AfrAm 787XX HIV- Flu
Asian/AfrAm 787XX HIV+ Shingles
Caucas 787XX HIV- Acne
Caucas 787XX HIV- Shingles
Caucas 787XX HIV- Acne
Bob is Caucasian and
I heard he was
admitted to hospital
with flu…
This is against the rules!
“flu” is not a quasi-identifier
Yes… and this is yet another
problem with k-anonymity
What Does Attacker Know?
AOL Privacy Debacle
• In August 2006, AOL released anonymized search
query logs
– 657K users, 20M queries over 3 months (March-May)
• Opposing goals
– Analyze data for research purposes, provide better
services for users and advertisers
– Protect privacy of AOL users
• Government laws and regulations
• Search queries may reveal income, evaluations, intentions to
acquire goods and services, etc.
AOL User 4417749
• AOL query logs have the form
<AnonID, Query, QueryTime, ItemRank,
ClickURL>
– ClickURL is the truncated URL
• NY Times re-identified AnonID 4417749
– Sample queries: “numb fingers”, “60 single men”, “dog
that urinates on everything”, “landscapers in Lilburn,
GA”, several people with the last name Arnold
• Lilburn area has only 14 citizens with the last name Arnold
– NYT contacts the 14 citizens, finds out AOL User
4417749 is 62-year-old Thelma Arnold
k-Anonymity Considered Harmful
• Syntactic
– Focuses on data transformation, not on what can
be learned from the anonymized dataset
– “k-anonymous” dataset can leak sensitive
information
• “Quasi-identifier” fallacy
– Assumes a priori that attacker will not
know certain information about his target
• Relies on locality
– Destroys utility of many real-world datasets
Demographics and uniqueness
Demography—from the Greek demo, for people, and graphy, for writing,
or analysis, or field of study, is “the study of changes (such as the
number of births, deaths, marriages, and illnesses) that occur over a
period of time in human populations”, or just “science of population”
What is demography?
This is demography
This is demography
This is demography
Uniqueness Examples
Tea Party vs. Occupy Wall Street on twitterUSA (2011)
Dynamic Population Mapping Using Mobile Phone Data
France and Portugal (2014)
Senegal (2015) Predicting Population Density from Cell-Phone
Activity
Post-Earthquake Population MovementNepal (2015)
London (2013-14) Predicting Crime Hotspots from Cell Phone Data
• Binary classification task using Random
Forest
Rwanda (2010) Risk Sharing in Natural Disasters through “Mobile Money”
Some Other Scenarios
Clinical Trial Deviations
Originally Viagra was developed to lower blood pressure and treat Angina
Now its used to help newborn pulmonary hypertension and altitude sickness
Incidence Prediction
Missed 4 or more visits, twice as likely to have an asthmatic incident
Particular Cardiac monitor sine wave points to highly likelihood of heart attack
Campaigns
Social media and advertising campaigns to understand user behavior and sentiment
Patient Satisfaction
Social media and advertising campaigns to understand user behavior and sentiment
Auditing is critical component HIPAA in ensuring patient privacy
1 Billion rows+ of audit data
146 mission critical clinical applications
Comprehensive audits yield 300-500k transactions/day
HIPAA requires audit system with 20 years of data
Auditing Project
Available to community as part of Compliance SDK
Updating for SQL Server 2012, HDInsight, Power View, and MobileBI*
References
• Li, Li, Venkatasubramanian. “t-Closeness:
Privacy Beyond k-Anonymity and l-Diversity”
(ICDE 2007).
• http://www.merriam-
webster.com/dictionary/demography
• p://www.demogr.mpg.de/En/educationcareer
/what_is_demography_1908/default.htm

More Related Content

What's hot

Symmetric & Asymmetric Cryptography
Symmetric & Asymmetric CryptographySymmetric & Asymmetric Cryptography
Symmetric & Asymmetric Cryptographychauhankapil
 
Security Attacks.ppt
Security Attacks.pptSecurity Attacks.ppt
Security Attacks.pptZaheer720515
 
Symmetric encryption and message confidentiality
Symmetric encryption and message confidentialitySymmetric encryption and message confidentiality
Symmetric encryption and message confidentialityCAS
 
User authentication
User authenticationUser authentication
User authenticationCAS
 
Data Encryption Standard (DES)
Data Encryption Standard (DES)Data Encryption Standard (DES)
Data Encryption Standard (DES)Haris Ahmed
 
Application layer security protocol
Application layer security protocolApplication layer security protocol
Application layer security protocolKirti Ahirrao
 
Network security cryptographic hash function
Network security  cryptographic hash functionNetwork security  cryptographic hash function
Network security cryptographic hash functionMijanur Rahman Milon
 
Cryptography and network security Nit701
Cryptography and network security Nit701Cryptography and network security Nit701
Cryptography and network security Nit701Amit Pathak
 

What's hot (20)

8 queen problem
8 queen problem8 queen problem
8 queen problem
 
Symmetric & Asymmetric Cryptography
Symmetric & Asymmetric CryptographySymmetric & Asymmetric Cryptography
Symmetric & Asymmetric Cryptography
 
Intruders
IntrudersIntruders
Intruders
 
Diffie-hellman algorithm
Diffie-hellman algorithmDiffie-hellman algorithm
Diffie-hellman algorithm
 
Email security
Email securityEmail security
Email security
 
Elgamal &amp; schnorr digital signature scheme copy
Elgamal &amp; schnorr digital signature scheme   copyElgamal &amp; schnorr digital signature scheme   copy
Elgamal &amp; schnorr digital signature scheme copy
 
Security Attacks.ppt
Security Attacks.pptSecurity Attacks.ppt
Security Attacks.ppt
 
Hash Function
Hash Function Hash Function
Hash Function
 
Authentication techniques
Authentication techniquesAuthentication techniques
Authentication techniques
 
Digital signature
Digital signatureDigital signature
Digital signature
 
System Security-Chapter 1
System Security-Chapter 1System Security-Chapter 1
System Security-Chapter 1
 
Symmetric encryption and message confidentiality
Symmetric encryption and message confidentialitySymmetric encryption and message confidentiality
Symmetric encryption and message confidentiality
 
User authentication
User authenticationUser authentication
User authentication
 
Data Encryption Standard (DES)
Data Encryption Standard (DES)Data Encryption Standard (DES)
Data Encryption Standard (DES)
 
Application layer security protocol
Application layer security protocolApplication layer security protocol
Application layer security protocol
 
Network security cryptographic hash function
Network security  cryptographic hash functionNetwork security  cryptographic hash function
Network security cryptographic hash function
 
Cryptography and network security Nit701
Cryptography and network security Nit701Cryptography and network security Nit701
Cryptography and network security Nit701
 
Application security
Application securityApplication security
Application security
 
Encryption algorithms
Encryption algorithmsEncryption algorithms
Encryption algorithms
 
Cybersecurity 2 cyber attacks
Cybersecurity 2 cyber attacksCybersecurity 2 cyber attacks
Cybersecurity 2 cyber attacks
 

Similar to Data Explosion and Privacy Challenges

张振杰:大数据时代的隐私保护的挑战和机遇
张振杰:大数据时代的隐私保护的挑战和机遇张振杰:大数据时代的隐私保护的挑战和机遇
张振杰:大数据时代的隐私保护的挑战和机遇hdhappy001
 
Distinct l diversity anonymization of set valued data
Distinct l diversity anonymization of set valued dataDistinct l diversity anonymization of set valued data
Distinct l diversity anonymization of set valued dataRohan Khude
 
Differential privacy and applications to location privacy
Differential privacy and applications to location privacyDifferential privacy and applications to location privacy
Differential privacy and applications to location privacyPôle Systematic Paris-Region
 
Amnesia: Data anonymization made easy (8th OpenAIRE workshop)
Amnesia: Data anonymization made easy (8th OpenAIRE workshop)Amnesia: Data anonymization made easy (8th OpenAIRE workshop)
Amnesia: Data anonymization made easy (8th OpenAIRE workshop)OpenAIRE
 
Big Data for a Better World
Big Data for a Better WorldBig Data for a Better World
Big Data for a Better Worldleadinghands
 
Cadth 2015 d7 regier recruitment
Cadth 2015 d7 regier recruitmentCadth 2015 d7 regier recruitment
Cadth 2015 d7 regier recruitmentCADTH Symposium
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big databis_foresight
 
12.4.15 linkages data for decision making
12.4.15 linkages data for decision making12.4.15 linkages data for decision making
12.4.15 linkages data for decision makingLINKAGES
 
NAPHSIS Keynote: Vital Records - Vital Input for Population Health Measurement
NAPHSIS Keynote: Vital Records - Vital Input for Population Health MeasurementNAPHSIS Keynote: Vital Records - Vital Input for Population Health Measurement
NAPHSIS Keynote: Vital Records - Vital Input for Population Health MeasurementPeter Speyer
 
Managing Confidential Information – Trends and Approaches
Managing Confidential Information – Trends and ApproachesManaging Confidential Information – Trends and Approaches
Managing Confidential Information – Trends and ApproachesMicah Altman
 
Finding and Using Secondary Data and Resources for Research
Finding and Using Secondary Data  and Resources for ResearchFinding and Using Secondary Data  and Resources for Research
Finding and Using Secondary Data and Resources for ResearchDr. Karen Whiteman
 
Conducting equity-focused community assessments
Conducting equity-focused community assessmentsConducting equity-focused community assessments
Conducting equity-focused community assessmentsChristina Holt
 
Annual Arizona Conference for Tribal BCCEDP Collaboration, Flagstaff, AZ
Annual Arizona Conference for Tribal BCCEDP Collaboration, Flagstaff, AZAnnual Arizona Conference for Tribal BCCEDP Collaboration, Flagstaff, AZ
Annual Arizona Conference for Tribal BCCEDP Collaboration, Flagstaff, AZjamieritchey
 
Dallas Police Department's Prostitute Diversion Initiative_Felini & Gonzales ...
Dallas Police Department's Prostitute Diversion Initiative_Felini & Gonzales ...Dallas Police Department's Prostitute Diversion Initiative_Felini & Gonzales ...
Dallas Police Department's Prostitute Diversion Initiative_Felini & Gonzales ...Felini
 
Felini draft may 2010
Felini draft may 2010Felini draft may 2010
Felini draft may 2010Felini
 

Similar to Data Explosion and Privacy Challenges (20)

张振杰:大数据时代的隐私保护的挑战和机遇
张振杰:大数据时代的隐私保护的挑战和机遇张振杰:大数据时代的隐私保护的挑战和机遇
张振杰:大数据时代的隐私保护的挑战和机遇
 
Distinct l diversity anonymization of set valued data
Distinct l diversity anonymization of set valued dataDistinct l diversity anonymization of set valued data
Distinct l diversity anonymization of set valued data
 
Bias in AI
Bias in AIBias in AI
Bias in AI
 
Differential privacy and applications to location privacy
Differential privacy and applications to location privacyDifferential privacy and applications to location privacy
Differential privacy and applications to location privacy
 
Amnesia: Data anonymization made easy (8th OpenAIRE workshop)
Amnesia: Data anonymization made easy (8th OpenAIRE workshop)Amnesia: Data anonymization made easy (8th OpenAIRE workshop)
Amnesia: Data anonymization made easy (8th OpenAIRE workshop)
 
Big Data for a Better World
Big Data for a Better WorldBig Data for a Better World
Big Data for a Better World
 
Data Journalism 101 - Day 1 by Michael J. Berens
Data Journalism 101 - Day 1 by Michael J. BerensData Journalism 101 - Day 1 by Michael J. Berens
Data Journalism 101 - Day 1 by Michael J. Berens
 
Cadth 2015 d7 regier recruitment
Cadth 2015 d7 regier recruitmentCadth 2015 d7 regier recruitment
Cadth 2015 d7 regier recruitment
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data
 
Managing and Analyzing Global Health Data
Managing and Analyzing Global Health DataManaging and Analyzing Global Health Data
Managing and Analyzing Global Health Data
 
12.4.15 linkages data for decision making
12.4.15 linkages data for decision making12.4.15 linkages data for decision making
12.4.15 linkages data for decision making
 
NAPHSIS Keynote: Vital Records - Vital Input for Population Health Measurement
NAPHSIS Keynote: Vital Records - Vital Input for Population Health MeasurementNAPHSIS Keynote: Vital Records - Vital Input for Population Health Measurement
NAPHSIS Keynote: Vital Records - Vital Input for Population Health Measurement
 
statistics.ppt
statistics.pptstatistics.ppt
statistics.ppt
 
Managing Confidential Information – Trends and Approaches
Managing Confidential Information – Trends and ApproachesManaging Confidential Information – Trends and Approaches
Managing Confidential Information – Trends and Approaches
 
Finding and Using Secondary Data and Resources for Research
Finding and Using Secondary Data  and Resources for ResearchFinding and Using Secondary Data  and Resources for Research
Finding and Using Secondary Data and Resources for Research
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Conducting equity-focused community assessments
Conducting equity-focused community assessmentsConducting equity-focused community assessments
Conducting equity-focused community assessments
 
Annual Arizona Conference for Tribal BCCEDP Collaboration, Flagstaff, AZ
Annual Arizona Conference for Tribal BCCEDP Collaboration, Flagstaff, AZAnnual Arizona Conference for Tribal BCCEDP Collaboration, Flagstaff, AZ
Annual Arizona Conference for Tribal BCCEDP Collaboration, Flagstaff, AZ
 
Dallas Police Department's Prostitute Diversion Initiative_Felini & Gonzales ...
Dallas Police Department's Prostitute Diversion Initiative_Felini & Gonzales ...Dallas Police Department's Prostitute Diversion Initiative_Felini & Gonzales ...
Dallas Police Department's Prostitute Diversion Initiative_Felini & Gonzales ...
 
Felini draft may 2010
Felini draft may 2010Felini draft may 2010
Felini draft may 2010
 

More from G Prachi

The trusted computing architecture
The trusted computing architectureThe trusted computing architecture
The trusted computing architectureG Prachi
 
Security risk management
Security risk managementSecurity risk management
Security risk managementG Prachi
 
Mobile platform security models
Mobile platform security modelsMobile platform security models
Mobile platform security modelsG Prachi
 
Malicious software and software security
Malicious software and software  securityMalicious software and software  security
Malicious software and software securityG Prachi
 
Network defenses
Network defensesNetwork defenses
Network defensesG Prachi
 
Network protocols and vulnerabilities
Network protocols and vulnerabilitiesNetwork protocols and vulnerabilities
Network protocols and vulnerabilitiesG Prachi
 
Web application security part 02
Web application security part 02Web application security part 02
Web application security part 02G Prachi
 
Web application security part 01
Web application security part 01Web application security part 01
Web application security part 01G Prachi
 
Basic web security model
Basic web security modelBasic web security model
Basic web security modelG Prachi
 
Least privilege, access control, operating system security
Least privilege, access control, operating system securityLeast privilege, access control, operating system security
Least privilege, access control, operating system securityG Prachi
 
Dealing with legacy code
Dealing with legacy codeDealing with legacy code
Dealing with legacy codeG Prachi
 
Exploitation techniques and fuzzing
Exploitation techniques and fuzzingExploitation techniques and fuzzing
Exploitation techniques and fuzzingG Prachi
 
Control hijacking
Control hijackingControl hijacking
Control hijackingG Prachi
 
Computer security concepts
Computer security conceptsComputer security concepts
Computer security conceptsG Prachi
 
Administering security
Administering securityAdministering security
Administering securityG Prachi
 
Database security and security in networks
Database security and security in networksDatabase security and security in networks
Database security and security in networksG Prachi
 
Protection in general purpose operating system
Protection in general purpose operating systemProtection in general purpose operating system
Protection in general purpose operating systemG Prachi
 
Program security
Program securityProgram security
Program securityG Prachi
 
Elementary cryptography
Elementary cryptographyElementary cryptography
Elementary cryptographyG Prachi
 
Information security introduction
Information security introductionInformation security introduction
Information security introductionG Prachi
 

More from G Prachi (20)

The trusted computing architecture
The trusted computing architectureThe trusted computing architecture
The trusted computing architecture
 
Security risk management
Security risk managementSecurity risk management
Security risk management
 
Mobile platform security models
Mobile platform security modelsMobile platform security models
Mobile platform security models
 
Malicious software and software security
Malicious software and software  securityMalicious software and software  security
Malicious software and software security
 
Network defenses
Network defensesNetwork defenses
Network defenses
 
Network protocols and vulnerabilities
Network protocols and vulnerabilitiesNetwork protocols and vulnerabilities
Network protocols and vulnerabilities
 
Web application security part 02
Web application security part 02Web application security part 02
Web application security part 02
 
Web application security part 01
Web application security part 01Web application security part 01
Web application security part 01
 
Basic web security model
Basic web security modelBasic web security model
Basic web security model
 
Least privilege, access control, operating system security
Least privilege, access control, operating system securityLeast privilege, access control, operating system security
Least privilege, access control, operating system security
 
Dealing with legacy code
Dealing with legacy codeDealing with legacy code
Dealing with legacy code
 
Exploitation techniques and fuzzing
Exploitation techniques and fuzzingExploitation techniques and fuzzing
Exploitation techniques and fuzzing
 
Control hijacking
Control hijackingControl hijacking
Control hijacking
 
Computer security concepts
Computer security conceptsComputer security concepts
Computer security concepts
 
Administering security
Administering securityAdministering security
Administering security
 
Database security and security in networks
Database security and security in networksDatabase security and security in networks
Database security and security in networks
 
Protection in general purpose operating system
Protection in general purpose operating systemProtection in general purpose operating system
Protection in general purpose operating system
 
Program security
Program securityProgram security
Program security
 
Elementary cryptography
Elementary cryptographyElementary cryptography
Elementary cryptography
 
Information security introduction
Information security introductionInformation security introduction
Information security introduction
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Data Explosion and Privacy Challenges

  • 2. INDEX • Statistics and Lack of barriers in Collection and Distribution of Person-specific information • Mathematical model for characterizing and comparing real-world data sharing practices and policies and for computing privacy and risk measurements • Demographics and Uniqueness
  • 3. How much data? • Google processes 20 PB a day (2008) • Wayback Machine has 3 PB + 100 TB/month (3/2009) • Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009) • CERN’s Large Hydron Collider (LHC) generates 15 PB a year 640K ought to be enough for anybody.
  • 4. Big Data EveryWhere! • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network
  • 5. Background • Large amount of person-specific data has been collected in recent years – Both by governments and by private entities • Data and knowledge extracted by data mining techniques represent a key asset to the society – Analyzing trends and patterns. – Formulating public policies • Laws and regulations require that some collected data must be made public – For example, Census data
  • 6. • Health-care datasets – Clinical studies, hospital discharge databases … • Genetic datasets – $1000 genome, HapMap, deCode … • Demographic datasets – U.S. Census Bureau, sociology studies … • Search logs, recommender systems, social networks, blogs … – AOL search data, social networks of blogging sites, Netflix movie ratings, Amazon … Public Data Conundrum
  • 7. • First thought: anonymize the data • How? • Remove “personally identifying information” (PII) – Name, Social Security number, phone number, email, address… what else? – Anything that identifies the person directly • Is this enough? What About Privacy?
  • 8. Re-identification by Linking Name Zipcode Age Sex Alice 47677 29 F Bob 47983 65 M Carol 47677 22 F Dan 47532 23 M Ellen 46789 43 F Voter registration data QID SA Zipcode Age Sex Disease 47677 29 F Ovarian Cancer 47602 22 F Ovarian Cancer 47678 27 M Prostate Cancer 47905 43 M Flu 47909 52 F Heart Disease 47906 47 M Heart Disease ID Name Alice Betty Charles David Emily Fred Microdata
  • 9. Latanya Sweeney’s Attack (1997) Massachusetts hospital discharge dataset Public voter dataset
  • 10. Quasi-Identifiers • Key attributes – Name, address, phone number - uniquely identifying! – Always removed before release • Quasi-identifiers – (5-digit ZIP code, birth date, gender) uniquely identify 87% of the population in the U.S. – Can be used for linking anonymized dataset with other datasets
  • 11. Classification of Attributes • Sensitive attributes – Medical records, salaries, etc. – These attributes is what the researchers need, so they are always released directly Name DOB Gender Zipcode Disease Andre 1/21/76 Male 53715 Heart Disease Beth 4/13/86 Female 53715 Hepatitis Carol 2/28/76 Male 53703 Brochitis Dan 1/21/76 Male 53703 Broken Arm Ellen 4/13/86 Female 53706 Flu Eric 2/28/76 Female 53706 Hang Nail Key Attribute Quasi-identifier Sensitive attribute
  • 12. K-Anonymity: Intuition • The information for each person contained in the released table cannot be distinguished from at least k-1 individuals whose information also appears in the release – Example: you try to identify a man in the released table, but the only information you have is his birth date and gender. There are k men in the table with the same birth date and gender. • Any quasi-identifier present in the released table must appear in at least k records
  • 13. K-Anonymity Protection Model • Private table • Released table: RT • Attributes: A1, A2, …, An • Quasi-identifier subset: Ai, …, Aj
  • 14. Male Female * 476** 47677 4767847602 2* 29 2722 ZIP code Age Sex Generalization • Goal of k-Anonymity – Each record is indistinguishable from at least k-1 other records – These k records form an equivalence class • Generalization: replace quasi-identifiers with less specific, but semantically consistent values
  • 15. Achieving k-Anonymity • Generalization – Replace specific quasi-identifiers with less specific values until get k identical values – Partition ordered-value domains into intervals • Suppression – When generalization causes too much information loss • This is common with “outliers” • Lots of algorithms in the literature – Aim to produce “useful” anonymizations … usually without any clear notion of utility
  • 17. Example of a k-Anonymous Table
  • 18. Name Birth Gender ZIP Race Andre 1964 m 02135 White Beth 1964 f 55410 Black Carol 1964 f 90210 White Dan 1967 m 02174 White Ellen 1968 f 02237 White Released table External data Source By linking these 2 tables, you still don’t learn Andre’s problem Example of Generalization (1)
  • 19. QID SA Zipcode Age Sex Disease 47677 29 F Ovarian Cancer 47602 22 F Ovarian Cancer 47678 27 M Prostate Cancer 47905 43 M Flu 47909 52 F Heart Disease 47906 47 M Heart Disease QID SA Zipcode Age Sex Disease 476** 476** 476** 2* 2* 2* * * * Ovarian Cancer Ovarian Cancer Prostate Cancer 4790* 4790* 4790* [43,52] [43,52] [43,52] * * * Flu Heart Disease Heart Disease Microdata Generalized table Example of Generalization (2) • Released table is 3-anonymous • If the adversary knows Alice’s quasi-identifier (47677, 29, F), he still does not know which of the first 3 records corresponds to Alice’s record !!
  • 20. Curse of Dimensionality • Generalization fundamentally relies on spatial locality – Each record must have k close neighbors • Real-world datasets are very sparse – Many attributes (dimensions) – “Nearest neighbor” is very far • Projection to low dimensions loses all info  k-anonymized datasets are useless [Aggarwal VLDB ‘05]
  • 21. HIPAA Privacy Rule “The identifiers that must be removed include direct identifiers, such as name, street address, social security number, as well as other identifiers, such as birth date, admission and discharge dates, and five-digit zip code. The safe harbor requires removal of geographic subdivisions smaller than a State, except for the initial three digits of a zip code if the geographic unit formed by combining all zip codes with the same initial three digits contains more than 20,000 people. In addition, age, if less than 90, gender, ethnicity, and other demographic information not listed may remain in the information. The safe harbor is intended to provide covered entities with a simple, definitive method that does not require much judgment by the covered entity to determine if the information is adequately de-identified." "Under the safe harbor method, covered entities must remove all of a list of 18 enumerated identifiers and have no actual knowledge that the information remaining could be used, alone or in combination, to identify a subject of the information."
  • 22. Two (and a Half) Interpretations • Membership disclosure: Attacker cannot tell that a given person in the dataset • Sensitive attribute disclosure: Attacker cannot tell that a given person has a certain sensitive attribute • Identity disclosure: Attacker cannot tell which record corresponds to a given person
  • 23. Unsorted Matching Attack • Problem: records appear in the same order in the released table as in the original table • Solution: randomize order before releasing
  • 24. Complementary Release Attack • Different releases of the same private table can be linked together to compromise k-anonymity
  • 26. Zipcode Age Disease 476** 2* Heart Disease 476** 2* Heart Disease 476** 2* Heart Disease 4790* ≥40 Flu 4790* ≥40 Heart Disease 4790* ≥40 Cancer 476** 3* Heart Disease 476** 3* Cancer 476** 3* Cancer A 3-anonymous patient table Bob Zipcode Age 47678 27 Carl Zipcode Age 47673 36 Homogeneity attack Background knowledge attack Attacks on k-Anonymity • k-Anonymity does not provide privacy if – Sensitive values in an equivalence class lack diversity – The attacker has background knowledge
  • 27. Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne Caucas 787XX Flu Asian/AfrAm 78XXX Flu Asian/AfrAm 78XXX Flu Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Shingles Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Flu Sensitive attributes must be “diverse” within each quasi-identifier equivalence class [Machanavajjhala et al. ICDE ‘06] l-Diversity
  • 28. Distinct l-Diversity • Each equivalence class has at least l well- represented sensitive values • Doesn’t prevent probabilistic inference attacks Disease ... HIV HIV HIV pneumonia ... ... bronchitis ... 10 records 8 records have HIV 2 records have other values
  • 29. Other Versions of l-Diversity • Probabilistic l-diversity – The frequency of the most frequent value in an equivalence class is bounded by 1/l • Entropy l-diversity – The entropy of the distribution of sensitive values in each equivalence class is at least log(l) • Recursive (c,l)-diversity – r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value – Intuition: the most frequent value does not appear too frequently
  • 30. … Cancer … Cancer … Cancer … Flu … Cancer … Cancer … Cancer … Cancer … Cancer … Cancer … Flu … Flu Original dataset Q1 Flu Q1 Cancer Q1 Cancer Q1 Cancer Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Flu Q2 Flu Anonymization B Q1 Flu Q1 Flu Q1 Cancer Q1 Flu Q1 Cancer Q1 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Q2 Cancer Anonymization A 99% have cancer 50% cancer  quasi-identifier group is “diverse” This leaks a ton of information 99% cancer  quasi-identifier group is not “diverse” …yet anonymized database does not leak anything Neither Necessary, Nor Sufficient
  • 31. Limitations of l-Diversity • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) – Very different degrees of sensitivity! • l-diversity is unnecessary – 2-diversity is unnecessary for an equivalence class that contains only HIV- records • l-diversity is difficult to achieve – Suppose there are 10000 records in total – To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes
  • 32. Skewness Attack • Example: sensitive attribute is HIV+ (1%) or HIV- (99%) • Consider an equivalence class that contains an equal number of HIV+ and HIV- records – Diverse, but potentially violates privacy! • l-diversity does not differentiate: – Equivalence class 1: 49 HIV+ and 1 HIV- – Equivalence class 2: 1 HIV+ and 49 HIV-l-diversity does not consider overall distribution of sensitive values!
  • 33. Bob Zip Age 47678 27 Zipcode Age Salary Disease 476** 2* 20K Gastric Ulcer 476** 2* 30K Gastritis 476** 2* 40K Stomach Cancer 4790* ≥40 50K Gastritis 4790* ≥40 100K Flu 4790* ≥40 70K Bronchitis 476** 3* 60K Bronchitis 476** 3* 80K Pneumonia 476** 3* 90K Stomach Cancer A 3-diverse patient table Conclusion 1. Bob’s salary is in [20k,40k], which is relatively low 2. Bob has some stomach-related disease l-diversity does not consider semantics of sensitive values! Similarity attack Sensitive Attribute Disclosure
  • 34. Caucas 787XX Flu Caucas 787XX Shingles Caucas 787XX Acne Caucas 787XX Flu Caucas 787XX Acne Caucas 787XX Flu Asian/AfrAm 78XXX Flu Asian/AfrAm 78XXX Flu Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Shingles Asian/AfrAm 78XXX Acne Asian/AfrAm 78XXX Flu [Li et al. ICDE ‘07] Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database Trick question: Why publish quasi-identifiers at all?? t-Closeness
  • 35. Caucas 787XX HIV+ Flu Asian/AfrAm 787XX HIV- Flu Asian/AfrAm 787XX HIV+ Shingles Caucas 787XX HIV- Acne Caucas 787XX HIV- Shingles Caucas 787XX HIV- Acne This is k-anonymous, l-diverse and t-close… …so secure, right? Anonymous, “t-Close” Dataset
  • 36. Caucas 787XX HIV+ Flu Asian/AfrAm 787XX HIV- Flu Asian/AfrAm 787XX HIV+ Shingles Caucas 787XX HIV- Acne Caucas 787XX HIV- Shingles Caucas 787XX HIV- Acne Bob is Caucasian and I heard he was admitted to hospital with flu… This is against the rules! “flu” is not a quasi-identifier Yes… and this is yet another problem with k-anonymity What Does Attacker Know?
  • 37. AOL Privacy Debacle • In August 2006, AOL released anonymized search query logs – 657K users, 20M queries over 3 months (March-May) • Opposing goals – Analyze data for research purposes, provide better services for users and advertisers – Protect privacy of AOL users • Government laws and regulations • Search queries may reveal income, evaluations, intentions to acquire goods and services, etc.
  • 38. AOL User 4417749 • AOL query logs have the form <AnonID, Query, QueryTime, ItemRank, ClickURL> – ClickURL is the truncated URL • NY Times re-identified AnonID 4417749 – Sample queries: “numb fingers”, “60 single men”, “dog that urinates on everything”, “landscapers in Lilburn, GA”, several people with the last name Arnold • Lilburn area has only 14 citizens with the last name Arnold – NYT contacts the 14 citizens, finds out AOL User 4417749 is 62-year-old Thelma Arnold
  • 39. k-Anonymity Considered Harmful • Syntactic – Focuses on data transformation, not on what can be learned from the anonymized dataset – “k-anonymous” dataset can leak sensitive information • “Quasi-identifier” fallacy – Assumes a priori that attacker will not know certain information about his target • Relies on locality – Destroys utility of many real-world datasets
  • 41. Demography—from the Greek demo, for people, and graphy, for writing, or analysis, or field of study, is “the study of changes (such as the number of births, deaths, marriages, and illnesses) that occur over a period of time in human populations”, or just “science of population” What is demography?
  • 45.
  • 46.
  • 47. Uniqueness Examples Tea Party vs. Occupy Wall Street on twitterUSA (2011)
  • 48. Dynamic Population Mapping Using Mobile Phone Data France and Portugal (2014)
  • 49. Senegal (2015) Predicting Population Density from Cell-Phone Activity
  • 51. London (2013-14) Predicting Crime Hotspots from Cell Phone Data • Binary classification task using Random Forest
  • 52. Rwanda (2010) Risk Sharing in Natural Disasters through “Mobile Money”
  • 53. Some Other Scenarios Clinical Trial Deviations Originally Viagra was developed to lower blood pressure and treat Angina Now its used to help newborn pulmonary hypertension and altitude sickness Incidence Prediction Missed 4 or more visits, twice as likely to have an asthmatic incident Particular Cardiac monitor sine wave points to highly likelihood of heart attack Campaigns Social media and advertising campaigns to understand user behavior and sentiment Patient Satisfaction Social media and advertising campaigns to understand user behavior and sentiment
  • 54. Auditing is critical component HIPAA in ensuring patient privacy 1 Billion rows+ of audit data 146 mission critical clinical applications Comprehensive audits yield 300-500k transactions/day HIPAA requires audit system with 20 years of data Auditing Project Available to community as part of Compliance SDK Updating for SQL Server 2012, HDInsight, Power View, and MobileBI*
  • 55. References • Li, Li, Venkatasubramanian. “t-Closeness: Privacy Beyond k-Anonymity and l-Diversity” (ICDE 2007). • http://www.merriam- webster.com/dictionary/demography • p://www.demogr.mpg.de/En/educationcareer /what_is_demography_1908/default.htm