SlideShare a Scribd company logo
Data anonymization is the process applied on
data to prevent identification of
individuals, making it possible to
share and analyze data securely.
Disclaimer: Stuff
shared here are
personal research,
does not represent
any organization
policies.
Prof Khaled El
Emam worked on
anonymizing heritage
health prize data
Are we safe?
Yes we are safe, as long as the lions prefer
eating fat, juicy zebras than us.
The safari rules
1.If you are the , you just need to be faster than
the slowest zebra.
2.If you are the , you need to be able to escape
from all the lions.
3.If you are the safari visitors,
to the lions & zebras without
getting hurt.
Enjoyment
Security
Max enjoyment:
Live with the lions for a week
Max security:
Stay at home watch National Geographic
Determined by
risk appetite
How can we apply this to
Data Anonymization
The data anonymization rules
1.If you are the , you just need to hack through
the weakest link.
2.If you are the , you need to be protected from
all the hackers.
3.If you are anonymizing the data,
Analytical
Usefulness
Security
Max Usefulness:
Raw data
Determined by
risk appetite
Max security:
Lock up data, don’t do any analysis
11
Known
Knowns
Known
Unknowns
Unknown
Unknowns
Unknown
knowns
Donald’s Matrix
*Important to know
Known Knowns
Most users do not care.
Not all data that can be shared
should be shared.
Data policies needs updating.
Laws, Standards & Regulations.
Will people abuse their access
rights?
What are the damages if data got
compromised?
Motivations of hackers?
Known Unknowns
Unknown UnknownsUnknown knowns
?
Minimize risk, find out more
Be preparedWhat we should already know
Who have official access?
Resources we have?
Value of data?
What are the identifiers?
Sharing the data?
Different data policies?
Laws, Standards & Regulations.
What are the techniques for
Data Anonymization
‘Hard’ Methods
More difficult to analyze
‘Soft’ Methods
Easier to analyze
Hashing
Encryption
Lv1
Lv2
Lv3
Remove: ---
Reduce: Mr. S
Reclassify: 40+yrs
Mask: 1234****
Black box
Sampling
Add noise / fake data
Shuffle
Breaking big data machine
learning
‘Hard’ Methods
Strong security, difficult to
analyze, dangerous if cracked
‘Soft’ Methods
Flexible security strength, easier to
analyze, anonymized
Hashing
Encryption
Lv1
Lv2
Lv3
Remove: ---
Reduce: Mr. S
Reclassify: 40+yrs
Mask: 1234****
Black box
Sampling
Add noise / fake data
Shuffle
Breaking big data machine
learning
For best results, use a combination
of techniques
Lv1: RRRM: Quick and dirty
Remove ID S12345739Y -> ----
Reduce Mr. Smith -> Mr. S,
St 21, XY Road, Bedok-> Bedok
Reclassify 43 yrs old -> 40+
$1,029,199 income-> $1million+
Mask 12345678->1234****
But these techniques are not good enough
"There are lots of smokers in the health records, but once you
narrow it down to an anonymous male black smoker born in
1965 who presented at the emergency room with aching
joints, it's actually pretty simple to merge the "anonymous"
record with a different "anonymised" database and out pops
the near-certain identity of the patient." ~ Cory Doctorow,
theguardian
Multi variable identification
Big Data is a double edge sword
Lv2: Black Box (No data visibility for data scientist)
Algorithm, Software,
System or People
In-house or 3rd Party
Requests Summarized
Results
Lv2: Sampling (lowers accuracy)
Probability
Simple Random
Systematic
Stratified
Probability Proportional to Size
Cluster
Nonprobability
(Try not to
use these)
Convenience
Quota
Purposive
Lv2: Sampling (lowers accuracy)
Probability
Simple Random
Systematic
Stratified
Probability Proportional to Size
Cluster
Nonprobability
(Try not to
use these)
Convenience
Quota
Purposive
All data
Data Collected
Sample
Lv2 Noise, fake & shuffle within data clusters
Lv2: Add noise / fake data (lowers accuracy)
Name: Adam Smith
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Name: David Hume
Visit1: 19/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Noise:
+5 days
Fake, male Scottish Name
Group visits by same person together and
apply same amount of noise
Name: David Abram
Visit1: 01/02/13
Visit2: 11/02/13
Name: David Abram
Visit1: 27/01/13
Visit2: 06/02/13
Affects daily/ monthly pattern
Noise:
-5 days
Lv2: Shuffle (may break data relationships but retains trend)
Name: Adam Smith
Purchase1 : Cabbage
Purchase2 : Tomato
Name: David Abram
Purchase1 : Bread
Purchase2 : Sushi
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
Shuffle
Name: Adam Smith
Purchase1 : Bread
Purchase2 : Sushi
Name: David Abram
Purchase1 : Cabbage
Purchase2 : Tomato
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
Different gender, cannot shuffle with Adam/David
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
From David
From Adam
Are we safe?
RRRM
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. S
Age: 40+yrs
Postal:428***
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
Are we safe?
Noise /
Fake
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
Are we safe?
Shuffle
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Bread
Purchase2 : Sushi
Encrypted
Are we safe? Before Vs After
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Bread
Purchase2 : Sushi
Not really safe - Netflix case study
+
Prof. Arvind Narayanan
Not really safe - Netflix case study
+
Prof. Arvind Narayanan
Sparse data
Even the most prolific Netflix users has only rated a
tiny fraction of Netflix’s enormous library. Thus most
columns, which represents a particular movie, are
empty. Therefore, the chances of two or more users
giving the same rating to the same set of movies is
quite small; thus sets of user’s movie ratings can
almost uniquely identify users.
Credit: Prof. Arvind Narayanan
Best match:
David
2nd Best match:
Adam
Best match:
Alice
2nd Best match:
Lisa
Lv3: Breaking Big Data Machine Learning
Lv3 Noise, fake & shuffle across data clusters
Lv3: Add trend breaking noise / fake data
Name: David Abram
Visit1: 01/02/13 (Bought item A,B,C)
Visit2: 11/02/13 (Bought item D,E)
Name: David Abram
Visit1: 26/01/13 (Bought item A,B,C,X)
Visit2: 05/02/13 (Bought item D,E)
Re order visits, add noise to date
Fake purchase
X, and sequence of visits related findings will be ignored
Name: Adam Smith
Purchase1 : Cabbage
Purchase2 : Tomato
Name: David Abram
Purchase1 : Bread
Purchase2 : Sushi
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
Shuffle
Name: Adam Smith
Purchase1 : Bread
Purchase2 : Sushi
Name: Emma Goldman
Purchase1 : Cabbage
Purchase2 : Tomato
Name: David Abram
Purchase1: Female Hygiene
Purchase2 : Strawberry
Gender related findings will be ignored
Lv3: Trend breaking shuffle
Analytical
Usefulness
Security
Max Usefulness:
Raw data
Determined by
risk appetite
Max security:
Lock up data, don’t do any analysis
Yes we are safe, as long as the lions prefer
eating fat, juicy zebras than us.
Security
Analytical
Usefulness
Point of
stupidity
Known
Knowns
Known
Unknowns
Unknown
Unknowns
Unknown
knowns
Donald’s Matrix
thiakx@gmail.com
Linkedin: Kai Xin, Thia
Interesting reads
• Anonymizing Health Data
• Data protection in the EU: the certainty of uncertainty
• Robust De-anonymization of Large Sparse Datasets
• Eccentricity Explained
• A new way to protect privacy in large-scale genome-wide
association studies
• Why 'Anonymous' Data Sometimes Isn't
• Has Big Data Made Anonymity Impossible?
• ‘Anonymous’ Netflix Prize data not so anonymous after all
• A Data Broker Offers a Peek Behind the Curtain

More Related Content

Recently uploaded

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 

Recently uploaded (20)

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Lions, zebras and Big Data Anonymization

  • 1.
  • 2. Data anonymization is the process applied on data to prevent identification of individuals, making it possible to share and analyze data securely.
  • 3. Disclaimer: Stuff shared here are personal research, does not represent any organization policies. Prof Khaled El Emam worked on anonymizing heritage health prize data
  • 5. Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.
  • 6. The safari rules 1.If you are the , you just need to be faster than the slowest zebra. 2.If you are the , you need to be able to escape from all the lions. 3.If you are the safari visitors, to the lions & zebras without getting hurt.
  • 7. Enjoyment Security Max enjoyment: Live with the lions for a week Max security: Stay at home watch National Geographic Determined by risk appetite
  • 8. How can we apply this to Data Anonymization
  • 9. The data anonymization rules 1.If you are the , you just need to hack through the weakest link. 2.If you are the , you need to be protected from all the hackers. 3.If you are anonymizing the data,
  • 10. Analytical Usefulness Security Max Usefulness: Raw data Determined by risk appetite Max security: Lock up data, don’t do any analysis
  • 12. Known Knowns Most users do not care. Not all data that can be shared should be shared. Data policies needs updating. Laws, Standards & Regulations. Will people abuse their access rights? What are the damages if data got compromised? Motivations of hackers? Known Unknowns Unknown UnknownsUnknown knowns ? Minimize risk, find out more Be preparedWhat we should already know Who have official access? Resources we have? Value of data? What are the identifiers? Sharing the data? Different data policies? Laws, Standards & Regulations.
  • 13. What are the techniques for Data Anonymization
  • 14. ‘Hard’ Methods More difficult to analyze ‘Soft’ Methods Easier to analyze Hashing Encryption Lv1 Lv2 Lv3 Remove: --- Reduce: Mr. S Reclassify: 40+yrs Mask: 1234**** Black box Sampling Add noise / fake data Shuffle Breaking big data machine learning
  • 15. ‘Hard’ Methods Strong security, difficult to analyze, dangerous if cracked ‘Soft’ Methods Flexible security strength, easier to analyze, anonymized Hashing Encryption Lv1 Lv2 Lv3 Remove: --- Reduce: Mr. S Reclassify: 40+yrs Mask: 1234**** Black box Sampling Add noise / fake data Shuffle Breaking big data machine learning For best results, use a combination of techniques
  • 16. Lv1: RRRM: Quick and dirty Remove ID S12345739Y -> ---- Reduce Mr. Smith -> Mr. S, St 21, XY Road, Bedok-> Bedok Reclassify 43 yrs old -> 40+ $1,029,199 income-> $1million+ Mask 12345678->1234**** But these techniques are not good enough
  • 17. "There are lots of smokers in the health records, but once you narrow it down to an anonymous male black smoker born in 1965 who presented at the emergency room with aching joints, it's actually pretty simple to merge the "anonymous" record with a different "anonymised" database and out pops the near-certain identity of the patient." ~ Cory Doctorow, theguardian Multi variable identification Big Data is a double edge sword
  • 18. Lv2: Black Box (No data visibility for data scientist) Algorithm, Software, System or People In-house or 3rd Party Requests Summarized Results
  • 19. Lv2: Sampling (lowers accuracy) Probability Simple Random Systematic Stratified Probability Proportional to Size Cluster Nonprobability (Try not to use these) Convenience Quota Purposive
  • 20. Lv2: Sampling (lowers accuracy) Probability Simple Random Systematic Stratified Probability Proportional to Size Cluster Nonprobability (Try not to use these) Convenience Quota Purposive All data Data Collected Sample
  • 21. Lv2 Noise, fake & shuffle within data clusters
  • 22. Lv2: Add noise / fake data (lowers accuracy) Name: Adam Smith Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Name: David Hume Visit1: 19/04/13 Visit2: 26/05/13 Visit3: 06/06/13 Noise: +5 days Fake, male Scottish Name Group visits by same person together and apply same amount of noise Name: David Abram Visit1: 01/02/13 Visit2: 11/02/13 Name: David Abram Visit1: 27/01/13 Visit2: 06/02/13 Affects daily/ monthly pattern Noise: -5 days
  • 23. Lv2: Shuffle (may break data relationships but retains trend) Name: Adam Smith Purchase1 : Cabbage Purchase2 : Tomato Name: David Abram Purchase1 : Bread Purchase2 : Sushi Name: Emma Goldman Purchase1: Female Hygiene Purchase2 : Strawberry Shuffle Name: Adam Smith Purchase1 : Bread Purchase2 : Sushi Name: David Abram Purchase1 : Cabbage Purchase2 : Tomato Name: Emma Goldman Purchase1: Female Hygiene Purchase2 : Strawberry Different gender, cannot shuffle with Adam/David Name: Emma Goldman Purchase1: Female Hygiene Purchase2 : Strawberry From David From Adam
  • 24. Are we safe? RRRM ID: S1235930X Name: Adam Smith Age: 45 Postal:428102 Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato ID: ----- Name: Mr. S Age: 40+yrs Postal:428*** Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato
  • 25. Are we safe? Noise / Fake ID: S1235930X Name: Adam Smith Age: 45 Postal:428102 Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato ID: ----- Name: Mr. H Age: 40+yrs Postal:428*** Visit1: 15/04/13 Visit2: 26/05/13 Visit3: 06/06/13 Purchase1 : Cabbage Purchase2 : Tomato
  • 26. Are we safe? Shuffle ID: S1235930X Name: Adam Smith Age: 45 Postal:428102 Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato ID: ----- Name: Mr. H Age: 40+yrs Postal:428*** Visit1: 15/04/13 Visit2: 26/05/13 Visit3: 06/06/13 Purchase1 : Bread Purchase2 : Sushi
  • 27. Encrypted Are we safe? Before Vs After ID: S1235930X Name: Adam Smith Age: 45 Postal:428102 Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato ID: ----- Name: Mr. H Age: 40+yrs Postal:428*** Visit1: 15/04/13 Visit2: 26/05/13 Visit3: 06/06/13 Purchase1 : Bread Purchase2 : Sushi
  • 28. Not really safe - Netflix case study + Prof. Arvind Narayanan
  • 29. Not really safe - Netflix case study + Prof. Arvind Narayanan Sparse data Even the most prolific Netflix users has only rated a tiny fraction of Netflix’s enormous library. Thus most columns, which represents a particular movie, are empty. Therefore, the chances of two or more users giving the same rating to the same set of movies is quite small; thus sets of user’s movie ratings can almost uniquely identify users.
  • 30. Credit: Prof. Arvind Narayanan Best match: David 2nd Best match: Adam Best match: Alice 2nd Best match: Lisa
  • 31. Lv3: Breaking Big Data Machine Learning
  • 32. Lv3 Noise, fake & shuffle across data clusters
  • 33. Lv3: Add trend breaking noise / fake data Name: David Abram Visit1: 01/02/13 (Bought item A,B,C) Visit2: 11/02/13 (Bought item D,E) Name: David Abram Visit1: 26/01/13 (Bought item A,B,C,X) Visit2: 05/02/13 (Bought item D,E) Re order visits, add noise to date Fake purchase X, and sequence of visits related findings will be ignored
  • 34. Name: Adam Smith Purchase1 : Cabbage Purchase2 : Tomato Name: David Abram Purchase1 : Bread Purchase2 : Sushi Name: Emma Goldman Purchase1: Female Hygiene Purchase2 : Strawberry Shuffle Name: Adam Smith Purchase1 : Bread Purchase2 : Sushi Name: Emma Goldman Purchase1 : Cabbage Purchase2 : Tomato Name: David Abram Purchase1: Female Hygiene Purchase2 : Strawberry Gender related findings will be ignored Lv3: Trend breaking shuffle
  • 35. Analytical Usefulness Security Max Usefulness: Raw data Determined by risk appetite Max security: Lock up data, don’t do any analysis
  • 36. Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.
  • 37.
  • 39. Interesting reads • Anonymizing Health Data • Data protection in the EU: the certainty of uncertainty • Robust De-anonymization of Large Sparse Datasets • Eccentricity Explained • A new way to protect privacy in large-scale genome-wide association studies • Why 'Anonymous' Data Sometimes Isn't • Has Big Data Made Anonymity Impossible? • ‘Anonymous’ Netflix Prize data not so anonymous after all • A Data Broker Offers a Peek Behind the Curtain

Editor's Notes

  1. There are known knowns. These are things we know that we know. There are known unknowns. These are things that we know we don't know. But there are also unknown unknowns. These are things we don't know we don't know. - Donald Rumsfeld *He missed out the unknown knowns.These are things we forget or intentionallyrefuse to acknowledge that we know
  2. These are generic examples, theseparagraphs should be customize specific domains – healthcare, cloud, IT, banks etc. Also we need these content let management understand what they dunno they dunno, so they can maybe feel less scared
  3. *Beware of data with multiple, related records in a time series
  4. http://www.theguardian.com/technology/blog/2013/jun/05/data-protection-eu-anonymous
  5. Sparse dataEven the most prolific Netflix users has only rated a tiny fraction of Netflix’s enormous library. Thus most columns, which represents a particular movie, are empty. Therefore, the chances of two or more users giving the same rating to the same set of movies is quite small; thus sets of user’s movie ratings can almost uniquely identify users.
  6. Especially for obscure movieshttp://www.cs.utexas.edu/~shmat/netflix-faq.html
  7. http://33bits.org/2008/10/03/eccentricity-explained/
  8. Ultimately, it is our responsibility as people who handle data to learn how to protect the data from the lions, the people want to watch the world burn. It is important for us to have the skills to ensure that data analytics can continue with a spirit of sharing, learning and gaining insights from one another and not be obstructed by fear of the bad guys
  9. Responsibility of data scientist to learn data anonymization