SlideShare a Scribd company logo
Big Data For A Better World
Sponsors:
Tonight’s schedule
• Panel Presentation
• Keynote
• Networking
Big Data For A Better World
Panel Presentation
• Leon Wilson
• David M. Holmes
• Jason Therrien
Big Data For A Better World
Big Data: The Promise, the
Premise and the Practice
Kambiz (come-bees) Ghazinour
Advanced Information Security and Privacy Research Lab
Kent State University
Nov 16, 2017
Big Data: The Promise, the
Premise and the Practice
Big Data: The Promise, the
Premise and the Practice
Volume
Velocity
Variety
Veracity
Big, Fast, Diverse, Uncertain Data:
The Promise, the Premise and the
Practice
Google, MapReduce 2004
Big, Fast, Diverse, Inaccurate Data:
The Problem of Promising
Protection of Personal Information
in a Protected and Privacy
Preserving Platform in Practice
Phone Metadata
The Stanford Experiment: See Data and Goliath by Bruce Schneider
• phone metadata from 500 volunteers
• One called a hospital, a medical lab, a pharmacy, and
several short calls to a monitoring hotline for a heart
monitor
• One called her sister at length, then calls to an abortion
clinic, further calls two weeks later, and a final call a
month later.
• a heart patient, an abortion …
• This is just metadata not content. It’s very revealing.
Who should be able to see it?
Personal Data and Privacy
No one shall be subjected to arbitrary
interference with his privacy, family, home
or correspondence, nor to attacks upon his
honor and reputation. Everyone has the
right to the protection of the law against
such interference or attacks
• Article12 of the Universal Declaration of Human
Rights
Nothing to hide, nothing to fear?
• Many people need to control their privacy
– victims of rape or other trauma,
– people escaping abusive relationships
– people who may be discriminated against (HIV positive, previous mental illness, spent convictions,
recovering addicts)
– people at risk of “honor” violence from their families for breaking cultural norms
– Adopters, protecting their adopted children from the birth families that abused them
– witness protection, undercover police, some social workers and prison staff ….
• It is unthinking or callous to see other people’s privacy as unimportant.
• Data is forever and your circumstances or society’s attitudes may change
The Value of Big Data
• Facebook and Amazon are valued at $500B,
as of July 2017
• Most of Facebook’s value comes from
personal data
Anonymization
• Statistical analyses are anonymous “70 percent of
American smokers want to quit” does not expose
personal data
• Data about individuals can be anonymous, but it
becomes very difficult when more than a few facts
are included even if these facts are not specific and
some of them are wrong (eg Netflix)
3 ways to anonymize
• Suppress - omit from the released data
• Generalize - for example, replace birth date
with something less specific, like birth year
• Perturb - make changes to the data
Anonymization is difficult
15
Example 1: AOL Search Data August
2006
• To stimulate research into the value of
search data, AOL released the
anonymized search records of 658,000
users over a three month period from
March to May 2006
AOL anonymization
• AOL had tried to anonymize the data they released by removing the
searcher’s IP address and replacing the AOL username with a unique
random identifier linking of the searches by any individual, so that the
data was still useful for research
• It did not take long for two journalists to identify user 4417749, who
had searched for people with the last name Arnold, “homes sold in
shadow lake subdivision Gwinnett county Georgia" and “pine straw in
Lilburn” as Thelma Arnold, a widow living in Lilburn, Georgia
• Her other searches provide a deeply personal view of her life,
difficulties and desires
AOL faced strong criticism
• The violation of privacy was widely
condemned
• AOL described their action as a “screw up”
• They took down the data, but it was too late.
The internet never forgets. Several mirror
sites had already been set up.
http://www.not-secret.com
Example 2:
The Netflix™ Prize
• In October 2006, Netflix launched a $1m prize for an
algorithm that was 10% better than its existing algorithm
Cinematch
• participants were given access to the contest training data
set of more than 100 million ratings from over 480
thousand randomly-chosen, anonymous customers on
nearly 18 thousand movie titles.
• How much information would you need to be able to
identify customers?
Netflix
• Netflix said “to protect customer privacy, all personal
information identifying individual customers has been
removed and all customer ids have been replaced by
randomly-assigned ids. The date of each rating and the
title and year of release for each movie are provided. No
other customer or movie information is provided.”
• Two weeks after the prize was launched, Arvind
Narayanan and Vitaly Shmatikov of the University of
Texas at Austin announced that they could identify a high
proportion of the 480,000 subscribers in the training
data.
Narayanan and Shmatikov’s results
• How much does the attacker need to know about a Netflix subscriber in
order to identify her record in the dataset, and thus completely learn her
movie viewing history? Very little.
• For example, suppose the attacker learns a few random ratings and the
corresponding dates for some subscriber, perhaps from coffee-time chat.
• With 8 movie ratings (of which we allow 2 to be completely wrong) and
dates that may have a 3-day error, 96% of Netflix subscribers whose
records have been released can be uniquely identified in the dataset.
• For 64% of subscribers, knowledge of only 2 ratings and dates is sufficient
for complete deanonymization, and for 89%, 2 ratings and dates are
enough to reduce the set of records to 8 out of almost 500,000, which
can then be inspected for further deanonymisation.
Why are Narayanan and Shmatikov’s
results important?
1. They were results from probability theory, so they apply to all
sparse datasets. (They tested the results later, using the
Internet Movie Database IMDb as a source of data).
2. Psychologists at Cambridge University have shown that a small
number of seemingly innocuous Facebook Likes can be used to
automatically and accurately predict a range of highly sensitive
personal attributes including: sexual orientation, ethnicity,
religious and political views, personality traits, intelligence,
happiness, use of addictive substances, parental separation,
age, and gender).
“Security” Attack Scenario
25
The Attack Scenario
26
The Usefulness Challenge
27
The Attack Scenario - Anonymization
28
Re-identification by linking
Re-identification by linking (example)
Anonymization in Data Systems
31
K-anonymity
K-Anonymity
Output Perturbation
Example of suppression and
generalization
Classification of Attributes
Classification of Attributes
Example
Finding similar instances
• A Fast Approximate Nearest Neighbor Search
Algorithm in the Hamming Space
– Locality sensitive hashing (LSH)
– Error Weighted Hashing (EWH)
– Etc.
39
Big, Fast, Diverse, Inaccurate Data:
The Problem of Promising
Protection of Personal Information
in a Protected and Privacy
Preserving Platform in Practice
Thank you!
• Questions?
Kambiz Ghazinour
kghazino@kent.edu
@DrGhazinour
41
Advanced Information Security and Privacy Lab
Big Data For A Better World
• Networking
Sponsors:

More Related Content

Similar to Big Data for a Better World

Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data
bis_foresight
 
Roundtable: Social Media Users' Privacy Expectations & the Ethics of Using Th...
Roundtable: Social Media Users' Privacy Expectations & the Ethics of Using Th...Roundtable: Social Media Users' Privacy Expectations & the Ethics of Using Th...
Roundtable: Social Media Users' Privacy Expectations & the Ethics of Using Th...
Toronto Metropolitan University
 
The Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient WorldThe Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient World
PYA, P.C.
 
Privacy & Big Data - What do they know about me?
Privacy & Big Data - What do they know about me?Privacy & Big Data - What do they know about me?
Privacy & Big Data - What do they know about me?
Facundo Mauricio
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
RahulTr22
 
Big Data and You
Big Data and YouBig Data and You
Interesting ways Big Data is used today
Interesting ways Big Data is used todayInteresting ways Big Data is used today
Interesting ways Big Data is used today
Daniel Sârbe
 
Class_5_Data_2018W_pptx.pptx
Class_5_Data_2018W_pptx.pptxClass_5_Data_2018W_pptx.pptx
Class_5_Data_2018W_pptx.pptx
ccaskumba
 
Adler nurani
Adler nurani Adler nurani
Adler nurani
Mediabistro
 
Adler nurani
Adler nurani Adler nurani
Adler nurani
Mediabistro
 
Big Data World
Big Data WorldBig Data World
Big Data World
Hossein Zahed
 
Confessions (and Lessons) of a "Recovering" Data Broker
Confessions (and Lessons) of a "Recovering" Data BrokerConfessions (and Lessons) of a "Recovering" Data Broker
Confessions (and Lessons) of a "Recovering" Data Broker
metanautix
 
Confessions of a “Recovering” Data Broker: Responsible Innovation in the Age ...
Confessions of a “Recovering” Data Broker: Responsible Innovation in the Age ...Confessions of a “Recovering” Data Broker: Responsible Innovation in the Age ...
Confessions of a “Recovering” Data Broker: Responsible Innovation in the Age ...
Jim Adler
 
Data Kindness on the Internet
Data Kindness on the InternetData Kindness on the Internet
Data Kindness on the Internet
Christan Grant
 
Osint part 1_personal_privacy
Osint part 1_personal_privacyOsint part 1_personal_privacy
Osint part 1_personal_privacy
Sandra (Sandy) Dunn
 
Introduction to Privacy and Social Networking
Introduction to Privacy and Social NetworkingIntroduction to Privacy and Social Networking
Introduction to Privacy and Social Networking
Jason Hong
 
Data-Driven Enterprise on Any Beat by Manuel Torres - Monroe, La., NewsTrain ...
Data-Driven Enterprise on Any Beat by Manuel Torres - Monroe, La., NewsTrain ...Data-Driven Enterprise on Any Beat by Manuel Torres - Monroe, La., NewsTrain ...
Data-Driven Enterprise on Any Beat by Manuel Torres - Monroe, La., NewsTrain ...
News Leaders Association's NewsTrain
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) final
kimlyman
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
Abe Usher
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
Sloan Carne
 

Similar to Big Data for a Better World (20)

Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data
 
Roundtable: Social Media Users' Privacy Expectations & the Ethics of Using Th...
Roundtable: Social Media Users' Privacy Expectations & the Ethics of Using Th...Roundtable: Social Media Users' Privacy Expectations & the Ethics of Using Th...
Roundtable: Social Media Users' Privacy Expectations & the Ethics of Using Th...
 
The Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient WorldThe Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient World
 
Privacy & Big Data - What do they know about me?
Privacy & Big Data - What do they know about me?Privacy & Big Data - What do they know about me?
Privacy & Big Data - What do they know about me?
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
 
Big Data and You
Big Data and YouBig Data and You
Big Data and You
 
Interesting ways Big Data is used today
Interesting ways Big Data is used todayInteresting ways Big Data is used today
Interesting ways Big Data is used today
 
Class_5_Data_2018W_pptx.pptx
Class_5_Data_2018W_pptx.pptxClass_5_Data_2018W_pptx.pptx
Class_5_Data_2018W_pptx.pptx
 
Adler nurani
Adler nurani Adler nurani
Adler nurani
 
Adler nurani
Adler nurani Adler nurani
Adler nurani
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Confessions (and Lessons) of a "Recovering" Data Broker
Confessions (and Lessons) of a "Recovering" Data BrokerConfessions (and Lessons) of a "Recovering" Data Broker
Confessions (and Lessons) of a "Recovering" Data Broker
 
Confessions of a “Recovering” Data Broker: Responsible Innovation in the Age ...
Confessions of a “Recovering” Data Broker: Responsible Innovation in the Age ...Confessions of a “Recovering” Data Broker: Responsible Innovation in the Age ...
Confessions of a “Recovering” Data Broker: Responsible Innovation in the Age ...
 
Data Kindness on the Internet
Data Kindness on the InternetData Kindness on the Internet
Data Kindness on the Internet
 
Osint part 1_personal_privacy
Osint part 1_personal_privacyOsint part 1_personal_privacy
Osint part 1_personal_privacy
 
Introduction to Privacy and Social Networking
Introduction to Privacy and Social NetworkingIntroduction to Privacy and Social Networking
Introduction to Privacy and Social Networking
 
Data-Driven Enterprise on Any Beat by Manuel Torres - Monroe, La., NewsTrain ...
Data-Driven Enterprise on Any Beat by Manuel Torres - Monroe, La., NewsTrain ...Data-Driven Enterprise on Any Beat by Manuel Torres - Monroe, La., NewsTrain ...
Data-Driven Enterprise on Any Beat by Manuel Torres - Monroe, La., NewsTrain ...
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) final
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
 

Recently uploaded

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Big Data for a Better World

  • 1. Big Data For A Better World Sponsors:
  • 2. Tonight’s schedule • Panel Presentation • Keynote • Networking Big Data For A Better World
  • 3. Panel Presentation • Leon Wilson • David M. Holmes • Jason Therrien Big Data For A Better World
  • 4. Big Data: The Promise, the Premise and the Practice Kambiz (come-bees) Ghazinour Advanced Information Security and Privacy Research Lab Kent State University Nov 16, 2017
  • 5. Big Data: The Promise, the Premise and the Practice
  • 6. Big Data: The Promise, the Premise and the Practice Volume Velocity Variety Veracity
  • 7. Big, Fast, Diverse, Uncertain Data: The Promise, the Premise and the Practice Google, MapReduce 2004
  • 8. Big, Fast, Diverse, Inaccurate Data: The Problem of Promising Protection of Personal Information in a Protected and Privacy Preserving Platform in Practice
  • 9. Phone Metadata The Stanford Experiment: See Data and Goliath by Bruce Schneider • phone metadata from 500 volunteers • One called a hospital, a medical lab, a pharmacy, and several short calls to a monitoring hotline for a heart monitor • One called her sister at length, then calls to an abortion clinic, further calls two weeks later, and a final call a month later. • a heart patient, an abortion … • This is just metadata not content. It’s very revealing. Who should be able to see it?
  • 10. Personal Data and Privacy No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honor and reputation. Everyone has the right to the protection of the law against such interference or attacks • Article12 of the Universal Declaration of Human Rights
  • 11. Nothing to hide, nothing to fear? • Many people need to control their privacy – victims of rape or other trauma, – people escaping abusive relationships – people who may be discriminated against (HIV positive, previous mental illness, spent convictions, recovering addicts) – people at risk of “honor” violence from their families for breaking cultural norms – Adopters, protecting their adopted children from the birth families that abused them – witness protection, undercover police, some social workers and prison staff …. • It is unthinking or callous to see other people’s privacy as unimportant. • Data is forever and your circumstances or society’s attitudes may change
  • 12. The Value of Big Data • Facebook and Amazon are valued at $500B, as of July 2017 • Most of Facebook’s value comes from personal data
  • 13. Anonymization • Statistical analyses are anonymous “70 percent of American smokers want to quit” does not expose personal data • Data about individuals can be anonymous, but it becomes very difficult when more than a few facts are included even if these facts are not specific and some of them are wrong (eg Netflix)
  • 14. 3 ways to anonymize • Suppress - omit from the released data • Generalize - for example, replace birth date with something less specific, like birth year • Perturb - make changes to the data
  • 16. Example 1: AOL Search Data August 2006 • To stimulate research into the value of search data, AOL released the anonymized search records of 658,000 users over a three month period from March to May 2006
  • 17. AOL anonymization • AOL had tried to anonymize the data they released by removing the searcher’s IP address and replacing the AOL username with a unique random identifier linking of the searches by any individual, so that the data was still useful for research • It did not take long for two journalists to identify user 4417749, who had searched for people with the last name Arnold, “homes sold in shadow lake subdivision Gwinnett county Georgia" and “pine straw in Lilburn” as Thelma Arnold, a widow living in Lilburn, Georgia • Her other searches provide a deeply personal view of her life, difficulties and desires
  • 18. AOL faced strong criticism • The violation of privacy was widely condemned • AOL described their action as a “screw up” • They took down the data, but it was too late. The internet never forgets. Several mirror sites had already been set up.
  • 20.
  • 21. Example 2: The Netflix™ Prize • In October 2006, Netflix launched a $1m prize for an algorithm that was 10% better than its existing algorithm Cinematch • participants were given access to the contest training data set of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles. • How much information would you need to be able to identify customers?
  • 22. Netflix • Netflix said “to protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The date of each rating and the title and year of release for each movie are provided. No other customer or movie information is provided.” • Two weeks after the prize was launched, Arvind Narayanan and Vitaly Shmatikov of the University of Texas at Austin announced that they could identify a high proportion of the 480,000 subscribers in the training data.
  • 23. Narayanan and Shmatikov’s results • How much does the attacker need to know about a Netflix subscriber in order to identify her record in the dataset, and thus completely learn her movie viewing history? Very little. • For example, suppose the attacker learns a few random ratings and the corresponding dates for some subscriber, perhaps from coffee-time chat. • With 8 movie ratings (of which we allow 2 to be completely wrong) and dates that may have a 3-day error, 96% of Netflix subscribers whose records have been released can be uniquely identified in the dataset. • For 64% of subscribers, knowledge of only 2 ratings and dates is sufficient for complete deanonymization, and for 89%, 2 ratings and dates are enough to reduce the set of records to 8 out of almost 500,000, which can then be inspected for further deanonymisation.
  • 24. Why are Narayanan and Shmatikov’s results important? 1. They were results from probability theory, so they apply to all sparse datasets. (They tested the results later, using the Internet Movie Database IMDb as a source of data). 2. Psychologists at Cambridge University have shown that a small number of seemingly innocuous Facebook Likes can be used to automatically and accurately predict a range of highly sensitive personal attributes including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender).
  • 28. The Attack Scenario - Anonymization 28
  • 31. Anonymization in Data Systems 31
  • 35. Example of suppression and generalization
  • 39. Finding similar instances • A Fast Approximate Nearest Neighbor Search Algorithm in the Hamming Space – Locality sensitive hashing (LSH) – Error Weighted Hashing (EWH) – Etc. 39
  • 40. Big, Fast, Diverse, Inaccurate Data: The Problem of Promising Protection of Personal Information in a Protected and Privacy Preserving Platform in Practice
  • 41. Thank you! • Questions? Kambiz Ghazinour kghazino@kent.edu @DrGhazinour 41 Advanced Information Security and Privacy Lab
  • 42. Big Data For A Better World • Networking Sponsors: