SlideShare a Scribd company logo
1 of 28
1© Cloudera, Inc. All rights reserved.
Sean Owen | Director, Data Science
Fuzzy Data Leaks
2© Cloudera, Inc. All rights reserved.
(Not Fuzzy) Data Leak:
In 2007, hackers stole
details 45.7 million
credit and debit cards
from TJ Maxx and
Marshall's. Possibly
the "largest hack
ever."
3© Cloudera, Inc. All rights reserved.
Fuzzy Data Leak:
ACME Co. shares this
non-personally-
identifiable data with
a partner for market
research. Where's the
leak?
City,Gender,DOB,…
Duluth,M,18-Sep-1975,…
Reno,F,04-Feb-1954,…
Tacoma,F,01-Mar-1944,…
Austin,M,09-Oct-1980,…
Spokane,M,09-Feb-1970,…
Reno,M,10-Oct-1980,…
Boise,F,01-Jan-1970,…
...
4© Cloudera, Inc. All rights reserved.
"About half of the U.S. population
… are likely to be uniquely
identified by only place, gender,
date of birth …"
Uniqueness of Simple Demographics in the U.S. Population,
by Latanya Sweeney
5© Cloudera, Inc. All rights reserved.
AOL Search Data Leak
Raw user input is sensitive
6© Cloudera, Inc. All rights reserved.
In 2006, AOL released
20 million raw search
queries for 658,000
users for research
purposes, with users
identified only by
number
7© Cloudera, Inc. All rights reserved.
User 4417749: Thelma Arnold
www.nytimes.com/imagepages/2006/08/08/business/09aol-graphic.html
8© Cloudera, Inc. All rights reserved.
Secure Raw User Input, Even Internally
v
v
Landing Zone Data Hub
Tokenization
…
9© Cloudera, Inc. All rights reserved.
Netflix Prize Data Set
A little noise is not enough
10© Cloudera, Inc. All rights reserved.
In 2008, Netflix
released 100 million
anonymized ratings
from 480,189 users to
17,770 movies and
offered $1M for the
best recommender
built from this data
11© Cloudera, Inc. All rights reserved.
"... all customer identifying
information has been removed ...
only a small sample was included
(less than one-tenth of our
complete dataset) and that data
was subject to perturbation."
Netflix | http://www.netflixprize.com/faq
12© Cloudera, Inc. All rights reserved.
"Using [IMDB] …,
we successfully identified the
Netflix records of known users,
uncovering their apparent political
preferences …"
Robust De-anonymization of Large Sparse Datasets, by
Arvind Narayanan and Vitaly Shmatikov
13© Cloudera, Inc. All rights reserved.
Netflix Netflix Prize Data IMDB
14© Cloudera, Inc. All rights reserved.
Obscure movies carry
orders of magnitude
more information
about identity.
https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
15© Cloudera, Inc. All rights reserved.
Perturbation doesn't
hurt ability to model
much.
Simple Apache Spark
implementation,
adding a fake movie
rating for every real
one, increases RMSE
by ~ 0.1
val dataSet = sc.textFile(".../training_set.csv").map { line =>
val Array(item, user, rating, _) = line.split(",")
Rating(user.toInt, item.toInt, rating.toDouble)
}
val allItems = dataSet.map(_.product).distinct.collect
val perturbedData = dataSet.mapPartitions { partition =>
val r = new scala.util.Random()
partition.flatMap { case rating@Rating(user, _, value) =>
val fakeItem = allItems(r.nextInt(allItems.size))
Seq(rating, new Rating(user, fakeItem, value))
}
}
val Array(trainSet, testSet) =
dataSet.randomSplit(Array(0.99, 0.01), 42)
...
val model = ALS.train(trainSet, 100, 20, 0.05)
...
println(s"RMSE = ${math.sqrt(MSE)}")
16© Cloudera, Inc. All rights reserved.
But, why did Netflix even release
movie names?
"Other datasets provide it. Cinematch
doesn’t currently use this data. Use it if
you want."
(HT Matthew Wright)
17© Cloudera, Inc. All rights reserved.
NYC Taxi Ride Data Set
There's a wrong and right way to obfuscate
18© Cloudera, Inc. All rights reserved.
The New York City Taxi
& Limousine
Commission retains
detailed records of
taxi rides in the city. In
2014 this was
obtained by request
under the Freedom Of
Information Law
(FOIL)
upload.wikimedia.org/wikipedia/commons/8/84/NYC_Hybrid_Taxi.JPG
19© Cloudera, Inc. All rights reserved.
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,
pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,
trip_distance,pickup_longitude,pickup_latitude,
dropoff_longitude,dropoff_latitude
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,
2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,
1.00,-73.978165,40.757977,
-73.989838,40.751171
2Y92? 5002921?
20© Cloudera, Inc. All rights reserved.
1844173304
.hashCode
"Why, then the world's mine oyster,
Which I with sword will open."
21© Cloudera, Inc. All rights reserved.
D41D8CD98F00B204E9800998ECF8427E
MD5
""
tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1
22© Cloudera, Inc. All rights reserved.
3N49
FY985
BXJ761
26,000
+ 676,000
+ 17,576,000
= 18,278,000
23© Cloudera, Inc. All rights reserved.
Time to MD5 hash all
18,278,000 possible
medallions on a 6-
node Apache Spark
cluster:
2 seconds
24© Cloudera, Inc. All rights reserved.
val digits = '0' to '9'
val alpha = 'A' to 'Z'
val nxnn = sc.parallelize(digits).flatMap(a =>
for (b <- alpha; c <- digits; d <- digits)
yield new String(Array(a, b, c, d)))
val xx = for (a <- alpha; b <- alpha) yield Array(a, b)
val xxnnn = sc.parallelize(xx).flatMap(prefix =>
for (c <- digits; d <- digits; e <- digits)
yield new String(prefix ++ Array(c, d, e)))
val xxxnnn = sc.parallelize(xx).flatMap(prefix =>
for (c <- alpha; d <- digits; e <- digits; f <- digits)
yield new String(prefix ++ Array(c, d, e, f)))
sc.union(nxnn, xxnnn, xxxnnn).mapPartitions { medallions =>
val md5 = java.security.MessageDigest.getInstance("MD5")
medallions.map { medallion =>
val bytes = medallion.getBytes(StandardCharsets.UTF_8)
(toHex(md5.digest(bytes)), medallion)
}
}.collectAsMap()
25© Cloudera, Inc. All rights reserved.
"In Brad Cooper’s
case, we now know
that his cab took him
to Greenwich Village,
possibly to have
dinner at Melibea,
and that he paid
$10.50, with no
recorded tip."
research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/
26© Cloudera, Inc. All rights reserved.
Solution 1: Random Mapping
• Make an arbitrary mapping from
inputs to random, unique, obfuscated
identifiers
• A0AA = 348599828
• A0AB = 118201402
• …
• As secure as it is random
• Must construct complete mapping
Solution 2: Salting
• Hash some private transformation of
the input, like: input + "salt"
• Easy, same characteristics
• MD5("3N49") reversible since
preimage was small, obvious
• MD5("3N49"+"salt") not
reversible since preimage is unknown
without knowing "salt"
A Little Salt Goes a Long Way
27© Cloudera, Inc. All rights reserved.
Tokenize,
Perturb
& Salt
28© Cloudera, Inc. All rights reserved.
Thank you
@sean_r_owen
sowen@cloudera.com

More Related Content

Viewers also liked

Congelamiento de precios productos en wal mart
Congelamiento de precios   productos en wal martCongelamiento de precios   productos en wal mart
Congelamiento de precios productos en wal martDiario Elcomahueonline
 
Reasons for foreign listings by South African junior mining and exploration c...
Reasons for foreign listings by South African junior mining and exploration c...Reasons for foreign listings by South African junior mining and exploration c...
Reasons for foreign listings by South African junior mining and exploration c...Vicki Shaw
 
Day 1: Digital parliamentarians: Tools, opportunities and challenges for elec...
Day 1: Digital parliamentarians: Tools, opportunities and challenges for elec...Day 1: Digital parliamentarians: Tools, opportunities and challenges for elec...
Day 1: Digital parliamentarians: Tools, opportunities and challenges for elec...wepc2016
 
Alba Lucia Sanchez Mejia
Alba Lucia Sanchez Mejia	Alba Lucia Sanchez Mejia
Alba Lucia Sanchez Mejia astrydquintero
 
Securing the Infrastructure and the Workloads of Linux Containers
Securing the Infrastructure and the Workloads of Linux ContainersSecuring the Infrastructure and the Workloads of Linux Containers
Securing the Infrastructure and the Workloads of Linux ContainersMassimiliano Mattetti
 
Presentacion de educación en México
Presentacion de educación en MéxicoPresentacion de educación en México
Presentacion de educación en Méxicombrionessauceda
 
مراجعة الصف الثانى الاعدادى
مراجعة الصف الثانى الاعدادىمراجعة الصف الثانى الاعدادى
مراجعة الصف الثانى الاعدادىHanaa Ahmed
 

Viewers also liked (12)

Congelamiento de precios productos en wal mart
Congelamiento de precios   productos en wal martCongelamiento de precios   productos en wal mart
Congelamiento de precios productos en wal mart
 
Power tecnologia
Power tecnologiaPower tecnologia
Power tecnologia
 
Reasons for foreign listings by South African junior mining and exploration c...
Reasons for foreign listings by South African junior mining and exploration c...Reasons for foreign listings by South African junior mining and exploration c...
Reasons for foreign listings by South African junior mining and exploration c...
 
Day 1: Digital parliamentarians: Tools, opportunities and challenges for elec...
Day 1: Digital parliamentarians: Tools, opportunities and challenges for elec...Day 1: Digital parliamentarians: Tools, opportunities and challenges for elec...
Day 1: Digital parliamentarians: Tools, opportunities and challenges for elec...
 
Alba Lucia Sanchez Mejia
Alba Lucia Sanchez Mejia	Alba Lucia Sanchez Mejia
Alba Lucia Sanchez Mejia
 
ebay for Beginners
ebay for Beginnersebay for Beginners
ebay for Beginners
 
Securing the Infrastructure and the Workloads of Linux Containers
Securing the Infrastructure and the Workloads of Linux ContainersSecuring the Infrastructure and the Workloads of Linux Containers
Securing the Infrastructure and the Workloads of Linux Containers
 
Emprendimiento
EmprendimientoEmprendimiento
Emprendimiento
 
Cafe Con Leche
Cafe Con LecheCafe Con Leche
Cafe Con Leche
 
Presentacion de educación en México
Presentacion de educación en MéxicoPresentacion de educación en México
Presentacion de educación en México
 
Resume
ResumeResume
Resume
 
مراجعة الصف الثانى الاعدادى
مراجعة الصف الثانى الاعدادىمراجعة الصف الثانى الاعدادى
مراجعة الصف الثانى الاعدادى
 

Similar to Fuzzy Data Leaks

Spark, ElasticSearch, and Kafka by Brian Kursar
Spark, ElasticSearch, and Kafka by Brian KursarSpark, ElasticSearch, and Kafka by Brian Kursar
Spark, ElasticSearch, and Kafka by Brian KursarData Con LA
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraDataStax
 
Spark DataFrames for Data Munging
Spark DataFrames for Data MungingSpark DataFrames for Data Munging
Spark DataFrames for Data Munging(Susan) Xinh Huynh
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Cloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forestsCloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forestsDataconomy Media
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosJoe Stein
 
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Databricks
 
Extending Gremlin with Foundational Steps
Extending Gremlin with Foundational StepsExtending Gremlin with Foundational Steps
Extending Gremlin with Foundational StepsStephen Mallette
 
The Very ^ 2 Basics of R
The Very ^ 2 Basics of RThe Very ^ 2 Basics of R
The Very ^ 2 Basics of RWinston Chen
 
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)Robert Stupp
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...InfluxData
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008eComm2008
 
La résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesLa résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesData2B
 
Deep Learning and the technology behind Self-Driving Cars
Deep Learning and the technology behind Self-Driving CarsDeep Learning and the technology behind Self-Driving Cars
Deep Learning and the technology behind Self-Driving CarsLucas García, PhD
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
An Introduction to Scala (2014)
An Introduction to Scala (2014)An Introduction to Scala (2014)
An Introduction to Scala (2014)William Narmontas
 

Similar to Fuzzy Data Leaks (20)

Spark, ElasticSearch, and Kafka by Brian Kursar
Spark, ElasticSearch, and Kafka by Brian KursarSpark, ElasticSearch, and Kafka by Brian Kursar
Spark, ElasticSearch, and Kafka by Brian Kursar
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 
Spark DataFrames for Data Munging
Spark DataFrames for Data MungingSpark DataFrames for Data Munging
Spark DataFrames for Data Munging
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Cloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forestsCloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forests
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache Mesos
 
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
 
Extending Gremlin with Foundational Steps
Extending Gremlin with Foundational StepsExtending Gremlin with Foundational Steps
Extending Gremlin with Foundational Steps
 
The Very ^ 2 Basics of R
The Very ^ 2 Basics of RThe Very ^ 2 Basics of R
The Very ^ 2 Basics of R
 
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)Cassandra + Spark (You’ve got the lighter, let’s start a fire)
Cassandra + Spark (You’ve got the lighter, let’s start a fire)
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
La résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesLa résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphes
 
Deep Learning and the technology behind Self-Driving Cars
Deep Learning and the technology behind Self-Driving CarsDeep Learning and the technology behind Self-Driving Cars
Deep Learning and the technology behind Self-Driving Cars
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Ml presentation
Ml presentationMl presentation
Ml presentation
 
An Introduction to Scala (2014)
An Introduction to Scala (2014)An Introduction to Scala (2014)
An Introduction to Scala (2014)
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 

Recently uploaded (20)

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 

Fuzzy Data Leaks

  • 1. 1© Cloudera, Inc. All rights reserved. Sean Owen | Director, Data Science Fuzzy Data Leaks
  • 2. 2© Cloudera, Inc. All rights reserved. (Not Fuzzy) Data Leak: In 2007, hackers stole details 45.7 million credit and debit cards from TJ Maxx and Marshall's. Possibly the "largest hack ever."
  • 3. 3© Cloudera, Inc. All rights reserved. Fuzzy Data Leak: ACME Co. shares this non-personally- identifiable data with a partner for market research. Where's the leak? City,Gender,DOB,… Duluth,M,18-Sep-1975,… Reno,F,04-Feb-1954,… Tacoma,F,01-Mar-1944,… Austin,M,09-Oct-1980,… Spokane,M,09-Feb-1970,… Reno,M,10-Oct-1980,… Boise,F,01-Jan-1970,… ...
  • 4. 4© Cloudera, Inc. All rights reserved. "About half of the U.S. population … are likely to be uniquely identified by only place, gender, date of birth …" Uniqueness of Simple Demographics in the U.S. Population, by Latanya Sweeney
  • 5. 5© Cloudera, Inc. All rights reserved. AOL Search Data Leak Raw user input is sensitive
  • 6. 6© Cloudera, Inc. All rights reserved. In 2006, AOL released 20 million raw search queries for 658,000 users for research purposes, with users identified only by number
  • 7. 7© Cloudera, Inc. All rights reserved. User 4417749: Thelma Arnold www.nytimes.com/imagepages/2006/08/08/business/09aol-graphic.html
  • 8. 8© Cloudera, Inc. All rights reserved. Secure Raw User Input, Even Internally v v Landing Zone Data Hub Tokenization …
  • 9. 9© Cloudera, Inc. All rights reserved. Netflix Prize Data Set A little noise is not enough
  • 10. 10© Cloudera, Inc. All rights reserved. In 2008, Netflix released 100 million anonymized ratings from 480,189 users to 17,770 movies and offered $1M for the best recommender built from this data
  • 11. 11© Cloudera, Inc. All rights reserved. "... all customer identifying information has been removed ... only a small sample was included (less than one-tenth of our complete dataset) and that data was subject to perturbation." Netflix | http://www.netflixprize.com/faq
  • 12. 12© Cloudera, Inc. All rights reserved. "Using [IMDB] …, we successfully identified the Netflix records of known users, uncovering their apparent political preferences …" Robust De-anonymization of Large Sparse Datasets, by Arvind Narayanan and Vitaly Shmatikov
  • 13. 13© Cloudera, Inc. All rights reserved. Netflix Netflix Prize Data IMDB
  • 14. 14© Cloudera, Inc. All rights reserved. Obscure movies carry orders of magnitude more information about identity. https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
  • 15. 15© Cloudera, Inc. All rights reserved. Perturbation doesn't hurt ability to model much. Simple Apache Spark implementation, adding a fake movie rating for every real one, increases RMSE by ~ 0.1 val dataSet = sc.textFile(".../training_set.csv").map { line => val Array(item, user, rating, _) = line.split(",") Rating(user.toInt, item.toInt, rating.toDouble) } val allItems = dataSet.map(_.product).distinct.collect val perturbedData = dataSet.mapPartitions { partition => val r = new scala.util.Random() partition.flatMap { case rating@Rating(user, _, value) => val fakeItem = allItems(r.nextInt(allItems.size)) Seq(rating, new Rating(user, fakeItem, value)) } } val Array(trainSet, testSet) = dataSet.randomSplit(Array(0.99, 0.01), 42) ... val model = ALS.train(trainSet, 100, 20, 0.05) ... println(s"RMSE = ${math.sqrt(MSE)}")
  • 16. 16© Cloudera, Inc. All rights reserved. But, why did Netflix even release movie names? "Other datasets provide it. Cinematch doesn’t currently use this data. Use it if you want." (HT Matthew Wright)
  • 17. 17© Cloudera, Inc. All rights reserved. NYC Taxi Ride Data Set There's a wrong and right way to obfuscate
  • 18. 18© Cloudera, Inc. All rights reserved. The New York City Taxi & Limousine Commission retains detailed records of taxi rides in the city. In 2014 this was obtained by request under the Freedom Of Information Law (FOIL) upload.wikimedia.org/wikipedia/commons/8/84/NYC_Hybrid_Taxi.JPG
  • 19. 19© Cloudera, Inc. All rights reserved. medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag, pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs, trip_distance,pickup_longitude,pickup_latitude, dropoff_longitude,dropoff_latitude 89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N, 2013-01-01 15:11:48,2013-01-01 15:18:10,4,382, 1.00,-73.978165,40.757977, -73.989838,40.751171 2Y92? 5002921?
  • 20. 20© Cloudera, Inc. All rights reserved. 1844173304 .hashCode "Why, then the world's mine oyster, Which I with sword will open."
  • 21. 21© Cloudera, Inc. All rights reserved. D41D8CD98F00B204E9800998ECF8427E MD5 "" tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1
  • 22. 22© Cloudera, Inc. All rights reserved. 3N49 FY985 BXJ761 26,000 + 676,000 + 17,576,000 = 18,278,000
  • 23. 23© Cloudera, Inc. All rights reserved. Time to MD5 hash all 18,278,000 possible medallions on a 6- node Apache Spark cluster: 2 seconds
  • 24. 24© Cloudera, Inc. All rights reserved. val digits = '0' to '9' val alpha = 'A' to 'Z' val nxnn = sc.parallelize(digits).flatMap(a => for (b <- alpha; c <- digits; d <- digits) yield new String(Array(a, b, c, d))) val xx = for (a <- alpha; b <- alpha) yield Array(a, b) val xxnnn = sc.parallelize(xx).flatMap(prefix => for (c <- digits; d <- digits; e <- digits) yield new String(prefix ++ Array(c, d, e))) val xxxnnn = sc.parallelize(xx).flatMap(prefix => for (c <- alpha; d <- digits; e <- digits; f <- digits) yield new String(prefix ++ Array(c, d, e, f))) sc.union(nxnn, xxnnn, xxxnnn).mapPartitions { medallions => val md5 = java.security.MessageDigest.getInstance("MD5") medallions.map { medallion => val bytes = medallion.getBytes(StandardCharsets.UTF_8) (toHex(md5.digest(bytes)), medallion) } }.collectAsMap()
  • 25. 25© Cloudera, Inc. All rights reserved. "In Brad Cooper’s case, we now know that his cab took him to Greenwich Village, possibly to have dinner at Melibea, and that he paid $10.50, with no recorded tip." research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/
  • 26. 26© Cloudera, Inc. All rights reserved. Solution 1: Random Mapping • Make an arbitrary mapping from inputs to random, unique, obfuscated identifiers • A0AA = 348599828 • A0AB = 118201402 • … • As secure as it is random • Must construct complete mapping Solution 2: Salting • Hash some private transformation of the input, like: input + "salt" • Easy, same characteristics • MD5("3N49") reversible since preimage was small, obvious • MD5("3N49"+"salt") not reversible since preimage is unknown without knowing "salt" A Little Salt Goes a Long Way
  • 27. 27© Cloudera, Inc. All rights reserved. Tokenize, Perturb & Salt
  • 28. 28© Cloudera, Inc. All rights reserved. Thank you @sean_r_owen sowen@cloudera.com

Editor's Notes

  1. http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/ http://www.nbcnews.com/id/17871485/ns/technology_and_science-security/t/tj-maxx-theft-believed-largest-hack-ever/
  2. Uniqueness of Simple Demographics in the U.S. Population LIDAP-WP4 Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA: 2000 (1000) by Latanya Sweeney http://www.citeulike.org/user/burd/article/5822736
  3. http://www.nytimes.com/2006/08/09/technology/09aol.html http://www.nytimes.com/imagepages/2006/08/08/business/09aol-graphic.html
  4. www.netflixprize.com
  5. Also: small perturbation in dates, ratings. Some ratings dropped.
  6. https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
  7. A little perturbation isn't enough
  8. www.netflixprize.com
  9. import org.apache.spark.mllib.recommendation._ val dataSet = sc.textFile("/user/sowen/DataSets/Netflix/download/training_set/training_set.csv") .map { line => val Array(item, user, rating, _) = line.split(",") Rating(user.toInt, item.toInt, rating.toDouble) } val allItems = dataSet.map(_.product).distinct.collect val perturbedData = dataSet.mapPartitions { partition => val r = new scala.util.Random() partition.flatMap { case rating @ Rating(user, _, value) => val fakeItem = allItems(r.nextInt(allItems.size)) Seq(rating, new Rating(user, fakeItem, value)) } } val Array(trainSet, testSet) = perturbedData.randomSplit(Array(0.99, 0.01), 42) trainSet.cache() testSet.cache() val model = ALS.train(trainSet, 100, 20, 0.05) val testTuples = testSet.map { case Rating(user, product, rate) => ((user, product), rate) } val predictions = model.predict(testTuples.keys).map { case Rating(user, product, rate) => ((user, product), rate) } val MSE = testTuples.join(predictions).values.map { case (r1, r2) => val err = r1 - r2 err * err }.mean() println(s"RMSE = ${math.sqrt(MSE)}")
  10. https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
  11. http://chriswhong.com/open-data/foil_nyc_taxi/ https://upload.wikimedia.org/wikipedia/commons/8/84/NYC_Hybrid_Taxi.JPG
  12. https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1
  13. https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1
  14. https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1
  15. https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1
  16. val digits = '0' to '9' val alpha = 'A' to 'Z' val nxnn = sc.parallelize(digits).flatMap(a => for (b <- alpha; c <- digits; d <- digits) yield new String(Array(a, b, c, d))) val xx = for (a <- alpha; b <- alpha) yield Array(a, b) val xxnnn = sc.parallelize(xx).flatMap(prefix => for (c <- digits; d <- digits; e <- digits) yield new String(prefix ++ Array(c, d, e))) val xxxnnn = sc.parallelize(xx).flatMap(prefix => for (c <- alpha; d <- digits; e <- digits; f <- digits) yield new String(prefix ++ Array(c, d, e, f))) def toHex(bytes: Array[Byte]) = bytes.map { b => val u = b & 0xFF if (u < 16) "0" + u.toHexString else u.toHexString }.mkString.toUpperCase sc.union(nxnn, xxnnn, xxxnnn).mapPartitions { medallions => val md5 = java.security.MessageDigest.getInstance("MD5") medallions.map { medallion => val bytes = medallion.getBytes(java.nio.charset.StandardCharsets.UTF_8) (toHex(md5.digest(bytes)), medallion) } }.collectAsMap()
  17. https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/