Can Privacy Exist With
Machine Learning?
Steve Touw, Chief Technology Officer, Immuta - Gartner Cool Vendor 2018
“Data can be either useful or
perfectly anonymous but never both.”
Paul Ohm,
Broken Promises of Privacy,
57 UCLA Law Review 1701 (2010)
I know stuff about
Judd and Leslie
Judd Apatow & Leslie Mann
Photo Credit: PacificCoastNews.com
© 2017 Immuta All Rights Reserved. 3
New York Taxi &
Limousine Commission
• Data was released containing taxi pickups,
dropoffs, location, time, amount, and tip
amount, among others
• This seems pretty harmless?
© 2017 Immuta All Rights Reserved. 4
Well, Judd and Leslie May
Not Think It’s Harmless
This photos was geotagged (with time), so
by simply querying by medallion and time,
we know how much Judd and Leslie tip!
© 2017 Immuta All Rights Reserved. 5
This is an example of a “link attack”
Medallion & Photo Time
Medallion & Pickup Time
New York
Taxi Data
© 2017 Immuta All Rights Reserved. 6
New York Actually Tried to Anonymize the data
By hashing the medallion
But that didn’t matter….
© 2017 Immuta All Rights Reserved. 7
New York
Taxi Data
Medallion & Photo Time
Pickup Time & Pickup Loc
Pickup Loc & Dropoff Loc
Dropoff Loc & Dropoff Time
Dropoff Time & Receipt
Medallion & Pickup Time
Pickup Time & Pickup Loc
Pickup Loc & Dropoff Loc
Dropoff Loc & Dropoff Time
Dropoff Time & Amount
© 2017 Immuta All Rights Reserved. 8
Remember!
Data can be either useful or perfectly
anonymous but never both.
In fact
“...just three data points were enough to
identify an even larger percentage of
people in the data set. That means that
someone with copies of just three of
your recent receipts — or one receipt,
one Instagram photo of you having
coffee with friends, and one tweet about
the phone you just bought — would have
a 94 percent chance of extracting your
credit card records from those of a
million other people”
© 2017 Immuta All Rights Reserved. 10
“...one Instagram photo of you having
coffee with friends, and one tweet
about the phone you just bought…”
More data is available to us than ever, which
means link attacks become increasingly simple
It’s very easy to build profiles of individuals...
© 2017 Immuta All Rights Reserved. 11
The European Union responds
General Data Protection Regulation (GDPR)
Effective May 25, 2018
Fines up to 4 percent of global revenue
Applies to any company collecting data on EU citizens
© 2017 Immuta All Rights Reserved. 12
GDPR Article 4(1):
'personal data' means any information relating to an identified or identifiable
natural person ('data subject'); an identifiable natural person is one who can be
identified, directly or indirectly, in particular by reference to an identifier such as
a name, an identification number, location data, an online identifier or to one or
more factors specific to the physical, physiological, genetic, mental, economic,
cultural or social identity of that natural person;
In Q3 alone, we’ve seen a huge uptick in interest
from regulators in regulating data, to include
• California Consumer Privacy Act was passed in June 2018, and will take effect in 2020.
• Vermont became the first state in the nation to regulate data brokers.
• In September 2018, the Trump administration, acting through National Telecommunications and
Information Administration, released a “Request for Comments on Developing the Administration’s
Approach to Consumer Privacy.”
• This is the first concrete illustration that a national-level privacy regulation like the GDPR is coming to the US.
• Immuta prediction: By 2020, no major economic zone will be free of an overarching data protection law.
© 2017 Immuta All Rights Reserved. 14
PRIVACY
MACHINE
LEARNING
MACHINE LEARNING WILL
CHANGE THE ECONOMY AS WE
KNOW IT
It’s all
about
the
data!
What Amazon Teaches Us About the Future
Responding to data is at the core of Amazon does… and
why organizations across verticals need to follow its lead
• Supply chain optimization: optimize distribution, storage, routes, schedules, products
• Pricing and profit optimization: elastically tailor pricing to products and consumers
• Customer segmentation: real-time analysis to boost marketing/advertising efficiency
• Software/hardware system analytics: optimizing use and distribution of IT infrastructure globally
• Competitive analysis: automatically process billions of data points about the company, its
competitors, and new trends to create daily / hourly / real-time, automated analyses
© 2017 Immuta All Rights Reserved. 17
The Newer Guys Have the Upper Hand
Low technical debt
• Futuristic software
architectures
Centralized Data
• No data silos
• Specific problem-
set drove data schemas
Fewer Regulatory
Controls
• Not for long!!
They are Data Agile
© 2017 Immuta All Rights Reserved. 18
© 2017 Immuta All Rights Reserved. 19
Centralized Policy
Enforcement
Rapid Access to Data Frictionless to
Data Analysts
Focus on this today
The Three Pillars to Data Agility
© 2017 Immuta All Rights Reserved. 20
Centralized Policy Enforcement
Old World
• Policies managed uniquely
at each data source
• Use ETL to create ”safe” versions of
data
• IT interprets legal
guidance themselves
• Audit logs are disjointed/inconsistent
New World
• Consistent layer for creating data policies
• Policies are enforced dynamically
• Plain-english policy builder usable by any
author and understandable by all
• An unprecedented list of policy logic
at your fingertips
• All actions monitored granularly and
consistently
© 2017 Immuta All Rights Reserved. 21
Introducing
Immuta
© 2017 Immuta All Rights Reserved. 22
Privacy Preserving Techniques
(we do a bunch, I’m only going to touch on a few here)
© 2017 Immuta All Rights Reserved. 23
Right To Privacy?
• Early on photography was expensive
• Near the turn of the century the masses
had general use of photography
• "instantaneous photographs and newspaper
enterprise have invaded the sacred precincts of
private and domestic life." - Samuel Warren and
Louis Brandeis (U.S. Supreme Court Justice)
• Proposed right to “be let alone”
• We generally accept being observed,
but rarely accept being identified
© 2017 Immuta All Rights Reserved. 24
The End of Privacy
[as we know it]?
• Rise of technology and data science
has killed privacy as we know it
• Instead of focusing on how and
when our data is gathered...
• Privacy should now be
how our data is being used.
© 2017 Immuta All Rights Reserved. 25
Immuta can do this
The GDPR understands this!
• The cornerstone of GDPR is consent
• You should only process data for the purposes for
which your data subjects have explicitly consented
• In other words: you must consider analytical
context as a guide to what data you can see
• This is very different from role-based access controls
© 2017 Immuta All Rights Reserved. 26
Towards Practical Differential Privacy for SQL Queries
Johnson, Near, Song, Aug 2017
The Internal study
of queries at Uber
• SQL queries written by
employees at Uber
• 8.1 million queries executed
between March 2013 and
August 2016
• Broad range of sensitive data
including rider and driver
information, trip logs, and
customer support data
27
34% of Uber Data Science
Queries are aggregates
Statistical queries matter!
Data can be either useful or perfectly
anonymous but never both.
IF WE CONSIDER STATISTICAL QUERIES USEFUL, THIS CAN BE A LIE:
How?
© 2017 Immuta All Rights Reserved. 29
Let’s play a game
• Think of a number between 1 and 6
• Now I’m going to ask you a question you
probably don’t want to answer in public
• Do you hide spending from your spouse?
• Now raise your hand if you thought of
a 3 OR answered yes to the above
© 2017 Immuta All Rights Reserved. 30
This is Differential Privacy
• I protected your privacy by providing plausible deniability
• But I can also understand the percentage of people that hide spending from their
spouse because I understand the probability of you selecting a 3
• Differential Privacy is restricted to only statistical queries and adds the appropriate
amount of noise based on the sensitivity of the question
• ‘Differential privacy formalizes the idea that a "private" computation should not reveal
whether any one person participated in the input or not, much less what their data are.’
- [Frank McSherry] (https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md)
© 2017 Immuta All Rights Reserved. 31
How Could NYT Have Done it?
Localized Sensitivity
© 2017 Immuta All Rights Reserved. 32
How do we
do it?
Simple…
In plain
English
everyone
can
understand
© 2017 Immuta All Rights Reserved. 33
Can Privacy and
Machine Learning
Exist Together?
We believe it can,
data agility is what
you need
© 2017 Immuta All Rights Reserved. 34
Questions
steve@immuta.com
@steve_touw
www.immuta.com
Come visit our Booth #729!

Can Privacy Exist With Machine Learning?

  • 1.
    Can Privacy ExistWith Machine Learning? Steve Touw, Chief Technology Officer, Immuta - Gartner Cool Vendor 2018
  • 2.
    “Data can beeither useful or perfectly anonymous but never both.” Paul Ohm, Broken Promises of Privacy, 57 UCLA Law Review 1701 (2010)
  • 3.
    I know stuffabout Judd and Leslie Judd Apatow & Leslie Mann Photo Credit: PacificCoastNews.com © 2017 Immuta All Rights Reserved. 3
  • 4.
    New York Taxi& Limousine Commission • Data was released containing taxi pickups, dropoffs, location, time, amount, and tip amount, among others • This seems pretty harmless? © 2017 Immuta All Rights Reserved. 4
  • 5.
    Well, Judd andLeslie May Not Think It’s Harmless This photos was geotagged (with time), so by simply querying by medallion and time, we know how much Judd and Leslie tip! © 2017 Immuta All Rights Reserved. 5
  • 6.
    This is anexample of a “link attack” Medallion & Photo Time Medallion & Pickup Time New York Taxi Data © 2017 Immuta All Rights Reserved. 6
  • 7.
    New York ActuallyTried to Anonymize the data By hashing the medallion But that didn’t matter…. © 2017 Immuta All Rights Reserved. 7
  • 8.
    New York Taxi Data Medallion& Photo Time Pickup Time & Pickup Loc Pickup Loc & Dropoff Loc Dropoff Loc & Dropoff Time Dropoff Time & Receipt Medallion & Pickup Time Pickup Time & Pickup Loc Pickup Loc & Dropoff Loc Dropoff Loc & Dropoff Time Dropoff Time & Amount © 2017 Immuta All Rights Reserved. 8
  • 9.
    Remember! Data can beeither useful or perfectly anonymous but never both.
  • 10.
    In fact “...just threedata points were enough to identify an even larger percentage of people in the data set. That means that someone with copies of just three of your recent receipts — or one receipt, one Instagram photo of you having coffee with friends, and one tweet about the phone you just bought — would have a 94 percent chance of extracting your credit card records from those of a million other people” © 2017 Immuta All Rights Reserved. 10
  • 11.
    “...one Instagram photoof you having coffee with friends, and one tweet about the phone you just bought…” More data is available to us than ever, which means link attacks become increasingly simple It’s very easy to build profiles of individuals... © 2017 Immuta All Rights Reserved. 11
  • 12.
    The European Unionresponds General Data Protection Regulation (GDPR) Effective May 25, 2018 Fines up to 4 percent of global revenue Applies to any company collecting data on EU citizens © 2017 Immuta All Rights Reserved. 12
  • 13.
    GDPR Article 4(1): 'personaldata' means any information relating to an identified or identifiable natural person ('data subject'); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;
  • 14.
    In Q3 alone,we’ve seen a huge uptick in interest from regulators in regulating data, to include • California Consumer Privacy Act was passed in June 2018, and will take effect in 2020. • Vermont became the first state in the nation to regulate data brokers. • In September 2018, the Trump administration, acting through National Telecommunications and Information Administration, released a “Request for Comments on Developing the Administration’s Approach to Consumer Privacy.” • This is the first concrete illustration that a national-level privacy regulation like the GDPR is coming to the US. • Immuta prediction: By 2020, no major economic zone will be free of an overarching data protection law. © 2017 Immuta All Rights Reserved. 14
  • 15.
  • 16.
    MACHINE LEARNING WILL CHANGETHE ECONOMY AS WE KNOW IT
  • 17.
    It’s all about the data! What AmazonTeaches Us About the Future Responding to data is at the core of Amazon does… and why organizations across verticals need to follow its lead • Supply chain optimization: optimize distribution, storage, routes, schedules, products • Pricing and profit optimization: elastically tailor pricing to products and consumers • Customer segmentation: real-time analysis to boost marketing/advertising efficiency • Software/hardware system analytics: optimizing use and distribution of IT infrastructure globally • Competitive analysis: automatically process billions of data points about the company, its competitors, and new trends to create daily / hourly / real-time, automated analyses © 2017 Immuta All Rights Reserved. 17
  • 18.
    The Newer GuysHave the Upper Hand Low technical debt • Futuristic software architectures Centralized Data • No data silos • Specific problem- set drove data schemas Fewer Regulatory Controls • Not for long!! They are Data Agile © 2017 Immuta All Rights Reserved. 18
  • 19.
    © 2017 ImmutaAll Rights Reserved. 19
  • 20.
    Centralized Policy Enforcement Rapid Accessto Data Frictionless to Data Analysts Focus on this today The Three Pillars to Data Agility © 2017 Immuta All Rights Reserved. 20
  • 21.
    Centralized Policy Enforcement OldWorld • Policies managed uniquely at each data source • Use ETL to create ”safe” versions of data • IT interprets legal guidance themselves • Audit logs are disjointed/inconsistent New World • Consistent layer for creating data policies • Policies are enforced dynamically • Plain-english policy builder usable by any author and understandable by all • An unprecedented list of policy logic at your fingertips • All actions monitored granularly and consistently © 2017 Immuta All Rights Reserved. 21
  • 22.
    Introducing Immuta © 2017 ImmutaAll Rights Reserved. 22
  • 23.
    Privacy Preserving Techniques (wedo a bunch, I’m only going to touch on a few here) © 2017 Immuta All Rights Reserved. 23
  • 24.
    Right To Privacy? •Early on photography was expensive • Near the turn of the century the masses had general use of photography • "instantaneous photographs and newspaper enterprise have invaded the sacred precincts of private and domestic life." - Samuel Warren and Louis Brandeis (U.S. Supreme Court Justice) • Proposed right to “be let alone” • We generally accept being observed, but rarely accept being identified © 2017 Immuta All Rights Reserved. 24
  • 25.
    The End ofPrivacy [as we know it]? • Rise of technology and data science has killed privacy as we know it • Instead of focusing on how and when our data is gathered... • Privacy should now be how our data is being used. © 2017 Immuta All Rights Reserved. 25
  • 26.
    Immuta can dothis The GDPR understands this! • The cornerstone of GDPR is consent • You should only process data for the purposes for which your data subjects have explicitly consented • In other words: you must consider analytical context as a guide to what data you can see • This is very different from role-based access controls © 2017 Immuta All Rights Reserved. 26
  • 27.
    Towards Practical DifferentialPrivacy for SQL Queries Johnson, Near, Song, Aug 2017 The Internal study of queries at Uber • SQL queries written by employees at Uber • 8.1 million queries executed between March 2013 and August 2016 • Broad range of sensitive data including rider and driver information, trip logs, and customer support data 27
  • 28.
    34% of UberData Science Queries are aggregates Statistical queries matter!
  • 29.
    Data can beeither useful or perfectly anonymous but never both. IF WE CONSIDER STATISTICAL QUERIES USEFUL, THIS CAN BE A LIE: How? © 2017 Immuta All Rights Reserved. 29
  • 30.
    Let’s play agame • Think of a number between 1 and 6 • Now I’m going to ask you a question you probably don’t want to answer in public • Do you hide spending from your spouse? • Now raise your hand if you thought of a 3 OR answered yes to the above © 2017 Immuta All Rights Reserved. 30
  • 31.
    This is DifferentialPrivacy • I protected your privacy by providing plausible deniability • But I can also understand the percentage of people that hide spending from their spouse because I understand the probability of you selecting a 3 • Differential Privacy is restricted to only statistical queries and adds the appropriate amount of noise based on the sensitivity of the question • ‘Differential privacy formalizes the idea that a "private" computation should not reveal whether any one person participated in the input or not, much less what their data are.’ - [Frank McSherry] (https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md) © 2017 Immuta All Rights Reserved. 31
  • 32.
    How Could NYTHave Done it? Localized Sensitivity © 2017 Immuta All Rights Reserved. 32
  • 33.
    How do we doit? Simple… In plain English everyone can understand © 2017 Immuta All Rights Reserved. 33
  • 34.
    Can Privacy and MachineLearning Exist Together? We believe it can, data agility is what you need © 2017 Immuta All Rights Reserved. 34
  • 35.