​GDPR and Hadoop
​The elephant in the room
​Janosch Woschitz
​2017-09-27
2
• GDPR Overview
• Rights of the data subject
• Challenges within Hadoop ecosystem
• Technical considerations
Agenda
3
• Complex and detailed topic
• This is NOT legal advice
• A lot of opinions and interpretations about
GDPR
• Talk is not covering all aspects of GDPR
• Process matters, documentation is your
friend
Disclaimer
Take it with a grain of salt
4
“Regulation (EU) 2016/679 of the European Parliament [...] on the protection of natural persons with
regard to the processing of personal data and on the free movement of such data, and repealing
Directive 95/46/EC (General Data Protection Regulation)”
• Establishes data protection as a fundamental right
• Creates unified data protection law for all EU member states
• Enables EU citizens to be in control of their personal data
General Data Protection Regulation
GDP what?
- Official title of the GDPR, http://eur-lex.europa.eu/eli/reg/2016/679/oj
5
• Applies if the data controller or processor (organization) or the data
subject (person) is based in the EU
• Applies to organizations based outside the European Union if they
process or monitor personal data of EU citizens
• Employees might be EU citizens as well
General Data Protection Regulation
Who is affected?
6
• Officially published on May 4th 2016
• Applicable from May 25th 2018 across the EU (including UK)
• “Regulation” instead of “Directive” → no need for national
implementing legislation, directly applicable to all EU countries
• Evaluated and reviewed on May 25th 2020
General Data Protection Regulation
When does it happen?
7
• Better data protection and portability for consumers
• Fines for non-compliance will be
– up to €10M or 2% revenue for minor violations
– up to €20M or 4% revenue for major violations
• Any individual has the right to raise a complaint against any
organisation (Art. 77)
General Data Protection Regulation
Why should I care?
8
Privacy by design
Better data protection, you said?
• Privacy by design and by default, essential data protection
• Breach notification within 72 hours
• Data minimization and access limitation
• Data Protection Officer (DPO) and Data Privacy Impact Assessments
(DPIAs)
• Active, specific and unambiguous consent
“the controller shall [...] implement appropriate technical and organisational measures [...] in an
effective manner [...] in order to meet the requirements of this Regulation and protect the rights of
data subjects.” - Article 25, GDPR
9
Personal data?
https://pixabay.com/en/family-drawing-children-cat-paper-879432/
10
Personal data (examples)
It all depends on context
• Location or web surfing data
• Video surveillance and images
• Personal interests or behavioural patterns
• A child's drawing depicting its family
• Publication of x-ray plates together with the patient's first name
• Damage caused by graffiti in public transportation
• X1234 drinks a glass of wine more than 3 times a week, drives a
Bentley and has a Windows 10 phone
11
Source: Facebook
• Right of access and data portability
– free of charge
– structured, commonly used and machine readable
• Right to erasure
– “without undue delay”
• Right to object, to restrict, to rectify, ...
Data citizen rights
Rights of the data subject
GDPR and Hadoop
13
Hadoop ecosystem & beyond
The known Hadoopverse (excerpt)
and much more ...
14
Data processing on Hadoop
Bird’s eye view
• Various data sources and ingestion tools
• Diverse input formats, structured & unstructured
• Diverse processing tools
• Liberal data access, local data science
• Write-append and immutable data structures
• Redundant data
Ingest Process Access
15
Challenges by
example
• Customer data from
RDBMS to HDFS
• Streaming device
location data to
Kafka
16
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
Challenges by example
Ingest table from RDBMS
daily import (e.g. via sqoop)
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
today
-1 day
-2 days
Big DataSmaller Data
17
Problems & Solution approaches
• Right to be forgotten
• Access limitation
• Bound to consent
• ...
• Anonymization
• Hashing
• Encryption
• ...
18
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
Challenges by example
Encrypt, a.k.a. Lost Key Pattern
daily import (e.g. via sqoop)
“userId”: 123
“firstName”: “Janosch”
“dateOfBirth”: “1984-01-01”
“userId”: 123
“firstName”: “54DCF13E4...”
“dateOfBirth”: “D3DFBCE...”
today
-1 day
-2 days
123
19
deviceId: 123pushes data to Kafka topic
123
B
“deviceId”: 123
“lat”: 52.510781
“lon”: 13.371735
Challenges by example
Deletion in log based systems
Edge device
456
A
123
D
123
∅
Kafka topic Consumer
B, C, D, ∅
offset
2
123
C
3 4 5 6
20
deviceId: 123pushes data to Kafka topic
123
D4
“deviceId”: 123
“lat”: 52.510781
“lon”: 13.371735
Challenges by example
Encrypt on write
Edge device
123
Z3
456
T3
123
6H
Kafka topic Consumer
A, B, C, D
offset
1
123
N7
2 3 4 5
123
?
21
Vendor recommendations
Distributions to the rescue!
• Hortonworks - "GDPR: The Good, Bad and Ugly", Jun 20 2017
• Cloudera - "Simplify your response to GDPR", Aug 24 2017
• GDPR compliance via partner solutions
• Only partial answers
Source: Cloudera Inc.
22
GDPR recommendations simplified
Kudu
Sentry
Navigator
Data Science
Workbench
HDFS / ...
Ranger
Atlas
Zeppelin
+ lots of partner solutions
23
Data privacy and open source
Pragmatic considerations
• Secured cluster
• Raw data in encryption zones with very limited access
• Anonymize for further processing wherever possible
• Proper retention policies, batch delete requests and perform regular
clean-ups
• Integrate with Atlas and Ranger → tagging, filtering and masking
• Custom solutions for glue and missing pieces
24
Summary
• No comprehensive open-source solution available
• Proprietary services target specific problem domains, integration still
necessary
• Some time until legal dust settled
• Idea: Avro (logical types) + Vault (or similar) + Ranger + Atlas?
The road ahead
2525 © 2017 Teradata
26
Hadoop Security Primer
In just one slide
• Authentication - Kerberos
• Authorization - Ranger, Sentry, ACLs
• Auditing / Monitoring - Ranger, Navigator, ...
• Encryption of data in motion - KMS, Navigator, ...
• Encryption of data at rest - Encryption zones, SEDs, ...
• Hadoop Security (Ben Spivey, Joey Echeverria)
• Hadoop and Kerberos: The Madness beyond the Gate
27
Personal data
According to GDPR
“any information relating to an identified or identifiable natural person (‘data
subject’);
An identifiable natural person is one who can be identified, directly or indirectly,
in particular by reference to an identifier such as a name, an identification
number, location data, an online identifier or to one or more factors specific to
the physical, physiological, genetic, mental, economic, cultural or social identity
of that natural person.”
- Article 4, GDPR

GDPR and Hadoop

  • 1.
    ​GDPR and Hadoop ​Theelephant in the room ​Janosch Woschitz ​2017-09-27
  • 2.
    2 • GDPR Overview •Rights of the data subject • Challenges within Hadoop ecosystem • Technical considerations Agenda
  • 3.
    3 • Complex anddetailed topic • This is NOT legal advice • A lot of opinions and interpretations about GDPR • Talk is not covering all aspects of GDPR • Process matters, documentation is your friend Disclaimer Take it with a grain of salt
  • 4.
    4 “Regulation (EU) 2016/679of the European Parliament [...] on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)” • Establishes data protection as a fundamental right • Creates unified data protection law for all EU member states • Enables EU citizens to be in control of their personal data General Data Protection Regulation GDP what? - Official title of the GDPR, http://eur-lex.europa.eu/eli/reg/2016/679/oj
  • 5.
    5 • Applies ifthe data controller or processor (organization) or the data subject (person) is based in the EU • Applies to organizations based outside the European Union if they process or monitor personal data of EU citizens • Employees might be EU citizens as well General Data Protection Regulation Who is affected?
  • 6.
    6 • Officially publishedon May 4th 2016 • Applicable from May 25th 2018 across the EU (including UK) • “Regulation” instead of “Directive” → no need for national implementing legislation, directly applicable to all EU countries • Evaluated and reviewed on May 25th 2020 General Data Protection Regulation When does it happen?
  • 7.
    7 • Better dataprotection and portability for consumers • Fines for non-compliance will be – up to €10M or 2% revenue for minor violations – up to €20M or 4% revenue for major violations • Any individual has the right to raise a complaint against any organisation (Art. 77) General Data Protection Regulation Why should I care?
  • 8.
    8 Privacy by design Betterdata protection, you said? • Privacy by design and by default, essential data protection • Breach notification within 72 hours • Data minimization and access limitation • Data Protection Officer (DPO) and Data Privacy Impact Assessments (DPIAs) • Active, specific and unambiguous consent “the controller shall [...] implement appropriate technical and organisational measures [...] in an effective manner [...] in order to meet the requirements of this Regulation and protect the rights of data subjects.” - Article 25, GDPR
  • 9.
  • 10.
    10 Personal data (examples) Itall depends on context • Location or web surfing data • Video surveillance and images • Personal interests or behavioural patterns • A child's drawing depicting its family • Publication of x-ray plates together with the patient's first name • Damage caused by graffiti in public transportation • X1234 drinks a glass of wine more than 3 times a week, drives a Bentley and has a Windows 10 phone
  • 11.
    11 Source: Facebook • Rightof access and data portability – free of charge – structured, commonly used and machine readable • Right to erasure – “without undue delay” • Right to object, to restrict, to rectify, ... Data citizen rights Rights of the data subject
  • 12.
  • 13.
    13 Hadoop ecosystem &beyond The known Hadoopverse (excerpt) and much more ...
  • 14.
    14 Data processing onHadoop Bird’s eye view • Various data sources and ingestion tools • Diverse input formats, structured & unstructured • Diverse processing tools • Liberal data access, local data science • Write-append and immutable data structures • Redundant data Ingest Process Access
  • 15.
    15 Challenges by example • Customerdata from RDBMS to HDFS • Streaming device location data to Kafka
  • 16.
    16 “userId”: 123 “firstName”: “Janosch” “dateOfBirth”:“1984-01-01” “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” Challenges by example Ingest table from RDBMS daily import (e.g. via sqoop) “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” today -1 day -2 days Big DataSmaller Data
  • 17.
    17 Problems & Solutionapproaches • Right to be forgotten • Access limitation • Bound to consent • ... • Anonymization • Hashing • Encryption • ...
  • 18.
    18 “userId”: 123 “firstName”: “Janosch” “dateOfBirth”:“1984-01-01” “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” Challenges by example Encrypt, a.k.a. Lost Key Pattern daily import (e.g. via sqoop) “userId”: 123 “firstName”: “Janosch” “dateOfBirth”: “1984-01-01” “userId”: 123 “firstName”: “54DCF13E4...” “dateOfBirth”: “D3DFBCE...” today -1 day -2 days 123
  • 19.
    19 deviceId: 123pushes datato Kafka topic 123 B “deviceId”: 123 “lat”: 52.510781 “lon”: 13.371735 Challenges by example Deletion in log based systems Edge device 456 A 123 D 123 ∅ Kafka topic Consumer B, C, D, ∅ offset 2 123 C 3 4 5 6
  • 20.
    20 deviceId: 123pushes datato Kafka topic 123 D4 “deviceId”: 123 “lat”: 52.510781 “lon”: 13.371735 Challenges by example Encrypt on write Edge device 123 Z3 456 T3 123 6H Kafka topic Consumer A, B, C, D offset 1 123 N7 2 3 4 5 123 ?
  • 21.
    21 Vendor recommendations Distributions tothe rescue! • Hortonworks - "GDPR: The Good, Bad and Ugly", Jun 20 2017 • Cloudera - "Simplify your response to GDPR", Aug 24 2017 • GDPR compliance via partner solutions • Only partial answers Source: Cloudera Inc.
  • 22.
    22 GDPR recommendations simplified Kudu Sentry Navigator DataScience Workbench HDFS / ... Ranger Atlas Zeppelin + lots of partner solutions
  • 23.
    23 Data privacy andopen source Pragmatic considerations • Secured cluster • Raw data in encryption zones with very limited access • Anonymize for further processing wherever possible • Proper retention policies, batch delete requests and perform regular clean-ups • Integrate with Atlas and Ranger → tagging, filtering and masking • Custom solutions for glue and missing pieces
  • 24.
    24 Summary • No comprehensiveopen-source solution available • Proprietary services target specific problem domains, integration still necessary • Some time until legal dust settled • Idea: Avro (logical types) + Vault (or similar) + Ranger + Atlas? The road ahead
  • 25.
    2525 © 2017Teradata
  • 26.
    26 Hadoop Security Primer Injust one slide • Authentication - Kerberos • Authorization - Ranger, Sentry, ACLs • Auditing / Monitoring - Ranger, Navigator, ... • Encryption of data in motion - KMS, Navigator, ... • Encryption of data at rest - Encryption zones, SEDs, ... • Hadoop Security (Ben Spivey, Joey Echeverria) • Hadoop and Kerberos: The Madness beyond the Gate
  • 27.
    27 Personal data According toGDPR “any information relating to an identified or identifiable natural person (‘data subject’); An identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.” - Article 4, GDPR