How to implement a gdpr solution in a cloudera architecture

How-to implement a GDPR
solution in a Cloudera
architecture
Cloudera + StreamSets Data Collector

Introduction
Since the implementation of GDPR regulation, all data processors across the world have
been struggling to be GDPR compliant and also deal with the new reality in Big Data, that
data is constantly drifting and mutating.
In this presentation, the approach will be:
● Cloudera architecture
● No additional financial cost
● Masking & Encrypting

Architecture
This architecture enables the following:
● Discover Data - ETL
● Data Governance - Policies
● Protect - Anonymization

Pre-Assumptions
1. Cluster hostname: cm1.localdomain
2. StreamSets Data Collector Linux user: sdc Kerberos Principal: sdc/cm1.localdomain/@DOMAIN.COM
3. Cluster Cloudera Manager: Cloudera Manager 5.12.2
4. StreamSets Data Collector: Installed
5. Cluster Authentication Pre-Installed: Kerberos
a. Kerberos Realm DOMAIN.COM
6. StreamSets Data Collector version: 3.5.0
7. StreamSets Data Collector Authentication: Kerberos

Base of Knowledge
With GDPR regulation it’s difficult to find the right balance
between protecting data and utilizing it, but with the solution
on this presentation you will not only do the ETL on your data
lake as you will also protect all the sensitive data.
Knowing this, you must know your data, the purpose of it, as
also the architecture of your System Environment and
therefore choose one or more ways to handle it.

JDBC Connections
The JDBC connections on StreamSets Origins require additional drivers.
For Hive and Impala JDBC Origin connection you need the Cloudera drivers (HiveJDBC41.jar, ImpalaJDBC41.jar), as for a Oracle connection the
Ojdbc8.jar.
Base of Knowledge
JDBC Conn Security URL
Hive
Metadata
None jdbc:hive2://server:10000/db_name
Hive
Metadata
Kerberos
(static)
jdbc:hive2://server:10000/db_name;principal=hive/server@REALM.COM
Hive
(Origin)
Kerberos
(user)
jdbc:impala://server:10000;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=server.example.com;KrbServiceName=hive
Impala
(Origin)
None jdbc:impala://server:21050;authMech=0
Impala
(Origin)
Kerberos jdbc:impala://server:21050;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=server.example.com;KrbServiceName=impala
Oracle
(Origin)
Static jdbc:oracle:thin:@server:listerner_port:db

Use Case
Ingest and Mask Data from Local File System to Hive Metadata.

Data Description
Sample Data
Synopsis
● Dataset title.basics.tsv from IMDB dataset https://www.imdb.com/interfaces/
● Read Tabular Data File from Local File System,
● Mask the field primaryTitle with a regular expression
● Create a Hive External Table and Write to HDFS

Data Result
Sample Data
Synopsis
● Regular expression: (.*)(.{4})
● Irreversible

Use Case
Ingest and Encrypt Data from Local File System to Hive Metadata.

Data Result
Sample Data
Synopsis
● Regular expression: (.*)(.{4})
● Reversible

Thanks
Big Data Engineer
Tiago Simões

How to implement a gdpr solution in a cloudera architecture

More Related Content

What's hot

Similar to How to implement a gdpr solution in a cloudera architecture

Recently uploaded

How to implement a gdpr solution in a cloudera architecture