Since the implementation of GDPR regulation, all data processors across the world have been struggling to be GDPR compliant and also deal with the new reality in Big Data, that data is constantly drifting and mutating.
In this presentation, the approach will be:
Cloudera architecture
No additional financial cost
Masking & Encrypting
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Implement a GDPR-compliant data solution using Cloudera and StreamSets
1. How-to implement a GDPR
solution in a Cloudera
architecture
Cloudera + StreamSets Data Collector
2. Introduction
Since the implementation of GDPR regulation, all data processors across the world have
been struggling to be GDPR compliant and also deal with the new reality in Big Data, that
data is constantly drifting and mutating.
In this presentation, the approach will be:
● Cloudera architecture
● No additional financial cost
● Masking & Encrypting
4. Pre-Assumptions
1. Cluster hostname: cm1.localdomain
2. StreamSets Data Collector Linux user: sdc Kerberos Principal: sdc/cm1.localdomain/@DOMAIN.COM
3. Cluster Cloudera Manager: Cloudera Manager 5.12.2
4. StreamSets Data Collector: Installed
5. Cluster Authentication Pre-Installed: Kerberos
a. Kerberos Realm DOMAIN.COM
6. StreamSets Data Collector version: 3.5.0
7. StreamSets Data Collector Authentication: Kerberos
5. Base of Knowledge
With GDPR regulation it’s difficult to find the right balance
between protecting data and utilizing it, but with the solution
on this presentation you will not only do the ETL on your data
lake as you will also protect all the sensitive data.
Knowing this, you must know your data, the purpose of it, as
also the architecture of your System Environment and
therefore choose one or more ways to handle it.
6. JDBC Connections
The JDBC connections on StreamSets Origins require additional drivers.
For Hive and Impala JDBC Origin connection you need the Cloudera drivers (HiveJDBC41.jar, ImpalaJDBC41.jar), as for a Oracle connection the
Ojdbc8.jar.
Base of Knowledge
JDBC Conn Security URL
Hive
Metadata
None jdbc:hive2://server:10000/db_name
Hive
Metadata
Kerberos
(static)
jdbc:hive2://server:10000/db_name;principal=hive/server@REALM.COM
Hive
(Origin)
Kerberos
(user)
jdbc:impala://server:10000;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=server.example.com;KrbServiceName=hive
Impala
(Origin)
None jdbc:impala://server:21050;authMech=0
Impala
(Origin)
Kerberos jdbc:impala://server:21050;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=server.example.com;KrbServiceName=impala
Oracle
(Origin)
Static jdbc:oracle:thin:@server:listerner_port:db
8. Data Description
Sample Data
Synopsis
● Dataset title.basics.tsv from IMDB dataset https://www.imdb.com/interfaces/
● Read Tabular Data File from Local File System,
● Mask the field primaryTitle with a regular expression
● Create a Hive External Table and Write to HDFS