GDPR compliance application architecture and implementation using Hadoop and Streaming

1 © Hortonworks Inc. 2011–2018. All rights reserved
GDPR Compliance Application
Architecture and Implementation
using Hadoop and Streaming
Saurabh Mishra and Arun Thangamani
System Architects, Professional Services, Hortonworks

Who are We?
Saurabh Mishra
Systems Architect, Hortonworks Professional Services
@draftsperson
Arun Thangamani
Systems Architect, Hortonworks Professional Services
@ArunThangamani

GDPR Overview

Personal data
Controller
Data Subject
Users, 3rd Party
Processor
Authenticate,Authorize,
AuditAccess
GDPR Regulation from a Technical
Standpoint
Enforce Processing with
Specific Purpose(s)
Supervisory
Authority

GDPR - Quick Summary
• Data subject’s rights to self data
• Access, Rectification, Erasure, Portability,
Objection
• Data subject’s data types
• Includes identifiers, biometric, genetic
data
• Data subject’s Usage of data
• Enforce processing data with specific
purpose(s)
• Data subject’s data – Other Specifications
• Low overhead to correct data
• Erasing data even from Immutable systems
• Data subject data protection, audit and
reporting
• Tracking and Copy Prevention
• Enforced Consent Seeking
• Authentication and Authorization rules
• Minimize Anonymous access
• Prevent Un-Authorized access
• Audit any access to data
• Report any breaches in 72 hours
• Prevent copies and transfer of data
• Track personal data movement
• Enforced consent during authorization
• even in the phase of cross sell, upsell and
data mining

GDPR – Application Design

RDBMS
Company XYZ – Efficient GDPR Implementation using Hadoop
User Content
Transaction
Tables (UCTT)
Audit Tables (AT)
Data in 10’s of Peta Bytes
[Few Years - Few Decades]
Content
1) Who did what?
2) What was done?
Schema for Customer-1
Schema for Customer-N
Data Center – 1 , RDBMS
….....
User Applications
GDPR Regulated Data
Data Center – N, RDBMS
UCTT
AT
User Content
Transaction
Tables (UCTT)
….....
User Applications
GDPR Regulated Data
Data Center – N, RDBMS & HDP & HDF
UCTT
Audit Tables (AT)
AT
Data Center – 1 , RDBMS & HDP & HDF
Nifi
RDBMS HDP
Kafka
HDF
…..... ….....
….....

Company XYZ – GDPR Applications Design per Data Center
Ingest
Use Cases
Update
Use Cases
Report
Use Cases
HiveServer2
Applications
Nifi/Spark
Applications Staging Tables
Audit Tables
Kafka
H
i
v
e
Purge
Merge
Logic
HDP/HDF Cluster
Y
a
r
n
RDBMS Silo -1 RDBMS Silo -N
GDPR  Audit Table flows  Data Center - 1
…
GDPR
Audit Data
Ingestion
GDPR
Right to be
Forgotten
GDPR
Audit
Reports

Data Ingestion Internals - Kafka
Kafka
Producer - 1
Kafka
Producer - 2
Kafka
Producer - 3
Kafka
Receiver - 1
Kafka
Receiver - 2
Kafka
Receiver - 3
OS Page Cache
Kafka Broker – 3
Zookeeper Node-1 Zookeeper Node-N
Kafka Broker – 2
Kafka Broker – 1
T1 P3
T1 P4
T2 P1
T4 P5
T1 P6
T4 P12
T4 P15
T4 P8
Zookeeper Node-2
/controller
/topics
/admin
/consumer
/broker
/kafka-acl

Data Ingestion Internals - Nifi
Kafka
KR-1
KR-2
KR-3
Merge Content
PHS 1
PHS 2
PHS N
N
a
m
e
N
o
d
e
Flow File Repository Content File RepositoryProvenance Repository
…
Raid-10
4 disks
Raid-10
6 disks
Raid-10
4 disks
H
i
v
e
Nifi Node-1
Flow File Cache
….
Content File
Repo Details
• Ordered, Pushed and
Retrieved to/from Disk
• Batch size default
10,000
• Copy on write
• Pass by reference
• Flow files
copied/moved to
Relevant Queues
Container
Section
Files with
Offsets
Content Claim Resource Claim
1 2 16…
Merged
Log
1 2 N
Lucene Shards
…
Partitioned Logs
Provenance Repository
Details
Hive Backend SQL Server
Data Node 1
Data Node 2
Data Node N
….

Efficient “Right to be Forgotten” - Hive ACID - Merged Reads
Table A
1 2 3 4 5Buckets
Base
Files
1 2 3 4 5
Partition=2018-03-18
…
…
Updates => Delete & Re-Create
Table A
1 2 3 4 5Buckets
Base
Files
1 2 3 4 5
…
…
Updates => Base & Delta Files => Merged Reads
Merge on Write Merge on Read
Simple Read,
Heavy Writes
Simple Writes,
Heavier Reads

NIFI Ingest using Hive
ACID
Purge & Update
Requests using Hive
Merge
Reporting using Hive
LLAP
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Buckets
Tall Base
And
Short Delta Files
Tall Base
And
Many Short Delta Files
Buckets
Life of Table A – Single Partition
Buckets
Tall NEW Base Files
After Compaction
Compaction
Process
1 2 3 4 5 6
Base ORC File
Stripes
within 1 Bucket
Delta ORC Files
….....
1 2 6
Mapper – 1 Mapper – 2 Mapper – 6
• Merge on Read Processing within Containers
• Read both source stripe and delta files to provide
answer
• Processing Within 1 Bucket…
Efficient Data Lifecycle Internals – Hive ACID with Merged Reads

Efficient “Right to be Forgotten” using Hive ACID Merge
Ids Needing
Insert, Update and Delete
Table A
…
Merge Process
1 2 3 4 5
Staging Table B
HIVE MERGE STATEMENT
MERGE INTO audit.A AS T
USING audit.staging_table AS S
ON T.ID = S.ID and T.tran_date = S.tran_date
WHEN MATCHED AND (T.TranValue != S.TranValue AND
S.TranValue IS NOT NULL) THEN UPDATE
SET TranValue = S.TranValue,
last_update_user = 'merge_update'
WHEN MATCHED AND S.TranValue IS NULL THEN DELETE
WHEN NOT MATCHED THEN INSERT VALUES (S.ID,
S.TranValue, 'merge_insert', S.tran_date);
1 2 5
Mapper – 1 Mapper – 2 Mapper – N
Buckets
Base
Files
Hive ACID - Writes and Reads can obtain shared locks at same time
Hive ACID – Writes on the same table and same partition are chained
Performance  ~50 Tables with a total of 256 TB of data
Hive ACID merge on all tables and partitions completed in
approximately 4 hours

Platform

GDPR Requirement for Global Deployment Across the World
Cluster Deployment
• N Datacenter around world
• Preview and Production
Environment
• Each environment consist of
HDF and HDP clusters
• 60 Ambari Managed Clusters
• T-shirt Sized Datacenters
• Large
• Medium
• Small

HDP Cluster
Cluster Blueprint Design
HDF Cluster
Master & Management Nodes (5 Master v1) – Including Standby
Master 1
Namenode (Active)
Zk Failover Controller
Yarn (Active)
Master 2 Master 3
Hive Metastore (Compaction)
Zookeeper1
Journal Node1
Namenode (Standby)
Zk Failover Controller
Yarn (Standby)
Master 4
Knox Instance1
Zookeeper2
Journal Node2
Master 5
Knox Instance2
Zookeeper3
Journal Node3
Hive Metastore Instance2
HiveServer2 Instance2
Webhcat Instance2
Kerberos Client
HST Agent
LOGSEARCH_LOGFEEDER and
Metric Monitor
Hive Metastore Instance1
HiveServer2 Instance1
Webhcat Instance1
Kerberos Client
HST Agent
Metric Monitor
Yarn ATS
MR2 History Server
Slider
Kerberos Client
HST Agent
Metric Monitor
Infra Solr1
Logsearch Server
Ambari Server
Ranger(Active)
Kerberos Client, HST Agent ,
LOGSEARCH_LOGFEEDER and Metric Monitor
Infra Solr2
HST Server
Activity Analyzer/Explorer
Ranger(Standby)
Server
Client
Security
Slave Nodes
HDFS & YARN Slaves
Datanode
Metric Monitor
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent
Node Manager
Slave Nodes(2) with Ambari
Metrics
HDFS & YARN Slaves
Datanode
Distributed Ambari Metric
Collector
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent
Node Manager
Master & Management Nodes (2 Master) – Including Ambari
Master v1
Ambari Infra
Logsearch Server
Logsearch UI
Ranger (Active)
Infra Solr Instance
Master v2
Ambari Server
Metrics Collector
Metrics Grafana
Ranger (Standby)
Infra Solr Instance
Nifi
Nifi Node
Nifi
Metric Monitor
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent
Kafka Brokers
Kafka Broker
Broker
Metric Monitor
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent
Zookeeper
Kafka Brokers
Kafka Broker
Broker
Metric Monitor
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent

Knox-1
NameNode ResourceManager
EdgeNode-1
DataNode-11 DataNode-12 DataNode-1N
Ambari
Knox-2
…
AD/LDAP
(User Directory)
KDC
(Token Provider)
HiveServer2 Oozie
Spark History
Server
Yarn Timeline
Server
….
DataNode-13
Ranger-1
KT
U-TGT
S
1
SRV
S
2
PRN
SRV
SRV
KT
AU
S
2
KT
KT KT KT KT
KT
KT
SQL Server
Ranger-2
SQL Server
EdgeNode-N
KT
Zeppelin Server
KT
Knox-TGT
Knox User
Proxy
User Authentication
Sample Process
Services use Delegation
Tokens for Efficiency
User Authorization
System Gateway
KT KT KT KT
H
T
T
P
S
SQL
Server
KT
Administration
Application
Services
DataNode-M1 DataNode-M2 DataNode-MN…DataNode-M3
KT KT KT KT
… … … …
SQL
Server
Platform Security – GDPR Compliant

• Auto Deploy OS
• OS Best Practices using HDP & HDF
• Auto Deploy HDF and HDP
• Ambari API’s
• Ambari Blueprints
• Auto Validate Deployment
• Ambari Smoke tests using API’s
• Ambari Blueprint comparison against Standard
• Auto Deploy GDPR Application
• Kafka Producer deployment.
• Kafka Topics creation.
• Nifi Flow deployment.
• Hive DB and HDFS Directory creation.
• Custom Ambari Alerts Deployment
• Auto Validate GDPR Application
Automated Deployment
Cluster Deployment

Performance Tuning

Data Ingestion – Operating System Tunable
• Disable Transparent Huge Pages
• echo never > defrag and > enabled
• Disable Swap
• Configure VM Cache flushing
• Configure IO Scheduler as deadline
• Disk Jbod Ext4
• Mount Options- inode_readahead_blks=128,data=writeback,noatime,nodiratime
• Network
• Dual Bonded 10gbps
• rx-checksumming: on, tx-checksumming: on, scatter-gather: on, tcp-segmentation-
offload: on

Data Ingestion –Tunable
• Data Tunables
1. Buckets
I. Audit Type 1 Logs - table buckets
II. Audit Type 2 Logs - table buckets
2. Kafka Events
I. Audit Type 1 Logs Events Per Kafka Event
II. Audit Type 2 Logs Events Per Kafka Event
3. Ingested Data Volume
I. Audit Type 1 Logs - table data ingested
II. Audit Type 2 Logs - table data ingested
III. Ingested Partitions(Days) Per Table
• Platform Tunables
1. Nifi PHS Transaction Settings
I. Transactions Per Batch
II. Rows per transaction
III. Connections per process
IV. Nifi Insert Interval to Hive
2. Nifi Merge Content Setting
I. Merge Content Size
II. Merge Processes
III. Merge Process Threads
3. Nifi Settings
i. Queue Size
ii. Nifi concurrent Threads
iii. Nifi memory per Node
4. Hive Settings
I. ORC Stripe Size

Data Ingestion Tuning in Iterations
• Issues/Bottlenecks Identified and Resolved
• Iteration 1 - Nifi Merge Process
• 1GB from Merge Process was high
• 256 MB gave more stableness and throughput
• 21 MB/sec  24 Mb/Sec
• Iteration 2 - Nifi Disk Changes
• Content Repository High Usage with Disks
• 24 MB/sec  29 Mb/Sec
• Iteration 3 - Nifi Parallelism PHS Processor
• Multiple PHS instances
• Working around ORC creation bottlenecks
• 29 MB/sec  35 MB/sec
• Iteration 4 - Nifi Hive Changes
• Transaction per Batch
• Rows per transaction
• Hive Buckets Tuning
• 35 MB/sec  51 MB/sec
Page 22
21
24
29
35
51
0
10
20
30
40
50
60
Iteration - 0 Nifi Merge Process
Iteration - 1
Nifi Repository Disk
Changes
Iteration - 2
Nifi PHS Parallelism
Iteration - 3
Hive Streaming Changes
Iteration - 4
Data Ingest – Nifi Hive Streaming - Throughput MB/sec

0
100
200
300
400
500
600
700
Test Set 1 Test Set 2 Test Set 3 Test Set 4
Reporting Response Times and Cluster IO
Total Queries (Users * Queries Per User) Queries Returning Result within SLA
Average Time for Status Ready Average Time to Read Results
Reporting Queries on Audit Tables

Operations

• Export HDF Metrics (SPOF-HBASE) into HDP HBASE
Getting Around Limitations of HDF Metrics (SPOF)
Distributed Ambari Metrics
HDF Metrics
HDP Metrics

Ambari LogSearch
Log Monitoring
• Monitor Log Errors for Services
• HDFS
• YARN
• Hive
• Nifi
• Kafka
• Zookeeper

• High Level Cluster Status
• Key Metrics
• Top-N Alerts
Hortonworks DPS Operational App -> In-Progress
Centralized Global Eye

Questions?

Thank you

GDPR compliance application architecture and implementation using Hadoop and Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GDPR compliance application architecture and implementation using Hadoop and Streaming

Similar to GDPR compliance application architecture and implementation using Hadoop and Streaming (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

GDPR compliance application architecture and implementation using Hadoop and Streaming

Editor's Notes