SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
GDPR Compliance Application
Architecture and Implementation
using Hadoop and Streaming
Saurabh Mishra and Arun Thangamani
System Architects, Professional Services, Hortonworks
2 © Hortonworks Inc. 2011–2018. All rights reserved
Who are We?
Saurabh Mishra
Systems Architect, Hortonworks Professional Services
@draftsperson
Arun Thangamani
Systems Architect, Hortonworks Professional Services
@ArunThangamani
3 © Hortonworks Inc. 2011–2018. All rights reserved
GDPR Overview
4 © Hortonworks Inc. 2011–2018. All rights reserved
Personal data
Controller
Data Subject
Users, 3rd Party
Processor
Authenticate,Authorize,
AuditAccess
GDPR Regulation from a Technical
Standpoint
Enforce Processing with
Specific Purpose(s)
Supervisory
Authority
5 © Hortonworks Inc. 2011–2018. All rights reserved
GDPR - Quick Summary
• Data subject’s rights to self data
• Access, Rectification, Erasure, Portability,
Objection
• Data subject’s data types
• Includes identifiers, biometric, genetic
data
• Data subject’s Usage of data
• Enforce processing data with specific
purpose(s)
• Data subject’s data – Other Specifications
• Low overhead to correct data
• Erasing data even from Immutable systems
• Data subject data protection, audit and
reporting
• Tracking and Copy Prevention
• Enforced Consent Seeking
• Authentication and Authorization rules
• Minimize Anonymous access
• Prevent Un-Authorized access
• Audit any access to data
• Report any breaches in 72 hours
• Prevent copies and transfer of data
• Track personal data movement
• Enforced consent during authorization
• even in the phase of cross sell, upsell and
data mining
6 © Hortonworks Inc. 2011–2018. All rights reserved
GDPR – Application Design
7 © Hortonworks Inc. 2011–2018. All rights reserved
RDBMS
Company XYZ – Efficient GDPR Implementation using Hadoop
User Content
Transaction
Tables (UCTT)
Audit Tables (AT)
Data in 10’s of Peta Bytes
[Few Years - Few Decades]
Content
1) Who did what?
2) What was done?
Schema for Customer-1
Schema for Customer-N
Data Center – 1 , RDBMS
….....
User Applications
GDPR Regulated Data
Data Center – N, RDBMS
UCTT
AT
User Content
Transaction
Tables (UCTT)
Schema for Customer-1
Schema for Customer-N
….....
User Applications
GDPR Regulated Data
Data Center – N, RDBMS & HDP & HDF
UCTT
Audit Tables (AT)
Schema for Customer-1
Schema for Customer-N
AT
Data Center – 1 , RDBMS & HDP & HDF
Nifi
RDBMS HDP
Kafka
HDF
…..... ….....
….....
8 © Hortonworks Inc. 2011–2018. All rights reserved
Company XYZ – GDPR Applications Design per Data Center
Ingest
Use Cases
Update
Use Cases
Report
Use Cases
HiveServer2
Applications
Nifi/Spark
Applications Staging Tables
Audit Tables
Kafka
H
i
v
e
Purge
Merge
Logic
HDP/HDF Cluster
Y
a
r
n
RDBMS Silo -1 RDBMS Silo -N
GDPR  Audit Table flows  Data Center - 1
…
GDPR
Audit Data
Ingestion
GDPR
Right to be
Forgotten
GDPR
Audit
Reports
9 © Hortonworks Inc. 2011–2018. All rights reserved
Data Ingestion Internals - Kafka
Kafka
Producer - 1
Kafka
Producer - 2
Kafka
Producer - 3
Kafka
Receiver - 1
Kafka
Receiver - 2
Kafka
Receiver - 3
OS Page Cache
Kafka Broker – 3
Zookeeper Node-1 Zookeeper Node-N
Kafka Broker – 2
Kafka Broker – 1
T1 P3
T1 P4
T2 P1
T4 P5
T1 P6
T4 P12
T4 P15
T4 P8
Zookeeper Node-2
/controller
/topics
/admin
/consumer
/broker
/kafka-acl
10 © Hortonworks Inc. 2011–2018. All rights reserved
Data Ingestion Internals - Nifi
Kafka
KR-1
KR-2
KR-3
Merge Content
PHS 1
PHS 2
PHS N
N
a
m
e
N
o
d
e
Flow File Repository Content File RepositoryProvenance Repository
…
Raid-10
4 disks
Raid-10
6 disks
Raid-10
4 disks
H
i
v
e
Nifi Node-1
Flow File Cache
….
Content File
Repo Details
• Ordered, Pushed and
Retrieved to/from Disk
• Batch size default
10,000
• Copy on write
• Pass by reference
• Flow files
copied/moved to
Relevant Queues
Container
Section
Files with
Offsets
Content Claim Resource Claim
1 2 16…
Merged
Log
1 2 N
Lucene Shards
…
Partitioned Logs
Provenance Repository
Details
Hive Backend SQL Server
Data Node 1
Data Node 2
Data Node N
….
11 © Hortonworks Inc. 2011–2018. All rights reserved
Efficient “Right to be Forgotten” - Hive ACID - Merged Reads
Table A
1 2 3 4 5Buckets
Base
Files
1 2 3 4 5
Partition=2018-03-18
Partition=2018-03-17
Partition=2018-03-16
Partition=2018-03-15
…
…
Updates => Delete & Re-Create
Table A
1 2 3 4 5Buckets
Base
Files
1 2 3 4 5
Partition=2018-03-18
Partition=2018-03-17
Partition=2018-03-16
Partition=2018-03-15
…
…
Updates => Base & Delta Files => Merged Reads
Merge on Write Merge on Read
Simple Read,
Heavy Writes
Simple Writes,
Heavier Reads
12 © Hortonworks Inc. 2011–2018. All rights reserved
NIFI Ingest using Hive
ACID
Purge & Update
Requests using Hive
Merge
Reporting using Hive
LLAP
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Buckets
Tall Base
And
Short Delta Files
Tall Base
And
Many Short Delta Files
Buckets
Life of Table A – Single Partition
Buckets
Tall NEW Base Files
After Compaction
Compaction
Process
1 2 3 4 5 6
Base ORC File
Stripes
within 1 Bucket
Delta ORC Files
….....
1 2 6
Mapper – 1 Mapper – 2 Mapper – 6
• Merge on Read Processing within Containers
• Read both source stripe and delta files to provide
answer
• Processing Within 1 Bucket…
Efficient Data Lifecycle Internals – Hive ACID with Merged Reads
13 © Hortonworks Inc. 2011–2018. All rights reserved
Efficient “Right to be Forgotten” using Hive ACID Merge
Ids Needing
Insert, Update and Delete
Table A
Partition=2018-03-18
Partition=2018-03-17
Partition=2018-03-16
Partition=2018-03-15
…
Merge Process
1 2 3 4 5
Staging Table B
HIVE MERGE STATEMENT
MERGE INTO audit.A AS T
USING audit.staging_table AS S
ON T.ID = S.ID and T.tran_date = S.tran_date
WHEN MATCHED AND (T.TranValue != S.TranValue AND
S.TranValue IS NOT NULL) THEN UPDATE
SET TranValue = S.TranValue,
last_update_user = 'merge_update'
WHEN MATCHED AND S.TranValue IS NULL THEN DELETE
WHEN NOT MATCHED THEN INSERT VALUES (S.ID,
S.TranValue, 'merge_insert', S.tran_date);
1 2 5
Mapper – 1 Mapper – 2 Mapper – N
Partition=2018-03-18
Buckets
Base
Files
Hive ACID - Writes and Reads can obtain shared locks at same time
Hive ACID – Writes on the same table and same partition are chained
Performance  ~50 Tables with a total of 256 TB of data
Hive ACID merge on all tables and partitions completed in
approximately 4 hours
14 © Hortonworks Inc. 2011–2018. All rights reserved
Platform
15 © Hortonworks Inc. 2011–2018. All rights reserved
GDPR Requirement for Global Deployment Across the World
Cluster Deployment
• N Datacenter around world
• Preview and Production
Environment
• Each environment consist of
HDF and HDP clusters
• 60 Ambari Managed Clusters
• T-shirt Sized Datacenters
• Large
• Medium
• Small
16 © Hortonworks Inc. 2011–2018. All rights reserved
HDP Cluster
Cluster Blueprint Design
HDF Cluster
Master & Management Nodes (5 Master v1) – Including Standby
Master 1
Namenode (Active)
Zk Failover Controller
Yarn (Active)
Master 2 Master 3
Hive Metastore (Compaction)
Zookeeper1
Journal Node1
Namenode (Standby)
Zk Failover Controller
Yarn (Standby)
Master 4
Knox Instance1
Zookeeper2
Journal Node2
Master 5
Knox Instance2
Zookeeper3
Journal Node3
Hive Metastore Instance2
HiveServer2 Instance2
Webhcat Instance2
Kerberos Client
HST Agent
LOGSEARCH_LOGFEEDER and
Metric Monitor
Hive Metastore Instance1
HiveServer2 Instance1
Webhcat Instance1
Kerberos Client
HST Agent
LOGSEARCH_LOGFEEDER and
Metric Monitor
Yarn ATS
MR2 History Server
Slider
Kerberos Client
HST Agent
LOGSEARCH_LOGFEEDER and
Metric Monitor
Infra Solr1
Logsearch Server
Ambari Server
Ranger(Active)
Kerberos Client, HST Agent ,
LOGSEARCH_LOGFEEDER and Metric Monitor
Infra Solr2
HST Server
Activity Analyzer/Explorer
Ranger(Standby)
Kerberos Client, HST Agent ,
LOGSEARCH_LOGFEEDER and Metric Monitor
Server
Client
Security
Slave Nodes
HDFS & YARN Slaves
Datanode
Metric Monitor
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent
Node Manager
Slave Nodes(2) with Ambari
Metrics
HDFS & YARN Slaves
Datanode
Distributed Ambari Metric
Collector
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent
Node Manager
Master & Management Nodes (2 Master) – Including Ambari
Master v1
Ambari Infra
Logsearch Server
Logsearch UI
Ranger (Active)
Infra Solr Instance
Kerberos Client, HST Agent ,
LOGSEARCH_LOGFEEDER and Metric Monitor
Master v2
Ambari Server
Metrics Collector
Metrics Grafana
Ranger (Standby)
Infra Solr Instance
Kerberos Client, HST Agent ,
LOGSEARCH_LOGFEEDER and Metric Monitor
Nifi
Nifi Node
Nifi
Metric Monitor
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent
Kafka Brokers
Kafka Broker
Broker
Metric Monitor
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent
Zookeeper
Kafka Brokers
Kafka Broker
Broker
Metric Monitor
HST Agent,
LOGSEARCH_LOGFEEDER
Kerberos Client
Ambari Agent
17 © Hortonworks Inc. 2011–2018. All rights reserved
Knox-1
NameNode ResourceManager
EdgeNode-1
DataNode-11 DataNode-12 DataNode-1N
Ambari
Knox-2
…
AD/LDAP
(User Directory)
KDC
(Token Provider)
HiveServer2 Oozie
Spark History
Server
Yarn Timeline
Server
….
DataNode-13
Ranger-1
KT
U-TGT
S
1
SRV
S
2
PRN
SRV
SRV
KT
AU
S
2
KT
KT KT KT KT
KT
KT
SQL Server
Ranger-2
SQL Server
EdgeNode-N
KT
Zeppelin Server
KT
Knox-TGT
Knox User
Proxy
User Authentication
Sample Process
Services use Delegation
Tokens for Efficiency
User Authorization
System Gateway
KT KT KT KT
H
T
T
P
S
SQL
Server
KT
Administration
Application
Services
DataNode-M1 DataNode-M2 DataNode-MN…DataNode-M3
KT KT KT KT
… … … …
SQL
Server
Platform Security – GDPR Compliant
18 © Hortonworks Inc. 2011–2018. All rights reserved
• Auto Deploy OS
• OS Best Practices using HDP & HDF
• Auto Deploy HDF and HDP
• Ambari API’s
• Ambari Blueprints
• Auto Validate Deployment
• Ambari Smoke tests using API’s
• Ambari Blueprint comparison against Standard
• Auto Deploy GDPR Application
• Kafka Producer deployment.
• Kafka Topics creation.
• Nifi Flow deployment.
• Hive DB and HDFS Directory creation.
• Custom Ambari Alerts Deployment
• Auto Validate GDPR Application
Automated Deployment
Cluster Deployment
19 © Hortonworks Inc. 2011–2018. All rights reserved
Performance Tuning
20 © Hortonworks Inc. 2011–2018. All rights reserved
Data Ingestion – Operating System Tunable
• Disable Transparent Huge Pages
• echo never > defrag and > enabled
• Disable Swap
• Configure VM Cache flushing
• Configure IO Scheduler as deadline
• Disk Jbod Ext4
• Mount Options- inode_readahead_blks=128,data=writeback,noatime,nodiratime
• Network
• Dual Bonded 10gbps
• rx-checksumming: on, tx-checksumming: on, scatter-gather: on, tcp-segmentation-
offload: on
21 © Hortonworks Inc. 2011–2018. All rights reserved
Data Ingestion –Tunable
• Data Tunables
1. Buckets
I. Audit Type 1 Logs - table buckets
II. Audit Type 2 Logs - table buckets
2. Kafka Events
I. Audit Type 1 Logs Events Per Kafka Event
II. Audit Type 2 Logs Events Per Kafka Event
3. Ingested Data Volume
I. Audit Type 1 Logs - table data ingested
II. Audit Type 2 Logs - table data ingested
III. Ingested Partitions(Days) Per Table
• Platform Tunables
1. Nifi PHS Transaction Settings
I. Transactions Per Batch
II. Rows per transaction
III. Connections per process
IV. Nifi Insert Interval to Hive
2. Nifi Merge Content Setting
I. Merge Content Size
II. Merge Processes
III. Merge Process Threads
3. Nifi Settings
i. Queue Size
ii. Nifi concurrent Threads
iii. Nifi memory per Node
4. Hive Settings
I. ORC Stripe Size
22 © Hortonworks Inc. 2011–2018. All rights reserved
Data Ingestion Tuning in Iterations
• Issues/Bottlenecks Identified and Resolved
• Iteration 1 - Nifi Merge Process
• 1GB from Merge Process was high
• 256 MB gave more stableness and throughput
• 21 MB/sec  24 Mb/Sec
• Iteration 2 - Nifi Disk Changes
• Content Repository High Usage with Disks
• 24 MB/sec  29 Mb/Sec
• Iteration 3 - Nifi Parallelism PHS Processor
• Multiple PHS instances
• Working around ORC creation bottlenecks
• 29 MB/sec  35 MB/sec
• Iteration 4 - Nifi Hive Changes
• Transaction per Batch
• Rows per transaction
• Hive Buckets Tuning
• 35 MB/sec  51 MB/sec
Page 22
21
24
29
35
51
0
10
20
30
40
50
60
Iteration - 0 Nifi Merge Process
Iteration - 1
Nifi Repository Disk
Changes
Iteration - 2
Nifi PHS Parallelism
Iteration - 3
Hive Streaming Changes
Iteration - 4
Data Ingest – Nifi Hive Streaming - Throughput MB/sec
23 © Hortonworks Inc. 2011–2018. All rights reserved
0
100
200
300
400
500
600
700
Test Set 1 Test Set 2 Test Set 3 Test Set 4
Reporting Response Times and Cluster IO
Total Queries (Users * Queries Per User) Queries Returning Result within SLA
Average Time for Status Ready Average Time to Read Results
Reporting Queries on Audit Tables
24 © Hortonworks Inc. 2011–2018. All rights reserved
Operations
25 © Hortonworks Inc. 2011–2018. All rights reserved
• Export HDF Metrics (SPOF-HBASE) into HDP HBASE
Getting Around Limitations of HDF Metrics (SPOF)
Distributed Ambari Metrics
HDF Metrics
HDP Metrics
26 © Hortonworks Inc. 2011–2018. All rights reserved
Ambari LogSearch
Log Monitoring
• Monitor Log Errors for Services
• HDFS
• YARN
• Hive
• Nifi
• Kafka
• Zookeeper
27 © Hortonworks Inc. 2011–2018. All rights reserved
• High Level Cluster Status
• Key Metrics
• Top-N Alerts
Hortonworks DPS Operational App -> In-Progress
Centralized Global Eye
28 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
29 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

More Related Content

What's hot

Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 

What's hot (20)

Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDB
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 

Similar to GDPR compliance application architecture and implementation using Hadoop and Streaming

Similar to GDPR compliance application architecture and implementation using Hadoop and Streaming (20)

Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and Superset
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
Is 12 Factor App Right About Logging
Is 12 Factor App Right About LoggingIs 12 Factor App Right About Logging
Is 12 Factor App Right About Logging
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 

GDPR compliance application architecture and implementation using Hadoop and Streaming

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved GDPR Compliance Application Architecture and Implementation using Hadoop and Streaming Saurabh Mishra and Arun Thangamani System Architects, Professional Services, Hortonworks
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Who are We? Saurabh Mishra Systems Architect, Hortonworks Professional Services @draftsperson Arun Thangamani Systems Architect, Hortonworks Professional Services @ArunThangamani
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved GDPR Overview
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Personal data Controller Data Subject Users, 3rd Party Processor Authenticate,Authorize, AuditAccess GDPR Regulation from a Technical Standpoint Enforce Processing with Specific Purpose(s) Supervisory Authority
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved GDPR - Quick Summary • Data subject’s rights to self data • Access, Rectification, Erasure, Portability, Objection • Data subject’s data types • Includes identifiers, biometric, genetic data • Data subject’s Usage of data • Enforce processing data with specific purpose(s) • Data subject’s data – Other Specifications • Low overhead to correct data • Erasing data even from Immutable systems • Data subject data protection, audit and reporting • Tracking and Copy Prevention • Enforced Consent Seeking • Authentication and Authorization rules • Minimize Anonymous access • Prevent Un-Authorized access • Audit any access to data • Report any breaches in 72 hours • Prevent copies and transfer of data • Track personal data movement • Enforced consent during authorization • even in the phase of cross sell, upsell and data mining
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved GDPR – Application Design
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved RDBMS Company XYZ – Efficient GDPR Implementation using Hadoop User Content Transaction Tables (UCTT) Audit Tables (AT) Data in 10’s of Peta Bytes [Few Years - Few Decades] Content 1) Who did what? 2) What was done? Schema for Customer-1 Schema for Customer-N Data Center – 1 , RDBMS …..... User Applications GDPR Regulated Data Data Center – N, RDBMS UCTT AT User Content Transaction Tables (UCTT) Schema for Customer-1 Schema for Customer-N …..... User Applications GDPR Regulated Data Data Center – N, RDBMS & HDP & HDF UCTT Audit Tables (AT) Schema for Customer-1 Schema for Customer-N AT Data Center – 1 , RDBMS & HDP & HDF Nifi RDBMS HDP Kafka HDF …..... …..... ….....
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Company XYZ – GDPR Applications Design per Data Center Ingest Use Cases Update Use Cases Report Use Cases HiveServer2 Applications Nifi/Spark Applications Staging Tables Audit Tables Kafka H i v e Purge Merge Logic HDP/HDF Cluster Y a r n RDBMS Silo -1 RDBMS Silo -N GDPR  Audit Table flows  Data Center - 1 … GDPR Audit Data Ingestion GDPR Right to be Forgotten GDPR Audit Reports
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Data Ingestion Internals - Kafka Kafka Producer - 1 Kafka Producer - 2 Kafka Producer - 3 Kafka Receiver - 1 Kafka Receiver - 2 Kafka Receiver - 3 OS Page Cache Kafka Broker – 3 Zookeeper Node-1 Zookeeper Node-N Kafka Broker – 2 Kafka Broker – 1 T1 P3 T1 P4 T2 P1 T4 P5 T1 P6 T4 P12 T4 P15 T4 P8 Zookeeper Node-2 /controller /topics /admin /consumer /broker /kafka-acl
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Data Ingestion Internals - Nifi Kafka KR-1 KR-2 KR-3 Merge Content PHS 1 PHS 2 PHS N N a m e N o d e Flow File Repository Content File RepositoryProvenance Repository … Raid-10 4 disks Raid-10 6 disks Raid-10 4 disks H i v e Nifi Node-1 Flow File Cache …. Content File Repo Details • Ordered, Pushed and Retrieved to/from Disk • Batch size default 10,000 • Copy on write • Pass by reference • Flow files copied/moved to Relevant Queues Container Section Files with Offsets Content Claim Resource Claim 1 2 16… Merged Log 1 2 N Lucene Shards … Partitioned Logs Provenance Repository Details Hive Backend SQL Server Data Node 1 Data Node 2 Data Node N ….
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Efficient “Right to be Forgotten” - Hive ACID - Merged Reads Table A 1 2 3 4 5Buckets Base Files 1 2 3 4 5 Partition=2018-03-18 Partition=2018-03-17 Partition=2018-03-16 Partition=2018-03-15 … … Updates => Delete & Re-Create Table A 1 2 3 4 5Buckets Base Files 1 2 3 4 5 Partition=2018-03-18 Partition=2018-03-17 Partition=2018-03-16 Partition=2018-03-15 … … Updates => Base & Delta Files => Merged Reads Merge on Write Merge on Read Simple Read, Heavy Writes Simple Writes, Heavier Reads
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved NIFI Ingest using Hive ACID Purge & Update Requests using Hive Merge Reporting using Hive LLAP 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Buckets Tall Base And Short Delta Files Tall Base And Many Short Delta Files Buckets Life of Table A – Single Partition Buckets Tall NEW Base Files After Compaction Compaction Process 1 2 3 4 5 6 Base ORC File Stripes within 1 Bucket Delta ORC Files …..... 1 2 6 Mapper – 1 Mapper – 2 Mapper – 6 • Merge on Read Processing within Containers • Read both source stripe and delta files to provide answer • Processing Within 1 Bucket… Efficient Data Lifecycle Internals – Hive ACID with Merged Reads
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Efficient “Right to be Forgotten” using Hive ACID Merge Ids Needing Insert, Update and Delete Table A Partition=2018-03-18 Partition=2018-03-17 Partition=2018-03-16 Partition=2018-03-15 … Merge Process 1 2 3 4 5 Staging Table B HIVE MERGE STATEMENT MERGE INTO audit.A AS T USING audit.staging_table AS S ON T.ID = S.ID and T.tran_date = S.tran_date WHEN MATCHED AND (T.TranValue != S.TranValue AND S.TranValue IS NOT NULL) THEN UPDATE SET TranValue = S.TranValue, last_update_user = 'merge_update' WHEN MATCHED AND S.TranValue IS NULL THEN DELETE WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.TranValue, 'merge_insert', S.tran_date); 1 2 5 Mapper – 1 Mapper – 2 Mapper – N Partition=2018-03-18 Buckets Base Files Hive ACID - Writes and Reads can obtain shared locks at same time Hive ACID – Writes on the same table and same partition are chained Performance  ~50 Tables with a total of 256 TB of data Hive ACID merge on all tables and partitions completed in approximately 4 hours
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Platform
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved GDPR Requirement for Global Deployment Across the World Cluster Deployment • N Datacenter around world • Preview and Production Environment • Each environment consist of HDF and HDP clusters • 60 Ambari Managed Clusters • T-shirt Sized Datacenters • Large • Medium • Small
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved HDP Cluster Cluster Blueprint Design HDF Cluster Master & Management Nodes (5 Master v1) – Including Standby Master 1 Namenode (Active) Zk Failover Controller Yarn (Active) Master 2 Master 3 Hive Metastore (Compaction) Zookeeper1 Journal Node1 Namenode (Standby) Zk Failover Controller Yarn (Standby) Master 4 Knox Instance1 Zookeeper2 Journal Node2 Master 5 Knox Instance2 Zookeeper3 Journal Node3 Hive Metastore Instance2 HiveServer2 Instance2 Webhcat Instance2 Kerberos Client HST Agent LOGSEARCH_LOGFEEDER and Metric Monitor Hive Metastore Instance1 HiveServer2 Instance1 Webhcat Instance1 Kerberos Client HST Agent LOGSEARCH_LOGFEEDER and Metric Monitor Yarn ATS MR2 History Server Slider Kerberos Client HST Agent LOGSEARCH_LOGFEEDER and Metric Monitor Infra Solr1 Logsearch Server Ambari Server Ranger(Active) Kerberos Client, HST Agent , LOGSEARCH_LOGFEEDER and Metric Monitor Infra Solr2 HST Server Activity Analyzer/Explorer Ranger(Standby) Kerberos Client, HST Agent , LOGSEARCH_LOGFEEDER and Metric Monitor Server Client Security Slave Nodes HDFS & YARN Slaves Datanode Metric Monitor HST Agent, LOGSEARCH_LOGFEEDER Kerberos Client Ambari Agent Node Manager Slave Nodes(2) with Ambari Metrics HDFS & YARN Slaves Datanode Distributed Ambari Metric Collector HST Agent, LOGSEARCH_LOGFEEDER Kerberos Client Ambari Agent Node Manager Master & Management Nodes (2 Master) – Including Ambari Master v1 Ambari Infra Logsearch Server Logsearch UI Ranger (Active) Infra Solr Instance Kerberos Client, HST Agent , LOGSEARCH_LOGFEEDER and Metric Monitor Master v2 Ambari Server Metrics Collector Metrics Grafana Ranger (Standby) Infra Solr Instance Kerberos Client, HST Agent , LOGSEARCH_LOGFEEDER and Metric Monitor Nifi Nifi Node Nifi Metric Monitor HST Agent, LOGSEARCH_LOGFEEDER Kerberos Client Ambari Agent Kafka Brokers Kafka Broker Broker Metric Monitor HST Agent, LOGSEARCH_LOGFEEDER Kerberos Client Ambari Agent Zookeeper Kafka Brokers Kafka Broker Broker Metric Monitor HST Agent, LOGSEARCH_LOGFEEDER Kerberos Client Ambari Agent
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Knox-1 NameNode ResourceManager EdgeNode-1 DataNode-11 DataNode-12 DataNode-1N Ambari Knox-2 … AD/LDAP (User Directory) KDC (Token Provider) HiveServer2 Oozie Spark History Server Yarn Timeline Server …. DataNode-13 Ranger-1 KT U-TGT S 1 SRV S 2 PRN SRV SRV KT AU S 2 KT KT KT KT KT KT KT SQL Server Ranger-2 SQL Server EdgeNode-N KT Zeppelin Server KT Knox-TGT Knox User Proxy User Authentication Sample Process Services use Delegation Tokens for Efficiency User Authorization System Gateway KT KT KT KT H T T P S SQL Server KT Administration Application Services DataNode-M1 DataNode-M2 DataNode-MN…DataNode-M3 KT KT KT KT … … … … SQL Server Platform Security – GDPR Compliant
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved • Auto Deploy OS • OS Best Practices using HDP & HDF • Auto Deploy HDF and HDP • Ambari API’s • Ambari Blueprints • Auto Validate Deployment • Ambari Smoke tests using API’s • Ambari Blueprint comparison against Standard • Auto Deploy GDPR Application • Kafka Producer deployment. • Kafka Topics creation. • Nifi Flow deployment. • Hive DB and HDFS Directory creation. • Custom Ambari Alerts Deployment • Auto Validate GDPR Application Automated Deployment Cluster Deployment
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Performance Tuning
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Data Ingestion – Operating System Tunable • Disable Transparent Huge Pages • echo never > defrag and > enabled • Disable Swap • Configure VM Cache flushing • Configure IO Scheduler as deadline • Disk Jbod Ext4 • Mount Options- inode_readahead_blks=128,data=writeback,noatime,nodiratime • Network • Dual Bonded 10gbps • rx-checksumming: on, tx-checksumming: on, scatter-gather: on, tcp-segmentation- offload: on
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Data Ingestion –Tunable • Data Tunables 1. Buckets I. Audit Type 1 Logs - table buckets II. Audit Type 2 Logs - table buckets 2. Kafka Events I. Audit Type 1 Logs Events Per Kafka Event II. Audit Type 2 Logs Events Per Kafka Event 3. Ingested Data Volume I. Audit Type 1 Logs - table data ingested II. Audit Type 2 Logs - table data ingested III. Ingested Partitions(Days) Per Table • Platform Tunables 1. Nifi PHS Transaction Settings I. Transactions Per Batch II. Rows per transaction III. Connections per process IV. Nifi Insert Interval to Hive 2. Nifi Merge Content Setting I. Merge Content Size II. Merge Processes III. Merge Process Threads 3. Nifi Settings i. Queue Size ii. Nifi concurrent Threads iii. Nifi memory per Node 4. Hive Settings I. ORC Stripe Size
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Data Ingestion Tuning in Iterations • Issues/Bottlenecks Identified and Resolved • Iteration 1 - Nifi Merge Process • 1GB from Merge Process was high • 256 MB gave more stableness and throughput • 21 MB/sec  24 Mb/Sec • Iteration 2 - Nifi Disk Changes • Content Repository High Usage with Disks • 24 MB/sec  29 Mb/Sec • Iteration 3 - Nifi Parallelism PHS Processor • Multiple PHS instances • Working around ORC creation bottlenecks • 29 MB/sec  35 MB/sec • Iteration 4 - Nifi Hive Changes • Transaction per Batch • Rows per transaction • Hive Buckets Tuning • 35 MB/sec  51 MB/sec Page 22 21 24 29 35 51 0 10 20 30 40 50 60 Iteration - 0 Nifi Merge Process Iteration - 1 Nifi Repository Disk Changes Iteration - 2 Nifi PHS Parallelism Iteration - 3 Hive Streaming Changes Iteration - 4 Data Ingest – Nifi Hive Streaming - Throughput MB/sec
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved 0 100 200 300 400 500 600 700 Test Set 1 Test Set 2 Test Set 3 Test Set 4 Reporting Response Times and Cluster IO Total Queries (Users * Queries Per User) Queries Returning Result within SLA Average Time for Status Ready Average Time to Read Results Reporting Queries on Audit Tables
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Operations
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved • Export HDF Metrics (SPOF-HBASE) into HDP HBASE Getting Around Limitations of HDF Metrics (SPOF) Distributed Ambari Metrics HDF Metrics HDP Metrics
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Ambari LogSearch Log Monitoring • Monitor Log Errors for Services • HDFS • YARN • Hive • Nifi • Kafka • Zookeeper
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved • High Level Cluster Status • Key Metrics • Top-N Alerts Hortonworks DPS Operational App -> In-Progress Centralized Global Eye
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Thank you

Editor's Notes

  1. Hello Good Evening , As we all know GDPR requirement hit industry on May 25th. We got an opportunity Optimize an existing Application to provide new functionality related to GDPR Audit logging. We will be describing details of the such implementation, different design patterns used and various optimization which we did at several layers for achieving performance goals.
  2. Myself Saurabh , I am a Systems Architect in Hortonworks Professional service since 2012 and an active Hadoop implementer from last decade. I have with me Arun.
  3. Now Let’s understand GDPR from a technical standpoint. We have a system in which Data Subject Personal Data exits , the owner of the system is a Controller. With GDPR, Data Subject should be able to Access, Rectify , Erase , Move to another system and object on his data use. Anyone accessing the system should be forced to Authenticate his Identify , Get authorized and actions should all be available in logs as Audit. Processor is the authority who work with Controller and use personal data , he must do all 3A’s. Along with that Processor should be enforced for which purpose he access the personal data which is Enforced Content Seeking. No copy of data or if copied should be Auditable, low overhead for correction of data. And even if system is Immutable , erase of data should be possible. Most Important , Any breach if happened should be reported in 72 hours.
  4. The quick Summary, One of the important aspect of all this is Auditability of all and every actions happening on personal data. Also the Right to be forgotten and Reporting requirement applies to Audit data as well. Now let’s look at our implementation.
  5. Audit Data from Application and Database is Ingested using Kafka Producers to Nifi and then to Hive Acid tables using Put Hive Streaming . Audit Reporting and Right to be forgotten (Update and Delete) made possible using hive LLAP and Acid Features of the system via Hiveserver2 Async.