SlideShare a Scribd company logo
Securing Data in Hadoop at Uber
TitleMohammad Islam
Wei Han
Speaker Intro
• Mohammad Islam
– Staff Engineer @Uber
– Apache Hadoop contributor, PMC in Oozie & TEZ
– Co-Authored O’Reilly book about Apache Oozie
• Wei Han
– Technical Manager @ Uber
– Lead Hadoop Security team
What is (NOT) covered?
• Securing Hadoop data lake at Uber
• Focus on technologies
– Open source + internal tools
• NOT covering all aspects of data security
• NOT a legal advice or guidance
Data Security in Hadoop
What is Data Security?
 Prevent unauthorized access to data.
 Technical focus area in data lake:
AAAA
Authentication
Authorization
Auditing
Anonymization
4 Pillars of Data Security (1/2)
1. Authentication (AuthN)
Verify identity of a user
2. Authorization (AuthZ)
Access control of data
3. Auditing
• Post-mortem
• Anomaly detection
4. Anonymization
• tokenization
• Masking etc.
4 Pillars of Data Security (2/2)
Design Considerations
• Secure all access paths to HDFS
• Enforcement at the lowest-level
• User/group based (AD) access control
• Centralized policy store
Not at the cost of infrastructure flexibility
Internal Services
BI Tools
Workbench
ETL
Ingestion
Machine learning
Metadata
Hive
PrestoSpark
WF Scheduler
Flink
Pinot
YARN
Mesos
At a Glance ..
HDFS
• Outer layer verifies user identity
• Middle layers securely pass the identity
• Innermost layer enforces access control
Identity verificationPass identityAccess enforcement
Authentication
Authentication Overview
Hadoop Ecosystem w/ Kerberos
Hive, Presto, Spark, Flink etc
Non-Hadoop Services
Interactive Batch
Gateway
Machine
Apache KNOX
Kerberos
Custom Token
Web Authentication
Kinit
KerberosKerberos/DT
• Kerberos and delegation token (DT) for Hadoop
• Custom S2S authN for non-Hadoop Services
AuthN Protocol Translation - Knox
• Why?
– Seamless integration among AuthN protocols
• Translate custom AuthN protocols to Kerberos
• Contributed to Apache Knox
– Pluggable AuthN validator for any custom AuthN
protocol (KNOX-861, KNOX-869)
– Improved monitoring (KNOX-940)
Impersonation/Delegation
• Why?
– Hadoop already supports impersonation or doAs
• Work on-behalf-of others
– Internal authN mechanism doesn’t support it
• How?
– Utilize Apache Knox
– Whitelist the impersonated services using config
– Idea borrowed from Hadoop core-site.xml
Off-label Usage : Delegation Token
• Delegation token (DT) is a Hadoop concept
• Used DT for authentication only when:
 Other protocol doesn’t work or is not
ready
 DT for HDFS is already available
 HTTP REST service
• Added support in Presto and few other
internal services
Off-label Usage : Delegation Token :
Client
Service
Service
NameNode
1. Get DT w/ Kerb
2. DT as HTTP header
3. Verify DT
Summary
• (+) Quick and easy to implement
• (-) Extra load on NN (caching can address it)
Authorization
Authorization
Hive/
Presto
Spark/
Flink
Other
Engines
user
Policy Enforcement
admin
Self-serve
Platform
Policy
Store
HDFS
Extended ACLs
Policy
Policy
management
RBAC Policy to ACLs
Design Options
1. Use HDFS plugin (Sentry/Ranger)
2. Set the ACLs directly in HDFS
Why Option 2?
1. No overhead (Mem/latency) to NN
2. NN Scale/stability is a concern
3. Challenge : Keep in sync
admin
Policy
Propagator
HDFS
<Table, permission, groups>
setacl <data_dir, permission, groups>
Self-serve
Platform
Partition-based Access Control (1/2)
• Policy defined at partition level
– Usual access control is table-level
– Access can change based on time or geography
• Example use case:
– “events” table is partitioned by date
– Policies
• By default, employee can only access new events records
• Only authorized groups can access events records older
than X days
Partition-based Access Control (2/2)
/events/date=2018.06.06
group: employee, authorized_group:r-x
/events/date=2017.02.06
group: authorized_group:r-x
Table Privilege Group Time restriction
events read employee X days
events read authorized_group none
admin
Self-serve
Platform
Policy
Propagator
HDFS
Set ACL
Refresh ACL
(periodically)
Sample Policies
Auditing
Auditing
Audit
Collection
Log Analysis
and
Monitoring
Kafka
File access logs Query logs Service access logs
HDFS
Query Engines
(Hive/Presto)
Services
Producers
admin
HDFS
Query Engines
Real-time
monitoring
Anonymization
Anonymization
• Transform any data into unidentifiable form
• Loosely used to mention:
– Removal
– Redaction
– Masking
– Tokenization
Next : Enforce AuthZ through Encryption
Column-level Access Control
• Why?
– When only some columns in a table are sensitive
and need special access control
– Finer grained access control based on level of sensitivity
Column C1 C2 C3 …. C15 … C34 … C100
Sensitivity
Level
0 8 0 0 6 0 9 0 0
• Challenges
– Enforce on common access paths
– HDFS doesn’t understand column
Approach
• Enforce in data format (Parquet) level
• Encrypt only sensitive columns in HDFS
• Access controlled through encryption key
management
• Different column can have different key for
encryption/decryption
• Open source activities: PARQUET-1178 And
PARQUET-1325
Key
Management Parquet lib
Reader
(Hive, Spark, etc)
C1: plain text
C2E: encrypted
C1: plain text
C2: plain data(has access)
error/masked (no access)
Writer
(Hive, Spark, etc)
Parquet lib
HDFS
C1: plain text (non-sensitive)
C2: plain text (sensitive)
C1: plain text
C2E: encrypted
Column-level Access Control
Conclusion
Take Away
1. Security scope within Hadoop is expanding
• Conventional thinking is being challenged
• Need significant changes in the architecture
2. Finer-grain security is must for big data
• Column/Partition/Row level access control
3. Security by design is critical
• Retrofitting is very hard
Q&A

More Related Content

What's hot

Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
Secure your app with keycloak
Secure your app with keycloakSecure your app with keycloak
Secure your app with keycloak
Guy Marom
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with Keycloak
Julien Pivotto
 
Azure Identity and access management
Azure   Identity and access managementAzure   Identity and access management
Azure Identity and access management
Dinusha Kumarasiri
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
Vineet .
 
Understanding container security
Understanding container securityUnderstanding container security
Understanding container security
John Kinsella
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Getting started with Apache Camel presentation at BarcelonaJUG, january 2014
Getting started with Apache Camel presentation at BarcelonaJUG, january 2014Getting started with Apache Camel presentation at BarcelonaJUG, january 2014
Getting started with Apache Camel presentation at BarcelonaJUG, january 2014
Claus Ibsen
 
WordPress + NGINX Best Practices with EasyEngine
WordPress + NGINX Best Practices with EasyEngineWordPress + NGINX Best Practices with EasyEngine
WordPress + NGINX Best Practices with EasyEngine
NGINX, Inc.
 
Using AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic ProtectionUsing AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic Protection
Amazon Web Services
 
Building secure applications with keycloak
Building secure applications with keycloak Building secure applications with keycloak
Building secure applications with keycloak
Abhishek Koserwal
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & Operators
SIGHUP
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
Amazon Web Services
 
How Liberty Mutual Moves toward Real-Time Financial Closing
How Liberty Mutual Moves toward Real-Time Financial ClosingHow Liberty Mutual Moves toward Real-Time Financial Closing
How Liberty Mutual Moves toward Real-Time Financial Closing
Amazon Web Services
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
Amazon Web Services
 
Zuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne PlatformZuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne Platform
Mikey Cohen - Hiring Amazing Engineers
 
Gotchas using Terraform in a secure delivery pipeline
Gotchas using Terraform in a secure delivery pipelineGotchas using Terraform in a secure delivery pipeline
Gotchas using Terraform in a secure delivery pipeline
Anton Babenko
 
Secret Management with Hashicorp’s Vault
Secret Management with Hashicorp’s VaultSecret Management with Hashicorp’s Vault
Secret Management with Hashicorp’s Vault
AWS Germany
 

What's hot (20)

Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Secure your app with keycloak
Secure your app with keycloakSecure your app with keycloak
Secure your app with keycloak
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with Keycloak
 
Azure Identity and access management
Azure   Identity and access managementAzure   Identity and access management
Azure Identity and access management
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
 
Understanding container security
Understanding container securityUnderstanding container security
Understanding container security
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Getting started with Apache Camel presentation at BarcelonaJUG, january 2014
Getting started with Apache Camel presentation at BarcelonaJUG, january 2014Getting started with Apache Camel presentation at BarcelonaJUG, january 2014
Getting started with Apache Camel presentation at BarcelonaJUG, january 2014
 
WordPress + NGINX Best Practices with EasyEngine
WordPress + NGINX Best Practices with EasyEngineWordPress + NGINX Best Practices with EasyEngine
WordPress + NGINX Best Practices with EasyEngine
 
Using AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic ProtectionUsing AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic Protection
 
Building secure applications with keycloak
Building secure applications with keycloak Building secure applications with keycloak
Building secure applications with keycloak
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & Operators
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
 
How Liberty Mutual Moves toward Real-Time Financial Closing
How Liberty Mutual Moves toward Real-Time Financial ClosingHow Liberty Mutual Moves toward Real-Time Financial Closing
How Liberty Mutual Moves toward Real-Time Financial Closing
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Zuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne PlatformZuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne Platform
 
Gotchas using Terraform in a secure delivery pipeline
Gotchas using Terraform in a secure delivery pipelineGotchas using Terraform in a secure delivery pipeline
Gotchas using Terraform in a secure delivery pipeline
 
Secret Management with Hashicorp’s Vault
Secret Management with Hashicorp’s VaultSecret Management with Hashicorp’s Vault
Secret Management with Hashicorp’s Vault
 

Similar to Securing Data in Hadoop at Uber

Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
DataWorks Summit
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
Chris Nauroth
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
DataWorks Summit
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
DataWorks Summit
 
TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
trihug
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
Cloudera, Inc.
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of View
Karan Alang
 
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by ClouderaBig Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Caserta
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
Great Wide Open
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
Rommel Garcia
 
Curb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure ClusterCurb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure Cluster
ahortonworks
 
BSides SG Practical Red Teaming Workshop
BSides SG Practical Red Teaming WorkshopBSides SG Practical Red Teaming Workshop
BSides SG Practical Red Teaming Workshop
Ajay Choudhary
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
Adam Muise
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
Uwe Printz
 
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessHadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster Access
Cloudera, Inc.
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
Hadoop security
Hadoop securityHadoop security
Hadoop security
Shivaji Dutta
 

Similar to Securing Data in Hadoop at Uber (20)

Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
 
TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of View
 
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by ClouderaBig Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Curb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure ClusterCurb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure Cluster
 
BSides SG Practical Red Teaming Workshop
BSides SG Practical Red Teaming WorkshopBSides SG Practical Red Teaming Workshop
BSides SG Practical Red Teaming Workshop
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessHadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster Access
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 

Recently uploaded (20)

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 

Securing Data in Hadoop at Uber

  • 1. Securing Data in Hadoop at Uber TitleMohammad Islam Wei Han
  • 2. Speaker Intro • Mohammad Islam – Staff Engineer @Uber – Apache Hadoop contributor, PMC in Oozie & TEZ – Co-Authored O’Reilly book about Apache Oozie • Wei Han – Technical Manager @ Uber – Lead Hadoop Security team
  • 3. What is (NOT) covered? • Securing Hadoop data lake at Uber • Focus on technologies – Open source + internal tools • NOT covering all aspects of data security • NOT a legal advice or guidance
  • 5. What is Data Security?  Prevent unauthorized access to data.  Technical focus area in data lake: AAAA Authentication Authorization Auditing Anonymization
  • 6. 4 Pillars of Data Security (1/2) 1. Authentication (AuthN) Verify identity of a user 2. Authorization (AuthZ) Access control of data
  • 7. 3. Auditing • Post-mortem • Anomaly detection 4. Anonymization • tokenization • Masking etc. 4 Pillars of Data Security (2/2)
  • 8. Design Considerations • Secure all access paths to HDFS • Enforcement at the lowest-level • User/group based (AD) access control • Centralized policy store Not at the cost of infrastructure flexibility
  • 9. Internal Services BI Tools Workbench ETL Ingestion Machine learning Metadata Hive PrestoSpark WF Scheduler Flink Pinot YARN Mesos At a Glance .. HDFS • Outer layer verifies user identity • Middle layers securely pass the identity • Innermost layer enforces access control Identity verificationPass identityAccess enforcement
  • 11. Authentication Overview Hadoop Ecosystem w/ Kerberos Hive, Presto, Spark, Flink etc Non-Hadoop Services Interactive Batch Gateway Machine Apache KNOX Kerberos Custom Token Web Authentication Kinit KerberosKerberos/DT • Kerberos and delegation token (DT) for Hadoop • Custom S2S authN for non-Hadoop Services
  • 12. AuthN Protocol Translation - Knox • Why? – Seamless integration among AuthN protocols • Translate custom AuthN protocols to Kerberos • Contributed to Apache Knox – Pluggable AuthN validator for any custom AuthN protocol (KNOX-861, KNOX-869) – Improved monitoring (KNOX-940)
  • 13. Impersonation/Delegation • Why? – Hadoop already supports impersonation or doAs • Work on-behalf-of others – Internal authN mechanism doesn’t support it • How? – Utilize Apache Knox – Whitelist the impersonated services using config – Idea borrowed from Hadoop core-site.xml
  • 14. Off-label Usage : Delegation Token • Delegation token (DT) is a Hadoop concept • Used DT for authentication only when:  Other protocol doesn’t work or is not ready  DT for HDFS is already available  HTTP REST service • Added support in Presto and few other internal services
  • 15. Off-label Usage : Delegation Token : Client Service Service NameNode 1. Get DT w/ Kerb 2. DT as HTTP header 3. Verify DT Summary • (+) Quick and easy to implement • (-) Extra load on NN (caching can address it)
  • 18. RBAC Policy to ACLs Design Options 1. Use HDFS plugin (Sentry/Ranger) 2. Set the ACLs directly in HDFS Why Option 2? 1. No overhead (Mem/latency) to NN 2. NN Scale/stability is a concern 3. Challenge : Keep in sync admin Policy Propagator HDFS <Table, permission, groups> setacl <data_dir, permission, groups> Self-serve Platform
  • 19. Partition-based Access Control (1/2) • Policy defined at partition level – Usual access control is table-level – Access can change based on time or geography • Example use case: – “events” table is partitioned by date – Policies • By default, employee can only access new events records • Only authorized groups can access events records older than X days
  • 20. Partition-based Access Control (2/2) /events/date=2018.06.06 group: employee, authorized_group:r-x /events/date=2017.02.06 group: authorized_group:r-x Table Privilege Group Time restriction events read employee X days events read authorized_group none admin Self-serve Platform Policy Propagator HDFS Set ACL Refresh ACL (periodically) Sample Policies
  • 22. Auditing Audit Collection Log Analysis and Monitoring Kafka File access logs Query logs Service access logs HDFS Query Engines (Hive/Presto) Services Producers admin HDFS Query Engines Real-time monitoring
  • 24. Anonymization • Transform any data into unidentifiable form • Loosely used to mention: – Removal – Redaction – Masking – Tokenization Next : Enforce AuthZ through Encryption
  • 25. Column-level Access Control • Why? – When only some columns in a table are sensitive and need special access control – Finer grained access control based on level of sensitivity Column C1 C2 C3 …. C15 … C34 … C100 Sensitivity Level 0 8 0 0 6 0 9 0 0 • Challenges – Enforce on common access paths – HDFS doesn’t understand column
  • 26. Approach • Enforce in data format (Parquet) level • Encrypt only sensitive columns in HDFS • Access controlled through encryption key management • Different column can have different key for encryption/decryption • Open source activities: PARQUET-1178 And PARQUET-1325
  • 27. Key Management Parquet lib Reader (Hive, Spark, etc) C1: plain text C2E: encrypted C1: plain text C2: plain data(has access) error/masked (no access) Writer (Hive, Spark, etc) Parquet lib HDFS C1: plain text (non-sensitive) C2: plain text (sensitive) C1: plain text C2E: encrypted Column-level Access Control
  • 29. Take Away 1. Security scope within Hadoop is expanding • Conventional thinking is being challenged • Need significant changes in the architecture 2. Finer-grain security is must for big data • Column/Partition/Row level access control 3. Security by design is critical • Retrofitting is very hard
  • 30. Q&A

Editor's Notes

  1. This slide is the easiest one. For the sake of time, let me skip it.
  2. Before going to the details, let’s set the stage first. That means what will discuss and what we will not talk. At Uber, Hadoop is the backbone of our data lake. In this presentation, we will discuss how we achieve security in Hadoop In particular, we will explain how we merge the open source technologies with our in-house tools and servcies There are multiple ways we address data security. However, today we will talk about one of such approach. Also please take the presentation with a grain of salt.
  3. Source : https://techdifferences.com/difference-between-authentication-and-authorization.html We will discuss more details in later slides.
  4. Source : https://techdifferences.com/difference-between-authentication-and-authorization.html Meanings are so close /confusing to some people.
  5. https://theprivacyblog.com/anonymity/data-anonymization-is-hard-this-time-shown-with-nyc-taxi-data/ Deterrent : internal admin or approved users as well. To remove the misuse.
  6. Authentication is about verifying who you are Authorization is about checking whether you’re allowed to access the data
  7. This diagram shows the overall architecutre of how authZ works. Spearking of authZ, there're two important parties invovled, admin and user Admin is the one who decides which data can be accessed by who, normally we call this kind of rule, as a policy This diagram show the admin flow. Basically we build a self server platform, which allow admin to manage the policies. Eventually the policy is stored in a policy store, and more importantly policies will be converted to extended ACLs in HDFS. enforcement happends at HDFS, which gaurantees 100% coverage of access control. No matter what type of access(whether it's Hive, Spark, or some new engines adopted), it'll all comes to HDFS.
  8. Mohammad: bring RBAC concept first. Tell: RBAC is an abstract layer HDFS is physical layer (a set of dirs) converting from abstract->physical Emphasize that we take/implememted #2 Emphasize NN is sensitive service to us. Esp policy will grow #2 has no extra latency/memory overload because it’s offline Now let's look at how we covert the policy to ACLs in HDFS. Again let's bring up this diagram. As admin, you can define a policy, which is essentially is a tuple table, permission type(whether it's read or write), and which group can have the access Then a policy propagator will propagate the policy, to all HDFS files, which essentially setACL on all the table's files. Now in order to utilize ACls for enformcene, we had two options. First is we can utilize some HDFS plugins The 2nd option is what we presetned here, basically we set the ACL directly Eventually we chose 2nd option. The primary reason is #1 will have some overhead to NN. Becaue option 1 requires NN to cache all table's metadata and policy in memory, which can be a big deal when your table growsl. In the other hand, NN's scalability is a big and ongoing challengeing so we don't want to add extra overhead to NN. However in practice option2 also have challenges. The most challeging part is keep the policy in sync between policy store and HDFS. We have to build lots of monitoring, and even reconcilation to make these in sync.
  9. Talk about examples (time, geo) So far we only talked about table level access control. But in some use cases, Now let's look at partition level access control. Usually access control happens at table level. However in some scenarios, access can change based on time or geography. Let's look at an concrete example
  10. Let’s look at a realistic example how we achive this (time based refresh can be shorter) same as before, a policy has table, privilege, and group. However in this scneario, admin can add a time restriction. This is the same policy talked about. Employees's access has a time restriction of X days, meaning employee can only access events happens in the past X days. The 2nd policy indicates legal team doesn't have any time restriction. They can access all data. Now same as the flow, the policy propagator now will set different ACLs for different partitions, depending on how old the partition is. So in this example, partitiion 2018.06.06, the ACL indicates both employee and legal has R access. However the older parition 2017.02.06, only the legal team has access. Also note we have to keep reresh the ACL according to the polilcy, when parttion becomes older.
  11. (reduce time) Flink job
  12. Source: https://simplicable.com/new/data-anonymization Don’t need to mention all techs Today in this presentation, we’ll only focus on talk use encryption to achieve anoymizaiton in next slide
  13. Collevrating with IBM and other companies thru Apache community
  14. <shorter READ part> concrete example. in This example, c1 is non-sensitive, c2 is senstive When the writer writes the data to HDFS(whether it's Hive, Spark, whateer), for c1, it'll writes plain text to HDFS. For c2, we'll write encrypted data to HDFS. Now look at reader part. Whenever any reader reads the data, c1 will always be plain text. But for c2, on the parquet reader side, we'll check whether the user has access to the sensisitive column c2 or not. If it does, we'll retrive the encryption key and decrypt the data. Otherwise, if the user doesn't have access, we'll return an error or mask the data