Companies are increasingly moving to the cloud to store and process data. In this talk, we will talk through how companies can use tag-based policies in Apache Ranger to protect access to data both in on-premises environments as well in AWS-based cloud environments. We will go into details of how tag-based policies work and the integration with Apache Atlas and various services. We will also talk through how companies can leverage Ranger’s policies to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Kafka, Apache Hive, Apache Spark, or plain old ETL using MapReduce. We will also deep dive into Ranger’s proposed integration with S3 and other cloud-native systems. We will wrap it up with an end-to-end demo showing how tags and tag-based masking policies can be used to anonymize sensitive data and track how tags are propagated within the system and how sensitive data can be protected using tag-based policies
Speakers
Don Bosco Durai, Chief Security Architect, Privacera
Madhan Neethiraj, Sr. Director of Engineering, Hortonworks
Securing data in hybrid environments using Apache Ranger
1. SECURING DATA IN HYBRID ENVIRONMENTS
USING APACHE RANGER
Madhan Neethiraj , Hortonworks
Apache Ranger PMC, Apache Atlas PMC
Don Bosco Durai, Privacera
Apache Ranger PMC
2. Disclaimer
‣ This document may contain product features and technology directions that are under
development, may be under development in the future or may ultimately not be developed.
‣ Project capabilities are based on information that is publicly available within the Apache Software
Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from
inception to release through Apache, however, technical feasibility, market demand, user feedback
and the overarching Apache Software Foundation community development process can all effect
timing and final delivery.
‣ This document’s description of these features and technology directions does not represent a
contractual commitment, promise or obligation from Hortonworks to deliver these features in any
generally available product.
‣ Product features and technology directions are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
‣ Since this document contains an outline of general product development plans, customers should
not rely upon it when making purchasing decisions.
3. Agenda
• Hybrid Environment Use Cases
• Security Requirements for Cloud
• Implementation Challenges
• Hybrid Environment Security Implementation Flow
• Demo
• Roadmap
• Q & A
4. HYBRID ENVIRONMENT
- Predictable workload
- ETL and data wrangling
- BI and DW use cases
- Multiple Hadoop enabled
services and tools
- On-demand processing
units
- Analytical and other
services from Cloud
provider
- Share data with 3rd party
vendors
- Micro-Services
On Premise
DATA
5. PREFERRED SECURITY REQUIREMENTS
FOR CLOUD
1. Access permissions should be consistent in all
environments
2. Protect personal and sensitive data by
anonymizing or tokenizing
6. IMPLEMENTATION CHALLENGES
● Permission Model between Hadoop and Cloud are not the
same
● Users/Groups between both the systems might not be the
same
● Data constantly moves within the cluster and permissions
might change along with it
7. Hybrid Environment: overview
Rang
er
Atlas
Discover
y
Privacera
Hortonwrok
s DSS
Hive
3. Classify
4. ETL
Ra
w
Da
ta
1. Ingest
Name Street City State
John Doe 345 First St Eureka CA
Jane
Smith
876 Main St Newark NJ
Sally Mark
Name Street City State
BXPHDE YNiIkjoiTH Eureka CA
HNEQON WNDUHNd Newark NJ
7. Export
Name Street City State
BXPHDE YNiIkjoiTH Eureka CA
HNEQON WNDUHNd Newark NJ
S3
ACL
Sync
2. Scan
5.Metadata,
Lineage
6. Scan
8. DISCOVERY AND CLASSIFICATION
● Centrally store classifications in Apache Atlas
● Classify resources via
○ Apache Atlas UI
○ Apache Atlas API
● Auto classify using discovery tools like Privacera,
Hortonworks Data Steward Studio
● Different types of tags
○ Security (SSN, CC, EMAIL, NAME, ADDRESS, etc.)
○ Business (SALES, HR, FINANCE, MARKETING, etc.)
13. DATA UPLOAD TO S3
● Use Hive Export feature
● Dynamic Anonymization based on ETL User (e.g. s3_etl)
● Classification Based Access Authorization Policies to
block copying highly restricted data
● Row level filter
● Ranger policy to restrict access to S3 from Hive
15. ACL SYNC TO S3
● S3 Permission model
● Bucket Policies ACL Sync sets Bucket policies
● User Policies
● Object ACLs
16. ACL SYNC TO S3
S3
ACL
Sync
1. Kafka Notification
Rang
er
entity_type aws_s3_pseudo_dir
qualifiedName s3a://dws2018-demo/sales/sales_may_2018
tags SALES
2. Retrieve tag ACLs
3. Set ACLs in S3
18. DEMO - SALES DATA
User Access View
Sally (Sales) Y Clear/Raw
Mark (Marketing) Y Anonymized
Henry (HR) X X
Item Id Amount Email Name Street City State
435 439.34 jd@y.com John Doe 345 First St Eureka CA
894 592.02 js@m.com Jane Smith 876 Main St Newark NJ
19. DEMO FLOW
Copy Sales
Data into HDFS
Create Hive
Table from Sales
Data
Export Hive table
to S3
1. Atlas will create table meta and lineage
2. Privacera will scan table and send tags to Atlas
1. Privacera will anonymize data
2. Atlas will create lineage to S3
3. Ranger-S3 Policy Sync will set
permissions on S3 for the resource
20. TOOLS USED IN DEMO
Metadata, Lineage, Classification Apache Atlas
Auto Discovery Privacera
Tag Based Policies Apache Ranger
Data Transfer Apache Hive
Anonymization/Tokenization Privacera
Ranger to S3 Policy Sync Privacera
21. OTHER DATA MOVEMENT TOOLS
● Kafka Connect - Transformer Plugin
● Apache NiFi - Processor Plugin
● Apache Spark SQL - UDF
● Java API integrated with Ranger & Atlas
22. PENDING TASKS/OPEN ISSUES
● Support for mapping Ranger permissions for S3
resources
● Support advanced Ranger policies like wild card
● Monitor permission changes on the cloud side and take
actions (e.g. disallow them)
● Mapping of Ranger group-level permissions to S3ACLs
● AWS S3 bucket policy has max size of 20k
23. APACHE JIRAS
https://issues.apache.org/jira/browse/RANGER-1974 - Ranger Authorizer and
Audits for AWS S3
https://issues.apache.org/jira/browse/ATLAS-2708 - AWS S3 data lake typedefs
for Atlas (BARBARA ECKMAN)
https://issues.apache.org/jira/browse/ATLAS-2760 - Atlas Hive hook updates to
create lineage between Hive table and S3 entities