Securing data in hybrid environments using Apache Ranger

SECURING DATA IN HYBRID ENVIRONMENTS
USING APACHE RANGER
Madhan Neethiraj , Hortonworks
Apache Ranger PMC, Apache Atlas PMC
Don Bosco Durai, Privacera
Apache Ranger PMC

Disclaimer
‣ This document may contain product features and technology directions that are under
development, may be under development in the future or may ultimately not be developed.
‣ Project capabilities are based on information that is publicly available within the Apache Software
Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from
inception to release through Apache, however, technical feasibility, market demand, user feedback
and the overarching Apache Software Foundation community development process can all effect
timing and final delivery.
‣ This document’s description of these features and technology directions does not represent a
contractual commitment, promise or obligation from Hortonworks to deliver these features in any
generally available product.
‣ Product features and technology directions are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
‣ Since this document contains an outline of general product development plans, customers should
not rely upon it when making purchasing decisions.

Agenda
• Hybrid Environment Use Cases
• Security Requirements for Cloud
• Implementation Challenges
• Hybrid Environment Security Implementation Flow
• Demo
• Roadmap
• Q & A

HYBRID ENVIRONMENT
- Predictable workload
- ETL and data wrangling
- BI and DW use cases
- Multiple Hadoop enabled
services and tools
- On-demand processing
units
- Analytical and other
services from Cloud
provider
- Share data with 3rd party
vendors
- Micro-Services
On Premise
DATA

PREFERRED SECURITY REQUIREMENTS
FOR CLOUD
1. Access permissions should be consistent in all
environments
2. Protect personal and sensitive data by
anonymizing or tokenizing

IMPLEMENTATION CHALLENGES
● Permission Model between Hadoop and Cloud are not the
same
● Users/Groups between both the systems might not be the
same
● Data constantly moves within the cluster and permissions
might change along with it

Hybrid Environment: overview
Rang
er
Atlas
Discover
y
Privacera
Hortonwrok
s DSS
Hive
3. Classify
4. ETL
Ra
w
Da
ta
1. Ingest
Name Street City State
John Doe 345 First St Eureka CA
Jane
Smith
876 Main St Newark NJ
Sally Mark
BXPHDE YNiIkjoiTH Eureka CA
HNEQON WNDUHNd Newark NJ
7. Export
BXPHDE YNiIkjoiTH Eureka CA
HNEQON WNDUHNd Newark NJ
S3
ACL
Sync
2. Scan
5.Metadata,
Lineage
6. Scan

DISCOVERY AND CLASSIFICATION
● Centrally store classifications in Apache Atlas
● Classify resources via
○ Apache Atlas UI
○ Apache Atlas API
● Auto classify using discovery tools like Privacera,
Hortonworks Data Steward Studio
● Different types of tags
○ Security (SSN, CC, EMAIL, NAME, ADDRESS, etc.)
○ Business (SALES, HR, FINANCE, MARKETING, etc.)

RANGER TAG BASED ACCESS POLICY

RANGER TAG BASED DYNAMIC MASKING

DATA UPLOAD TO S3
● Use Hive Export feature
● Dynamic Anonymization based on ETL User (e.g. s3_etl)
● Classification Based Access Authorization Policies to
block copying highly restricted data
● Row level filter
● Ranger policy to restrict access to S3 from Hive

ACL SYNC TO S3
● S3 Permission model
● Bucket Policies  ACL Sync sets Bucket policies
● User Policies
● Object ACLs

ACL SYNC TO S3
S3
ACL
Sync
1. Kafka Notification
Rang
er
entity_type aws_s3_pseudo_dir
qualifiedName s3a://dws2018-demo/sales/sales_may_2018
tags SALES
2. Retrieve tag ACLs
3. Set ACLs in S3

DEMO - SALES DATA
User Access View
Sally (Sales) Y Clear/Raw
Mark (Marketing) Y Anonymized
Henry (HR) X X
Item Id Amount Email Name Street City State
435 439.34 jd@y.com John Doe 345 First St Eureka CA
894 592.02 js@m.com Jane Smith 876 Main St Newark NJ

DEMO FLOW
Copy Sales
Data into HDFS
Create Hive
Table from Sales
Data
Export Hive table
to S3
1. Atlas will create table meta and lineage
2. Privacera will scan table and send tags to Atlas
1. Privacera will anonymize data
2. Atlas will create lineage to S3
3. Ranger-S3 Policy Sync will set
permissions on S3 for the resource

TOOLS USED IN DEMO
Metadata, Lineage, Classification Apache Atlas
Auto Discovery Privacera
Tag Based Policies Apache Ranger
Data Transfer Apache Hive
Anonymization/Tokenization Privacera
Ranger to S3 Policy Sync Privacera

OTHER DATA MOVEMENT TOOLS
● Kafka Connect - Transformer Plugin
● Apache NiFi - Processor Plugin
● Apache Spark SQL - UDF
● Java API integrated with Ranger & Atlas

PENDING TASKS/OPEN ISSUES
● Support for mapping Ranger permissions for S3
resources
● Support advanced Ranger policies like wild card
● Monitor permission changes on the cloud side and take
actions (e.g. disallow them)
● Mapping of Ranger group-level permissions to S3ACLs
● AWS S3 bucket policy has max size of 20k

APACHE JIRAS
https://issues.apache.org/jira/browse/RANGER-1974 - Ranger Authorizer and
Audits for AWS S3
https://issues.apache.org/jira/browse/ATLAS-2708 - AWS S3 data lake typedefs
for Atlas (BARBARA ECKMAN)
https://issues.apache.org/jira/browse/ATLAS-2760 - Atlas Hive hook updates to
create lineage between Hive table and S3 entities

Securing data in hybrid environments using Apache Ranger

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Securing data in hybrid environments using Apache Ranger

Similar to Securing data in hybrid environments using Apache Ranger (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Securing data in hybrid environments using Apache Ranger

Editor's Notes