1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Bridle Your Flying Islands And Castles In The Sky:
Built-in Governance And Security For The Cloud
DataWorks Summit San Jose
June 2017
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Presenters
Jeff Sposetti
Senior Director of Product Management, Cloud
Hortonworks Data Cloud, Cloudbreak
Srikanth Venkat
Senior Director of Product Management, Security & Governance
Apache Ranger, Apache Atlas, Apache Knox
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
 Introduction
 Security & Governance: Apache Ranger, Atlas and Knox
 Bringing It Together: Cloud and Data Lake Security
 Demo
 Wrap Up
 Q & A
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Background: Ephemeral Workloads + Cloud Storage
CLOUD STORAGE
S3
WORKLOAD CLUSTERS
Durable Ephemeral
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Introducing Hortonworks Data Cloud
 Focuses on business agility, rather than
infinite configurability and cluster
management
 Addresses prescriptive, ephemeral use
cases around Apache Spark + Apache Hive
 Pre-tuned and configured for use with
Amazon S3
Learn more:
http://hortonworks.com/products/cloud/aws/
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Quick demo…
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
CLOUD
DATA LAKE
SECURITY
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Security & Governance:
Apache Ranger, Atlas and Knox
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Protecting the Elephant in the Castle…..
Kerberos,
Wire Encryption
HDFS Encryption
Apache Ranger
Network Segmentation,
Firewalls
LDAP/AD
Apache Knox
1
0
© Hortonworks Inc. 2011 – 2017. All Rights
Reserved
Apache Knox Proxying Services
★ Provide access to Hadoop via proxying of
HTTP resources
★ Ecosystem APIs and UIs + Hadoop oriented
dispatching for Kerberos + doAs
(impersonation) etc.
Authentication Services
★ REST API access, WebSSO flow for UIs
★ LDAP/AD, Header based PreAuth
★ Kerberos, SAML, OAuth
Client DSL/SDK Services
★ Scripting through DSL
★ Using Knox Shell classes directly as SDK
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Ranger
Comprehensive and Extensible Security Model
– Centralized platform to define, administer and manage
security policies across Hadoop components (HDFS, Hive,
HBase, YARN, Kafka, Solr, Storm, Knox, NiFi, Atlas)
– Extensible Architecture with ability to add custom policy
conditions, user context enrichers
Centralized Auditing
– Central audit location for all access requests
– Support multiple destination sources (HDFS, Solr, etc.)
– Real-time visual query interface
Fine-Grained Authorization
for data access control for Database, Table, Column, LDAP
Groups & Specific Users
Advanced Security
• Dynamic Security Policies: Tag (Atlas), Prohibition, Time, and
Location
• Dynamic Column Masking & Row Filtering
OPERATIONS SECURITY
GOVERNANCE
STORAGE
STORAGE
Machine
Learning
Batch
StreamingInteractive
Search
SECURITY
12
© Hortonworks Inc. , and Dataguise Inc. 2011 – 2017. All Rights Reserved
STRUCTURED
Atlas: Metadata & Governance
TRADITIONAL
RDBMS
METADATA
MPP
APPLIANCES
Kafka Storm
Sqoop
Hive
ATLAS
METADATA
Falcon
RANGER
Custom
Partners
Metadata-driven governance services for Hadoop and
enterprise big data ecosystems
Data Lineage/Provenance
 Along the entire data lifecycle with integrated Cross
component lineage
Data Classification
 Supports classification of data assets using tags (e.g. PII,
PHI, PCI etc.) and attributes
Metadata Catalog Search
 Free text search on metadata
 Advanced search using DSL
Integrations
across the Hadoop ecosystem, through a common metadata
store
 Free text search on metadata
 OOtB real-time metadata and lineage ingestion with Hive,
Sqoop, Storm/Kafka
 APIs for custom metadata ingestion
 Apache Ranger integration for classification based security
Key Benefits:
Modern Data Lakes need new ways to
govern because:
• Cost – Traditional staff ratio to data size not possible
• Diversity – Only way to manage velocity of new datasets
• Agility – Quick change based on tags / taxonomy
13
© Hortonworks Inc. , and Dataguise Inc. 2011 – 2017. All Rights Reserved
HDP – Security & Governance
Classification
Prohibition
Time
Location
Policies
PDP
Resource
Cache
Ranger
Manage Access Policies
and Audit Logs
Track Metadata
and Lineage
Atlas Client
Subscribers
to Topic
Gets Metadata
Updates
Atlas
Metastore
Tags
Assets
Entitles
Streams
Pipelines
Feeds
Hive
Tables
HDFS
Files
HBase
Tables
Entities
in Data
Lake
Industry First: Dynamic Tag-based Security Policies
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Bringing It Together:
Cloud and Data Lake Services for
Enterprise Ephemeral Workloads
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Ephemeral Workloads: Basic -> Advanced -> Enterprise
Basic Ephemeral Advanced Ephemeral Enterprise Ephemeral
Tuned and Optimized
Infrastructure
Simplified, Automated
Operations
S3 Integration
Protected Network Access
Schema - Shared (Hive Metastore) Shared (Hive Metastore)
Authentication Single-user Single-user Multi-User (LDAP/AD)
Authorization - - Security Policies (Ranger)
Audit - - Audit (Ranger)
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Ephemeral Workloads for the Cloud
CLOUD STORAGE
S3
WORKLOADS
Durable Ephemeral
Metastore
SCHEMA
Long Running
Security access to workload
clusters via a Protected Gateway
enabled for AuthN and HTTPS.
Define your data schema, security
policies, and metadata catalog
once for your ephemeral and
always-on workloads.
Atlas
CATALOG
Ranger
POLICY
SHARED DATA LAKE SERVICES
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Key Components for an Enterprise Ephemeral Deployment
SCHEMA POLICY AUDIT DIRECTORY
WHAT
Provides Hive schema (tables,
views, etc).
WHY
If you have 2+ workloads
accessing the same data, need
to share schema across those
workloads.
HOW
Externalize Hive Metastore
into for schema definition.
WHAT
Defines security policies
around Hive schema.
WHY
If you have 2+ users accessing
the same data, need policies
to be consistently available
and enforced.
HOW
Externalize and share Ranger
across workloads and store
policies external.
WHAT
Audit user access.
WHY
Capture data access activity.
HOW
Externalize and share Ranger
across workloads, leverage
cloud storage for audit data.
GATEWAY
WHAT
Provide single endpoint that
can be protected with SSL and
enabled for authentication to
access to cluster resources.
WHY
Avoid opening many ports,
some potentially w/o
authentication or SSL
protection.
HOW
Deploy a centralized protected
gateway automatically.
WHAT
Users and groups.
WHY
Provide authentication source
for users and authorization
source for groups.
HOW
Leverage external LDAP or
Active Directory.
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Architecture of an Enterprise Ephemeral Deployment
Access your cluster
components through the
protected gateway via SSL
on port 443 open on the
controller security group.
CONTROLLER
PROTECTED
GATEWAY
USER ACCESS
Zeppelin
HIVE LLAP / SPARK WORKLOADS
Hive
SHARED DATA LAKE SERVICES
Ranger
POLICY
(RDS)
AUDIT
(S3)
SCHEMA
(RDS)
DIRECTORY
(LDAP/AD)
Spark
Hive
Metasto
re
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hortonworks Data Cloud + Shared Data Lake Services
1
2
3
Register an Authentication Source (i.e. LDAP/AD).
Create a “Shared Data Lake”, specify S3 Bucket & RDS.
When you create a cluster, ”attach” to the Shared Data Lake Services:
• for Multi-User AuthN (LDAP/AD)
• for AuthZ + Audit (Ranger)
• for Schema (Hive Metastore)
PREREQUISITES
• LDAP/AD
• S3 Bucket
• Amazon RDS Instance
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Longer demo…
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Guidelines for Shared Data Lake Services
 Stay Ephemeral. All of your data and metadata in S3 and RDS respectively, do not create
tables or files in the local HDFS.
 The Hive warehouse is setup to be on S3 for data lakes, create tables in this location
instead of individual S3 buckets, it will make them easier to manage.
 Use Hive “external tables” for tables that are outside this warehouse, typically if the
data is being ingested through some path outside of Hadoop
 Create S3 bucket policies that exactly match usage so that you can spin up clusters with
the least privilege.
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Wrap Up
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Bringing Big Data + Cloud + Security Together: The Takeaways
 Cloud driving more ephemeral data processing use cases
 Ephemeral workloads leverage cloud storage
 This pattern is driving an architectural approach for “shared data lake services”
 To provide shared security services for Authentication, Authorization and Audit
** Building blocks are Apache Ranger, Apache Atlas and Apache Knox
THINK
WORKLOADS
Deploy “what you need”…
Right tool, right job!
THINK
SHARED
Shared “security policies”…
End-to-end security for workloads!**
THINK
EPHEMERAL
Run “when you need”…
Workloads will come and go!
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Learn More
 Try Hortonworks Data Cloud 2.0 Technical Preview
– http://hortonworks.github.io/hdp-aws/
 BREAKOUT SESSIONS
– Wednesday, June 14 @ 5:50p, Don’t Let Spark Burn Your House: Perspectives on Securing Spark
 CRASH COURSE
– Thursday, June 15 @ 3:00p – 6:00p, Apache Spark and Apache Hive processing on the Cloud
 BIRDS OF A FEATHER
– Thursday, June 15 @ 5:00p, Security and Governance
– Thursday, June 15 @ 5:00p, Cloud and Operations
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You
https://hortonworks.com/products/cloud/aws/
https://hortonworks.com/apache/ranger/
https://hortonworks.com/apache/atlas/

Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Security for the Cloud

  • 1.
    1 © HortonworksInc. 2011 – 2017. All Rights Reserved Bridle Your Flying Islands And Castles In The Sky: Built-in Governance And Security For The Cloud DataWorks Summit San Jose June 2017
  • 2.
    2 © HortonworksInc. 2011 – 2017. All Rights Reserved Presenters Jeff Sposetti Senior Director of Product Management, Cloud Hortonworks Data Cloud, Cloudbreak Srikanth Venkat Senior Director of Product Management, Security & Governance Apache Ranger, Apache Atlas, Apache Knox
  • 3.
    3 © HortonworksInc. 2011 – 2017. All Rights Reserved Agenda  Introduction  Security & Governance: Apache Ranger, Atlas and Knox  Bringing It Together: Cloud and Data Lake Security  Demo  Wrap Up  Q & A
  • 4.
    4 © HortonworksInc. 2011 – 2017. All Rights Reserved Background: Ephemeral Workloads + Cloud Storage CLOUD STORAGE S3 WORKLOAD CLUSTERS Durable Ephemeral
  • 5.
    5 © HortonworksInc. 2011 – 2017. All Rights Reserved Introducing Hortonworks Data Cloud  Focuses on business agility, rather than infinite configurability and cluster management  Addresses prescriptive, ephemeral use cases around Apache Spark + Apache Hive  Pre-tuned and configured for use with Amazon S3 Learn more: http://hortonworks.com/products/cloud/aws/
  • 6.
    6 © HortonworksInc. 2011 – 2017. All Rights Reserved Quick demo…
  • 7.
    7 © HortonworksInc. 2011 – 2017. All Rights Reserved CLOUD DATA LAKE SECURITY
  • 8.
    8 © HortonworksInc. 2011 – 2017. All Rights Reserved Security & Governance: Apache Ranger, Atlas and Knox
  • 9.
    9 © HortonworksInc. 2011 – 2017. All Rights Reserved Protecting the Elephant in the Castle….. Kerberos, Wire Encryption HDFS Encryption Apache Ranger Network Segmentation, Firewalls LDAP/AD Apache Knox
  • 10.
    1 0 © Hortonworks Inc.2011 – 2017. All Rights Reserved Apache Knox Proxying Services ★ Provide access to Hadoop via proxying of HTTP resources ★ Ecosystem APIs and UIs + Hadoop oriented dispatching for Kerberos + doAs (impersonation) etc. Authentication Services ★ REST API access, WebSSO flow for UIs ★ LDAP/AD, Header based PreAuth ★ Kerberos, SAML, OAuth Client DSL/SDK Services ★ Scripting through DSL ★ Using Knox Shell classes directly as SDK
  • 11.
    11 © HortonworksInc. 2011 – 2017. All Rights Reserved Apache Ranger Comprehensive and Extensible Security Model – Centralized platform to define, administer and manage security policies across Hadoop components (HDFS, Hive, HBase, YARN, Kafka, Solr, Storm, Knox, NiFi, Atlas) – Extensible Architecture with ability to add custom policy conditions, user context enrichers Centralized Auditing – Central audit location for all access requests – Support multiple destination sources (HDFS, Solr, etc.) – Real-time visual query interface Fine-Grained Authorization for data access control for Database, Table, Column, LDAP Groups & Specific Users Advanced Security • Dynamic Security Policies: Tag (Atlas), Prohibition, Time, and Location • Dynamic Column Masking & Row Filtering OPERATIONS SECURITY GOVERNANCE STORAGE STORAGE Machine Learning Batch StreamingInteractive Search SECURITY
  • 12.
    12 © Hortonworks Inc., and Dataguise Inc. 2011 – 2017. All Rights Reserved STRUCTURED Atlas: Metadata & Governance TRADITIONAL RDBMS METADATA MPP APPLIANCES Kafka Storm Sqoop Hive ATLAS METADATA Falcon RANGER Custom Partners Metadata-driven governance services for Hadoop and enterprise big data ecosystems Data Lineage/Provenance  Along the entire data lifecycle with integrated Cross component lineage Data Classification  Supports classification of data assets using tags (e.g. PII, PHI, PCI etc.) and attributes Metadata Catalog Search  Free text search on metadata  Advanced search using DSL Integrations across the Hadoop ecosystem, through a common metadata store  Free text search on metadata  OOtB real-time metadata and lineage ingestion with Hive, Sqoop, Storm/Kafka  APIs for custom metadata ingestion  Apache Ranger integration for classification based security Key Benefits: Modern Data Lakes need new ways to govern because: • Cost – Traditional staff ratio to data size not possible • Diversity – Only way to manage velocity of new datasets • Agility – Quick change based on tags / taxonomy
  • 13.
    13 © Hortonworks Inc., and Dataguise Inc. 2011 – 2017. All Rights Reserved HDP – Security & Governance Classification Prohibition Time Location Policies PDP Resource Cache Ranger Manage Access Policies and Audit Logs Track Metadata and Lineage Atlas Client Subscribers to Topic Gets Metadata Updates Atlas Metastore Tags Assets Entitles Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake Industry First: Dynamic Tag-based Security Policies
  • 14.
    14 © HortonworksInc. 2011 – 2017. All Rights Reserved Bringing It Together: Cloud and Data Lake Services for Enterprise Ephemeral Workloads
  • 15.
    15 © HortonworksInc. 2011 – 2017. All Rights Reserved Ephemeral Workloads: Basic -> Advanced -> Enterprise Basic Ephemeral Advanced Ephemeral Enterprise Ephemeral Tuned and Optimized Infrastructure Simplified, Automated Operations S3 Integration Protected Network Access Schema - Shared (Hive Metastore) Shared (Hive Metastore) Authentication Single-user Single-user Multi-User (LDAP/AD) Authorization - - Security Policies (Ranger) Audit - - Audit (Ranger)
  • 16.
    16 © HortonworksInc. 2011 – 2017. All Rights Reserved Ephemeral Workloads for the Cloud CLOUD STORAGE S3 WORKLOADS Durable Ephemeral Metastore SCHEMA Long Running Security access to workload clusters via a Protected Gateway enabled for AuthN and HTTPS. Define your data schema, security policies, and metadata catalog once for your ephemeral and always-on workloads. Atlas CATALOG Ranger POLICY SHARED DATA LAKE SERVICES
  • 17.
    17 © HortonworksInc. 2011 – 2017. All Rights Reserved Key Components for an Enterprise Ephemeral Deployment SCHEMA POLICY AUDIT DIRECTORY WHAT Provides Hive schema (tables, views, etc). WHY If you have 2+ workloads accessing the same data, need to share schema across those workloads. HOW Externalize Hive Metastore into for schema definition. WHAT Defines security policies around Hive schema. WHY If you have 2+ users accessing the same data, need policies to be consistently available and enforced. HOW Externalize and share Ranger across workloads and store policies external. WHAT Audit user access. WHY Capture data access activity. HOW Externalize and share Ranger across workloads, leverage cloud storage for audit data. GATEWAY WHAT Provide single endpoint that can be protected with SSL and enabled for authentication to access to cluster resources. WHY Avoid opening many ports, some potentially w/o authentication or SSL protection. HOW Deploy a centralized protected gateway automatically. WHAT Users and groups. WHY Provide authentication source for users and authorization source for groups. HOW Leverage external LDAP or Active Directory.
  • 18.
    18 © HortonworksInc. 2011 – 2017. All Rights Reserved Architecture of an Enterprise Ephemeral Deployment Access your cluster components through the protected gateway via SSL on port 443 open on the controller security group. CONTROLLER PROTECTED GATEWAY USER ACCESS Zeppelin HIVE LLAP / SPARK WORKLOADS Hive SHARED DATA LAKE SERVICES Ranger POLICY (RDS) AUDIT (S3) SCHEMA (RDS) DIRECTORY (LDAP/AD) Spark Hive Metasto re
  • 19.
    19 © HortonworksInc. 2011 – 2017. All Rights Reserved Hortonworks Data Cloud + Shared Data Lake Services 1 2 3 Register an Authentication Source (i.e. LDAP/AD). Create a “Shared Data Lake”, specify S3 Bucket & RDS. When you create a cluster, ”attach” to the Shared Data Lake Services: • for Multi-User AuthN (LDAP/AD) • for AuthZ + Audit (Ranger) • for Schema (Hive Metastore) PREREQUISITES • LDAP/AD • S3 Bucket • Amazon RDS Instance
  • 20.
    20 © HortonworksInc. 2011 – 2017. All Rights Reserved Longer demo…
  • 21.
    21 © HortonworksInc. 2011 – 2017. All Rights Reserved Guidelines for Shared Data Lake Services  Stay Ephemeral. All of your data and metadata in S3 and RDS respectively, do not create tables or files in the local HDFS.  The Hive warehouse is setup to be on S3 for data lakes, create tables in this location instead of individual S3 buckets, it will make them easier to manage.  Use Hive “external tables” for tables that are outside this warehouse, typically if the data is being ingested through some path outside of Hadoop  Create S3 bucket policies that exactly match usage so that you can spin up clusters with the least privilege.
  • 22.
    22 © HortonworksInc. 2011 – 2017. All Rights Reserved Wrap Up
  • 23.
    23 © HortonworksInc. 2011 – 2017. All Rights Reserved Bringing Big Data + Cloud + Security Together: The Takeaways  Cloud driving more ephemeral data processing use cases  Ephemeral workloads leverage cloud storage  This pattern is driving an architectural approach for “shared data lake services”  To provide shared security services for Authentication, Authorization and Audit ** Building blocks are Apache Ranger, Apache Atlas and Apache Knox THINK WORKLOADS Deploy “what you need”… Right tool, right job! THINK SHARED Shared “security policies”… End-to-end security for workloads!** THINK EPHEMERAL Run “when you need”… Workloads will come and go!
  • 24.
    24 © HortonworksInc. 2011 – 2017. All Rights Reserved Learn More  Try Hortonworks Data Cloud 2.0 Technical Preview – http://hortonworks.github.io/hdp-aws/  BREAKOUT SESSIONS – Wednesday, June 14 @ 5:50p, Don’t Let Spark Burn Your House: Perspectives on Securing Spark  CRASH COURSE – Thursday, June 15 @ 3:00p – 6:00p, Apache Spark and Apache Hive processing on the Cloud  BIRDS OF A FEATHER – Thursday, June 15 @ 5:00p, Security and Governance – Thursday, June 15 @ 5:00p, Cloud and Operations
  • 25.
    25 © HortonworksInc. 2011 – 2017. All Rights Reserved Thank You https://hortonworks.com/products/cloud/aws/ https://hortonworks.com/apache/ranger/ https://hortonworks.com/apache/atlas/

Editor's Notes