Running Enterprise Workloads in the Cloud

1 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Running Enterprise
Workloads in the Cloud
DataWorks Summit - San Jose
June 2018

Presenters
Jeff Sposetti
Product Manager @ Hortonworks
Attila Kanto
Principal Engineer @ Hortonworks

Agenda
 Introduction
 Cloudbreak
 Demo #1: Flyover
 Advanced Topics
 Demo #2: Deeper Dive
 Lessons Learned in the Cloud
 Wrap Up

No Upfront
HW Costs
Unlimited
Elastic Scale
Ephemeral &
Long-Running
IT &
Business Agility
Why Big Data Workloads in the Cloud?

Cloudbreak: Harness the agility of cloud with ease
Cloudbreak
• Declarative workload provisioning
across cloud providers
• Flexible topologies and security
configuration options
• DevOps friendly, easy setup and
simple to automate
• Built-in elasticity and auto-scaling
• Prescriptive integration with cloud
services
AWS
Ambari HDP + HDF
Azure
Ambari HDP + HDF

Deploy on Public or
Private Clouds
Dynamically configure and
manage clusters on public or
private clouds (Amazon Web
Services, Microsoft Azure,
Google Cloud Platform and
OpenStack)
Automated Scaling
Seamlessly manage elasticity
requirements as cluster
workloads change
Secured Cluster Access
Supports configuration
defining network boundaries,
configuring security groups,
gateway perimeter security
and enabling Kerberos

Cloudbreak Building Blocks
• Cloud Credentials
• Ambari Blueprints
• Auto Scaling
• Custom Recipes
• Custom Images
• Network
• Gateway
• Kerberos Security
• Dynamic Blueprints
• Cloud Storage
Simple and Flexible Prescriptive Secure

Demo #1
Flyover

Advanced Topics

Cloudbreak Building Blocks: Advanced Topics
• Cloud Credentials
• Ambari Blueprints
• Auto Scaling
• Custom Recipes
• Custom Images
• Network
• Gateway
• Kerberos Security
• Dynamic Blueprints
• Cloud Storage
Simple and Flexible Prescriptive Secure
Bringing it all together: Data Lake Shared Services

Custom Images

Background: Cloudbreak
1. Cloudbreak creates VM instances using a default base image.
2. Cloudbreak installs Ambari on a VM instance.
3. Cloudbreak instructs Ambari to install a cluster on the remaining VM instances.
Cloudbreak
Node
VM
Node
VM
Node
VM
Node
VM
Node
VM
Node
VM
Cluster

Custom Images Overview
Create the
Custom Image
Register the
Custom Image
Use the Custom
Image when
Creating a
Cluster
1 2 3

Kerberos Security

Background: Kerberos
 Strongly authenticating and establishing a user’s identity is the basis for secure access in
Hadoop. Users need to be able to reliably “identify” themselves and then have that
identity propagated throughout the Hadoop cluster.
 Once this is done, those users can access resources (such as files or directories) or
interact with the cluster (like running MapReduce jobs).
 Besides users, Hadoop cluster resources themselves (such as Hosts and Services) need
to authenticate with each other to avoid potential malicious systems or daemon’s
“posing as” trusted components of the cluster to gain access to data.

Background: Hadoop + Kerberos
Service
Component
A
Service
Component
B
Hadoop Cluster
KDC
keytabkeytab
Service
Component
C
keytab
Service
Component
D
keytab
Service
Component
X
Service
Component
X
keytabkeytab
Service
Component
X
keytab
Service
Component
X
keytab
Kerberos is used to
secure the
Components in the
cluster. Kerberos
identities are
managed via
“keytabs” on the
Component hosts.
Principals
for the
cluster are
managed in
the KDC.

Background: Ambari Kerberos Support
 Ambari provides automated options for working w/ existing MIT KDC or Active Directory
 Can be highly customized to fit many enterprise requirements
– Templating for customizable principals
– Control of Kerberos Client install and krb5.conf configuration
– Highly-configurable service principal identity naming
 These options are available via Ambari UI as well as via Ambari Blueprints
– Blueprints can include “Kerberos Descriptor” for kerberos-env and krb5-conf
https://cwiki.apache.org/confluence/display/AMBARI/Automated+Kerberizaton

Cloudbreak: Support for Enabling Kerberos
Goal
Provide a way for Cloudbreak users to create clusters that
are Kerberos enabled
Approach
Ambari exposes a-lot-of Kerberos options
Leverage Ambari Kerberos options and avoid re-creating
Ambari Kerberos experience
Be pragmatic about prescriptive options on-top

Cloudbreak: Enable Kerberos Security
 Create Cluster > Security > Advanced
 [ ] Enable Kerberos Security

Options: Use Existing KDC or Use Test KDC
Use Existing
KDC
Use Test KDC
Advanced
Basic
- Not for production use. For testing and
evaluation purposes only.
- Installs and configures an MIT KDC on the
master node.
- Configures the cluster to leverage that KDC.
- Provide basic information
about your existing KDC.
- Ambari Kerberos descriptors
are generated automatically.
- Provide basic information
about your existing KDC.
- Provide your own Ambari
Kerberos descriptors.

Dynamic Blueprints

Dynamic Blueprints: RDBMS and LDAP/AD
 Background:
– Cluster configuration often includes external database (for Hive, Ranger, etc) and LDAP/AD configs
– Users often have to create 1+ versions of the same Blueprint to handle different component
configurations for these external systems
– It’s a challenge to know the different Blueprint configuration choices per service across the stack
 Dynamic Blueprints:
– Ability to manage External Sources (e.g. RDBMS and LDAP/AD) outside of your Blueprint
– Cloudbreak will inject the configurations into your Blueprint
– Simplifies reuse of cluster configurations -> for external sources (RDBMS and LDAP/AD)
– Simplifies your Blueprints -> don’t have to know all the configurations for each component

Dynamic Blueprints: RDBMS
Create an External Source Select during Create Cluster

 External Sources > Database Configurations
 Built-In Types: Ambari, Druid, Hive, Oozie, Ranger, Superset
 Ability to set “other” type (for variable replacement)
JDBC properties
in Blueprint for
the Component?
Yes
Use Blueprint as-
is, no Component
configuration
property injection
No Inject
Component
configuration
properties
PROPERTY VARIABLES
rds.[type].connectionString
rds.[type].connectionDriver
rds.[type].connectionUserName
rds.[type].connectionPassword
rds.[type].databaseName
rds.[type].host
rds.[type].hostWithPort
rds.[type].databaseType
where
type=[ambari,druid,hive,oozie,ranger,superset]**
** the “other” type=[other-name]
Perform property
variable
replacement
S
E

 PostgreSQL, MySQL or Oracle

Example #1: Injecting type=“hive” configuration properties
Property Variable Example Value
rds.hive.connectionString jdbc:postgresql://hive.test.eu-west-
1:5432/hive
rds.hive.connectionDriver org.postgresql.Driver
rds.hive.connectionUserName mydatabaseuser
rds.hive.connectionPassword Hadoop123!
rds.hive.fancyName PostgreSQL, MySQL / MariaDB, Oracle
rds.hive.databaseType postgres, mysql, oracle
"hive-site": {
"properties": {
"javax.jdo.option.ConnectionURL": "{{{ rds.hive.connectionString }}}",
"javax.jdo.option.ConnectionDriverName": "{{{ rds.hive.connectionDriver }}}",
"javax.jdo.option.ConnectionUserName": "{{{ rds.hive.connectionUserName }}}",
"javax.jdo.option.ConnectionPassword": "{{{ rds.hive.connectionPassword }}}"
}
},
"hive-env" : {
"properties" : {
"hive_database" : "Existing {{{ rds.hive.fancyName}}} Database",
"hive_database_type" : "{{{ rds.hive.databaseType }}}"
}
}
rds.hive.connectionString
rds.hive.connectionUserName
rds.hive.connectionPassword
[type] = hive
In this scenario, PROPERTIES WILL BE INJECTED INTO THE BLUEPRINT

Example #2: Setting type=“other” property variables
Property Variable Example Value
rds.test.connectionString db.test.eu-west-1:5432/sometest
rds.test.connectionDriver org.postgresql.Driver
rds.test.connectionUserName mydatabaseuser
rds.test.connectionPassword Hadoop123!
rds.test.subprotocol postgres
rds.test.databaseEngine POSTGRES
rds.test.connectionString
rds.test.connectionUserName
rds.test.connectionPassword
[type] = test
In this scenario, PROPERTY VARIABLES WILL BE REPLACED IN THE
BLUEPRINT (not injected)
• You must include the property variables in your Blueprint
for replacement. Use Mustache template syntax. For
example:
"test-site": {
"properties": {
"javax.jdo.option.ConnectionURL":"{{rds.test.connectionString}}"
}
• Cloudbreak will perform property variable replacement in
your Blueprint.

Dynamic Blueprints: LDAP/AD
 External Sources > Authentication Configurations
 Built-In Components:
– Atlas, Hadoop, Hive LLAP, Ranger Admin, Ranger UserSync
LDAP properties
in Blueprint for
the Component?
Yes
Use Blueprint as-
is, no Component
configuration
property injection
No Inject
Component
configuration
properties
PROPERTY VARIABLES
ldap.connectionURL
ldap.domain
ldap.bindDn
ldap.bindPassword
ldap.userSearchBase
ldap.userObjectClass
ldap.userNameAttribute
ldap.groupSearchBase
ldap.groupObjectClass
ldap.groupNameAttribute
ldap.groupMemberAttribute
ldap.directoryType
ldap.directoryTypeShort
Perform property
variable
replacement
S
E

LDAP/AD Property Variable -> Mapping
ldap.connectionURL
ldap.directoryType
ldap.directoryTypeShort
ldap.bindDn
ldap.bindPassword
ldap.userSearchBase
ldap.userNameAttribute
ldap.userObjectClass
ldap.groupSearchBase
ldap.groupNameAttribute
ldap.groupObjectClass
ldap.groupMemberAttribute

Demo #2
Deeper Dive

Data Lake Shared
Services
Bringing It All Together

Why Data Lake Shared Services
 Customers have a need to secure ephemeral workload clusters
 Customers need a single metadata repository for Hive schema
 Customers want a single pane of glass to define users, groups and authorization policies
TECHNICAL PREVIEW

Ephemeral Workloads: Basic -> Advanced -> Enterprise
Basic Ephemeral Advanced Ephemeral Enterprise Spark & Hive
Tuned and Optimized
Infrastructure
Simplified, Automated
Operations
Cloud Storage Integration
Protected Gateway
Schema - Shared (Hive Metastore) Shared (Hive Metastore)
Authentication Single-user Single-user Single or Multi-User (LDAP/AD)
Authorization - - Security Policies (Ranger)
Cloud Storage Audit - - Audit (Ranger)
TECHNICAL PREVIEW

SCHEMA POLICY AUDIT DIRECTORY
WHAT
Provides Hive schema (tables,
views, etc).
WHY
If you have 2+ workloads
accessing the same schema,
need to share this across
workloads.
HOW
Externalize Hive Metastore
into for schema definition.
WHAT
Defines security policies
around Hive schema.
WHY
If you have 2+ users accessing
the same data, need policies
to be consistently available
and applied.
HOW
Externalize and share Ranger
across workloads and store
policies external.
WHAT
Audit user access.
WHY
Capture data access activity.
HOW
Externalize and share Ranger
across workloads, leverage
cloud storage for audit data.
GATEWAY
WHAT
Provide single endpoint that
can be protected with SSL and
enabled for authentication to
access to cluster resources.
WHY
Avoid opening many ports,
some potentially w/o
authentication or SSL
protection.
HOW
Deploy a centralized protected
gateway automatically.
WHAT
Users and groups.
WHY
Provide multi-user
authentication source for
users and definition of groups.
HOW
Leverage external LDAP/AD.
Data Lake: The Technical Ingredients
TECHNICAL PREVIEW

Data Lake: Flyover
LDAP/AD
Hive
Database
Ranger
Database
Cloud
Storage
Data Lake Workload
Cluster(s)
Ranger
Hive Metastore
Hive, Spark, Zeppelin
Attach
TECHNICAL PREVIEW

Demo #2 (again)
Deeper Dive++

Lessons Learned in the Cloud

Lessons Learned Topics
 Performance
 Costs
 Reliability
 Security

Lesson 1: Performance / Cost
 Know your cloud provider
– Cloudbreak offers an uniform API
– Similarities in basic concepts: compute, network, storage volumes, etc.
– Differences: performance, cloud connector, functionality
 Compute
– Instance types for your workload
– Different families: gp, compute, memory, storage optimized, gpu
– Network bandwith
 Storage
– Speed, reliability, cost
– Aggregated: ephemeral (fixed size and number)
– Disaggregated:
• block storage (HDFS)
• cloud object stores (connector architecture)

Lesson 2: Performance / Costs
 Capacity planning
– Workload type (batch / interactive)
– Allocate / release resources on demand
– Experimenting is cheap
 Flexible cluster shapes and sizes
– No one size fits all: security, HA, cluster topologies
– Cluster size is a variable not constant
– Spot/Preemtible VMs
 Automation
– DevOps mentality
– No manual configuration, finetuning

Lesson 3: Reliability / Fault tolerance
 Network
– Fault domain / Availability Zones
– Rack awareness (think where your instances are running)
– Topologies for HA scenario
 Externalize states:
– All your files, notebooks, schema, policies
– Ambari, Ranger, Hive Metastore etc. databases

Lesson 4: Security
Design your deployment to be secure from the beginning
Data protection
Authorization
Authentication
Perimeter
Level Security

Lesson 4: Security
 Perimeter level Security
– Private VPC/VNet deployments
– Inbound connectivity: security groups, ports
– Outbound: proxy / no internet
– Protected Gateway topology (Knox)
 Authentication:
– LDAP / AD
– Kerberos
 Authorization:
– Consistent authorization control across all HDP component (Ranger)
– Cloud provider specific (IAM roles)
 Data protection:
– At rest, in motion (e.g Ranger KMS, cloud provider specific disk encryptions)

Key takeaways
 Know your cloud provider
 Secure your cluster

Wrap Up

Learn More
 Try Cloudbreak 2.7
– http://docs.hortonworks.com
 Join Birds of a Feather
– Wednesday, June 20 @ 5:40p, Cloud and Operations
– Wednesday, June 20 @ 5:40p, Security and Governance
 Visit Breakout Sessions
– Thursday, June 21 @ 10:20a, Performance Analysis of AWS EC2 Instance Types, Michael Young

Thank You

Running Enterprise Workloads in the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Running Enterprise Workloads in the Cloud

Similar to Running Enterprise Workloads in the Cloud (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Running Enterprise Workloads in the Cloud