A deep dive into running data analytic workloads in the cloud

1© Cloudera, Inc. All rights reserved.
A deep dive into running data analytic workloads in the cloud
Strata San Jose 2018
Jason Wang | Altus Engineering
Aishwarya Venkataraman | Altus Engineering
Stefan Salandy | Systems Engineering
Mala Ramakrishnan | Senior Director, Altus Product & Marketing

Who we are
Jason Stefan
Aishwarya Mala

Agenda
- Introduction
- Cloudera Altus
- Introducing today’s lab
- Hands-on data pipeline
- Running analytic database as a PaaS
- Workload Analytics
- Conclusion

Introduction

The Big Shift
In 2017
58% on-premises
11% private cloud
25% public cloud
Source: 451 Research, Voice of the
Enterprise: Workloads and Key Projects,
Cloud Transformation, 2017.
By 2019
38% on-premises
15% private cloud
41% public cloud

Old Job
Buy databases in bulk and rent back to
departments
Load data into and out of individual
data silos as needed
Add storage to each platform as
needed
The cloud has redefined our world
Role of VP of Data Management
Most deployments are a hybrid of the old and new
SDX
COMPUTE
STORAGE
The New World
New Job
Departments buy their own databases
Safe, collaborative environment for
every department to access
centralized, shared data
Departments rent their storage needs

The market is diverging toward 4 distinct
environments
¼ PaaS
¼ Public Cloud / IaaS
¼ Private Cloud
¼ Non-Cloud

Perfectly valid reasons for each environment
Non-Cloud Private Cloud
Public Cloud /
IaaS
PaaS
I want to
maximize
• Cost-efficiency • Control, elasticity,
and convenience
• Control, elasticity,
and convenience
• Agility
I want to
minimize
• Dependence on
unproven technology
• Resource contention
between
departments
• Dependence on data
center floor space
• Dependence on IT
and therefore need
as simple as possible
I want to
standardize
• On whatever
provides the best
ROI
• On a single
environment for the
entire data center
• On a single cloud
provider for all
infrastructure needs
• On whatever is
easiest to use
I want to
store my
data
• On premises
because cheaper
and/or more secure
• On premises due to
company /
government mandate
• In the cloud because
easier
• In the cloud because
easier

Which environment do you want?
Non-Cloud Private Cloud Public Cloud / IaaS PaaS
“I need huge scale in a
single cluster”
“I want to separate compute
and storage”
“I want to configure and
troubleshoot my
environment”
“I’m done hiring my own
admins”
“I have a ton of cold data”
“I have unmet demand for
ad hoc workloads”
“We’ve already done a scan
of AWS and that’s where
we’re moving”
“My team has limited skills”
“My existing cluster
utilization is 90%”
“Bare metal is not an option
and I’m not allowed to move
to the cloud”
“My annual chargeback per
server is outrageous”
“I get no love from central
IT”

● The modern platform for machine
learning and analytics
● with multiple deployment options
● and one shared data experience

11© Cloudera, Inc. All rights reserved. 11
The modern platform for machine learning and analytics optimized for the cloud
DATA CATALOG
SECURITY GOVERNANCE
WORKLOAD
MANAGEMENT
INGEST &
REPLICATION
EXTENSIBLE
SERVICES
CORE
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
S3 ADL
S
HDFS KUDU
STORAGE
SERVICES
Cloudera Enterprise
PRIVATE CLOUDBARE METAL INFRASTRUCTURE
DEPLOYMENT
OPTIONS SERVICES

Who is this tutorial for?
The Data Management Infrastructure Model
https://www.gartner.com/doc/3817571/solve-data-challenges-data-management

Who is this tutorial for?
Data Management Infrastructure Model Roles and Skills
https://www.gartner.com/doc/3817571/solve-data-challenges-
data-management

Traditional on-premises workloads generally share a cluster
HDFS

Cloud workloads: Separation of storage and compute
Object Store (S3, ADLS)
Dedicated
compute
Shared
data

Technology drivers for workloads in the cloud
1. Scalable and cost-effective storage in
a single repository
1. Access to utility-based compute
1. Open and modular architectures
Amazon
EC2
Azure
Data Lake Storage
Amazon
S3
Azure
Virtual Machine

Types of clusters
lifecycle
transient permanent
single tenant
multi tenant
Data Engineering Pipeline
Analytics Cluster authorization
configuration
performance
troubleshooting
upgrade
metadata

Data Engineering in the Cloud
Hyperscale Cloud Storage
Batch
Cluster
Transient Batch
Spin up clusters as needed.
● On-demand/spot instances
● Usage-based pricing
● Sized for workload
● Cluster per tenant/user
Batch
Cluster
Batch
Cluster
Long-running Batch
Persistent clusters for frequent ETL.
● Reserved instances
● Node-based pricing
● Grow/shrink
● Cluster per tenant group
Persistent
Cluster
Batch
Persistent Batch on HDFS
Top performance for frequent ETL.
● Shared across tenant groups
● Lift-and-shift
PaaS
Batch
Persistent
Cluster
Batch Batch
Persistent Cluster
HDFS
Batch Batch

Analytics in the cloud
Object Storage
Transient
Cluster
Transient Analytics
(infrequent usage)
Spin up clusters when needed
● On-demand instances
● Grow/shrink
● Cluster per tenant or user
Persistent Analytics
(regular usage)
Persistent clusters for BI any time
● Grow/shrink
● Cluster per tenant group
Persistent Analytics
with Local Storage (fastest)
Max speed for more regular workloads
● Less frequent grow/shrink
● Shared cluster for shared local data
Persistent Cluster HDFS and/or
Kudu
Transient
Cluster
Persistent
Cluster
Persistent
Cluster
PaaS

Primary analytic workloads in the cloud
scale, agility, and cost-efficiencies
Shared, Open Storage
ETL / Data
Preparation
BI / SQL
Analytics
Only pay for what you
need, when you need it
• Transient workloads
• Contention-free
isolation
• Cloud-native
integration
Self-service flexibility at
any scale
• Elastic scale
• Multi-tenant isolation
• Cloud-native or local

Introduction to Cloudera Altus

Multi-cloud Platform-as-a-Service (PaaS) offering
Built to analyze and process data at scale in public cloud infrastructure
Cloudera Altus
EXTENSIBLE
SERVICES
ALTUS
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE

What is it?
- Short-lived
- Single tenant
- Hive, Spark, or MapReduce Cluster
Used for things like
- ETL jobs
- batch processing
- with data living in S3 or ADLS
- Provides fast and easy job submission
without cluster management
Available on AWS and Azure
Altus Data Engineering (DE)
DATA
ENGINEERING

What is it?
- Long-lived
- Multi tenant
- Impala Cluster
- data warehousing
- analytics
- with data living in S3 or ADLS
- Provies fast and easy analytics
without cluster management
Available on AWS
Altus Analytic Database (ADB)
ANALYTIC
DATABASE

What is it?
- Cloud native shared metadata store
with metadata living in S3 or ADLS
- Shared cataloging to define and preserve
structure and business context of data
- Provides unified security across
transient and recurring workloads
- Enables consistent governance
across all data to increase compliance
Cloudera Shared Data Experience (SDX)
S3 or ADLS
DATA
ENGINEERING
ANALYTIC
DATABASE
ANALYTIC
DATABASE
ANALYTIC
DATABASE
DATA
ENGINEERING

Altus Features
Focus on the workload, not the infrastructure.
Let Altus do the heavy lifting.
Low cost
• Per-node/per-hour pricing
• Create clusters as needed
• Terminate clusters when
they’re not in use
End-user focused
• Manages your cluster so you
don’t have to
• Submit Jobs via the UI/CLI/API
• Built in workload
troubleshooting and analytics
Easy to use
• Self-service for end-users
• Built on your familiar cloud
infrastructure
• Cluster provisioning in
minutes
Cloud-native
• Runs on AWS and Azure
• Read/Write against ADLS and
S3
• Decouple storage from compute
Integrated Platform
• Same Cloudera platform on-
premises and in the cloud
• Many different services like
DE and ADB
• Share metadata across
clusters with SDX
Secure
• Integrated with Azure and
AWS security models
• Cloudera NEVER has access
to your data
• Backed by native cloud
storage

Altus Workflow

Altus Workflow: create environment

Altus Workflow: create cluster

Altus Workflow: run a job

Altus Data Engineering Workflow: short-lived

Altus Analytic Database Workflow: long-lived

What is an Environment?
What are Clusters?
An Environment is an encapsulation of the cloud provider resources and the
cross account trust needed to deploy Cloudera clusters.
A Cluster is a Cloudera Cluster (CM + Master + Worker nodes) optimized for
running specific workloads.

1. Security Model for Delegated Access
2. Networking
3. Cloud Storage Data Access
AWS vs. Azure

AWS Model for Delegated Access: IAM Roles

Azure Model for Delegated Access: Service Principal

AWS Networking

Azure Networking

AWS S3 Data Access: Instance Profile

Azure ADLS Data Access: MSI

Today’s Lab:
Solving a Business Need With Cloudera Altus

Setting the Scene
- We work for an outdoor clothing retail company and website
sales are struggling
- We need to figure out whether sales orders correlate with
website visits and what steps to take to improve sales
- We’ll use Altus DE and Altus ADB to solve this

Already Setup: Raw Data Ingestion
Sales Orders Raw Logs

Part One: Data Engineering
Sales Orders Raw Logs Tokenized logs

Sales Orders Raw Logs Tokenized logs
Part Two: Analytics

What this will look like in today’s lab
1
2
3
4

Hands-on Data Pipeline

But first, go get the handout
https://tinyurl.com/y9zxxzkm

When you see this hand it means look at your handout for a hands-on task.
Handout overview

Log in to Altus
1
console.altus.cloudera.com

Create one cluster for Data engineering and one
cluster for Analytic Database. While these clusters are
creating, take a break!
Create Altus clusters
2

Perform ETL using Altus Data Engineering
3

Altus Analytics Database

Altus Analytic DB Architecture
S3
EC2
● Impala running on
EC2 nodes
● Data stored in S3
● Data can be
accessed by
multiple clusters

Explore data using Altus Analytic Database
4

Altus Workload Analytics

● Get insight into causes of
job failure
● Size clusters and optimize
job performance
● Identify issues even when
they don’t show up as
errors
Altus Workload Analytics

Hive invalid query
Troubleshooting failed jobs
5

Spark Out of Memory issue
Troubleshooting failed jobs
5

Example: Skewed join
- WA lists outlier tasks that have a long wait before they start
Optimize Performance

● Track history of recurring workloads over time
● Performance trends of each individual stage
● Automatic detection of abnormal behavior of recurring workloads (too fast or
too slow)
● Drilling down can show differences between data input / output size
● Group by jobs
DEMO
Track history

- Number of Map/Reduce jobs generated
- Log files for each individual task
- Metrics for each stage
- Browse and search configuration properties
DEMO
Execution details of a job

Conclusion

Spin up working environments ad hoc
Bring your own data and tools
Adjust resources on-demand
Pay for your actual consumption of resources
Key benefits of PaaS

cloudera.com/altus

Thank you
cloudera.com/altus

The key benefits of a modern analytic database
High-performance BI and SQL analytics
Flexibility for data and use case variety
Cost-effective scale for today and tomorrow
Go beyond SQL with an open architecture

Advantages of a modern approach
decoupled for cloud and on-premises
Go Beyond SQL
• Consolidate data silos with
an open architecture
• Shared data across SQL
and non-SQL workloads
Data Flexibility
• Iterative modeling and self-
service accessibility
• Portability: No proprietary
formats or storage lock-in
Cost-Effective Scalability
• Elastic scale in any
environment
• Cloud-native integration for
optimized pay-per-use costs
• Proven at massive scale
Hybrid
• Runs across multi-cloud &
on-prem for zero lock-in
• Multi-storage over S3,
ADLS, HDFS, Kudu, Isilon,
etc.
Shared Data

High-performance BI and SQL analytics
Flexibility for data and use case variety
Cost-effective scale for today and
tomorrow
Go beyond SQL with an open
architecture
Same SQL engine native across any
cloud and on-prem
Self-service access directly on object
stores, without the silos
Elasticity on-demand through
decoupled compute and object
storage
Converge workloads over shared
data, with zero lock-in
Key benefits translated for the cloud

A deep dive into running data analytic workloads in the cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A deep dive into running data analytic workloads in the cloud

Similar to A deep dive into running data analytic workloads in the cloud (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

A deep dive into running data analytic workloads in the cloud