Cloud-Native Machine Learning: Emerging Trends and the Road Ahead

CLOUD-NATIVE MACHINE LEARNING
Emerging Trends and the Road Ahead
Tristan Zajonc, CTO for Machine Learning @ Cloudera

2 © Cloudera, Inc. All rights reserved.
WHO AM I?
Tristan Zajonc
Cloudera
@tristanzajonc
CTO for Machine Learning at Cloudera
Previously lead ML engineering team at Cloudera
Former CEO & cofounder of Sense (acquired by Cloudera in 2016)
Originally Bayesian statistician and economist

DISCLAIMER
The information in this document is proprietary to Cloudera. No part of this document may be reproduced or
disclosed without the express prior written permission of Cloudera.
This document is not binding upon Cloudera in any way (including with respect to Cloudera’s course of
business, product strategy, future releases, or development commitments), and is not a part of or subject to
any license agreement or other agreement you may have with Cloudera. Cloudera makes no commitments
about any future developments or functionality in any Cloudera product. The development, release, and timing
of release of any software features or functionality described in this document remains at the discretion of
Cloudera and you should not rely on any statements about development plans or anticipated functionality in
this document when making any purchasing decisions.
Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant the
accuracy or completeness of the information, text, graphics, links or other items contained within this
material. This document is provided without a warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement.
Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect or
consequential damages that may result from the use of these materials.”

4 © Cloudera, Inc. All rights reserved. 4
enable the end-to-end machine learning lifecycle for teams
building thousands of applications across petabytes of data
with an enterprise-ready platform optimized for the cloud
Rapid provisioning and
experimentation
DevOps mindset
From development to
production
Team
collaboration
and shared
context
Any
framework,
your data, your
models
Secure by default
CPUs, GPUS, TPUs,
and beyond
Massive, diverse
data
Automation Runs anywhere

Confidential-Restricted – For Discussion Purposes
Only
ACCELERATING THREE STAGES OF MACHINE LEARNING
Manage pipelines + models
Deploy models
Automate pipelines
Monitor performance
DEPLOYDEVELOP
Make teams more productive
Explore data
Develop reports, pipelines, models
Collaborate with peers
TRAIN
Scale resources efficiently
Train models
Tune parameters
Track performance
End-to-end machine learning infrastructure for teams at scale
MANAGE
Run anywhere with a common architecture
Manage access and resources
Scale cost with usage

TODAY: CLOUDERA DATA SCIENCE WORKBENCH
Accelerate Machine Learning from Research to Production on CDH/HDP
For data scientists
• Experiment faster
Use R, Python, or Scala with
on-demand compute and
secure CDH data access
• Work together
Share reproducible research
with your whole team
• Deploy with confidence
Get to production repeatably
and without recoding
For IT professionals
• Bring data science to the data
Give your data science team
more freedom while reducing
the risk and cost of silos
• Secure by default
Leverage common security and
governance across workloads
• Run anywhere
On-premises or in the cloud

+

CLOUDERA DATA SCIENCE WORKBENCH FOR HDP/CDP
Enterprise data science and ML powered by Kubernetes
• Built with Docker and Kubernetes
• Isolated, reproducible user environments
• Supports both big and small data
• Local Python, R, Scala runtimes
• Connect to any data source
• Schedule & share GPU resources
• Run Spark, Impala, and other CDH services
• Secure and governed by default
• Easy, audited access to Kerberized clusters
• Leverages SDX platform services
CDH/HDP CDH/HDP
Cloudera Manager / Ambari
gateway node(s) CDH/HDP nodes
Hive, HDFS, ...
CDSW CDSW
Master
...
Engine
EngineEngine
EngineEngine
K8S K8S

SOME LESSONS LEARNED

LESSON 1: ENTERPRISE ML REQUIRES BIGGER PICTURE
Streaming
Ingest
Batch Ingest
Machine
Learning Tools
BI Tools and
SQL Editors
Data Products
DATA, METADATA, SECURITY, GOVERNANCE, WORKLOAD MANAGEMENT
MACHINE
LEARNING
DATA
ENGINEERING
DATA
WAREHOUSE
OPERATIONAL
DATABASE

UBER
Michelangelo: Uber’s Machine Learning Platform
Source: https://eng.uber.com/michelangelo/
DATA CENTER
PUBLIC CLOUD

NETFLIX
Netflix recommendation platform
Source: https://medium.com/netflix-techblog/netflix-at-spark-ai-summit-2018-5304749ed7fa
PUBLIC CLOUD

LESSON 2: EMERGING CONSENSUS ON MODERN ML STACK
Big Data
Platform
ML/AI
Frameworks
Container
Infrastructure

Data scientists want to use every library under the sun, platform should support that
Workbench
(interactive)
ANALYZ
E
Experiments
(batch, ad-hoc)
TRAIN
Reports
(shared output)
SHARE
Jobs
(batch, scheduled)
REPEAT
Models
(RPC APIs)
SERVE
LOOSELY COUPLED, ML/AI FRAMEWORK AGNOSTIC
LESSON 3: FOCUS ON WORKFLOWS NOT FRAMEWORKS

Cloudera ML is built around one Operator and four “CRDs”:
• Session (Interactive)
• Experiment (Batch)
• Job (Scheduled)
• Model (Online API)
No long-lived operators for particular libraries: TFJob, SparkJob, etc
Kubernetes is not exposed to data scientist. Goal is serverless experience.
WHAT DOES THIS LOOK LIKE IN K8S?
Let’s talk operators and CRDs….

Confidential-Restricted – For Discussion Purposes
Only
LESSON 4: NO SINGLE BEST INFRASTRUCTURE FOR CUSTOMERS
Non-Cloud Private Cloud Public Cloud PaaS
I want to
maximize
• Cost-efficiency • Control, elasticity,
and convenience
• Control, elasticity,
and convenience
• Agility
I want to
minimize
• Dependence on
unproven technology
• Resource contention
between departments
• Dependence on data
center floor space
• Dependence on IT
and therefore need
as simple as possible
I want to
standardize
• On whatever
provides the best
ROI
• On a single
environment for the
entire data center
• On a single cloud
provider for all
infrastructure needs
• On whatever is
easiest to use
I want to
store my data
• On premises
because cheaper
and/or more secure
• On premises due to
company /
government mandate
• In the cloud because
easier
• In the cloud because
easier

ROAD AHEAD

TODAY: CURRENT CDSW ARCHITECTURE
Containerized gateway to CDH/HDP cluster for data scientists
• Benefits
• Easy to attach for self-service data science
• Seamless connectivity to existing clusters
• Isolated, reproducible user environments
• Tradeoffs
• Requires sizing, managing two clusters
• Requires dependency coordination
• Not optimized for distributed training
• Complex, inelastic setup in public cloud
CDH/HDP CDH/HDP
Hive, HDFS, ...
CDSW CDSW
...
Master
...
Engine
EngineEngine
EngineEngine
K8S K8S

def risk_simulation():
import numpy
import tensorflow
...
results = sc.parallelize(xrange(0, 1000))
.map(risk_simulation).collect()

cldr mlx run --gpus 3 --cmd “python train.py”

CLOUDERA ML-X FOR CDP
Cloud-native enterprise machine learning platform
DATA SCIENCE DATA ENGINEERING MODEL OPERATIONS
CLOUDERA ML RUNTIME
Python/R, Spark, TensorFlow, CPU/GPU-Optimized
Interactive Development Batch Pipelines Predictive APIs
Full capability of CDSW
Rapid cloud provisioning
and elastic autoscaling
Unified data engineering and
ML with seamless
dependency management
Multi-cloud portability
powered by Kubernetes
Connects to HDFS or
cloud object storage
and shared metadata
Accelerated deep learning
with distributed GPU training
As-as-service or in private
cloud
CONTAINER CLOUD
EKS, AKS, GKE, OpenShift
DATA CATALOG
GOVERNANCESECURITY LIFECYCLEWORKLOAD XM
STORAGE Amazon
S3
Microsoft
ADLS
HDFS KUDU

Requires and extends CDH/HDP, pushing
distributed compute to the cluster
CDH / HDP CDH / HDP
Hive, HDFS, ...
CDSW CDSW
...
Master
...
Engine
EngineEngine
EngineEngine
Cloud optimized, elastic distributed compute;
can optionally use CDH / HDP
Kubernetes
(EKS, GKE, AKS,
OpenShift)
CDH / HDP CDH / HDP
Hive, HDFS,
HMS, Sentry,
Navigator
MLX MLX
...
Master
...
Engine
EngineEngine
EngineEngine
Object
Store
Altus Director / CloudBreak
OPTIONAL
DATA SCIENCE WORKBENCH vs. CLOUDERA ML-X
Cloud ML
APIs

• Simple dependency management, particularly for ML workloads
• python: pip install, R: install.packages, etc
• Unified workflow for distributed data engineering and machine learning
• Elastic compute (CPU / GPU) in the cloud
• Portable form-factor optimized for cloud and private cloud
• Long-term potential to unify resource management across workloads
WHY SPARK ON KUBERNETES?

SPARK ON KUBERNETES
• Spark has native integration with Kubernetes since 2.3+
• Runs Spark fully containerized on K8s
• Allows Spark to add / remove executors with user specified Spark configuration
• Able to leverages Kubernetes features to facilitate launching Spark
(e.g: Node Affinities, RBAC, Namespaces, etc).
• But still many limits…

Kubernetes
CMLCML
Engine
Spark
Driver
(Client
Mode)
Spark
Executor
Fluentd Fluentd
Project
Files
Project
Files
Spark
Executor
Project
Files
Hive, HDFS,
HMS, Sentry,
Navigator- Create user service account and namespace
for security and isolation
- Configure Spark driver / executor
configuration to connect to k8s API server.
- Also populate kerberos related information
for accessing HDFS and other CDH services.
- Populate dependency management related
configurations using pod template: volume,
image, user information, etc.
- Fluentd configured to pick up logging and
metric information to external storage
MAKING SPARK EASY IN ML-X

Development: NFS/EFS shared filesystem gives laptop like developer experience
Production: Built in source-to-image (S2I) pipeline gives PaaS experience
• Provides declarative pathway from version control to experiment or model
• Provides easy deployment from other environments, e.g. laptops, webhooks
RUN
• Stage 3: Run image
SNAPSHOT
• Stage 1: Git snapshot
HOW DEPENDENCY MANAGEMENT WORKS
Simplifying dependency management from development to production
BUILD
• Stage 2: Docker build

DEMO

• Spark
• Dynamic allocation and shuffle service
• GPU support
• Scale testing and performance improvements
• K8s
• Advanced scheduling (FairScheduler, Hierarchical Queues, etc)
• Fast/lazy image distribution at scale
• Other Interesting Frameworks
• Ray
• Distributed Tensorflow & Pytorch (Horovod)
WHAT’S NEXT

Relevant Sigs
sig-big-data, sig-ml, sig-scheduling

Cloud-Native Machine Learning: Emerging Trends and the Road Ahead

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloud-Native Machine Learning: Emerging Trends and the Road Ahead

Similar to Cloud-Native Machine Learning: Emerging Trends and the Road Ahead (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Cloud-Native Machine Learning: Emerging Trends and the Road Ahead