Introducing Cloudera Data Science Workbench for HDP 2.12.19

MACHINE LEARNING IN THE ENTERPRISE
SAUMITRA BURAGOHAIN | VICE PRESIDENT, PRODUCT MANAGEMENT
VIDYA RAMAN | DIRECTOR, PRODUCT MANAGEMENT
MICHAEL GREGORY, MACHINE LEARNING FIELD ENGINEERING LEAD

2 © Cloudera, Inc. All rights reserved.
The Industry’s First Enterprise Data Cloud
From the Edge to AI
Data
Engineering
Data
Warehouse
IoT, Ingest &
Streaming
AI, ML &
Data Science
Enterprise Data Cloud

3© Cloudera, Inc. All rights reserved.
OUR APPROACH TO MACHINE LEARNING
Modern enterprise platform, tools and expert guidance to help you unlock
business value with ML/AI
Agile platform to build,
train, and deploy many
scalable ML applications
Enterprise data science
tools to accelerate
team productivity
Expert guidance,
services & training to
fast track value & scale

PLATFORM

Infinitely Scalable
(Billions of files, Exabytes)
Low TCO
(Less Storage Overhead)
BIGGER
TRUSTED
Data Swamp->Data Lake
SMARTER
Deep Learning frameworks
(TensorFlow, Caffe)
GPU Pooling/Isolation
Faster time to deployment
(Containerized Micro-Services)
FASTER
App 1 App 2
S3, ADLS/WASB, GCS with Truly
Incremental Replication
HYBRID
HDP MARKET DRIVERS

DATA PLATFORM: HDP
Enterprise Grade Data Management: Hybrid, Secure and Scalable
● On-Premises and Multi-Cloud
● Enterprise Security & Data Lineage
● Choice of Deep Learning Support (Co-locate
TensorFlow/GPU/Data)
● Dockerized for Packaging/Isolation

THE CHALLENGE
Balance these needs
DATA SCIENCE
• Access to granular data
• Flexibility
• Preferred open source tools
• Elastic provisioning
• Compute
• Storage
• Reproducible research
• Path to production
DATA MANAGEMENT
• Security
• Governance
• Standards
• Low maintenance
• Low cost
• Self-service access

CLOUDERA DATA SCIENCE WORKBENCH

A PLATFORM FOR
MACHINE LEARNING
• Open platform
• Complete lifecycle
• Team collaboration
• Enterprise ready
• Runs anywhere
RESEARCH | PRODUCTION
LOCAL | SPARK | HIVE
DEPLOYMENT
COMPUTE
OPEN SOURCE ECOSYSTEMALGORITHMS
SELF-SERVICE
TOOLS
SOLUTIONS | USE CASESAPPS
CLOUD ON-PREMISES
ADLSS3 HDFS KUDU
CATALOG | SECURITY | GOVERNANCE
SHARED
CONTEXT

THE TYPICAL SOLUTION
“If I can’t use my favorite tools, I’ll…”
• Copy data to my laptop
• Copy data to a data science appliance
• Copy data to a cloud service
Why this is a problem:
• Complicates security
• Breaks data governance
• Adds latency to process
• Makes collaboration more difficult
• Complicates model management and
deployment
• Creates infrastructure silos

Accelerate Machine Learning from Research to Production
For data scientists
• Experiment faster
Use R, Python, or Scala with on-
demand compute and secure
CDH data access
• Work together
Share reproducible research
with your whole team
• Deploy with confidence
Get to production repeatably
and without recoding
For IT professionals
• Bring data science to the data
Give your data science team
more freedom while reducing
the risk and cost of silos
• Secure by default
Leverage common security and
governance across workloads
• Run anywhere
On-premises or in the cloud

A MODERN DATA SCIENCE ARCHITECTURE
Containerized environments with scalable, on-demand compute
• Built with Docker and Kubernetes
• Isolated, reproducible user environments
• Supports both big and small data
• Local Python, R, Scala runtimes
• Schedule & share GPU resources
• Run Spark, Impala, and other CDH services
• Secure and governed by default
• Easy, audited access to Kerberized clusters
• Leverages Ranger for Security and Atlas for
Governance
HDP HDP
Ambari
gateway node(s) HDP nodes
Hive, HDFS, ...
CDSW CDSW
...
Master
...
Engine
EngineEngine
EngineEngine

ACCELERATED DEEP LEARNING WITH GPUS
Multi-tenant GPU support on-premises or cloud
• Extend CDSW to deep learning
• Schedule & share GPU resources
• Train on GPUs, deploy on CPUs
• Works on-premises or cloud
CDSW
GPUCPU
HDP
CPU
HDP
single-node
training
distributed
training, scoring
“Our data scientists want GPUs, but
we need multi-tenancy. If they go to
the cloud on their own, it’s
expensive and we lose governance.”
GPU Available on HDP 3.x
GPU

Accelerate and simplify machine learning from research to production
ANALYZE DATA
• Explore data securely and
share insights with the
team
TRAIN MODELS
• Run, track, and compare
reproducible experiments
DEPLOY APIs
• Deploy and monitor models
as APIs to serve predictions
MANAGE SHARED RESOURCES
• Provide a secure, collaborative, self-service platform for your data science teams

INTRODUCING EXPERIMENTS
Versioned model training runs for evaluation and reproducibility
Data scientists can now...
• Create a snapshot of model code,
dependencies, and configuration
necessary to train the model
• Build and execute the training run
in an isolated container
• Track specified model metrics,
performance, and model artifacts
• Inspect, compare, or deploy prior
models

INTRODUCING MODELS
Machine learning models as one-click microservices (REST APIs)
1. Choose file, e.g. score.py
2. Choose function, e.g. forecast
f = open('model.pk', 'rb')
model = pickle.load(f)
def forecast(data):
return model.predict(data)
3. Choose resources
4. Deploy!
Running model containers also have
access to CDH for data lookups.

MODEL MANAGEMENT
View, test, monitor, and update models by team or project

DEMO

CLOUDERA DATA SCIENCE WORKBENCH 1.5 on HDP 3.1.0 & 2.6.5
Accelerate and simplify machine learning from research to production in HDP install base

© Cloudera, Inc. All rights reserved. 20
CDSW
TRAINING
• Self-paced video training
• Learn end-to-end ML
workflows in CDSW
• Code examples
• Python and R tracks
• Customers can register at
university.cloudera.com

DISCLAIMER
The information in this document is proprietary to Cloudera. No part of this document may be reproduced,
copied or transmitted in any form for any purpose without the express prior written permission of Cloudera.
This document is a preliminary version and not subject to your license agreement or any other agreement
with Cloudera. This document contains only intended strategies, developments and functionalities of
Cloudera products and is not intended to be binding upon Cloudera to any particular course of business,
product strategy and/or development. Please note that this document is subject to change and may be
changed by Cloudera at any time without notice.
Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant
the accuracy or completeness of the information, text, graphics, links or other items contained within this
material. This document is provided without a warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement.
Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect
or consequential damages that may result from the use of these materials. The limitation shall not apply in
cases of gross negligence.

MACHINE LEARNING AT CLOUDERA
Our philosophy
We empower our customers to
run their business on data with an
open platform:
● Your data
● Open algorithms
● Running anywhere
We accelerate enterprise data science.

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introducing Cloudera Data Science Workbench for HDP 2.12.19

Similar to Introducing Cloudera Data Science Workbench for HDP 2.12.19 (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Introducing Cloudera Data Science Workbench for HDP 2.12.19