Data Science and CDSW

1© Cloudera, Inc. All rights reserved.
Cloudera Data Engineering and Data
Science

We are in the age of machine learning
Data has never been
more plentiful
Open source data science and
machine learning libraries are
rapidly evolving
Flexible commodity storage and
compute make scalable
production machine learning
affordable
Data Analytics Deployment

But there are practical challenges
Most data science done at
small scale, individually,
and is difficult to replicate
Very few models
reach production
Teams have different,
conflicting requests for
languages & libraries
Native formats and
security can make data
access challenging
Data Analytics Deployment

Open data science in the enterprise
IT
drive adoption while maintaining compliance
Data Scientist
explore, experiment, iterate

• Team: Data scientists and analysts
• Goal: Understand data, develop and improve models,
share insights
• Data: New and changing; often sampled
• Environment: Local machine, sandbox cluster
• Tools: R, Python, SAS/SPSS, SQL; notebooks; data
wrangling/discovery tools, …
• End State: Reports, dashboards, PDF, MS Office
• Team: Data engineers, developers, SREs
• Goal: Build and maintain applications, improve
model performance, manage models in production
• Data: Known data; full scale
• Environment: Production clusters
• Tools: Java/Scala, C++; IDEs; continuous
integration, source control, …
• End State: Online/production applications
Types of data science
Exploratory
(discover and quantify opportunities)
Operational
(deploy production systems)

Typical data science workflow
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition

Apache Spark
Apache Spark is at the
core of our data science
experience
• Libraries for common
machine learning
• Trusted in production
by our customers
• Delivered with expert
support and training
• A requirement for our
Data Science
Workbench
Apache Spark is a huge
driver for machine
learning
• Native language
development tools
• Reliable operation at
big data scale
• Native access to
Hadoop data for
testing and training
Spark 2.0 is here
• Separate parcel for
easy implementation
for multiple Spark
instances
• Better Streaming
Performance
• Machine Learning
Persistence

Spark Use Cases
Top Use Cases Data Processing (55%), Real-Time Stream Processing
(44%), Exploratory Data Science (33%) and Machine Learning (33%).
3 out of 8 are employing Spark in data science research

Deep learning in Cloudera with Apache Spark
• Two packages:
• CaffeOnSpark
• TensorFlowOnSpark
• Developed by Yahoo
• Python and Scala APIs
• All DL architectures
• Integrated pipeline
• Run on existing clusters
• Training and inference
• Open source DL library
• Developed by Skymind
• Built on JVMs
• Supports CPUs and
GPUs
• Java, Scala, Python APIs
• Training and inference
• Imports models from:
• TensorFlow
• Caffe
• Torch
• Theano
• Runs on existing clusters
• Deep learning framework
• Developed by Intel
• Supports CPUs only
• Leverages Intel MKL
• Scala, Python APIs
• Imports models from:
• TensorFlow
• Caffe
• Torch
• Runs on existing clusters
Spark Packages DL4J BigDL

Solving Data Science is a Full-Stack Problem
• Leverage Big Data
• Enable real-time use cases
• Provide sufficient toolset for the Data Analysts
• Provide sufficient toolset for the Data Scientists
+ Data Engineers
• Provide standard data governance capabilities
• Provide standard security across the stack
• Provide flexible deployment options
• Integrate with partner tools
• Provide management tools that make it easy
for IT to deploy/maintain
✓Hadoop
✓Kafka, Spark Streaming
✓Spark, Hive, Hue
✓Data Science Workbench
✓Navigator + Partners
✓Kerberos, Sentry, Record Service, KMS/KTS
✓Cloudera Director
✓Rich Ecosystem
✓Cloudera Manager/Director

Data Science Workbench
Self-service data science for the enterprise

Our Goal: Bring more data science users to Hadoop
Help more data scientists
use the power of Hadoop
Use a powerful, familiar
environment with direct access to
Hadoop data and compute
Data Scientist
Data Engineer
Make it easy and secure to
add new users, use cases
Offer secure self-service analytics
and a faster path to production on
common, affordable infrastructure
Enterprise Architect
Hadoop Admin

Accelerates data science from
development to production with:
●Secure self-service data access
●On-demand compute
●Support for Python, R, and Scala
●Project dependency isolation for
multiple library versions
●Workflow automation, version
control, collaboration and sharing
Cloudera Data Science Workbench
Self-service data science for the enterprise

A modern data science architecture
CDH CDH
Cloudera Manager
gateway nodes CDH nodes
●Built on Docker and Kubernetes
●Runs on dedicated gateway nodes
●User sessions run in isolated
“engine” containers which:
○Host Kerberos-authenticated
Python/R/Scala runtimes
○Interact with Spark via YARN
client mode (Driver runs in
container, workers on CDH)
●Single-cluster only (for now)
Hive, HDFS, ...
CDSW CDSW
...
Master
...
Engine
EngineEngine
EngineEngine

“Our data scientists want GPUs, but
we can’t find a way to deliver multi-
tenancy.
If they go to the cloud on their own, it’s
expensive and we lose governance.”
●Extend existing CDSW benefits to
GPU-optimized deep learning tools
●Schedule & share GPU resources
●Train on GPUs, deploy on CPUs
●Works on-premises or cloud
Accelerated deep learning on-demand with GPUs
Data Science Workbench
GPUCPU
CDH
CPU
CDH
CPU
single-node
training
distributed
training, scoring
Multi-tenant GPU support on-premises or
cloud

What is the Data Science Workbench?
Workbench
Integrated development environment for
Python, R, and Scala with support for
Spark 2 and connectivity to secured
Hadoop clusters.
Projects
Collaborative hub for enterprise data
science with isolated projects, secure
collaboration, and simple dependency
management.
Jobs
Lightweight job and pipeline system for
data science workloads that supports real-
time monitoring, results tracking, and
email alerting.

Demo

Key Benefits
How is Cloudera Data Science different?
Works with fully secured clusters
One tool for multiple languages (Python, R, Scala)
Multi-tenant Architecture
Common Platform

Thank you
Jason Hubbard
jason.hubbard@cloudera.com

Data Science and CDSW

More Related Content

What's hot

Similar to Data Science and CDSW

Recently uploaded

Data Science and CDSW