Distributed Data Science, DevOps, and Docker

Twi$er: @BDaaSmeetup
Hashtag: #BDaaS

Our Sponsor
Big-Data-as-a-Service.
On-Prem, Cloud, or Hybrid. It’s BDaaS.

BDaaS Meetup
•  Welcome and IntroducCons
•  PresentaCon by Nanda Vijaydev
— Distributed Data Science, DevOps, and Docker
•  Q&A and Discussion
Hashtag: #BDaaS

Nanda Vijaydev
•  Data scienCst and director of soluCons at BlueData
•  Prior to BlueData, was a soluCons architect at Silicon Valley
Data Science
•  More than 10 years experience in data management and data
science
•  Has worked with dozens of organizaCons to deploy Hadoop,
Spark, & data science environments using Docker containers

with
Distributed Data Science
DevOps, and Docker
BDaaS Meetup
June 9, 2017

Nanda Vijaydev

@NandaVijaydev

Outline
•  Evolu>on of Data Science Opera>ons
•  Distributed Data Science on Docker
•  Challenges and Key Requirements
•  Demo
•  Key Takeaways
•  Q & A

Understand
Business Problem
Acquire/Collect
Analyze/Model Reflect/Evaluate
Deploy/
Disseminate
A pre$y picture
(ideal workflow)
A not so pre$y picture
(workflow in reality)
Data Science Tasks and Roles
Data
Engineer / Data
Scientist
Core Data
Scientist
Statisticians / Data
Scientists
Data AnalystData Analyst
Data
Engineer / Data
Scientist

Which do you
prefer to use:
SAS, R, or
Python?
Source: “SAS, R, or Python Survey 2016: Which Tool Do Analy>cs Pros Prefer?”, Burtch Works, July 2016
Preferred Language of Analytics Pros

Evolution of Data Science Operations
Sampling
Modeling &
Tuning
Reports
(e.g. credit
card oﬀer)
TradiConal Data Science & AnalyCcs
Distributed
Systems
Acquire
Data
Model
Tune
Deploy
Distributed Data Science & AnalyCcs

What Often Happens …
Faulty AssumpCons
•  IT team thinks they understand
requirements and use cases
•  Assumes infrastructure & systems
will work for most use cases
•  Assumes all data scien>sts will use
similar toolsets
•  Build the infrastructure ﬁrst, then
onboard the data scien>sts …

A More Realistic Journey

Onboarding data scien>sts
Con>nuous infrastructure provisioning
Base-R, SQL,
Python, Java
Established use
cases
Need to analyze
higher data
volumes
SparkR, PySpark,
spark-sql, H2O,
Zeppelin
Numpy,
Scipy, NLTK,
JupyterHub,
with Spark
AddiConal
modules for
Python users
R user base is
adopCng more
Big Data
R-Studio,
Shiny Server
with Spark +
H2O
Use cases, requirements, & tools
will con>nue to evolve over >me

DevOps for Data Science Operations
Source: Rob Nendorf, Allstate, “DevOps for Data Science”

What Data Science Teams Need:
•  Access to data with full ﬁdelity
•  Ability to quickly iterate & validate ﬁndings
•  Access to necessary tools and models
•  Ability to scale environments on-demand
•  Ability to share models and code
•  Ability to deploy and integrate the solu>on

Data Science – Usage Scenarios
1.  End-to-end analysis on local laptops /
worksta>ons using RStudio, Jupyter
2.  Preprocess on Hadoop/Spark, download
and analyze locally using RStudio/Jupyter
3.  Preprocess and analyze on Hadoop/Spark

Single node laptops / worksta>ons:
•  Using single node instances with more resources
•  Projects like ﬀ, bigmemory for R

Distributed processing:
•  SparkR/sparklyr with RStudio and Spark cluster
•  Jupyter/Zeppelin notebook with PySpark and Spark cluster
•  Hadoop clusters
•  Sandbox that can be scaled on demand
Scaling Options for Data Science

Accessing Aggregate Data from HDFS
•  Preprocessed or par>>oned data can be stored in HDFS
•  Can be accessed directly from RStudio/Jupyter using RHadoop client for
aggrega>on/modeling

Distributed Data Science
on Docker
with

R Environment with RStudio Server

•  Install local Spark if
not already available
•  Connect to Spark
cluster
•  Set appropriate
Spark conﬁgura>ons
for op>mal
performance
Spark with sparklyr from RStudio

Python Environment with Jupyter

•  Users work in their
familiar notebooks

•  BlueData provisions
mul>-node Spark
clusters

PySpark with Anaconda from Jupyter

Environments: Scale Up vs. Scale Out
R Packages,
Python, SQL
UI /
Notebooks
Scale Up
Frameworks
Compute
Data
Local Compute
(Laptops)

Spark (SQL, Scala, Python, Java, MLlib) + H2O

Spark (SQL, SparkR,
Scala, Java, MLlib) +
H2O

Scale Out
RStudio + R + Spark
+ sparklyr
Jupyter + Python +
Spark
Zeppelin + R +
Python + Spark
Spark

R
Spark

R
Spark

R
Spark

R
Spark

Spark

Spark

Spark

Spark

Spark

Spark

Spark

Scalable Data Science: Challenges
•  How do you keep up with the constant evoluCon of new
versions and tools?
–  The data science ecosystem is evolving very quickly (e.g. rapid pace of new Spark versions)
–  Related tools (e.g. RStudio, Jupyter, Zeppelin) have to keep pace to support new features
–  New versions of Spark and other tools require different versions of libraries and packages
•  One monolothic cluster won’t cut it … how do you support
the variaCons?
–  Different use cases & users need different op>ons, versions, packages
–  Workloads change … adding new packages or scaling clusters up and down is cumbersome

Scalable Data Science: Challenges
•  How do you make it easy for your data scienCsts to get what
they need?
–  Data scien>sts are comfortable with their desktop tools, not distributed compu>ng
–  They need on-demand environments with instant access to their preferred tools and data
•  How do you manage user access for IDEs / notebooks and
data sources?
–  Given the diﬀerent layers of the stack, this can be complex and challenging for enterprises
•  And more … repeatability, elasCcity, scalability, security,
performance ...

IOBoost™ - Extreme performance and scalability
Elas>cPlane™ - Self-service, mul>-tenant clusters
DataTap™ - In-place access to data on-prem or in the cloud
Blue Data EPIC™ Soaware Plaborm
Data Scien>sts Developers Data Engineers

Data Analysts

BI/Analy>cs Tools
NFS HDFS
Platform for Scalable Data Science
Compute
Storage
On-Premises Public Cloud
EC2
S3
Bring-Your-Own

Multi-Tenant Environments
Distributed clusters with
Jupyter & Zeppelin
notebooks
Links to available services
and notebooks

Pre-Integrated Docker-Based Images
DOCKER-BASED
IMAGES OF
YOUR CHOICE:
SAME FOR ON-
PREM, AWS, OR
ANY CLOUD

On-Demand Spark + R Environments
Just a few mouse
clicks to a fully
conﬁgured
cluster (e.g. with
Spark + RStudio
Server)

Scalable Data Science with R & Python
Deploy on-demand Spark clusters with
RStudio (sparklyr), Zeppelin, or Jupyter

Spark (via Zeppelin Notebook)
Turnkey Spark clusters on Docker,
with Zeppelin, Jupyter, and SparkR
pre-integrated

Scale to Production (Compute + Data)
Compute: Add worker nodes
Data: Point analy>cs to storage (HDFS, S3, NFS)

Distributed Data Science Operations
Data Scien>sts
Spark 2.0 +
Jupyter
Notebook
Spark 1.6.1
+ Zeppelin
Notebook
JupyterHub RStudio
BRING
YOUR OWN
TensorFlow
Hadoop
(Hive, M/R)
Datameer
Launch Launch Launch Launch
Launch Launch Launch Launch
Shared Data, Code, and Results
Users & Security
Orchestra>on & Mgmt
Data Analysts Data Engineers
Comprehensive management of secure, scalable, & reproducible data science environments
ON-PREMISES CLOUD

Distributed Data Science: Takeaways
•  Opera>onalizing distributed data science is hard work
–  Unique requirements for access to data, models, tools, etc.
•  Need to bring a DevOps approach to data science opera>ons
–  Support for fast, itera>ve prototyping and reproducibility
–  Requires ul>mate ﬂexibility as tools evolve and new op>ons emerge
•  Leverage a turnkey purpose-built plalorm (e.g. BlueData EPIC)
–  Bring DevOps agility to distributed data science, powered by Docker
–  Provide ability to share code, models, & data with secure mul>-tenancy
–  Enable on-demand environments with a choice of data science tools

Thank You
TRY BLUEDATA EPIC ON AWS
For more informa>on:
www.bluedata.com
sales@bluedata.com
www.bluedata.com/aws

Wrap-Up
•  We’ll share the slides and video recording
•  SuggesCons for future meetups?
•  Next meeCng TBD – we’ll keep you posted

Thank you
for a$ending!
Thanks to our sponsor:
www.bluedata.com
Hashtag: #BDaaS

Distributed Data Science, DevOps, and Docker

Recommended

Recommended

More Related Content

Similar to Distributed Data Science, DevOps, and Docker

Similar to Distributed Data Science, DevOps, and Docker (20)

Recently uploaded

Recently uploaded (20)

Distributed Data Science, DevOps, and Docker

Editor's Notes