The data-driven innovations of data scientists can be game-changing. But the siloed efforts and custom-crafted prototypes of individual data scientists can be difficult to scale, reproduce, and share across multiple users.
What works for an ad-hoc model in development may not work in production; what works as a one-off prototype on a laptop might not work as a consistent and repeatable process for large-scale distributed data science. What’s needed is an approach that brings the agility, automation, and collaboration of DevOps to your data science teams.
In this presentation at our Big-Data-as-a-Service (BDaaS) meetup in Santa Clara, Nanda Vijaydev (data scientist and director of solutions management at BlueData) discussed how to bring DevOps agility to large-scale distributed data science operations — with Big-Data-as-a-Service, powered by Docker containers.
Nanda’s presentation covered how to:
• Create on-demand environments with your users’ preferred data science tools (e.g. Spark, Zeppelin, Jupyter, RStudio)
• Deploy environments once and run them anywhere – on-premises or in the public cloud – using Docker
• Improve agility, collaboration, and productivity for your data science and engineering teams
More info on this BDaaS meetup can be found at: https://www.meetup.com/Big-Data-as-a-Service/events/239936007
9. Evolution of Data Science Operations
Sampling
Modeling &
Tuning
Reports
(e.g. credit
card offer)
TradiConal Data Science & AnalyCcs
Distributed
Systems
Acquire
Data
Model
Tune
Deploy
Distributed Data Science & AnalyCcs
10. What Often Happens …
Faulty AssumpCons
• IT team thinks they understand
requirements and use cases
• Assumes infrastructure & systems
will work for most use cases
• Assumes all data scien>sts will use
similar toolsets
• Build the infrastructure first, then
onboard the data scien>sts …
11. A More Realistic Journey
Onboarding data scien>sts
Con>nuous infrastructure provisioning
Base-R, SQL,
Python, Java
Established use
cases
Need to analyze
higher data
volumes
SparkR, PySpark,
spark-sql, H2O,
Zeppelin
Numpy,
Scipy, NLTK,
JupyterHub,
with Spark
AddiConal
modules for
Python users
R user base is
adopCng more
Big Data
R-Studio,
Shiny Server
with Spark +
H2O
Use cases, requirements, & tools
will con>nue to evolve over >me
12. DevOps for Data Science Operations
Source: Rob Nendorf, Allstate, “DevOps for Data Science”
13. What Data Science Teams Need:
• Access to data with full fidelity
• Ability to quickly iterate & validate findings
• Access to necessary tools and models
• Ability to scale environments on-demand
• Ability to share models and code
• Ability to deploy and integrate the solu>on
14. Data Science – Usage Scenarios
1. End-to-end analysis on local laptops /
worksta>ons using RStudio, Jupyter
2. Preprocess on Hadoop/Spark, download
and analyze locally using RStudio/Jupyter
3. Preprocess and analyze on Hadoop/Spark
16. Accessing Aggregate Data from HDFS
• Preprocessed or par>>oned data can be stored in HDFS
• Can be accessed directly from RStudio/Jupyter using RHadoop client for
aggrega>on/modeling
22. Environments: Scale Up vs. Scale Out
R Packages,
Python, SQL
UI /
Notebooks
Scale Up
Frameworks
Compute
Data
Local Compute
(Laptops)
Spark (SQL, Scala, Python, Java, MLlib) + H2O
Spark (SQL, SparkR,
Scala, Java, MLlib) +
H2O
Scale Out
RStudio + R + Spark
+ sparklyr
Jupyter + Python +
Spark
Zeppelin + R +
Python + Spark
Spark
R
Spark
R
Spark
R
Spark
R
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
23. Scalable Data Science: Challenges
• How do you keep up with the constant evoluCon of new
versions and tools?
– The data science ecosystem is evolving very quickly (e.g. rapid pace of new Spark versions)
– Related tools (e.g. RStudio, Jupyter, Zeppelin) have to keep pace to support new features
– New versions of Spark and other tools require different versions of libraries and packages
• One monolothic cluster won’t cut it … how do you support
the variaCons?
– Different use cases & users need different op>ons, versions, packages
– Workloads change … adding new packages or scaling clusters up and down is cumbersome
24. Scalable Data Science: Challenges
• How do you make it easy for your data scienCsts to get what
they need?
– Data scien>sts are comfortable with their desktop tools, not distributed compu>ng
– They need on-demand environments with instant access to their preferred tools and data
• How do you manage user access for IDEs / notebooks and
data sources?
– Given the different layers of the stack, this can be complex and challenging for enterprises
• And more … repeatability, elasCcity, scalability, security,
performance ...
34. Distributed Data Science: Takeaways
• Opera>onalizing distributed data science is hard work
– Unique requirements for access to data, models, tools, etc.
• Need to bring a DevOps approach to data science opera>ons
– Support for fast, itera>ve prototyping and reproducibility
– Requires ul>mate flexibility as tools evolve and new op>ons emerge
• Leverage a turnkey purpose-built plalorm (e.g. BlueData EPIC)
– Bring DevOps agility to distributed data science, powered by Docker
– Provide ability to share code, models, & data with secure mul>-tenancy
– Enable on-demand environments with a choice of data science tools
35. Thank You
TRY BLUEDATA EPIC ON AWS
For more informa>on:
www.bluedata.com
sales@bluedata.com
www.bluedata.com/aws
Nanda Vijaydev, Director of Solutions, BlueData
Nanda has more than 10 years of experience in Data Management and Data Science. At BlueData, Nanda works with Hadoop, Spark, and related technologies to build software solutions for Big Data analytics use cases. She has worked on multiple Data Science and Big Data projects for large enterprises in the healthcare, media, telecommunications, and other industries.
Prior to BlueData, she was a principal solutions architect at Silicon Valley Data Science and director of solutions engineering at Karmasphere. She has an in-depth understanding of Data Management tools including dataintegration, ETL, data warehousing, reporting, Hadoop, and Spark.
The role of “Data” has changed from an asset that was critical to monitoring and managing business operations (think the world of BI and Reports etc) to being a source of competitive advantage, and has become a strategic enterprise asset that will be the source of new products and services for all organizations.
This change combined with Big data technologies, and ability to use more types of data to refine the analytics, the traditional water fall model used has quickly evolved into an iterative, closedloop cycle with emphasis on continuous improvement. The faster organizations can bring data and teams together and iterate , the more likely they are to gain competitive advantage and create disproportionate value.
So how does one bring this analytics agility and velocity?
If you want to train a statistical model on very large amounts of data, you'll need three things: a storage platform capable of holding all of the training data, a computational platform capable of efficiently performing the heavy-duty mathematical computations required, and a statistical computing language with algorithms that can take advantage of the storage and computation power.
Waterfall/Slow(er)
Small(ish) Data
Single Server
Static Results
Iterative/Ongoing
Big/Fast Data
Multiple Servers
Results are ‘Big’
Assumes there is an inflection point and the need for previous infrastructure goes away
User requirements are on-going and infrastructure need is continuous
Data Science is not a one-time process
Build and evaluate in a sandbox, then evaluate and deploy at scale
Minimize recoding of the models – it’s not sustainable for continuous evaluation
Run environments should mimic build environments at a larger scale
R-Studio cluster provisioned on Spark2.1 in Bluedata
R-Studio cluster provisioned on Spark2.1 in Bluedata
Our vision for BlueData EPIC is to provide a single software platform for Big-Data-as-a-Service that supports both on-prem and cloud deployments for Big Data.
Rapid provisioning of data science environments at scale (Cloud or On-Prem)
Sharing of data, code, and results (this is a big deal!)
Easily customize environments
Reproducibility (environments, results)
Leverage a turnkey DevOps platform for Data Science (e.g. BlueData EPIC)