This session was recorded in San Francisco on February 5th, 2019 and can be viewed here: https://youtu.be/CgoxjmdyMiU
This session will discuss how to get up and running quickly with containerized H2O environments (H2O Flow, Sparkling Water, and Driverless AI) at scale, in a multi-tenant architecture with a shared pool of resources using CPUs and/or GPUs. See how how you can spin up (and tear down) your H2O environments on-demand, with just a few mouse clicks. Find out how to enable quota management of GPU resources for greater efficiency, and easily connect your compute to your datasets for large-scale distributed machine learning. Learn how to operationalize your machine learning pipelines and deliver faster time-to-value for your AI initiative — while ensuring enterprise-grade security and high performance.
Bio: Nanda Vijaydev is senior director of solutions at BlueData (now HPE) - where she leverages technologies like Hadoop, Spark, and TensorFlow to build solutions for enterprise analytics and machine learning use cases. Nanda has 10 years of experience in data management and data science. Previously, she worked on data science and big data projects in multiple industries, including healthcare and media; was a principal solutions architect at Silicon Valley Data Science; and served as director of solutions engineering at Karmasphere. Nanda has an in-depth understanding of the data analytics and data management space, particularly in the areas of data integration, ETL, warehousing, reporting, and machine learning.
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Nanda Vijaydev, BlueData - Deploying H2O in Large Scale Distributed Environments using Containers - H2O World San Francisco
1. Deploying H2O in
Large-Scale Distributed
Environments using
Containers
Nanda Vijaydev
Lead Data Scientist and Senior Director of Solutions
BlueData (now part of HPE)
www.bluedata.ai @NandaVijaydev @BlueData
#H2OWORLD
2. Hype versus Reality …
Source: https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
3. AI / ML Stack: Tools + Infrastructure
Multiple ML / DL frameworks, notebooks, and other tools
Data Science
Notebooks
Analytics
& BI Tools
Pipeline Tools
Data Scientists Developers Data Engineers Decision Makers
Role-based
access control
Tools for
distributed
AI / ML / DL
pipeline
RDBMS HDFS Streams Spark
Model
Storage
Workflow
Mgmt
Data Frameworks Access to data
and storage
4. • Access to valuable data: small, big, or both
• Choices of modeling techniques: each problem is
different
• Ability to build on datasets, validate on other
datasets, iterate, and improve
• Access to GPUs (and CPUs)
• Scale easily on real datasets
• Ability to operationalize in production
Distributed AI / ML / DL – Key Requirements
Source: https://rohitnarurkar.wordpress.com/2013/11/02/cuda-matrix-multiplication
5. • Scalability, repeatability, complexity,
reproducibility across environments
• Sharing data, not duplicating data
• Deploying distributed platforms, libraries,
applications, and versions
• Efficiently sharing expensive resources like GPUs
• Agility to scale up and down compute resources
• Providing a future-proof solution
• Ensuring compatible NVIDIA device kernel
module installation
Distributed AI / ML / DL – Challenges
Laptop On-Prem
Cluster
Off-Prem
Cluster
7. Docker is a computer program that
performs operating-system-level virtualization
also known as containerization.
Containerization allows the existence of
multiple isolated user-space instances.
Docker Containers
Source: https://en.wikipedia.org/wiki/docker_(software)
8. Container-Based Platform for AI / ML
Data Scientists Developers Data Engineers Data Analysts
BI/Analytics Tools Bring-Your-Own
NFS HDFS
Compute
Storage
On-Premises Public Cloud
ML/DL & Big Data Tools Data Science Tools
CPUs GPUs
IOBoost™ – Extreme performance and enterprise-grade scalability
ElasticPlane™ – Self-service, multi-tenant containerized environments
DataTap™ – In-place access to data on-prem or in the cloud
BlueData EPIC™ Software Platform
H2O for AI/ML
9. Accelerated AI / ML Deployment
• With H2O + BlueData, customers now have:
• Pre-built Docker H2O images with CUDA and automated cluster
creation for the entire stack
• Appropriate NVIDIA kernel module surfaced automatically to the
containers
• Easy access to resources required (e.g. single node, single GPU, multi-
node, multi-GPU combinations)
• UI, CLI, and API access (notebooks, web, SSH)
• NFS mounts surfaced as local drives for sharing assets
10. Example of an H2O Pipeline on Containers
H2O Driverless AI
Import Validate
Export
Shared Data Access Layer
… Data Sources …
11. Docker images for multiple
applications and versions
Ability to create and
add new images, and
save or restore
tested combinations
on demand
Deploy H2O from Pre-Built Images in the
BlueData EPIC App Store
12. Multi-Tenant, with Quotas for GPU Resources
Support for multi-tenancy
and ability to define quota
per tenant
Define ‘flavor’ types used to
launch Docker containers
13. Spin Up Multiple Environments
Quick launch templates
for one-click cluster
creation
Run multiple clusters,
with different versions or
combinations of tools,
side by side
14. Pick from a list of
pre-built and tested images
Assign specific resources (GPUs,
CPUs) to the cluster, depending on
the use case (e.g. for Driverless AI)
Define number of nodes, here for
H2O and Sparkling Water
On-Demand Cluster Creation
15. • 2 Docker containers running different versions
of Driverless AI with 1 GPU (Tesla P100) each
• NVIDIA device kernel (driver version: 390.46)
• NVIDIA CUDA (9.x) and cuDNN libraries
including
1. libcudnn7-7.4.2.24-1.cuda9.0.x86_64.rpm
2. libcudnn7-devel-7.4.2.24-
1.cuda9.0.x86_64.rpm
Source: https://medium.com/linagora-engineering/making-image-classification-simple-with-spark-deep-learning-f654a8b876b8
H2O Driverless AI with GPUs
16. • The user authenticates on Driverless AI
• Import datasets from BlueData DataTap
with DataTap connector, optimized
access with BlueData IOBoost
• Analyze the data
• Run experiments
• Build models, save them …
• Validate against other datasets from
DataTap …
• Export model for production
Run Driverless AI on Containers with GPUs
dtap
17. • Optionally initialize Sparkling Water against an existing H2O cluster created previously
[external backend]
• Pass to Sparkling Water the appropriate jar to use for the HDFS connectivity
• Work on your dataset using the HDFS connectivity
Work with Sparkling Water Cluster and HDFS
18. • BlueData EPIC automatically
deploys the environments
• Using persistent containers
• Providing true multi-tenancy
• Access to shared resources (CPU,
RAM, GPUs, storage)
• Pre-built H2O images in the
BlueData EPIC App Store
• Enterprise-grade security
(integration with AD /LDAP / TDE)
Simplify H2O Deployments at Scale in Minutes
19. BlueData DataTap
BlueData IOBoost
Enable Compute / Storage Separation
Connect the clusters to different datasets without
copying the data, and with performance optimized
20. From the BlueData EPIC App Store, deploy
more application clusters to connect to H2O
Integrate H2O with Production Environment
21. • Infrastructure for distributed ML / DL is complex (CPUs, GPUs, data …)
This complexity can be abstracted from data science teams with self-
service provisioning and automation, using containers
GPU access can be effectively used by the containerized application,
then released for other applications and users
For a flexible and scalable solution, data resources
should be decoupled from compute
• H2O, Driverless AI, and Sparkling Water can be
deployed at scale on containers – whether
on-premises, on any public cloud, or hybrid
BlueData + H2O proven in production with Global 2000 enterprises
Lessons Learned – H2O on Containers
Deep learning uses general learning algorithms
The algorithms need to build the layers of an artificial neural network
Training data
Processing this training data requires lots of computation
Matrix multiplications
The #1 challenge with respect to bringing the DevOps mindset to to Big Data is the scalability, reproducibility and repeatability.
It’s easy enough for developers to work on their laptops. Data scientists sometimes prototype the entire pipeline on a powerful laptop with a whatever it takes, “make it work” mentality. You can take a single node VM, install a bunch of libraries and work on smallish data sets.
But will that same program successfully deploy and work on a real environment that uses multi-node clusters, potentially different versions and libraries and more importantly significantly larger volumes of data. This last aspect is unique to the Big Data and is one of single biggest reason that data team are unable to iterate rapidly
ML / DL local
Single node VM
Local libraries
Limited data (10s of GB)
“It works on my laptop”
Multi-node environments
Different versions
Different environment variables
Libraries and dependencies must exist on all nodes
Big Data (TBs of data)
The BlueData EPIC software platform leverages Docker container technology – together with patented innovations – to deliver self-service, speed, security, and efficiency for Big Data Analytics, Data Science, and AI / ML / DL environments.
The key components of the BlueData EPIC (which stands for Elastic Private Instant Clusters) platform are:
• ElasticPlane™ enables users to spin up virtual clusters on-demand in a secure, multi-tenant environment.
• IOBoost™ ensures performance on par with bare-metal, with the agility and simplicity of Docker containers.
• DataTap™ accelerates time-to-value for Big Data by eliminating time-consuming data movement.
Our software platform can be deployed on any infrastructure – whether on-premises, in the public cloud (e.g. AWS and now Azure and GCP), or a hybrid architecture