Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI World London 2018

Scaling out Driverless AI
in Enterprise Data
Centers with IBM
Spectrum Conductor
Kevin Doyle
Lead Architect IBM Spectrum Conductor
IBM
LinkedIn: https://www.linkedin.com/in/kevin-doyle-675a4031/

Benefits of managing H2O with IBM Spectrum
Conductor
• H2O Driverless AI can scale across compute nodes for multiple instances, with each instance
allocated to one host
• In a future IBM Spectrum Conductor release, integration improves at the GPU level: You will be
able to run multiple Driverless AI instances on the same host, where each instance is allocated to
an assigned GPU
• Shared file system for Data and logs
• Failover to another host if Driverless AI goes down: IBM Spectrum Conductor starts it up on
another host (if resources available)
• Easily start and stop H2O Driverless AI and maintain instances for each user or groups of users
through role-based access control (RBAC) and consumer association, along with all other
workloads in one shared compute cluster
• H2O Driverless AI and IBM POWER9 GPU Systems are bringing together the best of breed AI
innovation. To handle the increasingly complex workloads of AI you need an integrated system of
software and hardware:
• IBM supports nearly 2.6x mPOWER9ore RAM, 9.5x more I/O bandwidth than comparable systems
• Nearly 2X the data ingest speed and over 50% faster feature engineering
• With GPU accelerated machine learning delivering nearly 30X speedup on model building
• Support for up to 6 V100 GPUs on a single system

What is IBM® Spectrum Conductor?
• IBM Spectrum Conductor confidently deploys modern computing frameworks and
services for a multitenant enterprise environment, both on-premises and in the cloud
• Provides multitenancy through application instances and Spark instance groups. You can
deploy modern computing frameworks and services, such as Spark, Anaconda, Driverless
AI, and H2O Sparkling Water efficiently and effectively, supporting multiple versions and
instances of each framework and service
• Increases performance and scale through granular and dynamic resource allocation for
application instances and Spark instance groups that share a resource pool
• Maximizes usage of resources and eliminates silos of resources that would otherwise
each be tied to separate application implementations
• Provides flexible and efficient data management for shared storage and high availability
by connecting to existing storage infrastructure, such as NFS mounts to a file system or
IBM Spectrum Scale™

VIRTUALIZED VIEWOF COMPUTE,NETWORKAND STORAGERESOURCES
Application
Application
examples
• Simulation
• Analysis
• Design
• Big data
IT constrained
• Long wait times
• Low utilization
• Data access
bottlenecks
• IT Sprawl
IBM Software Defined Infrastructure
Big data
Simulation and
modeling
Analytics
Traditional IBM Spectrum Conductor
Make multiple computers look
like one
Prioritized matching of supply
with demand
Benefits
• High utilization
• Throughput
• Performance
• Prioritization
• Reduced cost
Repeated for many
apps and groups
Converged
compute
and
storage
VIRTUALIZED VIEWOF COMPUTE,NETWORKAND STORAGERESOURCES
Faster results Fewer resources
Long running services
Distinct resources for
compute and storage
Traditional vs Conductor Management

IBM Systems
Shared Services Model for Spark, Machine Learning, and Deep Learning
• Physical view: IBM Spectrum Conductor installed on each Linux server
• Logical view: Users (groups) have their own Spark cluster (optional) that is isolated, protected, and
secured by Spark instance groups or application instances – Managed by SLA
| 5
Administrator
Compute Nodes
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Instance #1
LOB
Marketing…
Fraud Detection…
Data scientist
Instance #2
Data scientist
Driverless AI
Instance #3
Researcher
Instance #4
x86 Systems
Cloud Object Storage (COS)Spectrum Scale
Spectrum Conductor
Data Connectors

IBM Systems
IBM Spectrum Conductor
The most complete enterprise-grade solution for Data Science
• Anaconda Distributions
The solution supports multiple distributions of Anaconda running concurrently.
Users can add/remove Conda packages.
• Notebooks Integration
Out-of-the-box notebooks available: Jupyter, Zeppelin, RStudio, H2O
Sparkling Water. Other notebooks and distributed frameworks can be quickly
integrated.
• Spark Distributions
The solution supports multiple versions of Spark running concurrently.
• Workload Management / Scheduling
A proven workload scheduling engine that enhances the Spark master
scheduling logic to enable multi-tenancy.
• Services Management
Management of other long running application services on the same grid.
Spark applications commonly have dependencies on other services that can
now be managed as a single solution.
• Resource Management & Orchestration
Proven architecture at scale. Resources are dynamically allocated to Spark
workload with fine grain sharing across applications.
• IBM services and support
A single point of contact for your services and support needs.
| 6
Monitoring&Reporting
Workload Management / Scheduling
Resource Management &
Orchestration
Services Management
Services and Support
Red Hat Linux
x86…
Notebooks

IBM Systems 7© 2016 IBM Corporation
Competitive advantage through faster, more predictable analytics
Throughput: 41% greater than Spark with YARN; 57% greater than Spark with Mesos
Spectrum Conductor
with Spark
Spark / YARN Spark / Mesos
When minutes count 10 minutes 14.1 minutes 15.7 minutes
At quarter-end 80 hours 112.8 hours 125.6 hours
Product development 26 weeks 36.7 weeks 40.8 weeks
Source: STAC Report: Spark Resource Managers, Phase 1 (March 28, 2016)
Note: IBM is an active contributor in the Mesos community, helping to advance its capabilities and integration with IBM solutions
Predictability: longest job duration compared with median (lower is better)
Spectrum Conductor
with Spark
Spark / YARN Spark / Mesos
1.51X 1.62X 66.32X

IBM Systems 8© 2016 IBM Corporation
STAC reported significant advantages, up to 2.2x, for IBM Spectrum Conductor with Spark
over YARN and Mesos.

PowerAI Enterprise ML/DL - Data Science Stack
Open Source Frameworks Distribution
Data Layer
Runtimes,
Resource &
WL Managers
DL Frameworks
ML Libraries
ML/DL
UI and Flow
Data Science
Apps
Value-add Tools
IBM Spectrum Conductor
Tensor
Flow
Caffe PyTorch Chainer MLLib Graphx
Scikit-
learn
R xgboost
GPU Support / Distributed / BYOF / Session Scheduler / MPI / Containers… Anaconda
Python
Spark
Anaconda
Distributed Deep Learning (DDL)
Data Prep / Parallel Training / Model Tuning / Model Evaluation / Inference Services…
IBM Spectrum Conductor Deep Learning Impact
PowerAI Vision
IBM
PowerAI
Enterprise
IBM Spectrum Scale IBM Cloud Object Store
Watson Studio
Elastic Distributed Training (EDT)

Key concepts of IBM Spectrum Conductor
• Application instances
• Customizable feature to support running any long-running service within the cluster
• Application templates (yaml) are created to define the processes (services) that you
want to run in the cluster
• Driverless AI integration is done through application instances
• Spark instance groups
• Is an installation of Apache Spark that can run Spark core services (master, shuffle,
and history), Anaconda distribution instances, and notebooks as configured
• You can create and run multiple Spark instance groups, associating each instance
group with different Spark/Anaconda/notebook version packages as required
• H2O Sparkling Water integration is treated as a notebook within your Spark instance
groups

Key concepts of IBM Spectrum Conductor Cont
• Resource groups
• Provide a simple way of organizing and grouping resources (hosts)
• Defines how to divide up the hosts in the group into slots
• Slots are used to decide if a host is available to place new workload on it
• Consumers
• A way to map organizations/teams to resources they are allowed to use
• Resource planning uses consumers to determine advanced policies for when
to borrow/lend resources to other consumers
• Resource groups map to consumers to allow users adding application
instances or Spark instance groups to only use those resource groups

Role-based access control
• Permissions are assigned to roles
• Roles are assigned to users
• Most permissions are based on a consumer
• Users will have the permissions/role assigned but only for the consumers they
have access to
• Ability to allow users to only access/control what they should
• Example: Each user can see only their Driverless AI instances as desired

How does the integration work?
• H2O Driverless AI is launched on a single host
• The host can have either GPUs or just run with CPUs
• If using GPUs the entire host is taken (with current integration)
• An application instance is created for each user of Driverless AI
• Maintains security for the data this user has access to
• Environment variables through parameters are used to configure Driverless AI
• H2O Sparkling Water runs as a notebook in a Spark instance group
• When the notebook is started up it forms a mini cluster of executors
• These executors stay alive for the entire duration of the notebook
• IBM Spectrum Conductor disables preemption to not reclaim these hosts
• Multiple users can share a Sparkling Water notebook instance or have
dedicated ones per user

Current Integration
14
Session Scheduler
Security
Data Connector
Report/log management
Notebook Spark ELKPython
Resource, Cluster, Service Management (K8s/EGO)
ContainerGPU and Acceleration
Multi-tenancy
Batch Scheduler
Session Scheduler
Session Scheduler
Instance Group #1 Instance Group #2
App instance
# marketing
App instance
# fraud
Instance Group
# 5
Elastic Distributed
Training (EDT)
# other
apps …

Future Plans (short term)
• Log retrieval from IBM Spectrum Conductor web UI
• Ability to deploy Driverless AI with IBM Spectrum Conductor instead
of installing on all systems (new application template)
• Ability to modify application instance outputs more effectively
• Enhance job monitor to check when Driverless AI is up

Future Plans (longer term)
• Improved port management
• Today you can specify the ports to use, however, you don’t know if they are
being used on existing hots
• The ports might work at first but not later if something else is using the ports
• Improve handling of running Driverless AI with a subset of GPUs on
hosts in the cluster
• Integrate Driverless AI authentication with IBM Spectrum Conductor
authentication/authorization for easier setup
• Look at supporting Driverless AI to run across multiple machines
• Investigate the best approaches to connect to data sources

Long term architecture vision for Driverless AI
integrated with IBM Spectrum Conductor
H2O Driverless AI
Batch Scheduler
(1) Start Driverless AI
Linux Linux
Linux Linux
Linux Linux
Linux Linux
Linux Linux
Linux Linux
Linux Linux
Linux Linux
Linux Linux
Linux Linux
Session Scheduler
(2) Find a host to run Driverless AI
(3) Run workload
(training,
experiment, etc)
(4) Find hosts to run the
workload on to speed up
execution

It’s available now
• Contact Richard Shedrick ( rshedrick@us.ibm.com ) to get access to
the integration and learn more
• Future announcements and contact points on the integration at:
• IBM Spectrum Conductor Blog:
http://ibm.biz/ConductorBlogs
• IBM Spectrum Conductor’s Slack channel:
http://ibm.biz/ConductorSlack

20
Simplicity: Integrated
Platform that Just Works
Curate, Test, and Support
Fast Moving Open Source
Provide Enterprise
Distribution on RedHat
Easy to deploy Enterprise
AI Platform
Ease of Use, Unique
Capabilities
Faster Model
Training Time
Large data & model
support due to NVLink
Acceleration of Analytics &
ML
AutoML: PowerAI Vision
Elastic Training: Scale GPUs
as Required
Faster Training Times in
Single Server
Scalability to 100s of
Servers (Cluster level
Integration)
Leads to Faster Insights
and Better Economics
Platform that Partners can
build on
Software Partners: H2O,
IBM, Anaconda
SIs, Solution Vendors
& Accelerator Partners
Open AI Platform w/
Ecosystem Partners
Power9
CPU
GPU
PowerAI
IBM
SW
ISV SW
Solution
SIs
Top Reasons to Choose PowerAI Enterprise

Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI World London 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI World London 2018

Similar to Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI World London 2018 (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI World London 2018