Enabling a hardware accelerated deep learning data science experience for Apache Spark and Hadoop

Enabling a hardware accelerated deep
learning data science experience for
Apache Spark and Hadoop
Indrajit (I.P) Poddar
Senior Technical Staff Member
IBM Cloud and Cognitive Systems
June 2018

Safe Harbor Statement and Disclaimer
• Copyright IBM Corporation 2018. All rights reserved. U.S. Government Users Restricted Rights - use, duplication, or disclosure restricted
by GSA ADP Schedule Contract with IBM Corporation.
• IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United
States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information
was published. Such trademarks make also be registered or common law trademarks in other countries. A current list of IBM trademarks
is available on the Web at “Copyright and trademark information at : ibm.com/legal/copytrade/shtml.
• The information contained in this presentation is provided for informational purpose only. While efforts were made to verify the
completeness and accuracy of the information contained in this presentation, it is provided “as is” without warranty of any kind, expressed
or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other
documentation.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material,
code or functionality. Information about potential future products may not be incorporated into any contract. Nothing contained in this
presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM (or its suppliers or licensors),
or altering the terms and conditions of any agreement or license governing the use of IBM products and/or software.
• Any statements of performance are based on measurements and projections using standard IBM benchmarks in a controlled environment.
The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such
as the amount of multi-programming in the user’s job stream, the I/O configuration, the storage configuration, and the workload
processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated.
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making
a purchasing decision.”

AI, Deep Learning, Machine Learning
02
Data Science Experience
03
Hardware Acceleration
04
Demo
Agenda 01

4
Artificial
Intelligence
Mimic Humans
Machine
Learning
Learn with
Experience
Deep Learning
(Neural Networks)
Self-Learn with
More Data

Deep Learning Has
Revolutionized Machine Learning
5
Data
Accuracy
Deep
Learning
Traditional
Machine
Learning
100
80
60
40
20
0
Deep Learning Popularity
Growing Exponentially
Source: Google Trends. Search term “Deep Learning”
2011 2012 2013 2014 2015 2016 2017

6
Machine Learning
Deep Learning
Input
Deep Neural Network
OutputFeature Extraction
& Classification
Input Feature
Extraction
Features Classification Output
Machine Learning
Algorithms

7
Transform & Prep
Data (ETL)
AI Infrastructure Stack
Applications
Cognitive APIs
(Eg: Watson)
In-House APIs
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
SparkML
Kubernetes, Spark, MPI
Hadoop HDFS,
NoSQL DBs
Accelerated
Infrastructure
Accelerated Servers Storage

8
Open Source Frameworks:
Supported Distribution
Developer Ease-of-Use Tools
Faster Training Times via
HW & SW Performance Optimizations
Integrated & Supported AI Platform
Higher Productivity for Data Scientists
Enable non-Data Scientists to use AI
Integrated
software and
hardware for
AI

Data Science Teams
Phase
Team
Tasks &
Pain points
Leader
concerns
Getting Started Modeling
Experimentation
Developing Apps
Developing Dashboards
Deployment
Monitoring
Support
• Defining projects
• Finding corporate data
• Connecting to data
sources
• Understanding the data
• Cleaning data
• Building models
Measuring accuracy
• Finding more data
• Building repeatable data
pipelines
• Integration with
engineering
• Machinery management
• QA
• Accuracy monitoring
• Scalability
• Models robustness with
new data
• Integration with
infrastructure
• (reuse of old models)
• Hiring, getting skills
• Data security
(breaches)
• Data security
• Productivity of a
very expensive &
rare skill
• Skill inconsistency
• Data security
• Productivity of a very
expensive & rare skill
• Knowledge loss due to
high employee turnover
• Meeting customer
expectations with timely
support
• Productivity of a very
expensive & rare skill
• Knowledge loss due to
high employee turnover
Data Scientist
Happiness

Teams getting started
• Learn
• Connect to Enterprise
data sources easily
• Collaborate
• Working on cluster
safer than desktops for
leader
• Safe behind the firewall
Big SQL, Db2 (warehouse/z/LUW), Hive
for HDP, HDFS for HDP
Hive for Cloudera (CDH)
HDFS for Cloudera (CDH)
Informix, Netezza, Oracle

Teams in modeling experimentation phase
• DSX Local simplifies distribution of team
work based on skills
• DSX Local increases knowledge sharing and
knowledge retention
• Currently based on open source notebooks,
productivity tools in the future
• DSX Local simplifies cluster management for
teams

Teams in applications building phase
• Facilitate creation of machine learning
models
• Facilitate deployment of models as API end-points
• Automation of Batch Scoring, Training and Evaluation
scripts as schedulable jobs
• GIT integration to collaborate with engineers in their
favorite environment
• Publish content to others in pdf / html / R-Shiny app

Teams in model deployment, monitoring and support phase
• Monitor models through a dashboard
• Model versioning, evaluation history
• Publish versions of models, supporting
dev/stage/production paradigm
• Monitor scalability through cluster dashboard
• Adapt scalability by redistributing
compute/memory/disk resources

Software Architecture Best Practices
Run as a collection of “dockerized” services which are managed by Kubernetes
Kubernetes handles the service orchestration by providing
• Service monitoring and administration
• High availability / service failure detection and automatic restart
• Dynamically adds or removes nodes
• Online upgrades
Services running in Kubernetes include:
• UI services built with Node.js frameworks for browsers to connect to
• User authentication services
• Project services for user collaboration and data sharing
• Notebook services with enhanced access to Jupyter notebooks
• A Spark service with access to sophisticated analytics libraries
• Pipeline and model building services
• Data connection building service for access to external data
• Various internal management services

Specialized Runtime environments for containers with GPUs
16
• Create microservices using
nvidia-docker images
• Add AI frameworks which
transparently exploit GPUs
such as Tensorflow to the
docker image
• Deploy image and allocate
GPUs in a cluster using
kubernetes

Connect to Spark and Hadoop cluster for larger data sets and
access to shared resources
or YARN

19
Faster Data Communication with Unique
CPU-GPU NVLink High-Speed Connection
1 TB
Memory
CPU
GPU GPU
170GB/s
NVLink
150 GB/s
1 TB
Memory
CPU
GPU GPU
170GB/s
NVLink
150 GB/s
Deep Learning Server (4-GPU Config)
Store Large Models
in System Memory
Operate on One
Layer at a Time
Fast Transfer
via NVLink

ResponseTime(sec)
Accelerate Data Scientist productivity and drive faster insights with DSX Local on
Power AC922
Power AC922 (Power 9 with NVIDIA GPUs) completes running GPU accelerated K-means clustering with 15GB
data in half of the time than tested x86 systems (Skylake 6150 with NVIDIA GPUs)
20
1. Based on IBM internal testing of the core computational step to form 5 clusters using a 5270410 x 301 float64 data set (15 GB) running the K-means algorithm using Apache Python ® and Tensorflow, Results valid as of 6/13/2018 and conducted under
laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems, individual results can vary based on workload size, use of storage subsystems & other conditions.
2. IBM Power AC922 (2x20-core/3.78 GHz and 4 NVIDIA® Tesla® V100 GPUs with NVLink) using 2x 960 GB SSD, 10 GbE two-port, RHEL 7.5 LE for Power9. Total 1 TB memory, each user assigned 180 GB in DSXL.
3. Competitive stack: 2-socket Intel Xeon SP (Skylake and 4 NVIDIA® Tesla® V100 GPUs) Gold 6150 (2x18-core/2.7 GHz) using 2x 960 GB SSD, 10 GbE two-port and RHEL 7.5. Total 768 GB memory, each user assigned 180 GB in DSXL.
4. Apache®, Apache Python®, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
• Power AC922 delivers 2x FASTER insights for GPU
accelerated K-means clustering workload than Intel
Xeon SP Gold 6150 based servers
• Power Systems Cluster with LC 922 (CPU optimized)
and AC922 (GPU accelerated) provides an optimized
infrastructure for IBM Data Science Experience Local
(DSX Local)
Cognitive Systems / June 13, 2018 / © 2018 IBM Corporation
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
0 1 2 3 4 5
Concurrent Users
K-means Clustering with Tensorflow
15 GB Data on GPUs
AC922 Skylake 6150
Lowerisbetter

Accelerate Data Scientist productivity and drive faster insights with DSX Local on
Power LC922
Power LC922 running K-means clustering with 1GB data scales to 2X the users than tested x86 systems
21
2. IBM Power LC922 (2x22-core/2.6 GHz/512 GB memory) using 10 x 4TB HDD, 10 GbE two-port, RHEL 7.5 LE for Power9
3. Competitive stack: 2-socket Intel Xeon SP (Skylake) Gold 6140 (2x18-core/2.4 GHz/512 GB memory) using 10 x 4TB HDD, 10 GbE two-port and RHEL 7.5
• Power LC922 supports 2X the number of users at
FASTER RESPONSE TIME than Intel Xeon SP Gold
6140 based servers
• Power LC922 delivers over 41% FASTER insights
for the same number of users (4-8 users)
Lowerisbetter
0
60
120
180
240
300
360
420
480
540
600
660
0 2 4 6 8 10
ResponseTime(sec)
Concurrent Users
K-Means Clustering with Tensorflow
1 GB Data on CPUs
Intel Xeon Gold 6140 Power LC9222
2x users with better
response time
Cognitive Systems / April 26, 2018 / © 2018 IBM Corporation

POC Development and Deployment Configuration Kubernetes
cluster
IBM Cloud © 2018 IBM Corporation
Deploy Deploy
Control
Storage
Compute
Control
Storage
Compute
Control
Storage
Compute
IBM POWER Systems LC 922
LPAR/VM LPAR/VM LPAR/VM
LPAR/VM LPAR/VM

Mixed Development and Deployment Configuration
Deploy Deploy
Control
Storage
Compute
Control
Storage
Compute
Control
Storage
Compute ComputeCompute
IBM POWER Systems LC 922 IBM POWER Systems AC 922

Separate Development and Deployment Configuration
Control
Storage
Compute
Control
Storage
Compute
Control
Storage
Compute Compute
Deploy Deploy
IBM POWER Systems LC 922 IBM POWER Systems AC 922

Train Large AI Models
Faster
Servers with NVLink to GPUs
25
3.1 Hours
49 Mins
0
2000
4000
6000
8000
10000
12000
Xeon x86 2640v4 w/
4x V100 GPUs
Power AC922 w/ 4x
V100 GPUs
Time(secs)
Caffe with LMS (Large Model Support)
Runtime of 1000 Iterations
3.8x Faster
GoogleNet model on Enlarged
ImageNet Dataset (2240x2240)
More details:
https://developer.ibm.com/linuxonpower/perfco
l/perfcol-mldl/

Distributed Deep
Learning (DDL)
26
Deep learning training takes
days to weeks
Distributed learning enables
scaling to 100s of servers
connected with Mellanox IB
1 System 64 Systems
16 Days Down to 7 Hours
58x Faster
16 Days
7 Hours
Near Ideal Scaling to 256 GPUs
ResNet-101, ImageNet-22K
1
2
4
8
16
32
64
128
256
4 16 64 256
Speedup
Number of GPUs
Ideal Scaling
DDL Actual Scaling
95%Scaling
with 256 GPUS
Caffe with PowerAI DDL, Running on Minsky (S822Lc) Power System
ResNet-50, ImageNet-1K

27
Network Switch
GPU
Memory
POWER
CPU
DDR4
GPU
Storage Network IB, Eth
PCle
DDR4POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
NVLinkNVLink
NVLink
NVLink
GPU
Memory
POWER
CPU
DDR4
GPU
PCle
DDR4POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
NVLinkNVLink
NVLink
NVLink
COMMUNICATION PATHS
DDL: Fully utilize bandwidth for links within each node and across all nodes
à Learners communicate as efficiently as possible
Storage
Mellanox IB Network Switch
GPU
Memory
POWER
CPU
DDR4
GPU
Network IB, Eth
PCle
DDR4POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
NVLinkNVLink
NVLink
NVLink
GPU
Memory
POWER
CPU
DDR4
GPU
PCle
DDR4POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
NVLinkNVLink
NVLink
NVLink

Auto Hyper-Parameter Tuning
Hyper-parameters
– Learning rate
– Decay rate
– Batch size
– Optimizer
• GradientDecedent,
Adadelta, Momentum,
RMSProp …
– Momentum (for some
optimizers)
– LSTM hidden unit size
Random
Tree-based Parzen
Estimator (TPE)
Bayesian
Multi-tenant Spark Cluster
IBM Spectrum Conductor with Spark
Spark search jobs are generated dynamically and executed in parallel
28

29
libGLM (C++ / CUDA
Optimized Primitive Lib)
Distributed Training
Logistic Regression Linear Regression
Support Vector
Machines (SVM)
Distributed Hyper-
Parameter Optimization
More Coming Soon
APIs for Popular ML
Frameworks
Snap ML
Distributed GPU-Accelerated Machine Learning Library
(coming
soon)
Snap Machine Learning (ML) Library

46x faster than previous
record set by Google
Workload: Click-through rate
prediction for advertising
Logistic Regression Classifier in
Snap ML using GPUs vs
TensorFlow using CPU-only
30
Snap ML: Training Time Goes
From An Hour to Minutes
Logistic Regression in Snap ML (with
GPUs) vs TensorFlow (CPU-only)
1.1 Hours
1.53
Minutes
0
20
40
60
80
Google
CPU-only
Snap ML
Power + GPU
Runtime(Minutes)
46x Faster
Dataset: Criteo Terabyte Click Logs
(http://labs.criteo.com/2013/12/download-terabyte-click-logs/)
4 billion training examples, 1 million features
Model: Logistic Regression: TensorFlow vs Snap ML
Test LogLoss: 0.1293 (Google using Tensorflow), 0.1292 (Snap ML)
Platform: 89 CPU-only machines in Google using Tensorflow versus
4 AC922 servers (each 2 Power9 CPUs + 4 V100 GPUs) for Snap ML
Google data from this Google blog

PowerAI Vision: ”Point-and-Click” AI for Images & Video
31
Label Image or
Video Data
Auto-Train AI
Model
Package & Deploy
AI Model

Semi-Automatic Labeling using PowerAI Vision
32
Train DL Model
Define Labels
Manually Label Some
Images / Video Frames
Manually Label
Use Trained DL
Model
Run Trained DL Model
on Entire Input Data
to Generate Labels
Correct Labels
on Some Data
Manually Correct
Labels on Some Data
Repeat Till Labels Achieve
Desired Accuracy

33
IBM Power LC922
(44-core, 512GB)
Intel Xeon SP based
2-socket server
(36-core, 512GB)
Server price 2,3,4
3-year warranty & OS
$35,618 $45,390
Performance 1
Models completed per hour at 8
concurrent users
84.6 49.8
Performance per
K$
2.4 per $K 1.1 per $K
2.2X
Better Price-performance
22%
Lower Server Cost
70%
More Performance
per Hour
DSX Local on Power LC922 Server: Improved Price-Performance for
Clients
Increased model completion and lower cost running K-means Clustering than tested Intel Xeon SP servers
2. IBM Power LC922 (2x22-core/2.6 GHz/512 GB memory) using 10 x 4TB HDD, 10 GbE two-port, RHEL 7.5 LE for Power9
4. Pricing is based on Power LC922 http://www-03.ibm.com/systems/power/hardware/linux-lc.html publicly available x86 pricing.
Cognitive Systems / April 26, 2018 / © 2018 IBM Corporation

DSX Local on Power LC922 Server: Improved Price-Performance for
Clients
Increased model completion and lower cost running K-means Clustering than tested Intel Xeon SP servers
Intel Xeon SP Gold
6140 server:
578 seconds
Power LC922
340 seconds
Power LC922
$35,618
Intel Xeon SP Gold
6140 server:
$45,390
Power LC922
Versus
Intel Xeon SP
Gold 6140 server
1. Based on IBM internal testing of the core computational step for 8 users to form 5 clusters using a 350694 x 301 float64 data set (1 GB) running the K-means algorithm using Apache Python ® and Tensorflow, Results valid as of 4/21/18 and conducted
under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems, individual results can vary based on workload size, use of storage subsystems & other conditions.
2. IBM Power LC922 2x22-core/2.6 GHz/512 GB memory) using 10 x 4TB HDD, 10 GbE two-port, RHEL 7.5 LE for Power9
4. Pricing is based on Power LC922 http://www-03.ibm.com/systems/power/hardware/linux-lc.html and publicly available x86 pricing.
41%
FASTER
Insights1
22%
LOWER
Price2,3,4

Thank You
36
Recorded demo: https://ibm.box.com/s/pj0hs07x1fqlp1z576rsnb6odwtf57k5

Notice and disclaimers
37
Copyright © 2018 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without
written permission from IBM.
U.S. Government Users Restricted Rights — use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of
initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. This document is
distributed “as is” without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this information, including
but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and
conditions of the agreements under which they are provided.
IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our
warranty terms apply.”
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those
customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries
in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and
discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or
their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and
interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with
such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.

Notice and disclaimers cont.
38
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied
warranties of merchantability and fitness for a particular, purpose.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, AIX, BigInsights, Bluemix, CICS, Easy Tier, FlashCopy, FlashSystem, GDPS, GPFS, Guardium, HyperSwap, IBM
Cloud Managed Services, IBM Elastic Storage, IBM FlashCore, IBM FlashSystem, IBM MobileFirst, IBM Power Systems, IBM PureSystems, IBM
Spectrum, IBM Spectrum Accelerate, IBM Spectrum Archive, IBM Spectrum Control, IBM Spectrum Protect, IBM Spectrum Scale, IBM Spectrum
Storage, IBM Spectrum Virtualize, IBM Watson, IBM Z, IBM z Systems, IBM z13, IMS, InfoSphere, Linear Tape File System, OMEGAMON,
OpenPower, Parallel Sysplex, Power, POWER, POWER4, POWER7, POWER8, Power Series, Power Systems, Power Systems Software, PowerHA,
PowerLinux, PowerVM, PureApplica- tion, RACF, Real-time Compression, Redbooks, RMF, SPSS, Storwize, Symphony, SystemMirror, System
Storage, Tivoli, WebSphere, XIV, z Systems, z/OS, z/VM, z/VSE, zEnterprise and zSecure are trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A
current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are
trademarks or registered trademarks of Oracle and/or its affiliates.

Enabling a hardware accelerated deep learning data science experience for Apache Spark and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Enabling a hardware accelerated deep learning data science experience for Apache Spark and Hadoop

Similar to Enabling a hardware accelerated deep learning data science experience for Apache Spark and Hadoop (20)

More from Indrajit Poddar

More from Indrajit Poddar (8)

Recently uploaded

Recently uploaded (20)

Enabling a hardware accelerated deep learning data science experience for Apache Spark and Hadoop