Enabling a hardware accelerated deep learning data science experience for Apache Spark and Hadoop
1. Enabling a hardware accelerated deep
learning data science experience for
Apache Spark and Hadoop
Indrajit (I.P) Poddar
Senior Technical Staff Member
IBM Cloud and Cognitive Systems
June 2018
2. Safe Harbor Statement and Disclaimer
• Copyright IBM Corporation 2018. All rights reserved. U.S. Government Users Restricted Rights - use, duplication, or disclosure restricted
by GSA ADP Schedule Contract with IBM Corporation.
• IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United
States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information
was published. Such trademarks make also be registered or common law trademarks in other countries. A current list of IBM trademarks
is available on the Web at “Copyright and trademark information at : ibm.com/legal/copytrade/shtml.
• The information contained in this presentation is provided for informational purpose only. While efforts were made to verify the
completeness and accuracy of the information contained in this presentation, it is provided “as is” without warranty of any kind, expressed
or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other
documentation.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material,
code or functionality. Information about potential future products may not be incorporated into any contract. Nothing contained in this
presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM (or its suppliers or licensors),
or altering the terms and conditions of any agreement or license governing the use of IBM products and/or software.
• Any statements of performance are based on measurements and projections using standard IBM benchmarks in a controlled environment.
The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such
as the amount of multi-programming in the user’s job stream, the I/O configuration, the storage configuration, and the workload
processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated.
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making
a purchasing decision.”
3. AI, Deep Learning, Machine Learning
02
Data Science Experience
03
Hardware Acceleration
04
Demo
Agenda 01
5. Deep Learning Has
Revolutionized Machine Learning
5
Data
Accuracy
Deep
Learning
Traditional
Machine
Learning
100
80
60
40
20
0
Deep Learning Popularity
Growing Exponentially
Source: Google Trends. Search term “Deep Learning”
2011 2012 2013 2014 2015 2016 2017
6. 6
Machine Learning
Deep Learning
Input
Deep Neural Network
OutputFeature Extraction
& Classification
Input Feature
Extraction
Features Classification Output
Machine Learning
Algorithms
7. 7
Transform & Prep
Data (ETL)
AI Infrastructure Stack
Applications
Cognitive APIs
(Eg: Watson)
In-House APIs
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
SparkML
Kubernetes, Spark, MPI
Hadoop HDFS,
NoSQL DBs
Accelerated
Infrastructure
Accelerated Servers Storage
8. 8
Open Source Frameworks:
Supported Distribution
Developer Ease-of-Use Tools
Faster Training Times via
HW & SW Performance Optimizations
Integrated & Supported AI Platform
Higher Productivity for Data Scientists
Enable non-Data Scientists to use AI
Integrated
software and
hardware for
AI
9. AI, Deep Learning, Machine Learning
02
Data Science Experience
03
Hardware Acceleration
04
Demo
Agenda 01
10. Data Science Teams
Phase
Team
Tasks &
Pain points
Leader
concerns
Getting Started Modeling
Experimentation
Developing Apps
Developing Dashboards
Deployment
Monitoring
Support
• Defining projects
• Finding corporate data
• Connecting to data
sources
• Understanding the data
• Cleaning data
• Building models
Measuring accuracy
• Finding more data
• Building repeatable data
pipelines
• Integration with
engineering
• Machinery management
• QA
• Accuracy monitoring
• Scalability
• Models robustness with
new data
• Integration with
infrastructure
• (reuse of old models)
• Hiring, getting skills
• Data security
(breaches)
• Data security
• Productivity of a
very expensive &
rare skill
• Skill inconsistency
• Data security
• Productivity of a very
expensive & rare skill
• Knowledge loss due to
high employee turnover
• Meeting customer
expectations with timely
support
• Productivity of a very
expensive & rare skill
• Knowledge loss due to
high employee turnover
Data Scientist
Happiness
11. Teams getting started
• Learn
• Connect to Enterprise
data sources easily
• Collaborate
• Working on cluster
safer than desktops for
leader
• Safe behind the firewall
Big SQL, Db2 (warehouse/z/LUW), Hive
for HDP, HDFS for HDP
Hive for Cloudera (CDH)
HDFS for Cloudera (CDH)
Informix, Netezza, Oracle
12. Teams in modeling experimentation phase
• DSX Local simplifies distribution of team
work based on skills
• DSX Local increases knowledge sharing and
knowledge retention
• Currently based on open source notebooks,
productivity tools in the future
• DSX Local simplifies cluster management for
teams
13. Teams in applications building phase
• Facilitate creation of machine learning
models
• Facilitate deployment of models as API end-points
• Automation of Batch Scoring, Training and Evaluation
scripts as schedulable jobs
• GIT integration to collaborate with engineers in their
favorite environment
• Publish content to others in pdf / html / R-Shiny app
14. Teams in model deployment, monitoring and support phase
• Monitor models through a dashboard
• Model versioning, evaluation history
• Publish versions of models, supporting
dev/stage/production paradigm
• Monitor scalability through cluster dashboard
• Adapt scalability by redistributing
compute/memory/disk resources
15. Software Architecture Best Practices
Run as a collection of “dockerized” services which are managed by Kubernetes
Kubernetes handles the service orchestration by providing
• Service monitoring and administration
• High availability / service failure detection and automatic restart
• Dynamically adds or removes nodes
• Online upgrades
Services running in Kubernetes include:
• UI services built with Node.js frameworks for browsers to connect to
• User authentication services
• Project services for user collaboration and data sharing
• Notebook services with enhanced access to Jupyter notebooks
• A Spark service with access to sophisticated analytics libraries
• Pipeline and model building services
• Data connection building service for access to external data
• Various internal management services
16. Specialized Runtime environments for containers with GPUs
16
• Create microservices using
nvidia-docker images
• Add AI frameworks which
transparently exploit GPUs
such as Tensorflow to the
docker image
• Deploy image and allocate
GPUs in a cluster using
kubernetes
17. Connect to Spark and Hadoop cluster for larger data sets and
access to shared resources
or YARN
18. AI, Deep Learning, Machine Learning
02
Data Science Experience
03
Hardware Acceleration
04
Demo
Agenda 01
19. 19
Faster Data Communication with Unique
CPU-GPU NVLink High-Speed Connection
1 TB
Memory
CPU
GPU GPU
170GB/s
NVLink
150 GB/s
1 TB
Memory
CPU
GPU GPU
170GB/s
NVLink
150 GB/s
Deep Learning Server (4-GPU Config)
Store Large Models
in System Memory
Operate on One
Layer at a Time
Fast Transfer
via NVLink
25. Train Large AI Models
Faster
Servers with NVLink to GPUs
25
3.1 Hours
49 Mins
0
2000
4000
6000
8000
10000
12000
Xeon x86 2640v4 w/
4x V100 GPUs
Power AC922 w/ 4x
V100 GPUs
Time(secs)
Caffe with LMS (Large Model Support)
Runtime of 1000 Iterations
3.8x Faster
GoogleNet model on Enlarged
ImageNet Dataset (2240x2240)
More details:
https://developer.ibm.com/linuxonpower/perfco
l/perfcol-mldl/
26. Distributed Deep
Learning (DDL)
26
Deep learning training takes
days to weeks
Distributed learning enables
scaling to 100s of servers
connected with Mellanox IB
1 System 64 Systems
16 Days Down to 7 Hours
58x Faster
16 Days
7 Hours
Near Ideal Scaling to 256 GPUs
ResNet-101, ImageNet-22K
1
2
4
8
16
32
64
128
256
4 16 64 256
Speedup
Number of GPUs
Ideal Scaling
DDL Actual Scaling
95%Scaling
with 256 GPUS
Caffe with PowerAI DDL, Running on Minsky (S822Lc) Power System
ResNet-50, ImageNet-1K
27. 27
Network Switch
GPU
Memory
POWER
CPU
DDR4
GPU
Storage Network IB, Eth
PCle
DDR4POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
NVLinkNVLink
NVLink
NVLink
GPU
Memory
POWER
CPU
DDR4
GPU
Storage Network IB, Eth
PCle
DDR4POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
NVLinkNVLink
NVLink
NVLink
COMMUNICATION PATHS
DDL: Fully utilize bandwidth for links within each node and across all nodes
à Learners communicate as efficiently as possible
Storage
Mellanox IB Network Switch
GPU
Memory
POWER
CPU
DDR4
GPU
Network IB, Eth
PCle
DDR4POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
NVLinkNVLink
NVLink
NVLink
GPU
Memory
POWER
CPU
DDR4
GPU
Storage Network IB, Eth
PCle
DDR4POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
NVLinkNVLink
NVLink
NVLink
28. Auto Hyper-Parameter Tuning
Hyper-parameters
– Learning rate
– Decay rate
– Batch size
– Optimizer
• GradientDecedent,
Adadelta, Momentum,
RMSProp …
– Momentum (for some
optimizers)
– LSTM hidden unit size
Random
Tree-based Parzen
Estimator (TPE)
Bayesian
Multi-tenant Spark Cluster
IBM Spectrum Conductor with Spark
Spark search jobs are generated dynamically and executed in parallel
28
29. 29
libGLM (C++ / CUDA
Optimized Primitive Lib)
Distributed Training
Logistic Regression Linear Regression
Support Vector
Machines (SVM)
Distributed Hyper-
Parameter Optimization
More Coming Soon
APIs for Popular ML
Frameworks
Snap ML
Distributed GPU-Accelerated Machine Learning Library
(coming
soon)
Snap Machine Learning (ML) Library
30. 46x faster than previous
record set by Google
Workload: Click-through rate
prediction for advertising
Logistic Regression Classifier in
Snap ML using GPUs vs
TensorFlow using CPU-only
30
Snap ML: Training Time Goes
From An Hour to Minutes
Logistic Regression in Snap ML (with
GPUs) vs TensorFlow (CPU-only)
1.1 Hours
1.53
Minutes
0
20
40
60
80
Google
CPU-only
Snap ML
Power + GPU
Runtime(Minutes)
46x Faster
Dataset: Criteo Terabyte Click Logs
(http://labs.criteo.com/2013/12/download-terabyte-click-logs/)
4 billion training examples, 1 million features
Model: Logistic Regression: TensorFlow vs Snap ML
Test LogLoss: 0.1293 (Google using Tensorflow), 0.1292 (Snap ML)
Platform: 89 CPU-only machines in Google using Tensorflow versus
4 AC922 servers (each 2 Power9 CPUs + 4 V100 GPUs) for Snap ML
Google data from this Google blog
32. Semi-Automatic Labeling using PowerAI Vision
32
Train DL Model
Define Labels
Manually Label Some
Images / Video Frames
Manually Label
Use Trained DL
Model
Run Trained DL Model
on Entire Input Data
to Generate Labels
Correct Labels
on Some Data
Manually Correct
Labels on Some Data
Repeat Till Labels Achieve
Desired Accuracy
34. DSX Local on Power LC922 Server: Improved Price-Performance for
Clients
Increased model completion and lower cost running K-means Clustering than tested Intel Xeon SP servers
Intel Xeon SP Gold
6140 server:
578 seconds
Power LC922
340 seconds
Power LC922
$35,618
Intel Xeon SP Gold
6140 server:
$45,390
Power LC922
Versus
Intel Xeon SP
Gold 6140 server
1. Based on IBM internal testing of the core computational step for 8 users to form 5 clusters using a 350694 x 301 float64 data set (1 GB) running the K-means algorithm using Apache Python ® and Tensorflow, Results valid as of 4/21/18 and conducted
under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems, individual results can vary based on workload size, use of storage subsystems & other conditions.
2. IBM Power LC922 2x22-core/2.6 GHz/512 GB memory) using 10 x 4TB HDD, 10 GbE two-port, RHEL 7.5 LE for Power9
3. Competitive stack: 2-socket Intel Xeon SP (Skylake) Gold 6140 (2x18-core/2.4 GHz/512 GB memory) using 10 x 4TB HDD, 10 GbE two-port and RHEL 7.5
4. Pricing is based on Power LC922 http://www-03.ibm.com/systems/power/hardware/linux-lc.html and publicly available x86 pricing.
5. Apache®, Apache Python®, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
41%
FASTER
Insights1
22%
LOWER
Price2,3,4
35. AI, Deep Learning, Machine Learning
02
Data Science Experience
03
Hardware Acceleration
04
Demo
Agenda 01
38. Notice and disclaimers cont.
38
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied
warranties of merchantability and fitness for a particular, purpose.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, AIX, BigInsights, Bluemix, CICS, Easy Tier, FlashCopy, FlashSystem, GDPS, GPFS, Guardium, HyperSwap, IBM
Cloud Managed Services, IBM Elastic Storage, IBM FlashCore, IBM FlashSystem, IBM MobileFirst, IBM Power Systems, IBM PureSystems, IBM
Spectrum, IBM Spectrum Accelerate, IBM Spectrum Archive, IBM Spectrum Control, IBM Spectrum Protect, IBM Spectrum Scale, IBM Spectrum
Storage, IBM Spectrum Virtualize, IBM Watson, IBM Z, IBM z Systems, IBM z13, IMS, InfoSphere, Linear Tape File System, OMEGAMON,
OpenPower, Parallel Sysplex, Power, POWER, POWER4, POWER7, POWER8, Power Series, Power Systems, Power Systems Software, PowerHA,
PowerLinux, PowerVM, PureApplica- tion, RACF, Real-time Compression, Redbooks, RMF, SPSS, Storwize, Symphony, SystemMirror, System
Storage, Tivoli, WebSphere, XIV, z Systems, z/OS, z/VM, z/VSE, zEnterprise and zSecure are trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A
current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are
trademarks or registered trademarks of Oracle and/or its affiliates.