This document discusses optimizing Apache Spark machine learning workloads on OpenPOWER platforms. It provides an overview of Spark, machine learning, and deep learning. It then discusses how OpenPOWER systems are well-suited for these workloads due to features like high memory bandwidth, large caches, and GPU support. The document outlines various techniques for tuning Spark performance on OpenPOWER, such as configuration of executors, cores, memory, and storage levels. It also presents examples analyzing the performance of a matrix factorization machine learning application under different Spark configurations.
5. Machine Learning and Deep Learning (ML/DL)
What you and I (our brains) do without even thinking about it…..we recognize a bicycle
Apr 7, 2017
(c) International Business Machines (IBM) 2017
5
6. 6
Now machines are learning the way we learn….
From "Texture of the Nervous
System of Man and the Vertebrates"
by Santiago Ramón y Cajal.
Artificial Neural Networks
Apr 7, 2017(c) International Business Machines (IBM) 2017
7. But training needs a lot computational resources
Easy scale-out with:
But model training is not easy to distribute
Training can take hours, days or
weeks
Input data and model sizes are
becoming larger than ever (e.g. video
input, billions of features etc.)
Real-time analytics with:
• whole system optimization
• offloaded computation
• accelerators, and
• higher memory bandwidth systems
Resulting in need for:
Moore’s law is dying
Apr 7, 2017(c) International Business Machines (IBM) 2017 7
9. 9
OpenPOWER: open hardware for high performance
Systems designed for
big data analytics
and superior cloud economics
Upto:
12 cores per cpu
96 hardware threads per cpu
1 TB RAM
7.6Tb/s combined I/O Bandwidth
GPUs and FPGAs coming…
OpenPOWER
Traditional
Intel x86
http://www.softlayer.com/POWER-SERVERS
https://mc.jarvice.com/
Apr 7, 2017(c) International Business Machines (IBM) 2017
12. Introducing Minsky S822LC OpenPOWER system for HPC
first custom-built GPU accelerator server with NVLink
|
12
2.5x Faster CPU-GPU Data
Communication via NVLink
NVLink
80 GB/s
GPU
P8
GPU GPU
P8
GPU
PCIe
32 GB/s
GPU
x86
GPU GPU
x86
GPU
No NVLink between CPU &
GPU for x86 Servers: PCIe
Bottleneck
NVIDIA P100 Pascal GPU
POWER8 NVLink Server x86 Servers with PCIe
• Custom-built GPU Accelerator Server
• High-Speed NVLink Connections between
CPUs & GPUs and among GPUs
• Features novel NVIDIA P100 Pascal GPU
accelerator
M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise
13. Deep Learning on OpenPOWER with GPUs
Transparent acceleration without code changes
|
13
Apr 7, 2017(c) International Business Machines (IBM) 2017
14. Introducing PowerAI: Get started fast with Deep Learning
14
Enabled by High Performance Computing Infrastructure
Package of Pre-Compiled Major
Deep Learning Frameworks
Easy to install & get started with
Deep Learning with Enterprise-
Class Support
for Performance
To Take Advantage of NVLink
https://www.ibm.com/ms-en/marketplace/deep-learning-platform
15. Machine and Deep Learning analytics on OpenPOWER
no code changes needed!!
15
ATLAS
Automatically Tuned Linear Algebra
Software)
https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-
44a4f27eba32/entry/DeepLearning4J_Deep_Learning_with_Java_Spark_and_Power?lang=en
17. Enabling Accelerators/GPUs in the cloud stack
17
Deep Learning Training + Inference
Containers
and images
Accelerators
Clustering frameworks
18. Tensorflow on tesla P100: PowerAI is 30% faster
18
IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 /
cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs/ Ubuntu 16.04 / CUDA
8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Larger value is better
19. PowerAI vs DGX-1: 1.6x Tensorflow throughput / dollar
19
▪ TensorFlow 0.12 on the IBM PowerAI platform takes advantage
of the full capabilities of NVLink
▪ For image classification and analysis this means a 1.6X price
performance advantage relative to the NVIDIA DGX-1
System Images / Second List Price $ / Image / Second
NVIDIA DGX-1
(8 P100 GPU,
512GB Mem)
330 $129,000 $390
PowerAI (4 P100
GPU, 512 GB Mem)
273 $67,000 $241
Lower cost is better
20. NVLink and P100 advantage
|
20
• NVLink reduces communication time and overhead
• Incorporating the fastest GPU for deep learning
• Data gets from GPU-GPU, Memory-GPU faster, for shorter training times
x86 based
GPU system
POWER8 +
Tesla
P100+NVLink
ImageNet / Alexnet: Minibatch size = 128
170 ms
78 ms
IBM advantage: data communication
and GPU performance
21. Spark Machine Learning performance tuning on OpenPOWER
What knobs can you tweak?
|
21
Apr 7, 2017(c) International Business Machines (IBM) 2017
27. Performance Tuning Tips for a Machine Learning Workload
27
Top Down Approach
Methodology:
Alternating Least Squares Based
Matrix Factorization application
Optimization Process:
Spark executor Instances
Spark executor cores
Spark executor memory
Spark shuffle location and manager
RDD persistence storage level
Application
Large No of Spark Tunable -
Spark Executors and Spark Cores ……
Default Configurations
Out of Box Performance
Bottom Approach
System Hardware
Characterizing the Workload
Through Resource monitoring
Custom SPARK Tunables from
Configuration Sweeps
Roofline Performance
36. IO Optimizations
How to take advantage of faster networks?
36Apr 7, 2017(c) International Business Machines (IBM) 2017
37. THE GAP – HIGH-PERFORMANCE NETWORKS
The networks – 1, 10, and 40 Gbps networks
Runtime(secs)
37Apr 7, 2017(c) International Business Machines (IBM) 2017
38. THE PERFORMANCE LOSS IN THE BIG-DATA STACK
High-Performance
I/O devices
• Data copies
• Context switches
• Cache pollution
• Deep call-stacks
• Legacy I/O interfaces
38Apr 7, 2017(c) International Business Machines (IBM) 2017
39. The Crail Architecture WWW.CRAIL.IO
A high-performance data fabric for the Apache Data
Processing Stack
Relies on the principles of user level IO
Separation between control path and data path
User-space direct-access I/O architecture/layer cut-through
Builds on a distributed, shared data store
No changes to overall data processing framework
Is optimized to serve short-lived data sharing and staging
spark / flink / storm …
HDFS
Crail Store
High Performance
RDMA Network
zerocopy
spark specific
shuffle broadcast
39Apr 7, 2017(c) International Business Machines (IBM) 2017
40. EVALUATION - TERASORT
0
100
200
300
400
500
600
Spark Spark/Crail
Runtime(seconds)
12.8 TB data set, TeraSort
reduce
map
128 nodes OpenPOWER cluster
• 2 x IBM POWER8 10-core @ 2.9 Ghz
• DRAM: 512GB DDR4
• 4 x 1.2 TB NVMe SSD
• 100GbE Mellanox ConnectX-4 EN (RoCE)
• Ubuntu 16.04 (kernel 4.4.0-31)
• Spark 2.0.2
Performance gain: 6x
• Most gain from reduce phase:
• Crail shuffler much faster than Spark build-in
• Dramatically reduced CPU involvement
• Dramatically improved network usage
• Map phase: all activity local
• Still faster than vanilla Spark
40Apr 7, 2017(c) International Business Machines (IBM) 2017
41. EVALUATION – TERASORT: NETWORK IO
• Vanilla Spark runs on 100GbE
• Spark/Crail runs on 100Gb RoCE/RDMA
• Vanilla Spark peaks at ~10Gb/s
• Spark/Crail shuffle delivers ~70Gb/s per node
41Apr 7, 2017(c) International Business Machines (IBM) 2017
42. EVALUATION – TERASORT CPU EFFICIENCY
• Spark/Crail completes much faster despite comparable CPU load
• Spark/Crail CPU efficiency is close to 2016 sorting benchmark winner: 3.13
vs. 4.4 GB/min/core
• 2016 winner runs native C code!
Spark +
Crail
Spark
2.0.2
Winner
2014
Winner
2016
Size
TB
12.8 100
Time
sec
98 527 1406 98.6
Cores 2560 6592 10240
Nodes 128 206 512
NW
Gb/s
100 10 100
Rate
TB/min
7.8 1.4 4.27 44.78
Rate/core
GB/min
3.13 0.58 0.66 4.4
42Apr 7, 2017(c) International Business Machines (IBM) 2017
43. CRAIL WITH THE HORTONWORKS STACK
scalable, fault-tolerant,
cost-efficient storage
resource manager
compute
frameworks
user
interfaces
broadcast
HDFS
plugin
RPCs
shuffle
caching
key-value
store
...
High-performance
Crailfabric
43Apr 7, 2017(c) International Business Machines (IBM) 2017
47. NOTICES AND DISCLAIMERS CON’T.
47
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those
products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of
non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business
Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®,
OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®,
Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and
System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or
other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.