How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

Optimizing Hortonworks
Apache Spark machine learning
workloads for contemporary
Open Platforms
Raj Krishnamurthy, Indrajit Poddar (I.P), IBM Systems
Animesh Trivedi, Bernard Metzler, IBM Research
© International Business Machines (IBM) 2017

Please Note:
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s
sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it should not be
relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver
any material, code or functionality. Information about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products remains at our sole
discretion.
• Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The
actual throughput or performance that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
2

Agenda
Spark, Machine Learning and Deep Learning Overview
Why OpenPower ?
Deep Learning with OpenPOWER GPUs
Spark Machine Learning performance tuning with OpenPower CPUs
IO Optimization for Spark TeraSort benchmark
3

What is Apache Spark
• Unified Analytics Platform
– Combine streaming, graph, machine
learning and sql analytics on a single
platform
– Simplified, multi-language
programming model
– Interactive and Batch
• In-Memory Design
– Pipelines multiple iterations on single
copy of data in memory
– Superior Performance
– Natural Successor to MapReduce
Fast and general engine for
large-scale data processing
Spark Core API
R Scala SQL Python Java
Spark SQL Streaming MLlib GraphX
4

Machine Learning and Deep Learning (ML/DL)
What you and I (our brains) do without even thinking about it…..we recognize a bicycle
Apr 7, 2017
(c) International Business Machines (IBM) 2017
5

6
Now machines are learning the way we learn….
From "Texture of the Nervous
System of Man and the Vertebrates"
by Santiago Ramón y Cajal.
Artificial Neural Networks
Apr 7, 2017(c) International Business Machines (IBM) 2017

But training needs a lot computational resources
Easy scale-out with:
But model training is not easy to distribute
Training can take hours, days or
weeks
Input data and model sizes are
becoming larger than ever (e.g. video
input, billions of features etc.)
Real-time analytics with:
• whole system optimization
• offloaded computation
• accelerators, and
• higher memory bandwidth systems
Resulting in need for:
Moore’s law is dying
Apr 7, 2017(c) International Business Machines (IBM) 2017 7

Today’s challenges demand whole system innovation
You are here
44 zettabytes
unstructured data
2010 2020
structured data
Data holds competitive valueFull system and stack open innovation required
DataGrowth
Price/Performance
Moore’s Law
Processor
Technology
2000 2020
Firmware / OS
Accelerators
Software
Storage
Network
8

9
OpenPOWER: open hardware for high performance
Systems designed for
big data analytics
and superior cloud economics
Upto:
12 cores per cpu
96 hardware threads per cpu
1 TB RAM
7.6Tb/s combined I/O Bandwidth
GPUs and FPGAs coming…
OpenPOWER
Traditional
Intel x86
http://www.softlayer.com/POWER-SERVERS
https://mc.jarvice.com/

10
OpenPower Ecosystem – Members
(c) International Business Machines (IBM) 2017 Apr 7, 2017

Memory
Interface
Control
Memory
IBM & Partner
Devices
CAPI/PCI
DMI
Cores
• 12 cores / 8 threads per core
• TDP: 130W and 190W
• 64K data cache, 32K instruction cache
Accelerators
• Crypto & memory expansion
• Transactional Memory
Caches
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
Memory Subsystem
• Memory buffers with 128MB Cache
• ~70ns latency to memory
Bus Interfaces
• Durable Memory attach Interface (DMI)
• Integrated PCIe Gen3
• SMP Interconnect for up to 4 sockets
Virtual Addressing
•Accelerator can work with same memory
addresses that the processors use
•Pointers de-referenced same as the host
application
•Removes OS & device driver overhead
Hardware Managed Cache Coherence
•Enables the accelerator to participate in “Locks” as
a normal thread
•Lowers Latency over IO communication model
6 Hardware Partners developing with CAPI
Over 20 CAPI Solutions
• All listed here http://ibm.biz/powercapi
Examples of Available CAPI Solutions
• IBM Data Engine for NoSQL
• DRC Graphfind analytics
• Erasure Code Acceleration for Hadoop
Coherent Accelerator Processor Interface
(CAPI)
22nm SOI, eDRAM, 15 ML 650mm2
SMP
http://openpowerfoundation.org/wp-content/uploads/2016/04/HardwareRevealFlyerFinal.pdf
Newly Announced OpenPOWER systems and solutions:
POWER8 Processor - Design
1
1

Introducing Minsky S822LC OpenPOWER system for HPC
first custom-built GPU accelerator server with NVLink
|
12
2.5x Faster CPU-GPU Data
Communication via NVLink
NVLink
80 GB/s
GPU
P8
GPU GPU
P8
GPU
PCIe
32 GB/s
GPU
x86
GPU GPU
x86
GPU
No NVLink between CPU &
GPU for x86 Servers: PCIe
Bottleneck
NVIDIA P100 Pascal GPU
POWER8 NVLink Server x86 Servers with PCIe
• Custom-built GPU Accelerator Server
• High-Speed NVLink Connections between
CPUs & GPUs and among GPUs
• Features novel NVIDIA P100 Pascal GPU
accelerator
M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise

Deep Learning on OpenPOWER with GPUs
Transparent acceleration without code changes
|
13

Introducing PowerAI: Get started fast with Deep Learning
14
Enabled by High Performance Computing Infrastructure
Package of Pre-Compiled Major
Deep Learning Frameworks
Easy to install & get started with
Deep Learning with Enterprise-
Class Support
for Performance
To Take Advantage of NVLink
https://www.ibm.com/ms-en/marketplace/deep-learning-platform

Machine and Deep Learning analytics on OpenPOWER
no code changes needed!!
15
ATLAS
Automatically Tuned Linear Algebra
Software)
https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-
44a4f27eba32/entry/DeepLearning4J_Deep_Learning_with_Java_Spark_and_Power?lang=en

OpenPOWER: GPU support
16
Credit: Kevin Klaues, Mesosphere
Mesos supports GPU scheduling
Huge speed-ups with GPUs and OpenPOWER!

Enabling Accelerators/GPUs in the cloud stack
17
Deep Learning Training + Inference
Containers
and images
Accelerators
Clustering frameworks

Tensorflow on tesla P100: PowerAI is 30% faster
18
IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 /
cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs/ Ubuntu 16.04 / CUDA
8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Larger value is better

PowerAI vs DGX-1: 1.6x Tensorflow throughput / dollar
19
▪ TensorFlow 0.12 on the IBM PowerAI platform takes advantage
of the full capabilities of NVLink
▪ For image classification and analysis this means a 1.6X price
performance advantage relative to the NVIDIA DGX-1
System Images / Second List Price $ / Image / Second
NVIDIA DGX-1
(8 P100 GPU,
512GB Mem)
330 $129,000 $390
PowerAI (4 P100
GPU, 512 GB Mem)
273 $67,000 $241
Lower cost is better

NVLink and P100 advantage
|
20
• NVLink reduces communication time and overhead
• Incorporating the fastest GPU for deep learning
• Data gets from GPU-GPU, Memory-GPU faster, for shorter training times
x86 based
GPU system
POWER8 +
Tesla
P100+NVLink
ImageNet / Alexnet: Minibatch size = 128
170 ms
78 ms
IBM advantage: data communication
and GPU performance

Spark Machine Learning performance tuning on OpenPOWER
What knobs can you tweak?
|
21

Spark on OpenPower
• Streaming and SQL benefit from High Thread Density and Concurrency
• Processing multiple packets of a stream and different stages of a message stream pipeline
• Processing multiple rows from a query
2
2

• Machine Learning benefits from Large Caches and Memory Bandwidth
• Iterative Algorithms on the same data
• Fewer core pipeline stalls and overall higher throughput
2
3
Spark on OpenPower

• Graph algorithms also benefit from Large Caches, Memory Bandwidth and Higher
Thread Strength
• Flexibility to go from 8 SMT threads per core to 4 or 2
• Manage Balance between thread performance and throughput
24
Spark on OpenPower

• Headroom
• Balanced resource utilization, more efficient scale-out
• Multi-tenant deployments
2
5
Spark on OpenPower

Roofline SPARK Performance Model
26
Spark Tunables
Spark Performance
“Roofline” Performance
Navigation uses system resource
workload characterization and analysis
to look for fundamental inefficiencies
“Roofline “
Good Enough
“Out of Box”
FOR 1 … MAX WORKERS
FOR 1 …. MAX CPU PER NODE
FOR 1 … MAX THREADS PER CPU
FOR 1… MAX PARTITIONS
Unwieldly & Complicated
(some respite in ML workloads
from data sampling)
Performance Navigation Automation Script

Performance Tuning Tips for a Machine Learning Workload
27
Top Down Approach
Methodology:
Alternating Least Squares Based
Matrix Factorization application
Optimization Process:
Spark executor Instances
Spark executor cores
Spark executor memory
Spark shuffle location and manager
RDD persistence storage level
Application
Large No of Spark Tunable -
Spark Executors and Spark Cores ……
Default Configurations
Out of Box Performance
Bottom Approach
System Hardware
Characterizing the Workload
Through Resource monitoring
Custom SPARK Tunables from
Configuration Sweeps
Roofline Performance

WorkFlow
28
• Matrix Factorization from SPARKBENCH - https://github.com/SparkTC/spark-bench
• Training
• Validation
• Prediction
With permission - Raj Krishnamurthy STRATA NYC 2016

Parameters used for data generation in MF application
Matrix Factorization with Alternating Least Squares
29
Data generation
parameters
Value
Rows in data matrix 62000
Columns in data matrix 62000
Data set size 100 GB
Spark parameter Value for MF
Master node 1
Worker nodes 6
Executors per Node 1
Executor cores 80 / 40 /24
Executor Memory 480 GB
Shuffle Location HDDs
Input Storage HDFS
Job Function Description / API called
7 Mean at
MFApp.java
AbstractJavaRDDLike.map
MatrixFactorizationModel.predict
JavaDoubleRDD.mean
6 Aggregate at
MFModel.scala
MatrixFactorizationModel.predict
MatrixFactorizationModel.countApproxDistinctUserProduct
5 First at
MFModel.scala
ml.recommendation.ALS.computeFactors
4 First at
MFModel.scala
ml.recommendation.ALS.computeFactors
3 Count at ALS.scala ALS.train and ALS.intialize
2 Count at ALS.scala ALS.train

Analyzing SPARK Configuration Sweep
30
Configur
ation
1 2 3 4 5 6 7 8 9 10 11
Spark
executor
cores
80 80 40 40 40 40 40 40 24 24 24
GC
options
Default Default Default ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=24
ParallelGCth
reads=24
Default
RDD
compres
sion
TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Storage
level
memory_a
nd_disk
memory
_only
memory
_only
memory_onl
y
memory_and
_disk_ser
memory_onl
y_ser
memory_onl
y
memory_onl
y
memory_and
_disk_ser
memory_and
_disk_ser
memory_
and_disk
_ser
Partition
numbers
1000 1000 1000 1000 1000 1000 800 1200 1000 1000 1000
Shuffle
Manager
Sort based Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Tungsten-
sort
Tungsten-
sort
Run-
time
(minutes
)
40 34 26 24 20 25 26 27 21 19 18
Various configurations tried in optimizing MF application on Spark

GC and Memory Foot print
31
Configuration Run time of last stage GC time of last stage
1 12 min 4.4 min
4 4.4 min 1.8 min
9 3.5 min 1.6 min
11 47s 16s
Run time and GC time of Stage 68 for different configurations

Last Stage Analysis
32© International Business Machines (IBM) 2017

Characterizing Configuration #1
33
CPU utilization on
a worker node
(configuration 1 )
Memory utilization
on a worker node (
configuration 1)

Characterizing Configuration #1 and Configuration #11
34
Memory footprint of configuration 11

Summary - How to Optimize Closer to Roofline Performance Faster?
• Classify workload into CPU, memory, IO or mixed (CPU, memory, IO) intensive
• Characterize “out-of-the-box” workload to understand CPU, Memory, IO and Network performance
characteristics
• Floorplan cluster resources
• Tune “out-of-the-box” workload to navigate “Roofline” performance space in the above named dimensions
– If workload is memory/IO/Network bound then tune SPARK to increase operational intensity operations/byte as much as
possible to make it CPU bound
• Divide search space into regions and perform exhaustive search

IO Optimizations
How to take advantage of faster networks?
36Apr 7, 2017(c) International Business Machines (IBM) 2017

THE GAP – HIGH-PERFORMANCE NETWORKS
The networks – 1, 10, and 40 Gbps networks
Runtime(secs)

THE PERFORMANCE LOSS IN THE BIG-DATA STACK
High-Performance
I/O devices
• Data copies
• Context switches
• Cache pollution
• Deep call-stacks
• Legacy I/O interfaces

The Crail Architecture WWW.CRAIL.IO
 A high-performance data fabric for the Apache Data
Processing Stack
 Relies on the principles of user level IO
 Separation between control path and data path
 User-space direct-access I/O architecture/layer cut-through
 Builds on a distributed, shared data store
 No changes to overall data processing framework
 Is optimized to serve short-lived data sharing and staging
spark / flink / storm …
HDFS
Crail Store
High Performance
RDMA Network
zerocopy
spark specific
shuffle broadcast

EVALUATION - TERASORT
0
100
200
300
400
500
600
Spark Spark/Crail
Runtime(seconds)
12.8 TB data set, TeraSort
reduce
map
128 nodes OpenPOWER cluster
• 2 x IBM POWER8 10-core @ 2.9 Ghz
• DRAM: 512GB DDR4
• 4 x 1.2 TB NVMe SSD
• 100GbE Mellanox ConnectX-4 EN (RoCE)
• Ubuntu 16.04 (kernel 4.4.0-31)
• Spark 2.0.2
Performance gain: 6x
• Most gain from reduce phase:
• Crail shuffler much faster than Spark build-in
• Dramatically reduced CPU involvement
• Dramatically improved network usage
• Map phase: all activity local
• Still faster than vanilla Spark

EVALUATION – TERASORT: NETWORK IO
• Vanilla Spark runs on 100GbE
• Spark/Crail runs on 100Gb RoCE/RDMA
• Vanilla Spark peaks at ~10Gb/s
• Spark/Crail shuffle delivers ~70Gb/s per node

EVALUATION – TERASORT CPU EFFICIENCY
• Spark/Crail completes much faster despite comparable CPU load
• Spark/Crail CPU efficiency is close to 2016 sorting benchmark winner: 3.13
vs. 4.4 GB/min/core
• 2016 winner runs native C code!
Spark +
Crail
Spark
2.0.2
Winner
2014
Winner
2016
Size
TB
12.8 100
Time
sec
98 527 1406 98.6
Cores 2560 6592 10240
Nodes 128 206 512
NW
Gb/s
100 10 100
Rate
TB/min
7.8 1.4 4.27 44.78
Rate/core
GB/min
3.13 0.58 0.66 4.4

CRAIL WITH THE HORTONWORKS STACK
scalable, fault-tolerant,
cost-efficient storage
resource manager
compute
frameworks
user
interfaces
broadcast
HDFS
plugin
RPCs
shuffle
caching
key-value
store
...
High-performance
Crailfabric

Roadmap
Where is OpenPOWER headed?
|
44

Accelerator Technology
2015 2016 2017
POWER8 POWER8 with NVLink POWER9
OpenPower
CAPI Interface
Enhanced CAPI
& NVLink
Connect-IB
FDR Infiniband
PCIe Gen3
ConnectX-4
EDR Infiniband
CAPI over PCIe Gen3
ConnectX-5
Next-Gen Infiniband
Enhanced CAPI over PCIe Gen4
IBM CPUs
Kepler
PCIe Gen3
Volta
Enhanced NVLink
Pascal
NVLink

NOTICES AND DISCLAIMERS
46
Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial
publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS"
WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT
LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the
agreements under which they are provided.
IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms
apply.”
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used
IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM
operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions
are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as
to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any
actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services
or products will ensure that the customer is in compliance with any law
© 2016 International Business Machines C

NOTICES AND DISCLAIMERS CON’T.
47
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those
products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of
non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business
Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®,
OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®,
Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and
System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or
other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

Similar to How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors