SlideShare a Scribd company logo
Optimizing Hortonworks
Apache Spark machine learning
workloads for contemporary
Open Platforms
Raj Krishnamurthy, Indrajit Poddar (I.P), IBM Systems
Animesh Trivedi, Bernard Metzler, IBM Research
© International Business Machines (IBM) 2017
Please Note:
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s
sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it should not be
relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver
any material, code or functionality. Information about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products remains at our sole
discretion.
• Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The
actual throughput or performance that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
2
© International Business Machines (IBM) 2017
Agenda
Spark, Machine Learning and Deep Learning Overview
Why OpenPower ?
Deep Learning with OpenPOWER GPUs
Spark Machine Learning performance tuning with OpenPower CPUs
IO Optimization for Spark TeraSort benchmark
3
© International Business Machines (IBM) 2017
What is Apache Spark
• Unified Analytics Platform
– Combine streaming, graph, machine
learning and sql analytics on a single
platform
– Simplified, multi-language
programming model
– Interactive and Batch
• In-Memory Design
– Pipelines multiple iterations on single
copy of data in memory
– Superior Performance
– Natural Successor to MapReduce
Fast and general engine for
large-scale data processing
Spark Core API
R Scala SQL Python Java
Spark SQL Streaming MLlib GraphX
4
© International Business Machines (IBM) 2017
Machine Learning and Deep Learning (ML/DL)
What you and I (our brains) do without even thinking about it…..we recognize a bicycle
Apr 7, 2017
(c) International Business Machines (IBM) 2017
5
6
Now machines are learning the way we learn….
From "Texture of the Nervous
System of Man and the Vertebrates"
by Santiago Ramón y Cajal.
Artificial Neural Networks
Apr 7, 2017(c) International Business Machines (IBM) 2017
But training needs a lot computational resources
Easy scale-out with:
But model training is not easy to distribute
Training can take hours, days or
weeks
Input data and model sizes are
becoming larger than ever (e.g. video
input, billions of features etc.)
Real-time analytics with:
• whole system optimization
• offloaded computation
• accelerators, and
• higher memory bandwidth systems
Resulting in need for:
Moore’s law is dying
Apr 7, 2017(c) International Business Machines (IBM) 2017 7
Today’s challenges demand whole system innovation
You are here
44 zettabytes
unstructured data
2010 2020
structured data
Data holds competitive valueFull system and stack open innovation required
DataGrowth
Price/Performance
Moore’s Law
Processor
Technology
2000 2020
Firmware / OS
Accelerators
Software
Storage
Network
8
© International Business Machines (IBM) 2017
9
OpenPOWER: open hardware for high performance
Systems designed for
big data analytics
and superior cloud economics
Upto:
12 cores per cpu
96 hardware threads per cpu
1 TB RAM
7.6Tb/s combined I/O Bandwidth
GPUs and FPGAs coming…
OpenPOWER
Traditional
Intel x86
http://www.softlayer.com/POWER-SERVERS
https://mc.jarvice.com/
Apr 7, 2017(c) International Business Machines (IBM) 2017
10
OpenPower Ecosystem – Members
(c) International Business Machines (IBM) 2017 Apr 7, 2017
Memory
Interface
Control
Memory
IBM & Partner
Devices
CAPI/PCI
DMI
Cores
• 12 cores / 8 threads per core
• TDP: 130W and 190W
• 64K data cache, 32K instruction cache
Accelerators
• Crypto & memory expansion
• Transactional Memory
Caches
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
Memory Subsystem
• Memory buffers with 128MB Cache
• ~70ns latency to memory
Bus Interfaces
• Durable Memory attach Interface (DMI)
• Integrated PCIe Gen3
• SMP Interconnect for up to 4 sockets
Virtual Addressing
•Accelerator can work with same memory
addresses that the processors use
•Pointers de-referenced same as the host
application
•Removes OS & device driver overhead
Hardware Managed Cache Coherence
•Enables the accelerator to participate in “Locks” as
a normal thread
•Lowers Latency over IO communication model
6 Hardware Partners developing with CAPI
Over 20 CAPI Solutions
• All listed here http://ibm.biz/powercapi
Examples of Available CAPI Solutions
• IBM Data Engine for NoSQL
• DRC Graphfind analytics
• Erasure Code Acceleration for Hadoop
Coherent Accelerator Processor Interface
(CAPI)
22nm SOI, eDRAM, 15 ML 650mm2
SMP
http://openpowerfoundation.org/wp-content/uploads/2016/04/HardwareRevealFlyerFinal.pdf
Newly Announced OpenPOWER systems and solutions:
POWER8 Processor - Design
1
1
© International Business Machines (IBM) 2017
Introducing Minsky S822LC OpenPOWER system for HPC
first custom-built GPU accelerator server with NVLink
|
12
2.5x Faster CPU-GPU Data
Communication via NVLink
NVLink
80 GB/s
GPU
P8
GPU GPU
P8
GPU
PCIe
32 GB/s
GPU
x86
GPU GPU
x86
GPU
No NVLink between CPU &
GPU for x86 Servers: PCIe
Bottleneck
NVIDIA P100 Pascal GPU
POWER8 NVLink Server x86 Servers with PCIe
• Custom-built GPU Accelerator Server
• High-Speed NVLink Connections between
CPUs & GPUs and among GPUs
• Features novel NVIDIA P100 Pascal GPU
accelerator
M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise
Deep Learning on OpenPOWER with GPUs
Transparent acceleration without code changes
|
13
Apr 7, 2017(c) International Business Machines (IBM) 2017
Introducing PowerAI: Get started fast with Deep Learning
14
Enabled by High Performance Computing Infrastructure
Package of Pre-Compiled Major
Deep Learning Frameworks
Easy to install & get started with
Deep Learning with Enterprise-
Class Support
for Performance
To Take Advantage of NVLink
https://www.ibm.com/ms-en/marketplace/deep-learning-platform
Machine and Deep Learning analytics on OpenPOWER
no code changes needed!!
15
ATLAS
Automatically Tuned Linear Algebra
Software)
https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-
44a4f27eba32/entry/DeepLearning4J_Deep_Learning_with_Java_Spark_and_Power?lang=en
OpenPOWER: GPU support
16
Credit: Kevin Klaues, Mesosphere
Mesos supports GPU scheduling
Huge speed-ups with GPUs and OpenPOWER!
Enabling Accelerators/GPUs in the cloud stack
17
Deep Learning Training + Inference
Containers
and images
Accelerators
Clustering frameworks
Tensorflow on tesla P100: PowerAI is 30% faster
18
IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 /
cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs/ Ubuntu 16.04 / CUDA
8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Larger value is better
PowerAI vs DGX-1: 1.6x Tensorflow throughput / dollar
19
▪ TensorFlow 0.12 on the IBM PowerAI platform takes advantage
of the full capabilities of NVLink
▪ For image classification and analysis this means a 1.6X price
performance advantage relative to the NVIDIA DGX-1
System Images / Second List Price $ / Image / Second
NVIDIA DGX-1
(8 P100 GPU,
512GB Mem)
330 $129,000 $390
PowerAI (4 P100
GPU, 512 GB Mem)
273 $67,000 $241
Lower cost is better
NVLink and P100 advantage
|
20
• NVLink reduces communication time and overhead
• Incorporating the fastest GPU for deep learning
• Data gets from GPU-GPU, Memory-GPU faster, for shorter training times
x86 based
GPU system
POWER8 +
Tesla
P100+NVLink
ImageNet / Alexnet: Minibatch size = 128
170 ms
78 ms
IBM advantage: data communication
and GPU performance
Spark Machine Learning performance tuning on OpenPOWER
What knobs can you tweak?
|
21
Apr 7, 2017(c) International Business Machines (IBM) 2017
Spark on OpenPower
• Streaming and SQL benefit from High Thread Density and Concurrency
• Processing multiple packets of a stream and different stages of a message stream pipeline
• Processing multiple rows from a query
2
2
© International Business Machines (IBM) 2017
• Machine Learning benefits from Large Caches and Memory Bandwidth
• Iterative Algorithms on the same data
• Fewer core pipeline stalls and overall higher throughput
2
3
Spark on OpenPower
© International Business Machines (IBM) 2017
• Graph algorithms also benefit from Large Caches, Memory Bandwidth and Higher
Thread Strength
• Flexibility to go from 8 SMT threads per core to 4 or 2
• Manage Balance between thread performance and throughput
24
Spark on OpenPower
© International Business Machines (IBM) 2017
• Headroom
• Balanced resource utilization, more efficient scale-out
• Multi-tenant deployments
2
5
Spark on OpenPower
© International Business Machines (IBM) 2017
Roofline SPARK Performance Model
26
Spark Tunables
Spark Performance
“Roofline” Performance
Navigation uses system resource
workload characterization and analysis
to look for fundamental inefficiencies
“Roofline “
Good Enough
“Out of Box”
FOR 1 … MAX WORKERS
FOR 1 …. MAX CPU PER NODE
FOR 1 … MAX THREADS PER CPU
FOR 1… MAX PARTITIONS
Unwieldly & Complicated
(some respite in ML workloads
from data sampling)
Performance Navigation Automation Script
© International Business Machines (IBM) 2017
Performance Tuning Tips for a Machine Learning Workload
27
Top Down Approach
Methodology:
Alternating Least Squares Based
Matrix Factorization application
Optimization Process:
Spark executor Instances
Spark executor cores
Spark executor memory
Spark shuffle location and manager
RDD persistence storage level
Application
Large No of Spark Tunable -
Spark Executors and Spark Cores ……
Default Configurations
Out of Box Performance
Bottom Approach
System Hardware
Characterizing the Workload
Through Resource monitoring
Custom SPARK Tunables from
Configuration Sweeps
Roofline Performance
WorkFlow
28
• Matrix Factorization from SPARKBENCH - https://github.com/SparkTC/spark-bench
• Training
• Validation
• Prediction
With permission - Raj Krishnamurthy STRATA NYC 2016
© International Business Machines (IBM) 2017
Parameters used for data generation in MF application
Matrix Factorization with Alternating Least Squares
29
Data generation
parameters
Value
Rows in data matrix 62000
Columns in data matrix 62000
Data set size 100 GB
Spark parameter Value for MF
Master node 1
Worker nodes 6
Executors per Node 1
Executor cores 80 / 40 /24
Executor Memory 480 GB
Shuffle Location HDDs
Input Storage HDFS
Job Function Description / API called
7 Mean at
MFApp.java
AbstractJavaRDDLike.map
MatrixFactorizationModel.predict
JavaDoubleRDD.mean
6 Aggregate at
MFModel.scala
MatrixFactorizationModel.predict
MatrixFactorizationModel.countApproxDistinctUserProduct
5 First at
MFModel.scala
ml.recommendation.ALS.computeFactors
4 First at
MFModel.scala
ml.recommendation.ALS.computeFactors
3 Count at ALS.scala ALS.train and ALS.intialize
2 Count at ALS.scala ALS.train
1 Count at ALS.scala ALS.train
0 Count at ALS.scala ALS.train
© International Business Machines (IBM) 2017
Analyzing SPARK Configuration Sweep
30
Configur
ation
1 2 3 4 5 6 7 8 9 10 11
Spark
executor
cores
80 80 40 40 40 40 40 40 24 24 24
GC
options
Default Default Default ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=24
ParallelGCth
reads=24
Default
RDD
compres
sion
TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Storage
level
memory_a
nd_disk
memory
_only
memory
_only
memory_onl
y
memory_and
_disk_ser
memory_onl
y_ser
memory_onl
y
memory_onl
y
memory_and
_disk_ser
memory_and
_disk_ser
memory_
and_disk
_ser
Partition
numbers
1000 1000 1000 1000 1000 1000 800 1200 1000 1000 1000
Shuffle
Manager
Sort based Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Tungsten-
sort
Tungsten-
sort
Run-
time
(minutes
)
40 34 26 24 20 25 26 27 21 19 18
Various configurations tried in optimizing MF application on Spark
© International Business Machines (IBM) 2017
GC and Memory Foot print
31
Configuration Run time of last stage GC time of last stage
1 12 min 4.4 min
4 4.4 min 1.8 min
9 3.5 min 1.6 min
11 47s 16s
Run time and GC time of Stage 68 for different configurations
© International Business Machines (IBM) 2017
Last Stage Analysis
32© International Business Machines (IBM) 2017
Characterizing Configuration #1
33
CPU utilization on
a worker node
(configuration 1 )
Memory utilization
on a worker node (
configuration 1)
© International Business Machines (IBM) 2017
Characterizing Configuration #1 and Configuration #11
34
Memory footprint of configuration 11
© International Business Machines (IBM) 2017
Summary - How to Optimize Closer to Roofline Performance Faster?
• Classify workload into CPU, memory, IO or mixed (CPU, memory, IO) intensive
• Characterize “out-of-the-box” workload to understand CPU, Memory, IO and Network performance
characteristics
• Floorplan cluster resources
• Tune “out-of-the-box” workload to navigate “Roofline” performance space in the above named dimensions
– If workload is memory/IO/Network bound then tune SPARK to increase operational intensity operations/byte as much as
possible to make it CPU bound
• Divide search space into regions and perform exhaustive search
35© International Business Machines (IBM) 2017
IO Optimizations
How to take advantage of faster networks?
36Apr 7, 2017(c) International Business Machines (IBM) 2017
THE GAP – HIGH-PERFORMANCE NETWORKS
The networks – 1, 10, and 40 Gbps networks
Runtime(secs)
37Apr 7, 2017(c) International Business Machines (IBM) 2017
THE PERFORMANCE LOSS IN THE BIG-DATA STACK
High-Performance
I/O devices
• Data copies
• Context switches
• Cache pollution
• Deep call-stacks
• Legacy I/O interfaces
38Apr 7, 2017(c) International Business Machines (IBM) 2017
The Crail Architecture WWW.CRAIL.IO
 A high-performance data fabric for the Apache Data
Processing Stack
 Relies on the principles of user level IO
 Separation between control path and data path
 User-space direct-access I/O architecture/layer cut-through
 Builds on a distributed, shared data store
 No changes to overall data processing framework
 Is optimized to serve short-lived data sharing and staging
spark / flink / storm …
HDFS
Crail Store
High Performance
RDMA Network
zerocopy
spark specific
shuffle broadcast
39Apr 7, 2017(c) International Business Machines (IBM) 2017
EVALUATION - TERASORT
0
100
200
300
400
500
600
Spark Spark/Crail
Runtime(seconds)
12.8 TB data set, TeraSort
reduce
map
128 nodes OpenPOWER cluster
• 2 x IBM POWER8 10-core @ 2.9 Ghz
• DRAM: 512GB DDR4
• 4 x 1.2 TB NVMe SSD
• 100GbE Mellanox ConnectX-4 EN (RoCE)
• Ubuntu 16.04 (kernel 4.4.0-31)
• Spark 2.0.2
Performance gain: 6x
• Most gain from reduce phase:
• Crail shuffler much faster than Spark build-in
• Dramatically reduced CPU involvement
• Dramatically improved network usage
• Map phase: all activity local
• Still faster than vanilla Spark
40Apr 7, 2017(c) International Business Machines (IBM) 2017
EVALUATION – TERASORT: NETWORK IO
• Vanilla Spark runs on 100GbE
• Spark/Crail runs on 100Gb RoCE/RDMA
• Vanilla Spark peaks at ~10Gb/s
• Spark/Crail shuffle delivers ~70Gb/s per node
41Apr 7, 2017(c) International Business Machines (IBM) 2017
EVALUATION – TERASORT CPU EFFICIENCY
• Spark/Crail completes much faster despite comparable CPU load
• Spark/Crail CPU efficiency is close to 2016 sorting benchmark winner: 3.13
vs. 4.4 GB/min/core
• 2016 winner runs native C code!
Spark +
Crail
Spark
2.0.2
Winner
2014
Winner
2016
Size
TB
12.8 100
Time
sec
98 527 1406 98.6
Cores 2560 6592 10240
Nodes 128 206 512
NW
Gb/s
100 10 100
Rate
TB/min
7.8 1.4 4.27 44.78
Rate/core
GB/min
3.13 0.58 0.66 4.4
42Apr 7, 2017(c) International Business Machines (IBM) 2017
CRAIL WITH THE HORTONWORKS STACK
scalable, fault-tolerant,
cost-efficient storage
resource manager
compute
frameworks
user
interfaces
broadcast
HDFS
plugin
RPCs
shuffle
caching
key-value
store
...
High-performance
Crailfabric
43Apr 7, 2017(c) International Business Machines (IBM) 2017
Roadmap
Where is OpenPOWER headed?
|
44
Apr 7, 2017(c) International Business Machines (IBM) 2017
Accelerator Technology
2015 2016 2017
POWER8 POWER8 with NVLink POWER9
OpenPower
CAPI Interface
Enhanced CAPI
& NVLink
Connect-IB
FDR Infiniband
PCIe Gen3
ConnectX-4
EDR Infiniband
CAPI over PCIe Gen3
ConnectX-5
Next-Gen Infiniband
Enhanced CAPI over PCIe Gen4
IBM CPUs
Kepler
PCIe Gen3
Volta
Enhanced NVLink
Pascal
NVLink
45© International Business Machines (IBM) 2017
NOTICES AND DISCLAIMERS
46
Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial
publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS"
WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT
LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the
agreements under which they are provided.
IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms
apply.”
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used
IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM
operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions
are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as
to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any
actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services
or products will ensure that the customer is in compliance with any law
© 2016 International Business Machines C
NOTICES AND DISCLAIMERS CON’T.
47
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those
products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of
non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business
Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®,
OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®,
Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and
System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or
other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
48
Q & A

More Related Content

What's hot

UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
vvpadhu
 
Introduction to NuoDB
Introduction to NuoDBIntroduction to NuoDB
Introduction to NuoDB
Sandun Perera
 

What's hot (20)

Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
 
Q1 Memory Fabric Forum: Compute Express Link (CXL) 3.1 Update
Q1 Memory Fabric Forum: Compute Express Link (CXL) 3.1 UpdateQ1 Memory Fabric Forum: Compute Express Link (CXL) 3.1 Update
Q1 Memory Fabric Forum: Compute Express Link (CXL) 3.1 Update
 
Data Modeling & Data Integration
Data Modeling & Data IntegrationData Modeling & Data Integration
Data Modeling & Data Integration
 
Archive First: An Intelligent Data Archival Strategy, Part 1 of 3
Archive First: An Intelligent Data Archival Strategy, Part 1 of 3Archive First: An Intelligent Data Archival Strategy, Part 1 of 3
Archive First: An Intelligent Data Archival Strategy, Part 1 of 3
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
CouchDB
CouchDBCouchDB
CouchDB
 
SAP HANA Cloud Data Lake.pdf
SAP HANA Cloud Data Lake.pdfSAP HANA Cloud Data Lake.pdf
SAP HANA Cloud Data Lake.pdf
 
Couch db
Couch dbCouch db
Couch db
 
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXLQ1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Mind Map of Big Data Technologies and Concepts
Mind Map of Big Data Technologies and ConceptsMind Map of Big Data Technologies and Concepts
Mind Map of Big Data Technologies and Concepts
 
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detailApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
 
MDM for Customer data with Talend
MDM for Customer data with Talend MDM for Customer data with Talend
MDM for Customer data with Talend
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
 
Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes
 
Data centric business and knowledge graph trends
Data centric business and knowledge graph trendsData centric business and knowledge graph trends
Data centric business and knowledge graph trends
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Büyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopBüyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve Hadoop
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Introduction to NuoDB
Introduction to NuoDBIntroduction to NuoDB
Introduction to NuoDB
 

Similar to How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
IBM Switzerland
 

Similar to How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors (20)

BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
Demystify OpenPOWER
Demystify OpenPOWERDemystify OpenPOWER
Demystify OpenPOWER
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Innovation with ai at scale on the edge vt sept 2019 v0
Innovation with ai at scale  on the edge vt sept 2019 v0Innovation with ai at scale  on the edge vt sept 2019 v0
Innovation with ai at scale on the edge vt sept 2019 v0
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
Capi snap overview
Capi snap overviewCapi snap overview
Capi snap overview
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
OpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in ZurichOpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in Zurich
 
Power overview 2018 08-13b
Power overview 2018 08-13bPower overview 2018 08-13b
Power overview 2018 08-13b
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 

How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

  • 1. Optimizing Hortonworks Apache Spark machine learning workloads for contemporary Open Platforms Raj Krishnamurthy, Indrajit Poddar (I.P), IBM Systems Animesh Trivedi, Bernard Metzler, IBM Research © International Business Machines (IBM) 2017
  • 2. Please Note: • IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion. • Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. • The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. • The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. • Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. 2 © International Business Machines (IBM) 2017
  • 3. Agenda Spark, Machine Learning and Deep Learning Overview Why OpenPower ? Deep Learning with OpenPOWER GPUs Spark Machine Learning performance tuning with OpenPower CPUs IO Optimization for Spark TeraSort benchmark 3 © International Business Machines (IBM) 2017
  • 4. What is Apache Spark • Unified Analytics Platform – Combine streaming, graph, machine learning and sql analytics on a single platform – Simplified, multi-language programming model – Interactive and Batch • In-Memory Design – Pipelines multiple iterations on single copy of data in memory – Superior Performance – Natural Successor to MapReduce Fast and general engine for large-scale data processing Spark Core API R Scala SQL Python Java Spark SQL Streaming MLlib GraphX 4 © International Business Machines (IBM) 2017
  • 5. Machine Learning and Deep Learning (ML/DL) What you and I (our brains) do without even thinking about it…..we recognize a bicycle Apr 7, 2017 (c) International Business Machines (IBM) 2017 5
  • 6. 6 Now machines are learning the way we learn…. From "Texture of the Nervous System of Man and the Vertebrates" by Santiago Ramón y Cajal. Artificial Neural Networks Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 7. But training needs a lot computational resources Easy scale-out with: But model training is not easy to distribute Training can take hours, days or weeks Input data and model sizes are becoming larger than ever (e.g. video input, billions of features etc.) Real-time analytics with: • whole system optimization • offloaded computation • accelerators, and • higher memory bandwidth systems Resulting in need for: Moore’s law is dying Apr 7, 2017(c) International Business Machines (IBM) 2017 7
  • 8. Today’s challenges demand whole system innovation You are here 44 zettabytes unstructured data 2010 2020 structured data Data holds competitive valueFull system and stack open innovation required DataGrowth Price/Performance Moore’s Law Processor Technology 2000 2020 Firmware / OS Accelerators Software Storage Network 8 © International Business Machines (IBM) 2017
  • 9. 9 OpenPOWER: open hardware for high performance Systems designed for big data analytics and superior cloud economics Upto: 12 cores per cpu 96 hardware threads per cpu 1 TB RAM 7.6Tb/s combined I/O Bandwidth GPUs and FPGAs coming… OpenPOWER Traditional Intel x86 http://www.softlayer.com/POWER-SERVERS https://mc.jarvice.com/ Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 10. 10 OpenPower Ecosystem – Members (c) International Business Machines (IBM) 2017 Apr 7, 2017
  • 11. Memory Interface Control Memory IBM & Partner Devices CAPI/PCI DMI Cores • 12 cores / 8 threads per core • TDP: 130W and 190W • 64K data cache, 32K instruction cache Accelerators • Crypto & memory expansion • Transactional Memory Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 Memory Subsystem • Memory buffers with 128MB Cache • ~70ns latency to memory Bus Interfaces • Durable Memory attach Interface (DMI) • Integrated PCIe Gen3 • SMP Interconnect for up to 4 sockets Virtual Addressing •Accelerator can work with same memory addresses that the processors use •Pointers de-referenced same as the host application •Removes OS & device driver overhead Hardware Managed Cache Coherence •Enables the accelerator to participate in “Locks” as a normal thread •Lowers Latency over IO communication model 6 Hardware Partners developing with CAPI Over 20 CAPI Solutions • All listed here http://ibm.biz/powercapi Examples of Available CAPI Solutions • IBM Data Engine for NoSQL • DRC Graphfind analytics • Erasure Code Acceleration for Hadoop Coherent Accelerator Processor Interface (CAPI) 22nm SOI, eDRAM, 15 ML 650mm2 SMP http://openpowerfoundation.org/wp-content/uploads/2016/04/HardwareRevealFlyerFinal.pdf Newly Announced OpenPOWER systems and solutions: POWER8 Processor - Design 1 1 © International Business Machines (IBM) 2017
  • 12. Introducing Minsky S822LC OpenPOWER system for HPC first custom-built GPU accelerator server with NVLink | 12 2.5x Faster CPU-GPU Data Communication via NVLink NVLink 80 GB/s GPU P8 GPU GPU P8 GPU PCIe 32 GB/s GPU x86 GPU GPU x86 GPU No NVLink between CPU & GPU for x86 Servers: PCIe Bottleneck NVIDIA P100 Pascal GPU POWER8 NVLink Server x86 Servers with PCIe • Custom-built GPU Accelerator Server • High-Speed NVLink Connections between CPUs & GPUs and among GPUs • Features novel NVIDIA P100 Pascal GPU accelerator M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise
  • 13. Deep Learning on OpenPOWER with GPUs Transparent acceleration without code changes | 13 Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 14. Introducing PowerAI: Get started fast with Deep Learning 14 Enabled by High Performance Computing Infrastructure Package of Pre-Compiled Major Deep Learning Frameworks Easy to install & get started with Deep Learning with Enterprise- Class Support for Performance To Take Advantage of NVLink https://www.ibm.com/ms-en/marketplace/deep-learning-platform
  • 15. Machine and Deep Learning analytics on OpenPOWER no code changes needed!! 15 ATLAS Automatically Tuned Linear Algebra Software) https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d- 44a4f27eba32/entry/DeepLearning4J_Deep_Learning_with_Java_Spark_and_Power?lang=en
  • 16. OpenPOWER: GPU support 16 Credit: Kevin Klaues, Mesosphere Mesos supports GPU scheduling Huge speed-ups with GPUs and OpenPOWER!
  • 17. Enabling Accelerators/GPUs in the cloud stack 17 Deep Learning Training + Inference Containers and images Accelerators Clustering frameworks
  • 18. Tensorflow on tesla P100: PowerAI is 30% faster 18 IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch) Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs/ Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch) Larger value is better
  • 19. PowerAI vs DGX-1: 1.6x Tensorflow throughput / dollar 19 ▪ TensorFlow 0.12 on the IBM PowerAI platform takes advantage of the full capabilities of NVLink ▪ For image classification and analysis this means a 1.6X price performance advantage relative to the NVIDIA DGX-1 System Images / Second List Price $ / Image / Second NVIDIA DGX-1 (8 P100 GPU, 512GB Mem) 330 $129,000 $390 PowerAI (4 P100 GPU, 512 GB Mem) 273 $67,000 $241 Lower cost is better
  • 20. NVLink and P100 advantage | 20 • NVLink reduces communication time and overhead • Incorporating the fastest GPU for deep learning • Data gets from GPU-GPU, Memory-GPU faster, for shorter training times x86 based GPU system POWER8 + Tesla P100+NVLink ImageNet / Alexnet: Minibatch size = 128 170 ms 78 ms IBM advantage: data communication and GPU performance
  • 21. Spark Machine Learning performance tuning on OpenPOWER What knobs can you tweak? | 21 Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 22. Spark on OpenPower • Streaming and SQL benefit from High Thread Density and Concurrency • Processing multiple packets of a stream and different stages of a message stream pipeline • Processing multiple rows from a query 2 2 © International Business Machines (IBM) 2017
  • 23. • Machine Learning benefits from Large Caches and Memory Bandwidth • Iterative Algorithms on the same data • Fewer core pipeline stalls and overall higher throughput 2 3 Spark on OpenPower © International Business Machines (IBM) 2017
  • 24. • Graph algorithms also benefit from Large Caches, Memory Bandwidth and Higher Thread Strength • Flexibility to go from 8 SMT threads per core to 4 or 2 • Manage Balance between thread performance and throughput 24 Spark on OpenPower © International Business Machines (IBM) 2017
  • 25. • Headroom • Balanced resource utilization, more efficient scale-out • Multi-tenant deployments 2 5 Spark on OpenPower © International Business Machines (IBM) 2017
  • 26. Roofline SPARK Performance Model 26 Spark Tunables Spark Performance “Roofline” Performance Navigation uses system resource workload characterization and analysis to look for fundamental inefficiencies “Roofline “ Good Enough “Out of Box” FOR 1 … MAX WORKERS FOR 1 …. MAX CPU PER NODE FOR 1 … MAX THREADS PER CPU FOR 1… MAX PARTITIONS Unwieldly & Complicated (some respite in ML workloads from data sampling) Performance Navigation Automation Script © International Business Machines (IBM) 2017
  • 27. Performance Tuning Tips for a Machine Learning Workload 27 Top Down Approach Methodology: Alternating Least Squares Based Matrix Factorization application Optimization Process: Spark executor Instances Spark executor cores Spark executor memory Spark shuffle location and manager RDD persistence storage level Application Large No of Spark Tunable - Spark Executors and Spark Cores …… Default Configurations Out of Box Performance Bottom Approach System Hardware Characterizing the Workload Through Resource monitoring Custom SPARK Tunables from Configuration Sweeps Roofline Performance
  • 28. WorkFlow 28 • Matrix Factorization from SPARKBENCH - https://github.com/SparkTC/spark-bench • Training • Validation • Prediction With permission - Raj Krishnamurthy STRATA NYC 2016 © International Business Machines (IBM) 2017
  • 29. Parameters used for data generation in MF application Matrix Factorization with Alternating Least Squares 29 Data generation parameters Value Rows in data matrix 62000 Columns in data matrix 62000 Data set size 100 GB Spark parameter Value for MF Master node 1 Worker nodes 6 Executors per Node 1 Executor cores 80 / 40 /24 Executor Memory 480 GB Shuffle Location HDDs Input Storage HDFS Job Function Description / API called 7 Mean at MFApp.java AbstractJavaRDDLike.map MatrixFactorizationModel.predict JavaDoubleRDD.mean 6 Aggregate at MFModel.scala MatrixFactorizationModel.predict MatrixFactorizationModel.countApproxDistinctUserProduct 5 First at MFModel.scala ml.recommendation.ALS.computeFactors 4 First at MFModel.scala ml.recommendation.ALS.computeFactors 3 Count at ALS.scala ALS.train and ALS.intialize 2 Count at ALS.scala ALS.train 1 Count at ALS.scala ALS.train 0 Count at ALS.scala ALS.train © International Business Machines (IBM) 2017
  • 30. Analyzing SPARK Configuration Sweep 30 Configur ation 1 2 3 4 5 6 7 8 9 10 11 Spark executor cores 80 80 40 40 40 40 40 40 24 24 24 GC options Default Default Default ParallelGCth reads=40 ParallelGCth reads=40 ParallelGCth reads=40 ParallelGCth reads=40 ParallelGCth reads=40 ParallelGCth reads=24 ParallelGCth reads=24 Default RDD compres sion TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE Storage level memory_a nd_disk memory _only memory _only memory_onl y memory_and _disk_ser memory_onl y_ser memory_onl y memory_onl y memory_and _disk_ser memory_and _disk_ser memory_ and_disk _ser Partition numbers 1000 1000 1000 1000 1000 1000 800 1200 1000 1000 1000 Shuffle Manager Sort based Sort based Sort based Sort based Sort based Sort based Sort based Sort based Sort based Tungsten- sort Tungsten- sort Run- time (minutes ) 40 34 26 24 20 25 26 27 21 19 18 Various configurations tried in optimizing MF application on Spark © International Business Machines (IBM) 2017
  • 31. GC and Memory Foot print 31 Configuration Run time of last stage GC time of last stage 1 12 min 4.4 min 4 4.4 min 1.8 min 9 3.5 min 1.6 min 11 47s 16s Run time and GC time of Stage 68 for different configurations © International Business Machines (IBM) 2017
  • 32. Last Stage Analysis 32© International Business Machines (IBM) 2017
  • 33. Characterizing Configuration #1 33 CPU utilization on a worker node (configuration 1 ) Memory utilization on a worker node ( configuration 1) © International Business Machines (IBM) 2017
  • 34. Characterizing Configuration #1 and Configuration #11 34 Memory footprint of configuration 11 © International Business Machines (IBM) 2017
  • 35. Summary - How to Optimize Closer to Roofline Performance Faster? • Classify workload into CPU, memory, IO or mixed (CPU, memory, IO) intensive • Characterize “out-of-the-box” workload to understand CPU, Memory, IO and Network performance characteristics • Floorplan cluster resources • Tune “out-of-the-box” workload to navigate “Roofline” performance space in the above named dimensions – If workload is memory/IO/Network bound then tune SPARK to increase operational intensity operations/byte as much as possible to make it CPU bound • Divide search space into regions and perform exhaustive search 35© International Business Machines (IBM) 2017
  • 36. IO Optimizations How to take advantage of faster networks? 36Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 37. THE GAP – HIGH-PERFORMANCE NETWORKS The networks – 1, 10, and 40 Gbps networks Runtime(secs) 37Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 38. THE PERFORMANCE LOSS IN THE BIG-DATA STACK High-Performance I/O devices • Data copies • Context switches • Cache pollution • Deep call-stacks • Legacy I/O interfaces 38Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 39. The Crail Architecture WWW.CRAIL.IO  A high-performance data fabric for the Apache Data Processing Stack  Relies on the principles of user level IO  Separation between control path and data path  User-space direct-access I/O architecture/layer cut-through  Builds on a distributed, shared data store  No changes to overall data processing framework  Is optimized to serve short-lived data sharing and staging spark / flink / storm … HDFS Crail Store High Performance RDMA Network zerocopy spark specific shuffle broadcast 39Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 40. EVALUATION - TERASORT 0 100 200 300 400 500 600 Spark Spark/Crail Runtime(seconds) 12.8 TB data set, TeraSort reduce map 128 nodes OpenPOWER cluster • 2 x IBM POWER8 10-core @ 2.9 Ghz • DRAM: 512GB DDR4 • 4 x 1.2 TB NVMe SSD • 100GbE Mellanox ConnectX-4 EN (RoCE) • Ubuntu 16.04 (kernel 4.4.0-31) • Spark 2.0.2 Performance gain: 6x • Most gain from reduce phase: • Crail shuffler much faster than Spark build-in • Dramatically reduced CPU involvement • Dramatically improved network usage • Map phase: all activity local • Still faster than vanilla Spark 40Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 41. EVALUATION – TERASORT: NETWORK IO • Vanilla Spark runs on 100GbE • Spark/Crail runs on 100Gb RoCE/RDMA • Vanilla Spark peaks at ~10Gb/s • Spark/Crail shuffle delivers ~70Gb/s per node 41Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 42. EVALUATION – TERASORT CPU EFFICIENCY • Spark/Crail completes much faster despite comparable CPU load • Spark/Crail CPU efficiency is close to 2016 sorting benchmark winner: 3.13 vs. 4.4 GB/min/core • 2016 winner runs native C code! Spark + Crail Spark 2.0.2 Winner 2014 Winner 2016 Size TB 12.8 100 Time sec 98 527 1406 98.6 Cores 2560 6592 10240 Nodes 128 206 512 NW Gb/s 100 10 100 Rate TB/min 7.8 1.4 4.27 44.78 Rate/core GB/min 3.13 0.58 0.66 4.4 42Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 43. CRAIL WITH THE HORTONWORKS STACK scalable, fault-tolerant, cost-efficient storage resource manager compute frameworks user interfaces broadcast HDFS plugin RPCs shuffle caching key-value store ... High-performance Crailfabric 43Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 44. Roadmap Where is OpenPOWER headed? | 44 Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 45. Accelerator Technology 2015 2016 2017 POWER8 POWER8 with NVLink POWER9 OpenPower CAPI Interface Enhanced CAPI & NVLink Connect-IB FDR Infiniband PCIe Gen3 ConnectX-4 EDR Infiniband CAPI over PCIe Gen3 ConnectX-5 Next-Gen Infiniband Enhanced CAPI over PCIe Gen4 IBM CPUs Kepler PCIe Gen3 Volta Enhanced NVLink Pascal NVLink 45© International Business Machines (IBM) 2017
  • 46. NOTICES AND DISCLAIMERS 46 Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.” Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law © 2016 International Business Machines C
  • 47. NOTICES AND DISCLAIMERS CON’T. 47 Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.