Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

1
Micro-architectural Characterization of
Apache Spark on Batch and Stream
Processing Workloads
Ahsan Javed Awan
EMJD-DC (KTH-UPC)
(https://www.kth.se/profile/ajawan/)
Mats Brorsson(KTH), Eduard Ayguade(UPC and BSC),
Vladimir Vlassov(KTH)

2
Motivation
Why should we care about architecture support?
*Taken from Babak's slides
Data Growing Faster Than Technology

3
Motivation
Cont...
Our GoalOur Goal
Improve the node level performance
through architecture support
*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
Phoenix ++,
Metis, Ostrich,
etc..
Hadoop, Spark,
Flink, etc..

4
Our Approach
● Performance characterization of in-memory data analytics on a
modern cloud server, in 5th International IEEE Conference on Big
Data and Cloud Computing, 2015 (Best Paper Award).
● How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server in 6th International Workshop on Big Data
Benchmarks, Performance Optimization and Emerging Hardware
(BpoE), held in conjunction with VLDB 2015, Hawaii, USA
– Limited to batch processing workloads only
– Does not consider the velocity aspect of big data
– Experiments are based on older version of Spark.
What are the major performance
bottlenecks??

5
Our Approach
● Does micro-architectural performance remains consistent
across batch and stream processing workloads ?
● How Data-frames micro-architecturally compare to RDDs ?
● How data velocity affect the micro-architectural performance ?
What are the remaining questions??

6
Progress Meeting 12-12-14
Which Scale-out Framework ?
[Picture Courtesy: Amir H. Payberah]
● Tuning of Spark internal Parameters
● Tuning of JVM Parameters (Heap size etc..)
● Micro-architecture Level Analysis using Hardware Performance
Counters.

7
Our Approach
Which benchmarks?

8
Our Hardware Configuration
Which Machine ?
Hyper Threading and Turbo-boost are disabled
Intel's Ivy Bridge Server

9
Does micro-architectural performance remains
consistent ?
Stream processing is micro-architecturally similar to batch processing in Spark

10
Cont..
Stream processing is micro-architecturally similar to batch processing in Spark

11
Cont..
Streaming workloads with similar Spark transformations have different
micro-architectural behavior

12
Cont..

13
Cont..

14
Cont..
Workload Spark Transformation Input
data
rate
Window
size (s)
Working Set with
2s sampling
interval
WWc FlatMap, Map,
ReduceByKeyAndWindow
10^4 30 15 x 10^4
CSpc FlatMap, Map,
CountByValueAndWindow
10^4 10 5 x 10^4
CErpz FlatMap, Map, Window,
GroupByKey
10^4 30 15 x 10^4
CAuC FlatMap, Map, Window,
GroupByKey, Count
10^4 10 5 x 10^4
Tpt FlatMap,
ReduceByKeyAndWindow,
Transform
10^1 60 30 x 10^1
Micro-batch size determines the micro-architectural behavior of stream processing
workloads with similar Spark transformations

15
Do Dataframes perform better than RDDs at
micro-architectural level?
DataFrame exhibit 25% less back-end bound stalls 64% less DRAM bound stalled cycles
25% less BW consumption10% less starvation of execution resources
Dataframes have better micro-architectural performance than RDDs

16
How Data Velocity affect micro-architectural
performance?
Better CPU utilization at higher data velocity

17
Cont..
Higher instruction retirement at higher data velocity Higher L1-Bound stalls at higher data velocity
Less starvation at higher data velocity Higher BW consumption at higher velocity

18
Our Approach
Conclusion
● Batch processing and stream processing has same micro-architectural
behavior in Spark if the difference between two implementations is of
micro-batching only.
● Spark workloads using DataFrames have improved instruction
retirement over workloads using RDDs.
● If the input data rates are small, stream processing workloads are
front-end bound. However, the front end bound stalls are reduced at
larger input data rates and instruction retirement is improved.

20
Our Approach
List of Papers
● Performance characterization of in-memory data analytics on a
modern cloud server, in 5th
International IEEE Conference on Big Data
and Cloud Computing, 2015 (Best Paper Award).
● How Data Volume Affects Spark Based Data Analytics on a Scale-up
Server in 6th
International Workshop on Big Data Benchmarks,
Performance Optimization and Emerging Hardware (BpoE), held in
conjunction with VLDB 2015, Hawaii, USA .
● Micro-architectural Characterization of Apache Spark on Batch and
Stream Processing Workloads. (accepted to BDCloud 2016)
● Node Architecture Implications for In-Memory Data Analytics in Scale-
in Clusters (accepted to IEEE BDCAT 2016)
● Implications of In-Memory Data Analytics with Apache Spark on Near
Data Computing Architectures (under submission).

Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

Similar to Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads (20)

Recently uploaded

Recently uploaded (20)

Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads