Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

1
Performance Characterization of In-Memory
Data Analytics on a Modern Cloud Server
Ahsan Javed Awan
EMJD-DC (KTH-UPC)
(https://www.kth.se/profile/ajawan/)
Mats Brorsson(KTH), Eduard Ayguade(UPC and BSC),
Vladimir Vlassov(KTH)

2
Motivation
Why should we care?
*Source: SGI

3
Motivation
Cont...
Our FocusOur Focus
Improve the single node performance
in scale-out configuration
*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
Phoenix ++,
Metis, Ostrich,
etc..
Hadoop, Spark,
Flink, etc..

4
Progress Meeting 12-12-14
Which Scale-out Framework ?
[Picture Courtesy: Amir H. Payberah]

5
Our Approach
● A three fold analysis method at Application, Thread and Micro-
architectural level
– Tuning of Spark internal Parameters
– Tuning of JVM Parameters (Heap size etc..)
– Thread Level Analysis using Intel Vtune
– Micro-architecture Level Analysis using Hardware
Performance Counters
Methodology

6
Our Approach
Benchmarks
3GB of Wikipedia raw datasets, Amazon Movies Reviews and
numerical records have been used

7
Our Hardware Configuration
Machine Details
Hyper Threading and Turbo Boost is disabled
Hyper Threading and Turbo-boost are disabled

8
Our Approach
System Configuration

9
Multicore Scalability of Spark
Application Level Performance
Spark scales poorly in Scale-up configuration

10
Stage Level Performance
● Shuffle Map Stages don't scale beyond 12 threads
across different workloads
● No of concurrent files open in Map-side shuffling is
C*R where C is no of threads in executor pool and R is
no of reduce tasks

11
Task Level Performance
Percentage increase in Area Under the Curve compared to 1-thread

12
Is there thread level load imbalance ??

13
CPU Utilization is not
scaling with performance

14
Is there any Work Time
Inflation ??

15
How does Micro-architecture contribute to
Work time inflation ??

18
Is Memory Bandwidth a bottleneck??

19
Key Findings
● More than 12 threads in an executor pool does not yield significant
performance
● Spark runtime system need to be improved to provide better load
balancing and avoid work-time inflation.
● Work time inflation and load imbalance on the threads are the
scalability bottlenecks.
● Removing the bottlenecks in the front-end of the processor would
not remove more than 20% of stalls.
● Effort should be focused on removing the memory bound stalls
since they account for up to 72% of stalls in the pipeline slots.
● Memory bandwidth of current processors is sufficient for in-
memory data analytics

20
Motivation
Future Directions
NUMA Aware Task Scheduling
Cache Aware Transformations
Exploiting Processing In Memory Architectures
HW/SW Data Prefetching
Rethinking Memory Architectures

21
Our Approach
● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade,
“Performance charaterization of in-memory data analytics on a
mordern cloud server,” in 5th
International IEEE Conference on
Big Data and Cloud Computing, 2015.
● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “How
Data Volume Affects Spark Based Data Analytics on a Scale-up
Server”, in 6th
International Workshop on Big Data
Benchmarking, Performance Optimization and Emerging
Hardware (BpoE) held in conjunction with VLDB, 2015.
What are the major bottlenecks??

22
Motivation
How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server

23
Motivation
Do Spark based data analytics benefit from using larger
scale-up servers

24
Motivation
Is GC detrimental to scalability of Spark applications?

25
Motivation
How does performance scale with data volume ?

26
Motivation
Does GC time scale linearly with Data Volume ??

27
Motivation
How does CPU utilization scale with data volume ?

28
Motivation
Is File I/O detrimental to performance ?

29
Motivation
How does data size affects micro-architectural
performance ?

30
Motivation
How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server

33
Key Findings
● Spark workloads do not benefit significantly from executors with
more than 12 cores.
● The performance of Spark workloads degrades with large volumes
of data due to substantial increase in garbage collection and file
I/O time.
● With out any tuning, Parallel Scavenge garbage collection scheme
outperforms Concurrent Mark Sweep and G1 garbage collectors
for Spark workloads.
● Spark workloads exhibit improved instruction retirement due to
lower L1 cache misses and better utilization of functional units
inside cores at large volumes of data.
● Memory bandwidth utilization of Spark benchmarks decreases
with large volumes of data and is 3x lower than the available off-
chip bandwidth on our test machine

34
Our Approach
● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade,
“Performance charaterization of in-memory data analytics on a
mordern cloud server,” in 5th
International IEEE Conference on
Big Data and Cloud Computing, 2015.
● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “How
Data Volume Affects Spark Based Data Analytics on a Scale-up
Server”, in 6th
International Workshop on Big Data
Benchmarking, Performance Optimization and Emerging
Hardware (BpoE) held in conjunction with VLDB, 2015.
What are the major bottlenecks??

35
Motivation
Future Directions
NUMA Aware Task Scheduling
Cache Aware Transformations
Exploiting Processing In Memory Architectures
HW/SW Data Prefectching
Rethinking Memory Architectures

Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

More Related Content

What's hot

Viewers also liked

Similar to Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

Recently uploaded

Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server