1
Performance Characterization of In-Memory
Data Analytics on a Modern Cloud Server
Ahsan Javed Awan
EMJD-DC (KTH-UPC)
(https://www.kth.se/profile/ajawan/)
Mats Brorsson(KTH), Eduard Ayguade(UPC and BSC),
Vladimir Vlassov(KTH)
2
Motivation
Why should we care?
*Source: SGI
3
Motivation
Cont...
Our FocusOur Focus
Improve the single node performance
in scale-out configuration
*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
Phoenix ++,
Metis, Ostrich,
etc..
Hadoop, Spark,
Flink, etc..
4
Progress Meeting 12-12-14
Which Scale-out Framework ?
[Picture Courtesy: Amir H. Payberah]
5
Our Approach
● A three fold analysis method at Application, Thread and Micro-
architectural level
– Tuning of Spark internal Parameters
– Tuning of JVM Parameters (Heap size etc..)
– Thread Level Analysis using Intel Vtune
– Micro-architecture Level Analysis using Hardware
Performance Counters
Methodology
6
Our Approach
Benchmarks
3GB of Wikipedia raw datasets, Amazon Movies Reviews and
numerical records have been used
7
Our Hardware Configuration
Machine Details
Hyper Threading and Turbo Boost is disabled
Hyper Threading and Turbo-boost are disabled
8
Our Approach
System Configuration
9
Multicore Scalability of Spark
Application Level Performance
Spark scales poorly in Scale-up configuration
10
Multicore Scalability of Spark
Stage Level Performance
● Shuffle Map Stages don't scale beyond 12 threads
across different workloads
● No of concurrent files open in Map-side shuffling is
C*R where C is no of threads in executor pool and R is
no of reduce tasks
11
Multicore Scalability of Spark
Task Level Performance
Percentage increase in Area Under the Curve compared to 1-thread
12
Is there thread level load imbalance ??
13
CPU Utilization is not
scaling with performance
14
Is there any Work Time
Inflation ??
15
How does Micro-architecture contribute to
Work time inflation ??
16
Cont...
17
Cont...
18
Is Memory Bandwidth a bottleneck??
19
Key Findings
● More than 12 threads in an executor pool does not yield significant
performance
● Spark runtime system need to be improved to provide better load
balancing and avoid work-time inflation.
● Work time inflation and load imbalance on the threads are the
scalability bottlenecks.
● Removing the bottlenecks in the front-end of the processor would
not remove more than 20% of stalls.
● Effort should be focused on removing the memory bound stalls
since they account for up to 72% of stalls in the pipeline slots.
● Memory bandwidth of current processors is sufficient for in-
memory data analytics
20
Motivation
Future Directions
NUMA Aware Task Scheduling
Cache Aware Transformations
Exploiting Processing In Memory Architectures
HW/SW Data Prefetching
Rethinking Memory Architectures
21
Our Approach
● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade,
“Performance charaterization of in-memory data analytics on a
mordern cloud server,” in 5th
International IEEE Conference on
Big Data and Cloud Computing, 2015.
● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “How
Data Volume Affects Spark Based Data Analytics on a Scale-up
Server”, in 6th
International Workshop on Big Data
Benchmarking, Performance Optimization and Emerging
Hardware (BpoE) held in conjunction with VLDB, 2015.
What are the major bottlenecks??
22
Motivation
How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server
23
Motivation
Do Spark based data analytics benefit from using larger
scale-up servers
24
Motivation
Is GC detrimental to scalability of Spark applications?
25
Motivation
How does performance scale with data volume ?
26
Motivation
Does GC time scale linearly with Data Volume ??
27
Motivation
How does CPU utilization scale with data volume ?
28
Motivation
Is File I/O detrimental to performance ?
29
Motivation
How does data size affects micro-architectural
performance ?
30
Motivation
How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server
31
Motivation
Cont..
32
Motivation
Cont..
33
Key Findings
● Spark workloads do not benefit significantly from executors with
more than 12 cores.
● The performance of Spark workloads degrades with large volumes
of data due to substantial increase in garbage collection and file
I/O time.
● With out any tuning, Parallel Scavenge garbage collection scheme
outperforms Concurrent Mark Sweep and G1 garbage collectors
for Spark workloads.
● Spark workloads exhibit improved instruction retirement due to
lower L1 cache misses and better utilization of functional units
inside cores at large volumes of data.
● Memory bandwidth utilization of Spark benchmarks decreases
with large volumes of data and is 3x lower than the available off-
chip bandwidth on our test machine
34
Our Approach
● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade,
“Performance charaterization of in-memory data analytics on a
mordern cloud server,” in 5th
International IEEE Conference on
Big Data and Cloud Computing, 2015.
● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “How
Data Volume Affects Spark Based Data Analytics on a Scale-up
Server”, in 6th
International Workshop on Big Data
Benchmarking, Performance Optimization and Emerging
Hardware (BpoE) held in conjunction with VLDB, 2015.
What are the major bottlenecks??
35
Motivation
Future Directions
NUMA Aware Task Scheduling
Cache Aware Transformations
Exploiting Processing In Memory Architectures
HW/SW Data Prefectching
Rethinking Memory Architectures

Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

  • 1.
    1 Performance Characterization ofIn-Memory Data Analytics on a Modern Cloud Server Ahsan Javed Awan EMJD-DC (KTH-UPC) (https://www.kth.se/profile/ajawan/) Mats Brorsson(KTH), Eduard Ayguade(UPC and BSC), Vladimir Vlassov(KTH)
  • 2.
    2 Motivation Why should wecare? *Source: SGI
  • 3.
    3 Motivation Cont... Our FocusOur Focus Improvethe single node performance in scale-out configuration *Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/ Phoenix ++, Metis, Ostrich, etc.. Hadoop, Spark, Flink, etc..
  • 4.
    4 Progress Meeting 12-12-14 WhichScale-out Framework ? [Picture Courtesy: Amir H. Payberah]
  • 5.
    5 Our Approach ● Athree fold analysis method at Application, Thread and Micro- architectural level – Tuning of Spark internal Parameters – Tuning of JVM Parameters (Heap size etc..) – Thread Level Analysis using Intel Vtune – Micro-architecture Level Analysis using Hardware Performance Counters Methodology
  • 6.
    6 Our Approach Benchmarks 3GB ofWikipedia raw datasets, Amazon Movies Reviews and numerical records have been used
  • 7.
    7 Our Hardware Configuration MachineDetails Hyper Threading and Turbo Boost is disabled Hyper Threading and Turbo-boost are disabled
  • 8.
  • 9.
    9 Multicore Scalability ofSpark Application Level Performance Spark scales poorly in Scale-up configuration
  • 10.
    10 Multicore Scalability ofSpark Stage Level Performance ● Shuffle Map Stages don't scale beyond 12 threads across different workloads ● No of concurrent files open in Map-side shuffling is C*R where C is no of threads in executor pool and R is no of reduce tasks
  • 11.
    11 Multicore Scalability ofSpark Task Level Performance Percentage increase in Area Under the Curve compared to 1-thread
  • 12.
    12 Is there threadlevel load imbalance ??
  • 13.
    13 CPU Utilization isnot scaling with performance
  • 14.
    14 Is there anyWork Time Inflation ??
  • 15.
    15 How does Micro-architecturecontribute to Work time inflation ??
  • 16.
  • 17.
  • 18.
    18 Is Memory Bandwidtha bottleneck??
  • 19.
    19 Key Findings ● Morethan 12 threads in an executor pool does not yield significant performance ● Spark runtime system need to be improved to provide better load balancing and avoid work-time inflation. ● Work time inflation and load imbalance on the threads are the scalability bottlenecks. ● Removing the bottlenecks in the front-end of the processor would not remove more than 20% of stalls. ● Effort should be focused on removing the memory bound stalls since they account for up to 72% of stalls in the pipeline slots. ● Memory bandwidth of current processors is sufficient for in- memory data analytics
  • 20.
    20 Motivation Future Directions NUMA AwareTask Scheduling Cache Aware Transformations Exploiting Processing In Memory Architectures HW/SW Data Prefetching Rethinking Memory Architectures
  • 21.
    21 Our Approach ● A.J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “Performance charaterization of in-memory data analytics on a mordern cloud server,” in 5th International IEEE Conference on Big Data and Cloud Computing, 2015. ● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “How Data Volume Affects Spark Based Data Analytics on a Scale-up Server”, in 6th International Workshop on Big Data Benchmarking, Performance Optimization and Emerging Hardware (BpoE) held in conjunction with VLDB, 2015. What are the major bottlenecks??
  • 22.
    22 Motivation How Data VolumeAffects Spark Based Data Analytics on a Scale-up Server
  • 23.
    23 Motivation Do Spark baseddata analytics benefit from using larger scale-up servers
  • 24.
    24 Motivation Is GC detrimentalto scalability of Spark applications?
  • 25.
    25 Motivation How does performancescale with data volume ?
  • 26.
    26 Motivation Does GC timescale linearly with Data Volume ??
  • 27.
    27 Motivation How does CPUutilization scale with data volume ?
  • 28.
    28 Motivation Is File I/Odetrimental to performance ?
  • 29.
    29 Motivation How does datasize affects micro-architectural performance ?
  • 30.
    30 Motivation How Data VolumeAffects Spark Based Data Analytics on a Scale-up Server
  • 31.
  • 32.
  • 33.
    33 Key Findings ● Sparkworkloads do not benefit significantly from executors with more than 12 cores. ● The performance of Spark workloads degrades with large volumes of data due to substantial increase in garbage collection and file I/O time. ● With out any tuning, Parallel Scavenge garbage collection scheme outperforms Concurrent Mark Sweep and G1 garbage collectors for Spark workloads. ● Spark workloads exhibit improved instruction retirement due to lower L1 cache misses and better utilization of functional units inside cores at large volumes of data. ● Memory bandwidth utilization of Spark benchmarks decreases with large volumes of data and is 3x lower than the available off- chip bandwidth on our test machine
  • 34.
    34 Our Approach ● A.J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “Performance charaterization of in-memory data analytics on a mordern cloud server,” in 5th International IEEE Conference on Big Data and Cloud Computing, 2015. ● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “How Data Volume Affects Spark Based Data Analytics on a Scale-up Server”, in 6th International Workshop on Big Data Benchmarking, Performance Optimization and Emerging Hardware (BpoE) held in conjunction with VLDB, 2015. What are the major bottlenecks??
  • 35.
    35 Motivation Future Directions NUMA AwareTask Scheduling Cache Aware Transformations Exploiting Processing In Memory Architectures HW/SW Data Prefectching Rethinking Memory Architectures