SlideShare a Scribd company logo
1
How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server
Ahsan Javed Awan
EMJD-DC (KTH-UPC)
(https://www.kth.se/profile/ajawan/)
Mats Brorsson(KTH), Vladimir Vlassov(KTH) and Eduard
Ayguade(UPC and BSC),
2
Motivation
Why should we care about architecture support?
*Source: SGI
Data Growing Faster Than Technology
4
Motivation
Cont...
Our FocusOur Focus
Improve the node level performance
through architecture support
*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
Phoenix ++,
Metis, Ostrich,
etc..
Hadoop, Spark,
Flink, etc..
5
Motivation
Conti...
● A mismatch between the characteristics of emerging workloads and the underlying
hardware.
– M. Ferdman et-al, “Clearing the clouds: A study of emerging scale-out workloads
on modern hardware,” in ASPLOS 2012.
– Z. Jia, et-al “Characterizing data analysis workloads in data centers,” in IISWC
2013.
– Z. Jia et-al, “Characterizing and subsetting big data workloads,” in IISWC 2014
– A. Yasin et-al, “Deep-dive analysis of the data analytics workload in cloudsuite,” in
IISWC 2014.
– T. Jiang, et-al, “Understanding the behavior of in-memory computing workloads,” in
IISWC 2014
Existing studies lack quantitative analysis of bottlenecks of
scale-out frameworks on single-node
6
Progress Meeting 12-12-14
Which Scale-out Framework ?
[Picture Courtesy: Amir H. Payberah]
7
Our Approach
● Performance characterization of in-memory data analytics on a
modern cloud server,” in 5th
International IEEE Conference on
Big Data and Cloud Computing, 2015 (Best Paper Award).
● How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server
What are the major bottlenecks??
Focus of this talk
8
Our Approach
● Do Spark based data analytics benefit from using scale-up
servers?
● How severe is the impact of garbage collection on performance
of Spark based data analytics?
● Is file I/O detrimental to Spark based data analytics
performance?
● How does data size affect the micro-architecture performance
of Spark based data analytics?
What are the remaining questions??
9
Our Approach
● We evaluate the impact of data volume on the performance of
Spark based data analytics running on a scale-up server.
● We quantify the limitations of using Spark on a scale-up server
with large volumes of data.
● We quantify the variations in micro-architectural performance of
applications across different data volumes.
What are the contributions??
10
Our Approach
● Use a subset of benchmarks from BigDataBench
● Use Big Data Generator Suite (BDGS), to generate synthetic
datasets of 6 GB, 12 GB and 24 GB.
● Configure Spark in local mode and tune its internal Parameters
● Rely on GC logs to collect garbage collection times.
● Use Spark logs to gather execution time of benchmarks.
● Use Concurrency Analysis in Intel Vtune to collect wait time and CPU
time of executor pool threads
● Use General Micro-architectural Exploration in Intel Vtune to analyze
impact of data volume on micro-architecture characteristics.
Methodology
11
Our Approach
What are the characteristics of benchmarks?
12
Our Hardware Configuration
System Details
13
Our Hardware Configuration
Machine Details
Hyper Threading and Turbo-boost are disabled
Hyper Threading and Turbo-boost are disabled
14
Our Approach
Software Parameters
15
Motivation
Do Spark based data analytics benefit from using larger
scale-up servers?
Spark applications do not benefit significantly by using more than 12-core executors
16
Motivation
Is GC detrimental to scalability of Spark applications?
The proportion of GC time increases with the number of cores
17
Motivation
Does performance remain consistent as we enlarge the data
size ?
Decrease in Data processed per second ranges from 11% to 93% ( Parallel Scavenge)
18
Motivation
Does the choice of Garbage Collector impact the data
processing capability of the system ??
Improvement in DPS ranges from 1.4x to 3.7x on average
in Parallel Scavenge as compared to G1
19
Motivation
How does GC affect data processing capability of
the system ??
GC time does not scale linearly with data size.
20
Motivation
How does CPU utilization scale with data volume ?
CPU Utilization decreases with increase in input data size
21
Motivation
Is File I/O detrimental to performance ?
Fraction of file I/O increases by 6x, 18x and 25x for Word Count,
Naive Bayes and Sort respectively when input data is increased by 4x
22
Motivation
How does data size affects micro-architectural
performance ?
5 to 10 % better instruction retirement as we enlarge the data size
23
Motivation
Cont..
Execution units inside the core exhibit improved utilization at larger data sets
24
Motivation
Cont..
Increase in L1 Bound Stalls implies better utilization of L1 Caches
25
Motivation
Cont..
Spark benchmarks exhibit reduced memory bandwidth utilization
26
Key Findings
● Spark workloads do not benefit significantly from executors with
more than 12 cores.
● The performance of Spark workloads degrades with large volumes
of data due to substantial increase in garbage collection and file
I/O time.
● With out any tuning, Parallel Scavenge garbage collection scheme
outperforms Concurrent Mark Sweep and G1 garbage collectors
for Spark workloads.
● Spark workloads exhibit improved instruction retirement due to
lower L1 cache misses and better utilization of functional units
inside cores at large volumes of data.
● Memory bandwidth utilization of Spark benchmarks decreases
with large volumes of data and is 3x lower than the available off-
chip bandwidth on our test machine
27
Motivation
Future Directions
NUMA Aware Task Scheduling
Cache Aware Transformations
Exploiting Processing In Memory Architectures
HW/SW Data Prefectching
Rethinking Memory Architectures

More Related Content

What's hot

An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitGanesan Narayanasamy
 
MSc Big Data: Connectomics Talk
MSc Big Data: Connectomics TalkMSc Big Data: Connectomics Talk
MSc Big Data: Connectomics TalkJohn Houston
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark Summit
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergencekoolkalpz
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudIJAAS Team
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.
 
IBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsIBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsthinkASG
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluationGIORGOS STAMELOS
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to sparksteccami
 
Real time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing applicationReal time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing applicationLeMeniz Infotech
 
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...ijcsit
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabaseKinetica
 

What's hot (20)

An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
MSc Big Data: Connectomics Talk
MSc Big Data: Connectomics TalkMSc Big Data: Connectomics Talk
MSc Big Data: Connectomics Talk
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergence
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
CDSS
CDSSCDSS
CDSS
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
IBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsIBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deployments
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluation
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to spark
 
Real time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing applicationReal time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing application
 
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
 

Viewers also liked

China desulfurization equipment industry market research and investment forec...
China desulfurization equipment industry market research and investment forec...China desulfurization equipment industry market research and investment forec...
China desulfurization equipment industry market research and investment forec...Qianzhan Intelligence
 
China investment attracting pattern and regional promotion planning report, 2...
China investment attracting pattern and regional promotion planning report, 2...China investment attracting pattern and regional promotion planning report, 2...
China investment attracting pattern and regional promotion planning report, 2...Qianzhan Intelligence
 
China pharmaceutical excipients industry indepth research and investment stra...
China pharmaceutical excipients industry indepth research and investment stra...China pharmaceutical excipients industry indepth research and investment stra...
China pharmaceutical excipients industry indepth research and investment stra...Qianzhan Intelligence
 
We Need More Legal Hackers Now!
We Need More Legal Hackers Now!We Need More Legal Hackers Now!
We Need More Legal Hackers Now!RightBrainLaw
 
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA  Y LA IDENTIDAD INSTITUCIONAL CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA  Y LA IDENTIDAD INSTITUCIONAL
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL martha calderon
 
China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...Qianzhan Intelligence
 
China banking industry market research and prospect forecast report
China banking industry market research and prospect forecast reportChina banking industry market research and prospect forecast report
China banking industry market research and prospect forecast reportQianzhan Intelligence
 
China high end equipment manufacturing park development pattern and investmen...
China high end equipment manufacturing park development pattern and investmen...China high end equipment manufacturing park development pattern and investmen...
China high end equipment manufacturing park development pattern and investmen...Qianzhan Intelligence
 
China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...Qianzhan Intelligence
 
COMO TRABAJO Y APLICO MIS COMPETENCIAS
COMO TRABAJO Y APLICO MIS COMPETENCIAS COMO TRABAJO Y APLICO MIS COMPETENCIAS
COMO TRABAJO Y APLICO MIS COMPETENCIAS martha calderon
 
Open source as a convivial and democratic mode of production
Open source as a convivial and democratic mode of productionOpen source as a convivial and democratic mode of production
Open source as a convivial and democratic mode of productionLouis Florin
 

Viewers also liked (20)

China desulfurization equipment industry market research and investment forec...
China desulfurization equipment industry market research and investment forec...China desulfurization equipment industry market research and investment forec...
China desulfurization equipment industry market research and investment forec...
 
Sonic
Sonic Sonic
Sonic
 
China investment attracting pattern and regional promotion planning report, 2...
China investment attracting pattern and regional promotion planning report, 2...China investment attracting pattern and regional promotion planning report, 2...
China investment attracting pattern and regional promotion planning report, 2...
 
China pharmaceutical excipients industry indepth research and investment stra...
China pharmaceutical excipients industry indepth research and investment stra...China pharmaceutical excipients industry indepth research and investment stra...
China pharmaceutical excipients industry indepth research and investment stra...
 
We Need More Legal Hackers Now!
We Need More Legal Hackers Now!We Need More Legal Hackers Now!
We Need More Legal Hackers Now!
 
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA  Y LA IDENTIDAD INSTITUCIONAL CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA  Y LA IDENTIDAD INSTITUCIONAL
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL
 
China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...
 
China banking industry market research and prospect forecast report
China banking industry market research and prospect forecast reportChina banking industry market research and prospect forecast report
China banking industry market research and prospect forecast report
 
China high end equipment manufacturing park development pattern and investmen...
China high end equipment manufacturing park development pattern and investmen...China high end equipment manufacturing park development pattern and investmen...
China high end equipment manufacturing park development pattern and investmen...
 
Business plan
Business planBusiness plan
Business plan
 
Problem 1
Problem 1Problem 1
Problem 1
 
Genre research
Genre researchGenre research
Genre research
 
AppsNgen
AppsNgenAppsNgen
AppsNgen
 
China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...
 
Tugas B.Inggris Pekan 1
Tugas B.Inggris Pekan 1Tugas B.Inggris Pekan 1
Tugas B.Inggris Pekan 1
 
COMO TRABAJO Y APLICO MIS COMPETENCIAS
COMO TRABAJO Y APLICO MIS COMPETENCIAS COMO TRABAJO Y APLICO MIS COMPETENCIAS
COMO TRABAJO Y APLICO MIS COMPETENCIAS
 
Welcome to 5th grade
Welcome to 5th gradeWelcome to 5th grade
Welcome to 5th grade
 
Open source as a convivial and democratic mode of production
Open source as a convivial and democratic mode of productionOpen source as a convivial and democratic mode of production
Open source as a convivial and democratic mode of production
 
South Korea
South KoreaSouth Korea
South Korea
 
Docker^3
Docker^3Docker^3
Docker^3
 

Similar to How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Ahsan Javed Awan
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLDESMOND YUEN
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET Journal
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!TigerGraph
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at ScaleNeo4j
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesJun Liu
 

Similar to How Data Volume Affects Spark Based Data Analytics on a Scale-up Server (20)

Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 Slides
 

Recently uploaded

一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单ewymefz
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Domenico Conte
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...correoyaya
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单nscud
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单vcaxypu
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportSatyamNeelmani2
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单vcaxypu
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 

Recently uploaded (20)

一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

  • 1. 1 How Data Volume Affects Spark Based Data Analytics on a Scale-up Server Ahsan Javed Awan EMJD-DC (KTH-UPC) (https://www.kth.se/profile/ajawan/) Mats Brorsson(KTH), Vladimir Vlassov(KTH) and Eduard Ayguade(UPC and BSC),
  • 2. 2 Motivation Why should we care about architecture support? *Source: SGI Data Growing Faster Than Technology
  • 3. 4 Motivation Cont... Our FocusOur Focus Improve the node level performance through architecture support *Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/ Phoenix ++, Metis, Ostrich, etc.. Hadoop, Spark, Flink, etc..
  • 4. 5 Motivation Conti... ● A mismatch between the characteristics of emerging workloads and the underlying hardware. – M. Ferdman et-al, “Clearing the clouds: A study of emerging scale-out workloads on modern hardware,” in ASPLOS 2012. – Z. Jia, et-al “Characterizing data analysis workloads in data centers,” in IISWC 2013. – Z. Jia et-al, “Characterizing and subsetting big data workloads,” in IISWC 2014 – A. Yasin et-al, “Deep-dive analysis of the data analytics workload in cloudsuite,” in IISWC 2014. – T. Jiang, et-al, “Understanding the behavior of in-memory computing workloads,” in IISWC 2014 Existing studies lack quantitative analysis of bottlenecks of scale-out frameworks on single-node
  • 5. 6 Progress Meeting 12-12-14 Which Scale-out Framework ? [Picture Courtesy: Amir H. Payberah]
  • 6. 7 Our Approach ● Performance characterization of in-memory data analytics on a modern cloud server,” in 5th International IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award). ● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server What are the major bottlenecks?? Focus of this talk
  • 7. 8 Our Approach ● Do Spark based data analytics benefit from using scale-up servers? ● How severe is the impact of garbage collection on performance of Spark based data analytics? ● Is file I/O detrimental to Spark based data analytics performance? ● How does data size affect the micro-architecture performance of Spark based data analytics? What are the remaining questions??
  • 8. 9 Our Approach ● We evaluate the impact of data volume on the performance of Spark based data analytics running on a scale-up server. ● We quantify the limitations of using Spark on a scale-up server with large volumes of data. ● We quantify the variations in micro-architectural performance of applications across different data volumes. What are the contributions??
  • 9. 10 Our Approach ● Use a subset of benchmarks from BigDataBench ● Use Big Data Generator Suite (BDGS), to generate synthetic datasets of 6 GB, 12 GB and 24 GB. ● Configure Spark in local mode and tune its internal Parameters ● Rely on GC logs to collect garbage collection times. ● Use Spark logs to gather execution time of benchmarks. ● Use Concurrency Analysis in Intel Vtune to collect wait time and CPU time of executor pool threads ● Use General Micro-architectural Exploration in Intel Vtune to analyze impact of data volume on micro-architecture characteristics. Methodology
  • 10. 11 Our Approach What are the characteristics of benchmarks?
  • 12. 13 Our Hardware Configuration Machine Details Hyper Threading and Turbo-boost are disabled Hyper Threading and Turbo-boost are disabled
  • 14. 15 Motivation Do Spark based data analytics benefit from using larger scale-up servers? Spark applications do not benefit significantly by using more than 12-core executors
  • 15. 16 Motivation Is GC detrimental to scalability of Spark applications? The proportion of GC time increases with the number of cores
  • 16. 17 Motivation Does performance remain consistent as we enlarge the data size ? Decrease in Data processed per second ranges from 11% to 93% ( Parallel Scavenge)
  • 17. 18 Motivation Does the choice of Garbage Collector impact the data processing capability of the system ?? Improvement in DPS ranges from 1.4x to 3.7x on average in Parallel Scavenge as compared to G1
  • 18. 19 Motivation How does GC affect data processing capability of the system ?? GC time does not scale linearly with data size.
  • 19. 20 Motivation How does CPU utilization scale with data volume ? CPU Utilization decreases with increase in input data size
  • 20. 21 Motivation Is File I/O detrimental to performance ? Fraction of file I/O increases by 6x, 18x and 25x for Word Count, Naive Bayes and Sort respectively when input data is increased by 4x
  • 21. 22 Motivation How does data size affects micro-architectural performance ? 5 to 10 % better instruction retirement as we enlarge the data size
  • 22. 23 Motivation Cont.. Execution units inside the core exhibit improved utilization at larger data sets
  • 23. 24 Motivation Cont.. Increase in L1 Bound Stalls implies better utilization of L1 Caches
  • 24. 25 Motivation Cont.. Spark benchmarks exhibit reduced memory bandwidth utilization
  • 25. 26 Key Findings ● Spark workloads do not benefit significantly from executors with more than 12 cores. ● The performance of Spark workloads degrades with large volumes of data due to substantial increase in garbage collection and file I/O time. ● With out any tuning, Parallel Scavenge garbage collection scheme outperforms Concurrent Mark Sweep and G1 garbage collectors for Spark workloads. ● Spark workloads exhibit improved instruction retirement due to lower L1 cache misses and better utilization of functional units inside cores at large volumes of data. ● Memory bandwidth utilization of Spark benchmarks decreases with large volumes of data and is 3x lower than the available off- chip bandwidth on our test machine
  • 26. 27 Motivation Future Directions NUMA Aware Task Scheduling Cache Aware Transformations Exploiting Processing In Memory Architectures HW/SW Data Prefectching Rethinking Memory Architectures