SlideShare a Scribd company logo
1 of 18
Download to read offline
Identifying the Potential of Near
Data Processing for Apache Spark
Ahsan J. Awan (KTH), Eduard Ayguade (BSC), Mats Brorsson (KTH), Moriyoshi Ohara (IBM),
Kazuaki Ishizaki (IBM), and Vladimir Vlassov (KTH)
KTH Royal Institute of Technology, Sweden
BSC Barcelona Super Computing Center, Spain
IBM Research Tokyo, Japan
MEMSYS 2017, Oct 2-5, 2017, Washington DC, VA
Motivation / ”Big Picture”
2
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
Previous Work / Further Reading
•  Performance characterization of in-memory data analytics on a
modern cloud server, 5th IEEE Conference on Big Data and Cloud
Computing, 2015 (Best Paper Award).
•  How Data Volume Affects Spark Based Data Analytics on a Scale-up
Server, 6th Workshop on Big Data Benchmarks, Performance
Optimization and Emerging Hardware (BpoE), held in conjunction with
VLDB 2015, Hawaii, USA.
•  Micro-architectural Characterization of Apache Spark on Batch and
Stream Processing Workloads, 6th IEEE Conference on Big Data and
Cloud Computing, 2016.
•  Node Architecture Implications for In-Memory Data Analytics in
Scale-in Clusters, 3rd IEEE/ACM Conference in Big Data Computing,
Applications and Technologies, 2016.
3
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
A fast and general engine for large-scale
data processing (https://spark.apache.org/).
Resilient Distributed Datasets (RDDs)
•  immutable collections of objects spread
across a cluster
Data-frames
Higher-order user-defined functions
•  Transformations (RDD à RDD)
•  Actions (RDDs à non-RDD)
Spark
4
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
1.  Processing-In-Memory (PIM)
2.  In-Storage Processing (ISP)
Improve the performance by
reducing costly data movements
back and forth between the
CPUs and Memories
Exploiting Near Data Processing (NDP)
5
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
G. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts, M. Meswani, D. Zhang, and M. Ignatowski. A processing in memory taxonomy and a case
for studying fixed-function PIM. In Workshop on Near-Data Processing (WoNDP), 2013.
3D-stacked PIM for Data Analytics
for Map-Reduce
•  perform Map operations on simple processing cores in the logic
layer of 3D-stacked memory devices
for Machine Learning
•  offload atomic operations onto logic layers in 3D stacked
memories
for Graph Analytics
•  offload the graph property calculations to HMC
for SQL queries
•  Joins can benefit from 3D-stacked PIM
6
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
Expected benefits of NDP for big-data analytics
•  PIM for DRAM-bound applications, e.g., map-reduce,
graph- and stream-processing, ML
•  ISP for I/O-bound (non-iterative) applications, e.g. SQL
•  PIM + ISP for phasic applications, both memory- and I/O-
bound, e.g. clustering (k-means), some graph processing
7
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
Can Spark workloads benefit from NDP?
8
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
Methodology
Identifying the potential of NDP to boost the performance of Spark workloads
by matching the characteristics of the workloads to different forms of NDP
(2D integrated PIM, 3D Stacked PIM, ISP)
Representative Spark workloads (most from BigdataBench)
•  Batch, SQL, stream-, graph-processing, ML
•  should cover a diverse set of Spark transformations and actions
•  should be common among available big-data benchmark suites
•  have been used in evaluation of MR frameworks.
9
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
Workloads (1/2)
10
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
Workloads (2/2)
11
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
System Configuration
12Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
•  Hyper-Threading and Turbo-
boost are disabled
•  Spark in local mode: driver and
executor are in same JVM
•  HotSpot JDK version 7u71 in
server mode
•  iotop to measure the total disk bandwidth
•  top to measure %usr and %io
•  Intel Vtune Amplifier to collect hardware
performance counters
Measurement Tools and Metrics
13Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
Metrics for Top-Down Analysis of Workloads
The case for Processing-In-Memory:
2D Integrated PIM instead of 3D Stacked PIM
14
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
M. Radulovic, at el. Another Trip to the Wall: How Much Will Stacked DRAM Benefit
HPC? In MEMSYS ’15.
The case for In-Storage-Processing
15
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
Grep (Gp)
K-means (Km)Windowed Word Count (Wwc)
A refined hypothesis based on workload
characterization
•  Non-iterative Spark workloads with high ratio of I/O wait time / CPU time, e.g. join,
aggregation, filter, word count and sort, are ideal candidates for ISP.
•  Spark workloads with low ratio of I/O wait time / CPU time, e.g. stream processing and
iterative graph processing, are bound by latency of frequent accesses to DRAM and are
ideal candidates for 2D integrated PIM.
•  Iterative Spark workloads with moderate ratio of I/O wait time / CPU time, e.g., K-means,
have both I/O bound and memory bound phases and hence will benefit from hybrid 2D
integrated PIM and ISP.
•  In order to satisfy the varying compute demands of Spark workloads, we envision an
NDC architecture with programmable logic based hybrid ISP and 2D integrated PIM.
16
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
Identifying the Potential of Near
Data Processing for Apache Spark
Ahsan J. Awan (KTH), Eduard Ayguade (BSC), Mats Brorsson (KTH), Moriyoshi Ohara (IBM),
Kazuaki Ishizaki (IBM), and Vladimir Vlassov (KTH)
KTH Royal Institute of Technology, Sweden
BSC Barcelona Super Computing Center, Spain
IBM Research Tokyo, Japan
MEMSYS 2017, Oct 2-5, 2017, Washington DC, VA
Contact
Ahsan Javed Awan
KTH Royal Institute of Technology, Stockholm, Sweden
Email: ajawn@kth.se
Profile: www.kth.se/profile/ajawan/
18
Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA

More Related Content

What's hot

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Data analysis using hive ql & tableau
Data analysis using hive ql & tableauData analysis using hive ql & tableau
Data analysis using hive ql & tableaupkale1708
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersAhsan Javed Awan
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitGanesan Narayanasamy
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierDemai Ni
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the artStavros Kontopoulos
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 

What's hot (19)

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Data analysis using hive ql & tableau
Data analysis using hive ql & tableauData analysis using hive ql & tableau
Data analysis using hive ql & tableau
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 

Similar to Potential of NDP for Apache Spark

Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark DataWorks Summit/Hadoop Summit
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackJérôme Kehrli
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
Bringing olap fully online  analyze changing datasets in mem sql and spark wi...Bringing olap fully online  analyze changing datasets in mem sql and spark wi...
Bringing olap fully online analyze changing datasets in mem sql and spark wi...SingleStore
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architecturesRaji Gogulapati
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 

Similar to Potential of NDP for Apache Spark (20)

Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
Bringing olap fully online  analyze changing datasets in mem sql and spark wi...Bringing olap fully online  analyze changing datasets in mem sql and spark wi...
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
 
Spark
SparkSpark
Spark
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Yu's resume
Yu's resumeYu's resume
Yu's resume
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architectures
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 

Recently uploaded

Call Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile serviceCall Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile servicerehmti665
 
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Thane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsThane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsPooja Nehwal
 
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...Amil baba
 
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一Fi sss
 
Vip Noida Escorts 9873940964 Greater Noida Escorts Service
Vip Noida Escorts 9873940964 Greater Noida Escorts ServiceVip Noida Escorts 9873940964 Greater Noida Escorts Service
Vip Noida Escorts 9873940964 Greater Noida Escorts Serviceankitnayak356677
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service ThanePooja Nehwal
 
定制(USF学位证)旧金山大学毕业证成绩单原版一比一
定制(USF学位证)旧金山大学毕业证成绩单原版一比一定制(USF学位证)旧金山大学毕业证成绩单原版一比一
定制(USF学位证)旧金山大学毕业证成绩单原版一比一ss ss
 
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一zul5vf0pq
 
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /WhatsappsBeautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsappssapnasaifi408
 
Presentation.pptxjnfoigneoifnvoeifnvklfnvf
Presentation.pptxjnfoigneoifnvoeifnvklfnvfPresentation.pptxjnfoigneoifnvoeifnvklfnvf
Presentation.pptxjnfoigneoifnvoeifnvklfnvfchapmanellie27
 
(办理学位证)多伦多大学毕业证成绩单原版一比一
(办理学位证)多伦多大学毕业证成绩单原版一比一(办理学位证)多伦多大学毕业证成绩单原版一比一
(办理学位证)多伦多大学毕业证成绩单原版一比一C SSS
 
Papular No 1 Online Istikhara Amil Baba Pakistan Amil Baba In Karachi Amil B...
Papular No 1 Online Istikhara Amil Baba Pakistan  Amil Baba In Karachi Amil B...Papular No 1 Online Istikhara Amil Baba Pakistan  Amil Baba In Karachi Amil B...
Papular No 1 Online Istikhara Amil Baba Pakistan Amil Baba In Karachi Amil B...Authentic No 1 Amil Baba In Pakistan
 
Call Girls Service Kolkata Aishwarya 🤌 8250192130 🚀 Vip Call Girls Kolkata
Call Girls Service Kolkata Aishwarya 🤌  8250192130 🚀 Vip Call Girls KolkataCall Girls Service Kolkata Aishwarya 🤌  8250192130 🚀 Vip Call Girls Kolkata
Call Girls Service Kolkata Aishwarya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Real Sure (Call Girl) in I.G.I. Airport 8377087607 Hot Call Girls In Delhi NCR
Real Sure (Call Girl) in I.G.I. Airport 8377087607 Hot Call Girls In Delhi NCRReal Sure (Call Girl) in I.G.I. Airport 8377087607 Hot Call Girls In Delhi NCR
Real Sure (Call Girl) in I.G.I. Airport 8377087607 Hot Call Girls In Delhi NCRdollysharma2066
 
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...Suhani Kapoor
 
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...Suhani Kapoor
 

Recently uploaded (20)

Call Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile serviceCall Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile service
 
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
 
Thane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsThane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call Girls
 
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
 
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
 
Vip Noida Escorts 9873940964 Greater Noida Escorts Service
Vip Noida Escorts 9873940964 Greater Noida Escorts ServiceVip Noida Escorts 9873940964 Greater Noida Escorts Service
Vip Noida Escorts 9873940964 Greater Noida Escorts Service
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
 
定制(USF学位证)旧金山大学毕业证成绩单原版一比一
定制(USF学位证)旧金山大学毕业证成绩单原版一比一定制(USF学位证)旧金山大学毕业证成绩单原版一比一
定制(USF学位证)旧金山大学毕业证成绩单原版一比一
 
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
 
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /WhatsappsBeautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
Beautiful Sapna Call Girls CP 9711199012 ☎ Call /Whatsapps
 
Presentation.pptxjnfoigneoifnvoeifnvklfnvf
Presentation.pptxjnfoigneoifnvoeifnvklfnvfPresentation.pptxjnfoigneoifnvoeifnvklfnvf
Presentation.pptxjnfoigneoifnvoeifnvklfnvf
 
(办理学位证)多伦多大学毕业证成绩单原版一比一
(办理学位证)多伦多大学毕业证成绩单原版一比一(办理学位证)多伦多大学毕业证成绩单原版一比一
(办理学位证)多伦多大学毕业证成绩单原版一比一
 
Papular No 1 Online Istikhara Amil Baba Pakistan Amil Baba In Karachi Amil B...
Papular No 1 Online Istikhara Amil Baba Pakistan  Amil Baba In Karachi Amil B...Papular No 1 Online Istikhara Amil Baba Pakistan  Amil Baba In Karachi Amil B...
Papular No 1 Online Istikhara Amil Baba Pakistan Amil Baba In Karachi Amil B...
 
Call Girls Service Kolkata Aishwarya 🤌 8250192130 🚀 Vip Call Girls Kolkata
Call Girls Service Kolkata Aishwarya 🤌  8250192130 🚀 Vip Call Girls KolkataCall Girls Service Kolkata Aishwarya 🤌  8250192130 🚀 Vip Call Girls Kolkata
Call Girls Service Kolkata Aishwarya 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Real Sure (Call Girl) in I.G.I. Airport 8377087607 Hot Call Girls In Delhi NCR
Real Sure (Call Girl) in I.G.I. Airport 8377087607 Hot Call Girls In Delhi NCRReal Sure (Call Girl) in I.G.I. Airport 8377087607 Hot Call Girls In Delhi NCR
Real Sure (Call Girl) in I.G.I. Airport 8377087607 Hot Call Girls In Delhi NCR
 
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
 
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
 
9953330565 Low Rate Call Girls In Jahangirpuri Delhi NCR
9953330565 Low Rate Call Girls In Jahangirpuri  Delhi NCR9953330565 Low Rate Call Girls In Jahangirpuri  Delhi NCR
9953330565 Low Rate Call Girls In Jahangirpuri Delhi NCR
 

Potential of NDP for Apache Spark

  • 1. Identifying the Potential of Near Data Processing for Apache Spark Ahsan J. Awan (KTH), Eduard Ayguade (BSC), Mats Brorsson (KTH), Moriyoshi Ohara (IBM), Kazuaki Ishizaki (IBM), and Vladimir Vlassov (KTH) KTH Royal Institute of Technology, Sweden BSC Barcelona Super Computing Center, Spain IBM Research Tokyo, Japan MEMSYS 2017, Oct 2-5, 2017, Washington DC, VA
  • 2. Motivation / ”Big Picture” 2 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 3. Previous Work / Further Reading •  Performance characterization of in-memory data analytics on a modern cloud server, 5th IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award). •  How Data Volume Affects Spark Based Data Analytics on a Scale-up Server, 6th Workshop on Big Data Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with VLDB 2015, Hawaii, USA. •  Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads, 6th IEEE Conference on Big Data and Cloud Computing, 2016. •  Node Architecture Implications for In-Memory Data Analytics in Scale-in Clusters, 3rd IEEE/ACM Conference in Big Data Computing, Applications and Technologies, 2016. 3 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 4. A fast and general engine for large-scale data processing (https://spark.apache.org/). Resilient Distributed Datasets (RDDs) •  immutable collections of objects spread across a cluster Data-frames Higher-order user-defined functions •  Transformations (RDD à RDD) •  Actions (RDDs à non-RDD) Spark 4 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 5. 1.  Processing-In-Memory (PIM) 2.  In-Storage Processing (ISP) Improve the performance by reducing costly data movements back and forth between the CPUs and Memories Exploiting Near Data Processing (NDP) 5 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA G. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts, M. Meswani, D. Zhang, and M. Ignatowski. A processing in memory taxonomy and a case for studying fixed-function PIM. In Workshop on Near-Data Processing (WoNDP), 2013.
  • 6. 3D-stacked PIM for Data Analytics for Map-Reduce •  perform Map operations on simple processing cores in the logic layer of 3D-stacked memory devices for Machine Learning •  offload atomic operations onto logic layers in 3D stacked memories for Graph Analytics •  offload the graph property calculations to HMC for SQL queries •  Joins can benefit from 3D-stacked PIM 6 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 7. Expected benefits of NDP for big-data analytics •  PIM for DRAM-bound applications, e.g., map-reduce, graph- and stream-processing, ML •  ISP for I/O-bound (non-iterative) applications, e.g. SQL •  PIM + ISP for phasic applications, both memory- and I/O- bound, e.g. clustering (k-means), some graph processing 7 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 8. Can Spark workloads benefit from NDP? 8 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 9. Methodology Identifying the potential of NDP to boost the performance of Spark workloads by matching the characteristics of the workloads to different forms of NDP (2D integrated PIM, 3D Stacked PIM, ISP) Representative Spark workloads (most from BigdataBench) •  Batch, SQL, stream-, graph-processing, ML •  should cover a diverse set of Spark transformations and actions •  should be common among available big-data benchmark suites •  have been used in evaluation of MR frameworks. 9 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 10. Workloads (1/2) 10 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 11. Workloads (2/2) 11 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 12. System Configuration 12Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA •  Hyper-Threading and Turbo- boost are disabled •  Spark in local mode: driver and executor are in same JVM •  HotSpot JDK version 7u71 in server mode
  • 13. •  iotop to measure the total disk bandwidth •  top to measure %usr and %io •  Intel Vtune Amplifier to collect hardware performance counters Measurement Tools and Metrics 13Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA Metrics for Top-Down Analysis of Workloads
  • 14. The case for Processing-In-Memory: 2D Integrated PIM instead of 3D Stacked PIM 14 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA M. Radulovic, at el. Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC? In MEMSYS ’15.
  • 15. The case for In-Storage-Processing 15 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA Grep (Gp) K-means (Km)Windowed Word Count (Wwc)
  • 16. A refined hypothesis based on workload characterization •  Non-iterative Spark workloads with high ratio of I/O wait time / CPU time, e.g. join, aggregation, filter, word count and sort, are ideal candidates for ISP. •  Spark workloads with low ratio of I/O wait time / CPU time, e.g. stream processing and iterative graph processing, are bound by latency of frequent accesses to DRAM and are ideal candidates for 2D integrated PIM. •  Iterative Spark workloads with moderate ratio of I/O wait time / CPU time, e.g., K-means, have both I/O bound and memory bound phases and hence will benefit from hybrid 2D integrated PIM and ISP. •  In order to satisfy the varying compute demands of Spark workloads, we envision an NDC architecture with programmable logic based hybrid ISP and 2D integrated PIM. 16 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA
  • 17. Identifying the Potential of Near Data Processing for Apache Spark Ahsan J. Awan (KTH), Eduard Ayguade (BSC), Mats Brorsson (KTH), Moriyoshi Ohara (IBM), Kazuaki Ishizaki (IBM), and Vladimir Vlassov (KTH) KTH Royal Institute of Technology, Sweden BSC Barcelona Super Computing Center, Spain IBM Research Tokyo, Japan MEMSYS 2017, Oct 2-5, 2017, Washington DC, VA
  • 18. Contact Ahsan Javed Awan KTH Royal Institute of Technology, Stockholm, Sweden Email: ajawn@kth.se Profile: www.kth.se/profile/ajawan/ 18 Identifying The Potential Of Near Data Processing For Apache Spark, MEMSYS 2017, Washington DC, VA, USA