SlideShare a Scribd company logo
Boosting Spark Performance
on Many-Core Machines
Qifan Pu
Sameer Agarwal (Databricks)
Reynold Xin (Databricks)
Ion Stoica
UC	
  BERKELEY	
  
Me
Ph.D. Student in AMPLab, advised by Prof. Ion Stoica
- Spark-related research projects
- e.g., how to run Spark in a geo-distributed fashion
- Big data storage (e.g., Alluxio)
- how to do memory management for multiple users
Intern at Databricks in the past summer
- Spark SQL team: aggregates, shuffle
This project
Spark performance on many-core machines
- Ongoing research, feedbacks are welcome
-  Focus of this talk:
•  understand shuffle performance
•  Investigate and implement In-memory shuffle
Moving beyond research
-  Hope is to get into Spark (but no guarantee yet)
Why do we care about many-core machines?
rdd.groupBy(…)…	
  
…	
  
…	
  
…	
  
…	
  
Spark started as a cluster computing framework
•  Spark’s one big success has been high scalability
•  Largest cluster known is 8000
•  Winner of 2014 Daytona GraySort Benchmark
•  Typical cluster sizes in practice:
“Our observation is that companies typically experiment with
cluster size of under 100 nodes and expand to 200 or more
nodes in the production stages.”
-- Susheel Kaushik (Senior Director at Pivotal)
Spark on cluster
Increasingly powerful single node machines
•  More cores packed on single chip
•  Intel Xeon Phi: 64-72 cores
•  Berkeley FireBox Project: ~100 cores
•  Larger instances in the Cloud
•  Various 32-core instances on EC2, Azure & GCE
•  EC2 X-Instance with 128 cores, 2TB (May 2016)
Memory (GB) vCPUs Hourly Cost ($) Cost/100GB Cost/8vCPU
x1.32xlarge 1952 128 13.338 0.68 0.83
g2.2xlarge 15 8 0.65 4.33 0.65
i2.2xlarge 61 8 1.705 2.80 1.70
m3.2xlarge 30 8 0.532 1.78 0.53
c3.2xlarge 15 8 0.42 2.80 0.42
Cost of many-core nodes
1 x1.32xlarge instance is a small cluster
(with more memory, fast inter-core network)
Spark’s design was based on many nodes
•  Data communication (a.k.a. shuffle)
•  Store intermediate data on disk
•  Serialization/deserialization needed across nodes
•  Now: much memory to spare, intra-node shuffle
•  Resource management
•  Designed to handle moderate amount on each node
•  Now: 1 executor for 100 cores + 2TB memory?
Focus of this talk
Ongoing work
Can we improve shuffle on single, multi-core
machines?
1, Memory is fast
2, We can use memory for shuffle
3, Therefore,
Shuffle will be fast
Will this “common sense” work?
Put in practice…
•  Spark: write to a file stream, and save all bytes to disk
•  Solution: …, …. to memory (bytes on heap)
spark.range(16M).repartition(1).selectExpr("sum(id)").collect()
0
 1
 2
 3
 4
 5
Vanilla Spark
Attempt 1
runtime (seconds)
Why zero improvement?
Attempt 1
Vanilla Spark
flushBuffer(),	
  4.81%	
  of	
  4me	
  
En4re	
  dura4on	
  of	
  a	
  job	
  (100%)	
  
Why zero improvement?
1, I/O throughput is not the bottleneck (in this job)
2, Buffer cache:
memory is being exploited by disk-based shuffle
Why zero improvement?
Understanding Shuffle Performance
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Mappers	
  
Understanding Shuffle Performance
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Results	
  or	
  another	
  shuffle	
  
Reducers	
  
More complications
•  Sort vs. hash-based shuffle
•  Spill when memory runs out
•  Clear data after shuffle
…
Case 1
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Thread	
  1	
  
Thread	
  2	
  
Case 1
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Thread	
  1	
  
Thread	
  2	
  Case	
  1:	
  thread	
  1	
  slower	
  than	
  thread	
  2	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Case 2
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Case 2
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Case	
  2:	
  wri4ng/reading	
  file	
  streams	
  is	
  slow	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Case study 2
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Results	
  or	
  another	
  shuffle	
  
Reducers	
  
Case 3
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Case 3
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Case	
  3:	
  I/O	
  conten4ons	
  	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Our	
  first	
  aTempt	
  should	
  work	
  in	
  this	
  case!	
  
Can we improve case 2?
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Previous	
  example	
  is	
  case	
  2.	
  
Attempt 2: get rid of ser/deser, etc
•  Attempt 2: create NxM queues (N=mappers, M=reducers)
push corresponding records into queues
mappers reducers
•  No serialization
•  No copy (data structure
shared by both sides)
Attempt 2: get rid of ser/deser, etc
•  Attempt 2: create NxM queues (N=mappers, M=reducers)
push corresponding records into queues
0
 2
 4
 6
 8
 10
 12
 14
 16
 18
Vanilla Spark
Attempt 1
Attempt 2
runtime (seconds)
3x	
  worse	
  
0
 2
 4
 6
 8
 10
 12
 14
 16
 18
Vanilla Spark
Attempt 1
Attempt 2
runtime (seconds)
Processing (s)
GC (s)
Attempt 2: get rid of ser/deser, etc
•  Attempt 2: create NxM queues (N=mappers, M=reducers)
push corresponding records into queues
•  Attempt 3: instead of queue, copy records to memory pages
Number of objects: ~records à ~pages
NxM pages, or alternatively, one page per reducer
Attempt 3: avoiding GC
record1	
   recor..	
  
..d2	
   record3	
  
…	
  
…	
  
…	
  
•  Unsafe Row (a buffer-backed row format):
•  row.pointTo(buffer)
Spark SQL
record1	
   recor..	
  
..d2	
   record3	
  
…	
  
…	
  
…	
  
Instantaneous creation of
unsafe rows by pointing to
different offsets In the
page
Attempt 3: avoiding GC
•  Attempt 3: copy records onto large memory pages
0
 2
 4
 6
 8
 10
 12
 14
 16
 18
Vanilla Spark
Attempt 1
Attempt 2
Attempt 3
runtime (seconds)
Improvement	
  from	
  avoid	
  ser/deser,	
  copy,	
  
I/O	
  conten4on,	
  expensive	
  code	
  path	
  	
  
Consistent improvement with varying size
spark.range(N).repar44on().selectExpr("sum(id)").collect()	
  
Use	
  N	
  from	
  2^	
  20	
  to	
  2^27 	
  	
  
0
50
100
150
200
250
300
Disk-based
 In-memory Shuffle
nanoseconds/row
2^20
2^21
2^22
2^23
2^24
2^25
2^26
2^27
TPC-DS performance (single node)
0
10
20
30
40
50
60
QueryRuntime(s)
In-memory Shuffle
Vanilla-Spark
27/33 queries improve with a median of 31%
Extending to multiple nodes
•  Implementation
•  All data goes to memory
•  For remote transfer, copy from memory to network buffer
•  A more memory-preserving way…
•  Local transfer goes to memory
•  Remote transfer goes to disk
•  Cons1: have to enforce stricter locality on reducers
•  Cons2: cannot avoid I/O contentions
spark.range(N).repar44on().selectExpr("sum(id)").collect()	
  
Simple shuffle job
0
100
200
300
400
500
600
1
 2
 3
 4
 1
 2
 3
 4
ns/row
Reduce Stage
Map Stage
Vanilla-Spark
 In-memory Shuffle
Map:
Consistent improvement
Reduce:
Improvement decreases with
more nodes
TPC-DS performance (x1.xlarge32)
0
20
40
60
80
q13
 q20
 q18
 q11
 q3
QueryRuntime(s)
In-memory Shuffle
Vanilla-Spark
•  SF=100
•  Pick top 5 queries
from single node
experiment
•  Best of 10 runs
Many other performance bottlenecks need investigation!
Summary
•  Spark on many-core requires many architectural changes
•  In-memory shuffle
•  How to improve shuffle performance with memory
•  31% improvement over Spark
•  On-going research
•  Identify other performance bottlenecks
Thank you
Qifan Pu
qifan@cs.berkeley.edu

More Related Content

What's hot

Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
Databricks
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
Spark Summit
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Spark Summit
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
Databricks
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
Spark Summit
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 

What's hot (20)

Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 

Viewers also liked

Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
Spark Summit
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
Spark Summit
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
Spark Summit
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Vasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingVasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python Profiling
Sergey Arkhipov
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python Performance
Sergey Arkhipov
 
The High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian OzsvaldThe High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian Ozsvald
PyData
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
GlobalLogic Ukraine
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning
晨揚 施
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
Spark Summit
 
Python profiling
Python profilingPython profiling
Python profiling
dreampuf
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA
 

Viewers also liked (20)

Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Vasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingVasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python Profiling
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python Performance
 
The High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian OzsvaldThe High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian Ozsvald
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
Python profiling
Python profilingPython profiling
Python profiling
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 

Similar to Spark Summit EU talk by Qifan Pu

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
andrew311
 
Hoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdfHoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdf
AshutoshKumar437302
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
Alexandre Rafalovitch
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Lucidworks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Tokyo Institute of Technology
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
J On The Beach
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
MongoDB
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
JAXLondon2014
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 

Similar to Spark Summit EU talk by Qifan Pu (20)

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Hoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdfHoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdf
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
birajmohan012
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
huseindihon
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
gargnatasha985
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
vrvipin164
 
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
44annissa
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
kinni singh$A17
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
fatima shekh$A17
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
harendmgr
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
huseindihon
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
LINAT
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
sharonblush
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
uapta
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
ginni singh$A17
 
Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)
Alireza Kamrani
 
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
kuldeepsharmaks8120
 
Biometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdfBiometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdf
Joel Ngushwai
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
45unexpected
 
all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...
palaniappancse
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 

Recently uploaded (20)

Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
 
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
 
Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)
 
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
 
Biometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdfBiometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdf
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
 
all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 

Spark Summit EU talk by Qifan Pu

  • 1. Boosting Spark Performance on Many-Core Machines Qifan Pu Sameer Agarwal (Databricks) Reynold Xin (Databricks) Ion Stoica UC  BERKELEY  
  • 2. Me Ph.D. Student in AMPLab, advised by Prof. Ion Stoica - Spark-related research projects - e.g., how to run Spark in a geo-distributed fashion - Big data storage (e.g., Alluxio) - how to do memory management for multiple users Intern at Databricks in the past summer - Spark SQL team: aggregates, shuffle
  • 3. This project Spark performance on many-core machines - Ongoing research, feedbacks are welcome -  Focus of this talk: •  understand shuffle performance •  Investigate and implement In-memory shuffle Moving beyond research -  Hope is to get into Spark (but no guarantee yet)
  • 4. Why do we care about many-core machines? rdd.groupBy(…)…   …   …   …   …   Spark started as a cluster computing framework
  • 5. •  Spark’s one big success has been high scalability •  Largest cluster known is 8000 •  Winner of 2014 Daytona GraySort Benchmark •  Typical cluster sizes in practice: “Our observation is that companies typically experiment with cluster size of under 100 nodes and expand to 200 or more nodes in the production stages.” -- Susheel Kaushik (Senior Director at Pivotal) Spark on cluster
  • 6. Increasingly powerful single node machines •  More cores packed on single chip •  Intel Xeon Phi: 64-72 cores •  Berkeley FireBox Project: ~100 cores •  Larger instances in the Cloud •  Various 32-core instances on EC2, Azure & GCE •  EC2 X-Instance with 128 cores, 2TB (May 2016)
  • 7. Memory (GB) vCPUs Hourly Cost ($) Cost/100GB Cost/8vCPU x1.32xlarge 1952 128 13.338 0.68 0.83 g2.2xlarge 15 8 0.65 4.33 0.65 i2.2xlarge 61 8 1.705 2.80 1.70 m3.2xlarge 30 8 0.532 1.78 0.53 c3.2xlarge 15 8 0.42 2.80 0.42 Cost of many-core nodes 1 x1.32xlarge instance is a small cluster (with more memory, fast inter-core network)
  • 8. Spark’s design was based on many nodes •  Data communication (a.k.a. shuffle) •  Store intermediate data on disk •  Serialization/deserialization needed across nodes •  Now: much memory to spare, intra-node shuffle •  Resource management •  Designed to handle moderate amount on each node •  Now: 1 executor for 100 cores + 2TB memory? Focus of this talk Ongoing work
  • 9. Can we improve shuffle on single, multi-core machines? 1, Memory is fast 2, We can use memory for shuffle 3, Therefore, Shuffle will be fast Will this “common sense” work?
  • 10. Put in practice… •  Spark: write to a file stream, and save all bytes to disk •  Solution: …, …. to memory (bytes on heap) spark.range(16M).repartition(1).selectExpr("sum(id)").collect() 0 1 2 3 4 5 Vanilla Spark Attempt 1 runtime (seconds) Why zero improvement? Attempt 1 Vanilla Spark
  • 11. flushBuffer(),  4.81%  of  4me   En4re  dura4on  of  a  job  (100%)   Why zero improvement?
  • 12. 1, I/O throughput is not the bottleneck (in this job) 2, Buffer cache: memory is being exploited by disk-based shuffle Why zero improvement?
  • 13. Understanding Shuffle Performance Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush   Mappers  
  • 14. Understanding Shuffle Performance Generated  Iterator   OP1   OP2   OP3   Results  or  another  shuffle   Reducers  
  • 15. More complications •  Sort vs. hash-based shuffle •  Spill when memory runs out •  Clear data after shuffle …
  • 16. Case 1 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush   Thread  1   Thread  2  
  • 17. Case 1 Queue  Input   flush   flush   flush   Thread  1   Thread  2  Case  1:  thread  1  slower  than  thread  2   Filestreams   Generated  Iterator   OP1   OP2   OP3  
  • 18. Case 2 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush  
  • 19. Case 2 Queue  Input   flush   flush   flush   Case  2:  wri4ng/reading  file  streams  is  slow   Filestreams   Generated  Iterator   OP1   OP2   OP3  
  • 20. Case study 2 Generated  Iterator   OP1   OP2   OP3   Results  or  another  shuffle   Reducers  
  • 21. Case 3 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush  
  • 22. Case 3 Queue  Input   flush   flush   flush   Case  3:  I/O  conten4ons     Filestreams   Generated  Iterator   OP1   OP2   OP3   Our  first  aTempt  should  work  in  this  case!  
  • 23. Can we improve case 2? Queue  Input   flush   flush   flush   Filestreams   Generated  Iterator   OP1   OP2   OP3   Previous  example  is  case  2.  
  • 24. Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues mappers reducers •  No serialization •  No copy (data structure shared by both sides)
  • 25. Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 runtime (seconds) 3x  worse  
  • 26. 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 runtime (seconds) Processing (s) GC (s) Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues
  • 27. •  Attempt 3: instead of queue, copy records to memory pages Number of objects: ~records à ~pages NxM pages, or alternatively, one page per reducer Attempt 3: avoiding GC record1   recor..   ..d2   record3   …   …   …  
  • 28. •  Unsafe Row (a buffer-backed row format): •  row.pointTo(buffer) Spark SQL record1   recor..   ..d2   record3   …   …   …   Instantaneous creation of unsafe rows by pointing to different offsets In the page
  • 29. Attempt 3: avoiding GC •  Attempt 3: copy records onto large memory pages 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 Attempt 3 runtime (seconds) Improvement  from  avoid  ser/deser,  copy,   I/O  conten4on,  expensive  code  path    
  • 30. Consistent improvement with varying size spark.range(N).repar44on().selectExpr("sum(id)").collect()   Use  N  from  2^  20  to  2^27     0 50 100 150 200 250 300 Disk-based In-memory Shuffle nanoseconds/row 2^20 2^21 2^22 2^23 2^24 2^25 2^26 2^27
  • 31. TPC-DS performance (single node) 0 10 20 30 40 50 60 QueryRuntime(s) In-memory Shuffle Vanilla-Spark 27/33 queries improve with a median of 31%
  • 32. Extending to multiple nodes •  Implementation •  All data goes to memory •  For remote transfer, copy from memory to network buffer •  A more memory-preserving way… •  Local transfer goes to memory •  Remote transfer goes to disk •  Cons1: have to enforce stricter locality on reducers •  Cons2: cannot avoid I/O contentions
  • 33. spark.range(N).repar44on().selectExpr("sum(id)").collect()   Simple shuffle job 0 100 200 300 400 500 600 1 2 3 4 1 2 3 4 ns/row Reduce Stage Map Stage Vanilla-Spark In-memory Shuffle Map: Consistent improvement Reduce: Improvement decreases with more nodes
  • 34. TPC-DS performance (x1.xlarge32) 0 20 40 60 80 q13 q20 q18 q11 q3 QueryRuntime(s) In-memory Shuffle Vanilla-Spark •  SF=100 •  Pick top 5 queries from single node experiment •  Best of 10 runs Many other performance bottlenecks need investigation!
  • 35. Summary •  Spark on many-core requires many architectural changes •  In-memory shuffle •  How to improve shuffle performance with memory •  31% improvement over Spark •  On-going research •  Identify other performance bottlenecks