Boosting Spark Performance
on Many-Core Machines
Qifan Pu
Sameer Agarwal (Databricks)
Reynold Xin (Databricks)
Ion Stoica
UC	
  BERKELEY	
  
Me
Ph.D. Student in AMPLab, advised by Prof. Ion Stoica
- Spark-related research projects
- e.g., how to run Spark in a geo-distributed fashion
- Big data storage (e.g., Alluxio)
- how to do memory management for multiple users
Intern at Databricks in the past summer
- Spark SQL team: aggregates, shuffle
This project
Spark performance on many-core machines
- Ongoing research, feedbacks are welcome
-  Focus of this talk:
•  understand shuffle performance
•  Investigate and implement In-memory shuffle
Moving beyond research
-  Hope is to get into Spark (but no guarantee yet)
Why do we care about many-core machines?
rdd.groupBy(…)…	
  
…	
  
…	
  
…	
  
…	
  
Spark started as a cluster computing framework
•  Spark’s one big success has been high scalability
•  Largest cluster known is 8000
•  Winner of 2014 Daytona GraySort Benchmark
•  Typical cluster sizes in practice:
“Our observation is that companies typically experiment with
cluster size of under 100 nodes and expand to 200 or more
nodes in the production stages.”
-- Susheel Kaushik (Senior Director at Pivotal)
Spark on cluster
Increasingly powerful single node machines
•  More cores packed on single chip
•  Intel Xeon Phi: 64-72 cores
•  Berkeley FireBox Project: ~100 cores
•  Larger instances in the Cloud
•  Various 32-core instances on EC2, Azure & GCE
•  EC2 X-Instance with 128 cores, 2TB (May 2016)
Memory (GB) vCPUs Hourly Cost ($) Cost/100GB Cost/8vCPU
x1.32xlarge 1952 128 13.338 0.68 0.83
g2.2xlarge 15 8 0.65 4.33 0.65
i2.2xlarge 61 8 1.705 2.80 1.70
m3.2xlarge 30 8 0.532 1.78 0.53
c3.2xlarge 15 8 0.42 2.80 0.42
Cost of many-core nodes
1 x1.32xlarge instance is a small cluster
(with more memory, fast inter-core network)
Spark’s design was based on many nodes
•  Data communication (a.k.a. shuffle)
•  Store intermediate data on disk
•  Serialization/deserialization needed across nodes
•  Now: much memory to spare, intra-node shuffle
•  Resource management
•  Designed to handle moderate amount on each node
•  Now: 1 executor for 100 cores + 2TB memory?
Focus of this talk
Ongoing work
Can we improve shuffle on single, multi-core
machines?
1, Memory is fast
2, We can use memory for shuffle
3, Therefore,
Shuffle will be fast
Will this “common sense” work?
Put in practice…
•  Spark: write to a file stream, and save all bytes to disk
•  Solution: …, …. to memory (bytes on heap)
spark.range(16M).repartition(1).selectExpr("sum(id)").collect()
0
 1
 2
 3
 4
 5
Vanilla Spark
Attempt 1
runtime (seconds)
Why zero improvement?
Attempt 1
Vanilla Spark
flushBuffer(),	
  4.81%	
  of	
  4me	
  
En4re	
  dura4on	
  of	
  a	
  job	
  (100%)	
  
Why zero improvement?
1, I/O throughput is not the bottleneck (in this job)
2, Buffer cache:
memory is being exploited by disk-based shuffle
Why zero improvement?
Understanding Shuffle Performance
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Mappers	
  
Understanding Shuffle Performance
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Results	
  or	
  another	
  shuffle	
  
Reducers	
  
More complications
•  Sort vs. hash-based shuffle
•  Spill when memory runs out
•  Clear data after shuffle
…
Case 1
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Thread	
  1	
  
Thread	
  2	
  
Case 1
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Thread	
  1	
  
Thread	
  2	
  Case	
  1:	
  thread	
  1	
  slower	
  than	
  thread	
  2	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Case 2
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Case 2
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Case	
  2:	
  wri4ng/reading	
  file	
  streams	
  is	
  slow	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Case study 2
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Results	
  or	
  another	
  shuffle	
  
Reducers	
  
Case 3
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Case 3
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Case	
  3:	
  I/O	
  conten4ons	
  	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Our	
  first	
  aTempt	
  should	
  work	
  in	
  this	
  case!	
  
Can we improve case 2?
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Previous	
  example	
  is	
  case	
  2.	
  
Attempt 2: get rid of ser/deser, etc
•  Attempt 2: create NxM queues (N=mappers, M=reducers)
push corresponding records into queues
mappers reducers
•  No serialization
•  No copy (data structure
shared by both sides)
Attempt 2: get rid of ser/deser, etc
•  Attempt 2: create NxM queues (N=mappers, M=reducers)
push corresponding records into queues
0
 2
 4
 6
 8
 10
 12
 14
 16
 18
Vanilla Spark
Attempt 1
Attempt 2
runtime (seconds)
3x	
  worse	
  
0
 2
 4
 6
 8
 10
 12
 14
 16
 18
Vanilla Spark
Attempt 1
Attempt 2
runtime (seconds)
Processing (s)
GC (s)
Attempt 2: get rid of ser/deser, etc
•  Attempt 2: create NxM queues (N=mappers, M=reducers)
push corresponding records into queues
•  Attempt 3: instead of queue, copy records to memory pages
Number of objects: ~records à ~pages
NxM pages, or alternatively, one page per reducer
Attempt 3: avoiding GC
record1	
   recor..	
  
..d2	
   record3	
  
…	
  
…	
  
…	
  
•  Unsafe Row (a buffer-backed row format):
•  row.pointTo(buffer)
Spark SQL
record1	
   recor..	
  
..d2	
   record3	
  
…	
  
…	
  
…	
  
Instantaneous creation of
unsafe rows by pointing to
different offsets In the
page
Attempt 3: avoiding GC
•  Attempt 3: copy records onto large memory pages
0
 2
 4
 6
 8
 10
 12
 14
 16
 18
Vanilla Spark
Attempt 1
Attempt 2
Attempt 3
runtime (seconds)
Improvement	
  from	
  avoid	
  ser/deser,	
  copy,	
  
I/O	
  conten4on,	
  expensive	
  code	
  path	
  	
  
Consistent improvement with varying size
spark.range(N).repar44on().selectExpr("sum(id)").collect()	
  
Use	
  N	
  from	
  2^	
  20	
  to	
  2^27 	
  	
  
0
50
100
150
200
250
300
Disk-based
 In-memory Shuffle
nanoseconds/row
2^20
2^21
2^22
2^23
2^24
2^25
2^26
2^27
TPC-DS performance (single node)
0
10
20
30
40
50
60
QueryRuntime(s)
In-memory Shuffle
Vanilla-Spark
27/33 queries improve with a median of 31%
Extending to multiple nodes
•  Implementation
•  All data goes to memory
•  For remote transfer, copy from memory to network buffer
•  A more memory-preserving way…
•  Local transfer goes to memory
•  Remote transfer goes to disk
•  Cons1: have to enforce stricter locality on reducers
•  Cons2: cannot avoid I/O contentions
spark.range(N).repar44on().selectExpr("sum(id)").collect()	
  
Simple shuffle job
0
100
200
300
400
500
600
1
 2
 3
 4
 1
 2
 3
 4
ns/row
Reduce Stage
Map Stage
Vanilla-Spark
 In-memory Shuffle
Map:
Consistent improvement
Reduce:
Improvement decreases with
more nodes
TPC-DS performance (x1.xlarge32)
0
20
40
60
80
q13
 q20
 q18
 q11
 q3
QueryRuntime(s)
In-memory Shuffle
Vanilla-Spark
•  SF=100
•  Pick top 5 queries
from single node
experiment
•  Best of 10 runs
Many other performance bottlenecks need investigation!
Summary
•  Spark on many-core requires many architectural changes
•  In-memory shuffle
•  How to improve shuffle performance with memory
•  31% improvement over Spark
•  On-going research
•  Identify other performance bottlenecks
Thank you
Qifan Pu
qifan@cs.berkeley.edu

Spark Summit EU talk by Qifan Pu

  • 1.
    Boosting Spark Performance onMany-Core Machines Qifan Pu Sameer Agarwal (Databricks) Reynold Xin (Databricks) Ion Stoica UC  BERKELEY  
  • 2.
    Me Ph.D. Student inAMPLab, advised by Prof. Ion Stoica - Spark-related research projects - e.g., how to run Spark in a geo-distributed fashion - Big data storage (e.g., Alluxio) - how to do memory management for multiple users Intern at Databricks in the past summer - Spark SQL team: aggregates, shuffle
  • 3.
    This project Spark performanceon many-core machines - Ongoing research, feedbacks are welcome -  Focus of this talk: •  understand shuffle performance •  Investigate and implement In-memory shuffle Moving beyond research -  Hope is to get into Spark (but no guarantee yet)
  • 4.
    Why do wecare about many-core machines? rdd.groupBy(…)…   …   …   …   …   Spark started as a cluster computing framework
  • 5.
    •  Spark’s onebig success has been high scalability •  Largest cluster known is 8000 •  Winner of 2014 Daytona GraySort Benchmark •  Typical cluster sizes in practice: “Our observation is that companies typically experiment with cluster size of under 100 nodes and expand to 200 or more nodes in the production stages.” -- Susheel Kaushik (Senior Director at Pivotal) Spark on cluster
  • 6.
    Increasingly powerful singlenode machines •  More cores packed on single chip •  Intel Xeon Phi: 64-72 cores •  Berkeley FireBox Project: ~100 cores •  Larger instances in the Cloud •  Various 32-core instances on EC2, Azure & GCE •  EC2 X-Instance with 128 cores, 2TB (May 2016)
  • 7.
    Memory (GB) vCPUsHourly Cost ($) Cost/100GB Cost/8vCPU x1.32xlarge 1952 128 13.338 0.68 0.83 g2.2xlarge 15 8 0.65 4.33 0.65 i2.2xlarge 61 8 1.705 2.80 1.70 m3.2xlarge 30 8 0.532 1.78 0.53 c3.2xlarge 15 8 0.42 2.80 0.42 Cost of many-core nodes 1 x1.32xlarge instance is a small cluster (with more memory, fast inter-core network)
  • 8.
    Spark’s design wasbased on many nodes •  Data communication (a.k.a. shuffle) •  Store intermediate data on disk •  Serialization/deserialization needed across nodes •  Now: much memory to spare, intra-node shuffle •  Resource management •  Designed to handle moderate amount on each node •  Now: 1 executor for 100 cores + 2TB memory? Focus of this talk Ongoing work
  • 9.
    Can we improveshuffle on single, multi-core machines? 1, Memory is fast 2, We can use memory for shuffle 3, Therefore, Shuffle will be fast Will this “common sense” work?
  • 10.
    Put in practice… • Spark: write to a file stream, and save all bytes to disk •  Solution: …, …. to memory (bytes on heap) spark.range(16M).repartition(1).selectExpr("sum(id)").collect() 0 1 2 3 4 5 Vanilla Spark Attempt 1 runtime (seconds) Why zero improvement? Attempt 1 Vanilla Spark
  • 11.
    flushBuffer(),  4.81%  of  4me   En4re  dura4on  of  a  job  (100%)   Why zero improvement?
  • 12.
    1, I/O throughputis not the bottleneck (in this job) 2, Buffer cache: memory is being exploited by disk-based shuffle Why zero improvement?
  • 13.
    Understanding Shuffle Performance Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush   Mappers  
  • 14.
    Understanding Shuffle Performance Generated  Iterator   OP1   OP2   OP3   Results  or  another  shuffle   Reducers  
  • 15.
    More complications •  Sortvs. hash-based shuffle •  Spill when memory runs out •  Clear data after shuffle …
  • 16.
    Case 1 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush   Thread  1   Thread  2  
  • 17.
    Case 1 Queue  Input   flush   flush   flush   Thread  1   Thread  2  Case  1:  thread  1  slower  than  thread  2   Filestreams   Generated  Iterator   OP1   OP2   OP3  
  • 18.
    Case 2 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush  
  • 19.
    Case 2 Queue  Input   flush   flush   flush   Case  2:  wri4ng/reading  file  streams  is  slow   Filestreams   Generated  Iterator   OP1   OP2   OP3  
  • 20.
    Case study 2 Generated  Iterator   OP1   OP2   OP3   Results  or  another  shuffle   Reducers  
  • 21.
    Case 3 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush  
  • 22.
    Case 3 Queue  Input   flush   flush   flush   Case  3:  I/O  conten4ons     Filestreams   Generated  Iterator   OP1   OP2   OP3   Our  first  aTempt  should  work  in  this  case!  
  • 23.
    Can we improvecase 2? Queue  Input   flush   flush   flush   Filestreams   Generated  Iterator   OP1   OP2   OP3   Previous  example  is  case  2.  
  • 24.
    Attempt 2: getrid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues mappers reducers •  No serialization •  No copy (data structure shared by both sides)
  • 25.
    Attempt 2: getrid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 runtime (seconds) 3x  worse  
  • 26.
    0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 runtime (seconds) Processing (s) GC (s) Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues
  • 27.
    •  Attempt 3:instead of queue, copy records to memory pages Number of objects: ~records à ~pages NxM pages, or alternatively, one page per reducer Attempt 3: avoiding GC record1   recor..   ..d2   record3   …   …   …  
  • 28.
    •  Unsafe Row(a buffer-backed row format): •  row.pointTo(buffer) Spark SQL record1   recor..   ..d2   record3   …   …   …   Instantaneous creation of unsafe rows by pointing to different offsets In the page
  • 29.
    Attempt 3: avoidingGC •  Attempt 3: copy records onto large memory pages 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 Attempt 3 runtime (seconds) Improvement  from  avoid  ser/deser,  copy,   I/O  conten4on,  expensive  code  path    
  • 30.
    Consistent improvement withvarying size spark.range(N).repar44on().selectExpr("sum(id)").collect()   Use  N  from  2^  20  to  2^27     0 50 100 150 200 250 300 Disk-based In-memory Shuffle nanoseconds/row 2^20 2^21 2^22 2^23 2^24 2^25 2^26 2^27
  • 31.
    TPC-DS performance (singlenode) 0 10 20 30 40 50 60 QueryRuntime(s) In-memory Shuffle Vanilla-Spark 27/33 queries improve with a median of 31%
  • 32.
    Extending to multiplenodes •  Implementation •  All data goes to memory •  For remote transfer, copy from memory to network buffer •  A more memory-preserving way… •  Local transfer goes to memory •  Remote transfer goes to disk •  Cons1: have to enforce stricter locality on reducers •  Cons2: cannot avoid I/O contentions
  • 33.
    spark.range(N).repar44on().selectExpr("sum(id)").collect()   Simple shufflejob 0 100 200 300 400 500 600 1 2 3 4 1 2 3 4 ns/row Reduce Stage Map Stage Vanilla-Spark In-memory Shuffle Map: Consistent improvement Reduce: Improvement decreases with more nodes
  • 34.
    TPC-DS performance (x1.xlarge32) 0 20 40 60 80 q13 q20 q18 q11 q3 QueryRuntime(s) In-memory Shuffle Vanilla-Spark •  SF=100 •  Pick top 5 queries from single node experiment •  Best of 10 runs Many other performance bottlenecks need investigation!
  • 35.
    Summary •  Spark onmany-core requires many architectural changes •  In-memory shuffle •  How to improve shuffle performance with memory •  31% improvement over Spark •  On-going research •  Identify other performance bottlenecks
  • 36.