Tiark Rompf
Purdue University
FLARE Scale up Spark SQL
with Native Compilation
and set your Data on Fire!
ScalaMartin Odersky + the Scala team
Rompf, Iulian Dragos, Adriaan Moors, Gilles Dubochet, Philipp Haller, Lukas Rytz, Ingo Maier, Antonio Cunei, Donna Malayeri, Miguel Garcia, Hubert Plociniczak, Aleksandar Pro
st: Geoffrey Washburn, Stéphane Micheloud, Lex Spoon, Sean Mc Dirmid, Burak Emir, Nikolay Mihaylov, Philippe Altherr, Vincent Cremet, Michel Schinz, Erik Stenman, Matthias
external/visiting contributors: Paul Phillips, Miles Sabin, Stepan Koltsov and others
User Programs
(Java, Scala, Python, R)
SQL
(JDBC, Console)
Spark
Resilient Distributed Dataset
Code
Generation
DataFrame API
Catalyst Optimizer
Spark SQL
How Fast Is Spark?
Demo
User Programs
(Java, Scala, Python, R)
SQL
(JDBC, Console)
Spark
Resilient Distributed Dataset
Code
Generation
DataFrame API
Catalyst Optimizer
Spark SQL
Spark Architecture
Flare: a New Back-end for Spark
User Programs
(Java, Scala, Python, R)
SQL
(JDBC, Console)
Spark
Resilient Distributed Dataset
Code
Generation
DataFrame API
Catalyst Optimizer
Spark SQL
Delite’s Back-end
DMLL
LMS Code Generation
Optimized Scala, C
(a) Spark SQL
Delite’s Runtime
native code
OptiQL OptiML OptiGraph
(b) Flare Level 1 (c) Flare Level 2 (d) Flare Level 3
Front-end
Flare’s
Code Generation
Flare’s
Code Generation
Flare’s Runtime
Export query plan
JNI
Front-end
Results
Single-Core Running Time: TPCH
Absolute running time in milliseconds (ms) for Postgres, Spark, HyPer and Flare in SF10
1
10
100
1000
10000
100000
1x106
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
RunningTime(ms)
PostgreSQL Spark HyPer Flare
Apache Parquet Format
1
10
100
1000
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21
Speedup
Spark CSV Spark Parquet Flare CSV Flare Parquet
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11
Spark CSV 16762 12244 21730 19836 19316 12278 24484 17726 30050 29533 5224
Spark Parquet 3728 13520 9099 6083 8706 535 13555 5512 19413 21822 3926
Flare CSV 641 168 757 698 758 568 788 875 1417 854 128
Flare Parquet 187 17 125 127 151 99 183 160 698 309 9
Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Spark CSV 21688 8554 12962 26721 12941 24690 27012 12409 19369 57330 7050
Spark Parquet 5570 7034 719 4506 21834 5176 6757 2681 8562 25089 5295
Flare CSV 701 388 573 551 150 1426 1229 605 792 1868 178
Flare Parquet 133 246 86 88 66 264 181 178 165 324 22
What about parallelism?
Parallel Scaling Experiment
Scaling-up Flare and Spark SQL in SF20
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 32
Speedup Flare
Q6 aggregate
Q13 outer-join
Q14 join
Q22 semi/anti-join
0
2000
4000
6000
8000
10000
12000
14000
16000
1 2 4 8 16 32
# Cores
Q22
0
2000
4000
6000
8000
10000
12000
14000
1 2 4 8 16 32
RunningTime(ms)
# Cores
Q14
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16 32
Q13
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16 32
RunningTime(ms)
Q6
Spark SQL
Flare Level 2
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 32
Spark
Hardware: Single NUMA machine with 4 sockets, 12 Xeon E5-4657L cores per socket, and
256GB RAM per socket (1 TB total).
NUMA Optimization
C
O
R
E
0
C
O
R
E
1
C
O
R
E
2
C
O
R
E
3
C
O
R
E
4
C
O
R
E
5
C
O
R
E
6
C
O
R
E
7
C
O
R
E
8
C
O
R
E
9
C
O
R
E
10
C
O
R
E
11
C
O
R
E
12
C
O
R
E
13
C
O
R
E
14
C
O
R
E
15
Memory
Columnar data
NUMA Optimization
5300
5400
5500
Q1
1
0
100
200
300
400
500
600
1 18 36 72
RunningTime(ms)
# Cores
12 12
23
12
24
46
3500
3600
3700
Q6
one socket
two sockets
four sockets
1
0
100
200
300
1 18 36 72
# Cores
14
22
29
23
44
58
Scaling-up Flare for SF100 with NUMA optimizations on different configurations: threads pinned to one, two and four sockets
• Q6 performs better when threads are dispatched on different sockets.
• Q1 is computation-bound, little effect
• On scaling-up Q1 and Q6 up to 72 cores (the capacity of the
machine), the maximum speedup is 46x and 58x.
Hardware: Single NUMA machine with 4 sockets, 12 Xeon E5-4657L cores per socket, and
256GB RAM per socket (1 TB total).
Heterogeneous Workloads:
UDFs and ML Kernels
Example: k-Means Clustering
untilconverged(mu, tol) { mu =>
// calculate distances to current centroids
val c = (0::m) {i =>
val allDistances = mu mapRows { centroid =>
dist(x(i), centroid)
}
allDistances.minIndex
}
// move each cluster centroid to the
// mean of the points assigned to it
val newMu = (0::k,*) { i =>
val (weightedpoints, points) = sum(0,m) { j =>
if (c(i) == j) (x(i),1)
}
if (points == 0) Vector.zeros(n)
else weightedpoints / points
}
newMu
5
10
15
20
25
30
35
40
45
50
1 12 24 48
# Cores
GDA
5
10
15
20
25
30
35
40
45
50
1 12 24 48
Speedup
# Cores
Gene
5
10
15
20
25
30
35
40
45
50
1 12 24 48
k-means
5
10
15
20
25
30
35
40
45
50
1 12 24 48
Speedup
LogReg
Spark
C++(nopin)
C++(pin)
C++(numa)
Level 3: Machine learning kernels, scaling on shared memory NUMA
with thread pinning and data partitioning
Flare Level 3: ML Performance
Flare Level 3: ML Performance
0
1
2
3
4
5
6
7
8
3.4 GB 17 GB
SpeedupoverSpark
LogReg
0
1
2
3
4
5
6
7
8
1.7 GB 17 GB
k-means
Spark
Delite-CPU
Delite-GPU
0
1
2
3
4
5
6
7
8
k-means LogReg
GPU Cluster
Level 3: Machine learning kernels run on a 20 node Amazon cluster (left, center)
and on a 4 node GPU cluster connected within a single rack.
TensorFlow -> TensorFlare
Relational + ML
/* TensorFlow inference as UDF */
val q = spark.sql("select ... from data
where class = findNearestCluster(...)
group by class")
flare(q).show
flaredata.github.io
flaredata.github.io
Grégory Essertel Ruby Tahboub James Decker
FLARE TEAM
Thank You.
Web: flaredata.github.io
Twitter: @flaredata
FLARE

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! with Tiark Rompf

  • 1.
    Tiark Rompf Purdue University FLAREScale up Spark SQL with Native Compilation and set your Data on Fire!
  • 5.
    ScalaMartin Odersky +the Scala team Rompf, Iulian Dragos, Adriaan Moors, Gilles Dubochet, Philipp Haller, Lukas Rytz, Ingo Maier, Antonio Cunei, Donna Malayeri, Miguel Garcia, Hubert Plociniczak, Aleksandar Pro st: Geoffrey Washburn, Stéphane Micheloud, Lex Spoon, Sean Mc Dirmid, Burak Emir, Nikolay Mihaylov, Philippe Altherr, Vincent Cremet, Michel Schinz, Erik Stenman, Matthias external/visiting contributors: Paul Phillips, Miles Sabin, Stepan Koltsov and others
  • 7.
    User Programs (Java, Scala,Python, R) SQL (JDBC, Console) Spark Resilient Distributed Dataset Code Generation DataFrame API Catalyst Optimizer Spark SQL
  • 8.
  • 10.
  • 11.
    User Programs (Java, Scala,Python, R) SQL (JDBC, Console) Spark Resilient Distributed Dataset Code Generation DataFrame API Catalyst Optimizer Spark SQL Spark Architecture
  • 12.
    Flare: a NewBack-end for Spark User Programs (Java, Scala, Python, R) SQL (JDBC, Console) Spark Resilient Distributed Dataset Code Generation DataFrame API Catalyst Optimizer Spark SQL Delite’s Back-end DMLL LMS Code Generation Optimized Scala, C (a) Spark SQL Delite’s Runtime native code OptiQL OptiML OptiGraph (b) Flare Level 1 (c) Flare Level 2 (d) Flare Level 3 Front-end Flare’s Code Generation Flare’s Code Generation Flare’s Runtime Export query plan JNI Front-end
  • 13.
  • 14.
    Single-Core Running Time:TPCH Absolute running time in milliseconds (ms) for Postgres, Spark, HyPer and Flare in SF10 1 10 100 1000 10000 100000 1x106 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 RunningTime(ms) PostgreSQL Spark HyPer Flare
  • 15.
    Apache Parquet Format 1 10 100 1000 Q1Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Speedup Spark CSV Spark Parquet Flare CSV Flare Parquet Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Spark CSV 16762 12244 21730 19836 19316 12278 24484 17726 30050 29533 5224 Spark Parquet 3728 13520 9099 6083 8706 535 13555 5512 19413 21822 3926 Flare CSV 641 168 757 698 758 568 788 875 1417 854 128 Flare Parquet 187 17 125 127 151 99 183 160 698 309 9 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Spark CSV 21688 8554 12962 26721 12941 24690 27012 12409 19369 57330 7050 Spark Parquet 5570 7034 719 4506 21834 5176 6757 2681 8562 25089 5295 Flare CSV 701 388 573 551 150 1426 1229 605 792 1868 178 Flare Parquet 133 246 86 88 66 264 181 178 165 324 22
  • 16.
  • 17.
    Parallel Scaling Experiment Scaling-upFlare and Spark SQL in SF20 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 32 Speedup Flare Q6 aggregate Q13 outer-join Q14 join Q22 semi/anti-join 0 2000 4000 6000 8000 10000 12000 14000 16000 1 2 4 8 16 32 # Cores Q22 0 2000 4000 6000 8000 10000 12000 14000 1 2 4 8 16 32 RunningTime(ms) # Cores Q14 0 10000 20000 30000 40000 50000 60000 1 2 4 8 16 32 Q13 0 1000 2000 3000 4000 5000 6000 7000 1 2 4 8 16 32 RunningTime(ms) Q6 Spark SQL Flare Level 2 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 32 Spark Hardware: Single NUMA machine with 4 sockets, 12 Xeon E5-4657L cores per socket, and 256GB RAM per socket (1 TB total).
  • 18.
  • 19.
    NUMA Optimization 5300 5400 5500 Q1 1 0 100 200 300 400 500 600 1 1836 72 RunningTime(ms) # Cores 12 12 23 12 24 46 3500 3600 3700 Q6 one socket two sockets four sockets 1 0 100 200 300 1 18 36 72 # Cores 14 22 29 23 44 58 Scaling-up Flare for SF100 with NUMA optimizations on different configurations: threads pinned to one, two and four sockets • Q6 performs better when threads are dispatched on different sockets. • Q1 is computation-bound, little effect • On scaling-up Q1 and Q6 up to 72 cores (the capacity of the machine), the maximum speedup is 46x and 58x. Hardware: Single NUMA machine with 4 sockets, 12 Xeon E5-4657L cores per socket, and 256GB RAM per socket (1 TB total).
  • 20.
  • 21.
    Example: k-Means Clustering untilconverged(mu,tol) { mu => // calculate distances to current centroids val c = (0::m) {i => val allDistances = mu mapRows { centroid => dist(x(i), centroid) } allDistances.minIndex } // move each cluster centroid to the // mean of the points assigned to it val newMu = (0::k,*) { i => val (weightedpoints, points) = sum(0,m) { j => if (c(i) == j) (x(i),1) } if (points == 0) Vector.zeros(n) else weightedpoints / points } newMu
  • 22.
    5 10 15 20 25 30 35 40 45 50 1 12 2448 # Cores GDA 5 10 15 20 25 30 35 40 45 50 1 12 24 48 Speedup # Cores Gene 5 10 15 20 25 30 35 40 45 50 1 12 24 48 k-means 5 10 15 20 25 30 35 40 45 50 1 12 24 48 Speedup LogReg Spark C++(nopin) C++(pin) C++(numa) Level 3: Machine learning kernels, scaling on shared memory NUMA with thread pinning and data partitioning Flare Level 3: ML Performance
  • 23.
    Flare Level 3:ML Performance 0 1 2 3 4 5 6 7 8 3.4 GB 17 GB SpeedupoverSpark LogReg 0 1 2 3 4 5 6 7 8 1.7 GB 17 GB k-means Spark Delite-CPU Delite-GPU 0 1 2 3 4 5 6 7 8 k-means LogReg GPU Cluster Level 3: Machine learning kernels run on a 20 node Amazon cluster (left, center) and on a 4 node GPU cluster connected within a single rack.
  • 24.
  • 25.
    Relational + ML /*TensorFlow inference as UDF */ val q = spark.sql("select ... from data where class = findNearestCluster(...) group by class") flare(q).show
  • 26.
  • 27.
  • 28.
    Grégory Essertel RubyTahboub James Decker FLARE TEAM
  • 29.