SlideShare a Scribd company logo
Kazuaki Ishizaki
IBM Research – Tokyo
Analyzing and Optimizing
Java Code Generation
for Apache Spark Query Plan
1
What is Apache Spark?
▪ Framework that processes distributed computing by transforming
distributed immutable memory structure using set of parallel operations
▪ e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset
– SQL-based data types are supported
– Scala is primary language for programming on Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.4 released in 2018/11
2
val ds = ...
val ds1 = ...
What is Apache Spark?
▪ Framework that processes distributed computing by transforming
distributed immutable memory structure using a set of parallel operations
▪ e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset
– SQL-based data types are supported
– Scala is primary language for programming on Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.4 released in 2018/11
3
val ds = ...
val ds1 = ...
This talk focuses on
executor behavior
How Code on Each Executor is Generated?
▪ The program written as embedded DSL is translated to Java code thru
analysis and optimizations in Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki4
val ds: Dataset[Array[Int]] = Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds.filter(a => a(0) > 0).map(a => a)
while (rowIterator.hasNext()) {
Row row = rowIterator.next;
ArrayData a = row.getArray(0);
…
}
SQL
Analyzer
Rule-based
Optimizer
Code
Generator
DataFrame
Dataset
(0, 2)
(1, 3)
Column 0
Row 0
Row 1
Java virtual machine
Spark Program
Generated Java code
Motivating Example
▪ A simple Spark program that performs filter and map operations
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki5
val ds = Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds.filter(a => a(0) > 0)
.map(a => a)
(0, 2)
(1, 3)
Column 0
Row 0
Row 1
Motivating Example
▪ Generate complicated code from a simple Spark program
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki6
val ds = Seq(
Array(0, 2),
Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next();
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj1[i] = new Intger(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1);
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
Note: Actually generated code is
more complicated
Performance Issues in Generated Code
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki7
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next();
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = Integer.valueOf(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj1[i] = Intger.valueOf(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1);
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
P1 (from columnar to row-oriented)
P3 (Boxing)
P2
P3 (Unboxing)
P3 (Boxing)
P2
P3 (Unboxing)
P2
val ds = Seq(
Array(0, 2),
Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
Our Contributions
▪ Revealed performance issues in generated code from a Spark program
▪ Devised three optimizations
– to eliminate unnecessary data copy (Data-copy)
– to improve efficiency of data representation (Data-representation)
– to eliminate unnecessary data conversion (Data-conversion)
▪ Achieved up to 1.4x performance improvements
– 22 TPC-H queries
– Two machine learning programs
▪ Merged these optimizations into Spark 2.3 and later versions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki8
These optimizations reduce
path length of handling data.
Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki9
Basic Compilation Strategy of an Operator
▪ Use Volcano style [Graefe93]
– Connect operations using an iterator for easy adding of new operators
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki10
Row-based
iterator
Operator
val ds = Seq((Array(0, 2), 10, “Tokyo”))
val ds1 = ds.filter(a => a.int_ > 0)
.map(a => a)
Row-based
iterator
Operator
Array Int String
(0, 2) 10 “Tokyo”
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// map(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
}
Overview of Generated Code in Spark
▪ Put multiple operator into one loop [Neumann11] when possible
– Can avoid overhead of iterators
– Encourage compiler optimizations in a loop
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki11
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// map(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
// map(...)
...
}
val ds1 = ds.filter(a => a(0) > 0)
.map(a => a)
Whole-stage code generationVolcano style
Columnar Storage to Generated Code
▪ While the source uses columnar storage, generated code requires
data in row-based storage
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki12
Columnar
Storage
Row-based
iterator
Operators
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
val ds = Seq((Array(0, 2), 10, “Tokyo”),
(Array(1, 3), 20, “Mumbay”)).toDS.cache
(0, 2) 10 “Tokyo”
Two operations
in a loop
Problem: Data Copy From Columnar Storage
▪ Copy a set of columns to a row occurs when the iterator is used
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki13
Columnar
Storage
Row-based
iterator
Operators
Data copy
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
(0, 2) 10 “Tokyo”
val ds = Seq((Array(0, 2), 10, “Tokyo”),
(Array(1, 3), 20, “Mumbay”)).toDS.cache
Solution: Generate Optimized Code
▪ If new analysis identifies the source is columnar storage,
– Use a counter-based loop without a row-based iterator
▪ To identify a row position in a columnar storage with an index
– Get data from columnar storage directly
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki14
Columnar
Storage
Operators
Column column1 = df1.getColumn(1);
int sum = 0;
for (int i = 0; i < column0.numRows; i++) {
int x = column1.getInteger(i);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki15
Overview of Old Internal Array Representation
▪ Use an Object array for each element to handle NULL value for SQL in
addition to a primitive value (e.g. 0 or 2)
☺Easy to represent NULL
 Use a boxed object (e.g. Integer object) to hold a primitive value (e.g. int)
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki16
[0] [1]
Object array
0 2
Integer Integer
NULL
[2]
val ds = Seq(Array(0, 2, NULL))
len = 3
Problem: Boxing and Unboxing Occur
▪ Cause boxing (i.e. create an object) by a setter method (e.g. setInt(1, 2))
▪ Cause unboxing by a getter method (e.g. getInt(0))
▪ Increase memory footprint by having a pointer to an object
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki17
int getInt(int i) {
return Integer.getValue(array[i]);
}
void setInt(int i, int v) {
array[i] = new Integer(v);
}
Unboxing
from Integer object to int value
Boxing
from int value to Integer object
len = 3
[0] [1]
0 2
Integer Integer
NULL
[2]
Object array
Solution: Use Primitive Type When Possible
▪ Keep a value in a primitive field when possible based on analysis
▪ Keep NULL in a separate bit field
☺Avoid boxing and unboxing
☺Reduce memory footprint
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki18
Non
Null
Non
Null 0
[0] [1]
int array
int getInt(int i) {
return array[i];
}
void setInt(int i, int v) {
array[i] = v;
}
len = 3 Null 2 0
[2]
bit field
Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki19
Problem: Boxing Occur
▪ When convert data representation from the Spark internal to Java object,
boxing occurs in the generated code
☺Easy to handle NULL value
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki20
ArrayData a = …;
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
Boxing
Solution: Use Primitive Type When Possible
▪ When the analysis identifies that the array is a primitive type array without
NULL, generate code using a primitive array.
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki21
ArrayData a = …;
int[] array = a.toIntArray();
Note: data-representation optimization improves efficiency of toIntArray()
Generated Code without Our Optimizations
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki22
val ds =
Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next(); // P1
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length]; // P3
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0); // P2
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length]; // P3
for (int i = 0; i < a.length; i++)
obj1[i] = new Intger(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1); // P2
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue); // P2
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
Generated Code with Our Optimizations
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki23
val ds =
Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Column column0 = ...getColumn(0);
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
for (int i = 0; i < column0.numRows(); i++) {
// eliminated data copy (P1)
ArrayData a = column0.getArray(i);
// eliminated data conversion (P3)
int[] input_filter = a.toDoubleArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
// eliminated data conversion (P3)
int[] input_map = a.toDoubleArray();
int[] mvalue = (int[])map_func.apply(input_map);
// use efficient data representation (P2)
ArrayData value = new IntArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Made data representation effective (Data-representation)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki24
Performance Evaluation Methodology
▪ Measured performance improvement of two types of applications using
our optimizations
– Database: TPC-H
– Machine learning: Logistic regression and k-means
▪ Experimental environment used
– Five 16-core Intel Xeon E5-2683 v4 CPU (2.1 GHz with 128 GB of RAM)
machines
▪ One for driver and four for executors
– Spark 2.2
– OpenJDK 1.8.0_181 with 96GB heap using default collection policy
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki25
Performance Improvements of TPC-H queries
▪ Achieve up to 1.41x performance improvement
– 1.10x on geometric mean
▪ Accomplished by only data-copy optimization
– No array is used in TPC-H
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki26
1
1.1
1.2
1.3
1.4
1.5
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Performanceimprovementover
nooptimization
With data-copy optimization
Scale factor=10
Higher is better
Performance Improvements of ML applications
▪ Achieve up to 1.42x performance improvement of logistic regression
– 1.21x on geometric mean
▪ Accomplished by optimization for array representation
– columnar storage optimization contributed slightly
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki27
1
1.1
1.2
1.3
1.4
1.5
K-means Logistic regression
Performanceimprovementover
nooptimization
Data-representation and data-conversion
optimizations
All optimizations
5M data points with 200 dimensions 32M data points with 200 dimensions
Higher is better
Cycle Breakdown for Logistic Regression
▪ Took 28% for data conversion without data-representation and data-
conversion optimizations
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki28
28.1
0
60.1
87.6
11.8 12.4
0%
20%
40%
60%
80%
100%
w/o optimizations w/ optimizations
Percentageofconsumedcycles
Others
Computation
Data conversion
Conclusion
▪ Revealed performance issues in generated code from a Spark program
▪ Devised three optimizations
– to eliminate unnecessary data copy (Data-copy)
– to improve efficiency of data representation (Data-representation)
– to eliminate unnecessary data conversion (Data-conversion)
▪ Achieved up to 1.4x performance improvements
– 22 TPC-H queries
– Two machine learning programs: logistic regression and k-means
▪ Merged these optimizations into Spark 2.3 and later versions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki29
Acknowledgments
▪ Thanks Apache Spark community for suggestions on merging our
optimizations into Apache Spark, especially
– Wenchen Fan, Herman van Hovell, Liang-Chi Hsieh, Takuya Ueshin, Sameer
Agarwal, Andrew Or, Davies Liu, Nong Li, and Reynold Xin
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki30

More Related Content

What's hot

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
Data stax academy
Data stax academyData stax academy
Data stax academy
Duyhai Doan
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
Daniel Zivkovic
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Holden Karau
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
Spark Summit
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
StampedeCon
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Wayne Chen
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 

What's hot (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Data stax academy
Data stax academyData stax academy
Data stax academy
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 

Similar to icpe2019_ishizaki_public

Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
Leonardo Gamas
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
Albert Bifet
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Roger Huang
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 

Similar to icpe2019_ishizaki_public (20)

Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 

More from Kazuaki Ishizaki

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
Kazuaki Ishizaki
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
Kazuaki Ishizaki
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
Kazuaki Ishizaki
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
Kazuaki Ishizaki
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
Kazuaki Ishizaki
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
Kazuaki Ishizaki
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
Kazuaki Ishizaki
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
Kazuaki Ishizaki
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
Kazuaki Ishizaki
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
Kazuaki Ishizaki
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
Kazuaki Ishizaki
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
Kazuaki Ishizaki
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_publicKazuaki Ishizaki
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
Kazuaki Ishizaki
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
Kazuaki Ishizaki
 

More from Kazuaki Ishizaki (17)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
 

Recently uploaded

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 

Recently uploaded (20)

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 

icpe2019_ishizaki_public

  • 1. Kazuaki Ishizaki IBM Research – Tokyo Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan 1
  • 2. What is Apache Spark? ▪ Framework that processes distributed computing by transforming distributed immutable memory structure using set of parallel operations ▪ e.g. map(), filter(), reduce(), … – Distributed immutable in-memory structures ▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset – SQL-based data types are supported – Scala is primary language for programming on Spark Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki Spark Runtime (written in Java and Scala) Spark Streaming (real-time) GraphX (graph) SparkSQL (SQL) MLlib (machine learning) Java Virtual Machine tasks Executor Driver Executor results Executor Data Data Data Open source: http://spark.apache.org/ Data Source (HDFS, DB, File, etc.) Latest version is 2.4 released in 2018/11 2 val ds = ... val ds1 = ...
  • 3. What is Apache Spark? ▪ Framework that processes distributed computing by transforming distributed immutable memory structure using a set of parallel operations ▪ e.g. map(), filter(), reduce(), … – Distributed immutable in-memory structures ▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset – SQL-based data types are supported – Scala is primary language for programming on Spark Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki Spark Runtime (written in Java and Scala) Spark Streaming (real-time) GraphX (graph) SparkSQL (SQL) MLlib (machine learning) Java Virtual Machine tasks Executor Driver Executor results Executor Data Data Data Open source: http://spark.apache.org/ Data Source (HDFS, DB, File, etc.) Latest version is 2.4 released in 2018/11 3 val ds = ... val ds1 = ... This talk focuses on executor behavior
  • 4. How Code on Each Executor is Generated? ▪ The program written as embedded DSL is translated to Java code thru analysis and optimizations in Spark Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki4 val ds: Dataset[Array[Int]] = Seq(Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds.filter(a => a(0) > 0).map(a => a) while (rowIterator.hasNext()) { Row row = rowIterator.next; ArrayData a = row.getArray(0); … } SQL Analyzer Rule-based Optimizer Code Generator DataFrame Dataset (0, 2) (1, 3) Column 0 Row 0 Row 1 Java virtual machine Spark Program Generated Java code
  • 5. Motivating Example ▪ A simple Spark program that performs filter and map operations Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki5 val ds = Seq(Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds.filter(a => a(0) > 0) .map(a => a) (0, 2) (1, 3) Column 0 Row 0 Row 1
  • 6. Motivating Example ▪ Generate complicated code from a simple Spark program Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki6 val ds = Seq( Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds .filter(a => a(0) > 0) .map(a => a) final class GeneratedIterator { Iterator inputIterator = ...; Row projectRow = new Row(1); RowWriter rowWriter = new RowWriter(projectRow); protected void processNext() { while (inputIterator.hasNext()) { Row inputRow = (Row) inputIterator.next(); ArrayData a = inputRow.getArray(0); Object[] obj0 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj0[i] = new Integer(a.getInt(i)); ArrayData array_filter = new GenericArrayData(obj0); int[] input_filter = array_filter.toIntArray(); boolean fvalue = (Boolean)filter_func.apply(input_filter); if (!fvalue) continue; Object[] obj1 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj1[i] = new Intger(a.getInt(i)); ArrayData array_map = new GenericArrayData(obj1); int[] input_map = array_map.toIntArray(); int[] mvalue = (double[])map_func.apply(input_map); ArrayData value = new GenericArrayData(mvalue); rowWriter.write(0, value); appendRow(projectRow); } } } Note: Actually generated code is more complicated
  • 7. Performance Issues in Generated Code ▪ P1: Unnecessary data copy ▪ P2: Inefficient data representation ▪ P3: Unnecessary data conversions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki7 final class GeneratedIterator { Iterator inputIterator = ...; Row projectRow = new Row(1); RowWriter rowWriter = new RowWriter(projectRow); protected void processNext() { while (inputIterator.hasNext()) { Row inputRow = (Row) inputIterator.next(); ArrayData a = inputRow.getArray(0); Object[] obj0 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj0[i] = Integer.valueOf(a.getInt(i)); ArrayData array_filter = new GenericArrayData(obj0); int[] input_filter = array_filter.toIntArray(); boolean fvalue = (Boolean)filter_func.apply(input_filter); if (!fvalue) continue; Object[] obj1 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj1[i] = Intger.valueOf(a.getInt(i)); ArrayData array_map = new GenericArrayData(obj1); int[] input_map = array_map.toIntArray(); int[] mvalue = (double[])map_func.apply(input_map); ArrayData value = new GenericArrayData(mvalue); rowWriter.write(0, value); appendRow(projectRow); } } } P1 (from columnar to row-oriented) P3 (Boxing) P2 P3 (Unboxing) P3 (Boxing) P2 P3 (Unboxing) P2 val ds = Seq( Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds .filter(a => a(0) > 0) .map(a => a)
  • 8. Our Contributions ▪ Revealed performance issues in generated code from a Spark program ▪ Devised three optimizations – to eliminate unnecessary data copy (Data-copy) – to improve efficiency of data representation (Data-representation) – to eliminate unnecessary data conversion (Data-conversion) ▪ Achieved up to 1.4x performance improvements – 22 TPC-H queries – Two machine learning programs ▪ Merged these optimizations into Spark 2.3 and later versions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki8 These optimizations reduce path length of handling data.
  • 9. Outline ▪ Problems ▪ Eliminate unnecessary data copy (Data-copy) ▪ Improve efficiency of data representation (Data-representation) ▪ Eliminate unnecessary data conversion (Data-conversion) ▪ Experiments Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki9
  • 10. Basic Compilation Strategy of an Operator ▪ Use Volcano style [Graefe93] – Connect operations using an iterator for easy adding of new operators Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki10 Row-based iterator Operator val ds = Seq((Array(0, 2), 10, “Tokyo”)) val ds1 = ds.filter(a => a.int_ > 0) .map(a => a) Row-based iterator Operator Array Int String (0, 2) 10 “Tokyo” while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // map(...) ... } while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // filter(...) ... }
  • 11. Overview of Generated Code in Spark ▪ Put multiple operator into one loop [Neumann11] when possible – Can avoid overhead of iterators – Encourage compiler optimizations in a loop Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki11 while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // map(...) ... } while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // filter(...) ... } while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // filter(...) ... // map(...) ... } val ds1 = ds.filter(a => a(0) > 0) .map(a => a) Whole-stage code generationVolcano style
  • 12. Columnar Storage to Generated Code ▪ While the source uses columnar storage, generated code requires data in row-based storage Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki12 Columnar Storage Row-based iterator Operators while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); ... ... } Array Int String (0, 2) (1, 3) 10 20 “Tokyo” “Mumbay” val ds = Seq((Array(0, 2), 10, “Tokyo”), (Array(1, 3), 20, “Mumbay”)).toDS.cache (0, 2) 10 “Tokyo” Two operations in a loop
  • 13. Problem: Data Copy From Columnar Storage ▪ Copy a set of columns to a row occurs when the iterator is used Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki13 Columnar Storage Row-based iterator Operators Data copy while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); ... ... } Array Int String (0, 2) (1, 3) 10 20 “Tokyo” “Mumbay” (0, 2) 10 “Tokyo” val ds = Seq((Array(0, 2), 10, “Tokyo”), (Array(1, 3), 20, “Mumbay”)).toDS.cache
  • 14. Solution: Generate Optimized Code ▪ If new analysis identifies the source is columnar storage, – Use a counter-based loop without a row-based iterator ▪ To identify a row position in a columnar storage with an index – Get data from columnar storage directly Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki14 Columnar Storage Operators Column column1 = df1.getColumn(1); int sum = 0; for (int i = 0; i < column0.numRows; i++) { int x = column1.getInteger(i); ... ... } Array Int String (0, 2) (1, 3) 10 20 “Tokyo” “Mumbay”
  • 15. Outline ▪ Problems ▪ Eliminate unnecessary data copy (Data-copy) ▪ Improve efficiency of data representation (Data-representation) ▪ Eliminate unnecessary data conversion (Data-conversion) ▪ Experiments Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki15
  • 16. Overview of Old Internal Array Representation ▪ Use an Object array for each element to handle NULL value for SQL in addition to a primitive value (e.g. 0 or 2) ☺Easy to represent NULL  Use a boxed object (e.g. Integer object) to hold a primitive value (e.g. int) Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki16 [0] [1] Object array 0 2 Integer Integer NULL [2] val ds = Seq(Array(0, 2, NULL)) len = 3
  • 17. Problem: Boxing and Unboxing Occur ▪ Cause boxing (i.e. create an object) by a setter method (e.g. setInt(1, 2)) ▪ Cause unboxing by a getter method (e.g. getInt(0)) ▪ Increase memory footprint by having a pointer to an object Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki17 int getInt(int i) { return Integer.getValue(array[i]); } void setInt(int i, int v) { array[i] = new Integer(v); } Unboxing from Integer object to int value Boxing from int value to Integer object len = 3 [0] [1] 0 2 Integer Integer NULL [2] Object array
  • 18. Solution: Use Primitive Type When Possible ▪ Keep a value in a primitive field when possible based on analysis ▪ Keep NULL in a separate bit field ☺Avoid boxing and unboxing ☺Reduce memory footprint Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki18 Non Null Non Null 0 [0] [1] int array int getInt(int i) { return array[i]; } void setInt(int i, int v) { array[i] = v; } len = 3 Null 2 0 [2] bit field
  • 19. Outline ▪ Problems ▪ Eliminate unnecessary data copy (Data-copy) ▪ Improve efficiency of data representation (Data-representation) ▪ Eliminate unnecessary data conversion (Data-conversion) ▪ Experiments Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki19
  • 20. Problem: Boxing Occur ▪ When convert data representation from the Spark internal to Java object, boxing occurs in the generated code ☺Easy to handle NULL value Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki20 ArrayData a = …; Object[] obj0 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj0[i] = new Integer(a.getInt(i)); ArrayData array_filter = new GenericArrayData(obj0); int[] input_filter = array_filter.toIntArray(); Boxing
  • 21. Solution: Use Primitive Type When Possible ▪ When the analysis identifies that the array is a primitive type array without NULL, generate code using a primitive array. Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki21 ArrayData a = …; int[] array = a.toIntArray(); Note: data-representation optimization improves efficiency of toIntArray()
  • 22. Generated Code without Our Optimizations ▪ P1: Unnecessary data copy ▪ P2: Inefficient data representation ▪ P3: Unnecessary data conversions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki22 val ds = Seq(Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds .filter(a => a(0) > 0) .map(a => a) final class GeneratedIterator { Iterator inputIterator = ...; Row projectRow = new Row(1); RowWriter rowWriter = new RowWriter(projectRow); protected void processNext() { while (inputIterator.hasNext()) { Row inputRow = (Row) inputIterator.next(); // P1 ArrayData a = inputRow.getArray(0); Object[] obj0 = new Object[a.length]; // P3 for (int i = 0; i < a.length; i++) obj0[i] = new Integer(a.getInt(i)); ArrayData array_filter = new GenericArrayData(obj0); // P2 int[] input_filter = array_filter.toIntArray(); boolean fvalue = (Boolean)filter_func.apply(input_filter); if (!fvalue) continue; Object[] obj1 = new Object[a.length]; // P3 for (int i = 0; i < a.length; i++) obj1[i] = new Intger(a.getInt(i)); ArrayData array_map = new GenericArrayData(obj1); // P2 int[] input_map = array_map.toIntArray(); int[] mvalue = (double[])map_func.apply(input_map); ArrayData value = new GenericArrayData(mvalue); // P2 rowWriter.write(0, value); appendRow(projectRow); } } }
  • 23. Generated Code with Our Optimizations ▪ P1: Unnecessary data copy ▪ P2: Inefficient data representation ▪ P3: Unnecessary data conversions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki23 val ds = Seq(Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds .filter(a => a(0) > 0) .map(a => a) final class GeneratedIterator { Column column0 = ...getColumn(0); Row projectRow = new Row(1); RowWriter rowWriter = new RowWriter(projectRow); protected void processNext() { for (int i = 0; i < column0.numRows(); i++) { // eliminated data copy (P1) ArrayData a = column0.getArray(i); // eliminated data conversion (P3) int[] input_filter = a.toDoubleArray(); boolean fvalue = (Boolean)filter_func.apply(input_filter); if (!fvalue) continue; // eliminated data conversion (P3) int[] input_map = a.toDoubleArray(); int[] mvalue = (int[])map_func.apply(input_map); // use efficient data representation (P2) ArrayData value = new IntArrayData(mvalue); rowWriter.write(0, value); appendRow(projectRow); } } }
  • 24. Outline ▪ Problems ▪ Eliminate unnecessary data copy (Data-copy) ▪ Made data representation effective (Data-representation) ▪ Experiments Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki24
  • 25. Performance Evaluation Methodology ▪ Measured performance improvement of two types of applications using our optimizations – Database: TPC-H – Machine learning: Logistic regression and k-means ▪ Experimental environment used – Five 16-core Intel Xeon E5-2683 v4 CPU (2.1 GHz with 128 GB of RAM) machines ▪ One for driver and four for executors – Spark 2.2 – OpenJDK 1.8.0_181 with 96GB heap using default collection policy Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki25
  • 26. Performance Improvements of TPC-H queries ▪ Achieve up to 1.41x performance improvement – 1.10x on geometric mean ▪ Accomplished by only data-copy optimization – No array is used in TPC-H Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki26 1 1.1 1.2 1.3 1.4 1.5 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Performanceimprovementover nooptimization With data-copy optimization Scale factor=10 Higher is better
  • 27. Performance Improvements of ML applications ▪ Achieve up to 1.42x performance improvement of logistic regression – 1.21x on geometric mean ▪ Accomplished by optimization for array representation – columnar storage optimization contributed slightly Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki27 1 1.1 1.2 1.3 1.4 1.5 K-means Logistic regression Performanceimprovementover nooptimization Data-representation and data-conversion optimizations All optimizations 5M data points with 200 dimensions 32M data points with 200 dimensions Higher is better
  • 28. Cycle Breakdown for Logistic Regression ▪ Took 28% for data conversion without data-representation and data- conversion optimizations Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki28 28.1 0 60.1 87.6 11.8 12.4 0% 20% 40% 60% 80% 100% w/o optimizations w/ optimizations Percentageofconsumedcycles Others Computation Data conversion
  • 29. Conclusion ▪ Revealed performance issues in generated code from a Spark program ▪ Devised three optimizations – to eliminate unnecessary data copy (Data-copy) – to improve efficiency of data representation (Data-representation) – to eliminate unnecessary data conversion (Data-conversion) ▪ Achieved up to 1.4x performance improvements – 22 TPC-H queries – Two machine learning programs: logistic regression and k-means ▪ Merged these optimizations into Spark 2.3 and later versions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki29
  • 30. Acknowledgments ▪ Thanks Apache Spark community for suggestions on merging our optimizations into Apache Spark, especially – Wenchen Fan, Herman van Hovell, Liang-Chi Hsieh, Takuya Ueshin, Sameer Agarwal, Andrew Or, Davies Liu, Nong Li, and Reynold Xin Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki30