SlideShare a Scribd company logo
Kazuaki Ishizaki
IBM Research – Tokyo
@kiszk
In-Memory Storage
Evolution in Apache Spark
#UnifiedAnalytics #SparkAISummit
About Me – Kazuaki Ishizaki
• Researcher at IBM Research in compiler optimizations
• Working for IBM Java virtual machine over 20 years
– In particular, just-in-time compiler
• Committer of Apache Spark (SQL package) from 2018
• ACM Distinguished Member
• Homepage: http://ibm.biz/ishizaki
b: https://github.com/kiszk wit: @kiszk
https://slideshare.net/ishizaki
2In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Why is In-Memory Storage?
• In-memory storage is mandatory for high performance
• In-memory columnar storage is necessary to
– Support first-class citizen column format Parquet
– Achieve better compression rate for table cache
3In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
memory address memory address
SummitAISpark
5000.01.92.0
321Summit
AI
Spark
5000.0
1.9
2.0
3
2
1
Row format Column format
Row 0
Row 1
Row 2
Column x
Column y
Column z
What I Will Talk about
• Columnar storage is used to improve performance for
– table cache, Parquet, ORC, and Arrow
• Columnar storage from Spark 2.3
– improves performance of PySpark with Pandas UDF using
Arrow
– can be connected with external other columnar storages by
using a public class “ColumnVector”
4#UnifiedAnalytics #SparkAISummit
How Columnar Storage is Used
• Table cache ORC
• Pandas UDF Parquet
5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
df = ...
df.cache
df1 = df.selectExpr(“y + 1.2”)
df = spark.read.parquet(“c”)
df1 = df.selectExpr(“y + 1.2”)
df = spark.read.format(“orc”).load(“c”)
df1 = df.selectExpr(“y + 1.2”)
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
df1 = df.withColumn(‘yy’, plus(df.y))
Performance among Spark Versions
• DataFrame table cache from Spark 2.0 to Spark 2.4
6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Spark 2.0
Spark 2.3
Spark 2.4
Performance comparison among different Spark versions
Relative elapsed time
shorter is better
df.filter(“i % 16 == 0").count
How This Improvement is Achieved
• Structure of columnar storage
• Generated code to access columnar storage
7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Outline
• Introduction
• Deep dive into columnar storage
• Deep dive into generated code of columnar storage
• Next steps
8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
In-Memory Storage Evolution (1/2)
9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
AI|
Spark| Spark AI
Table cache
2.0
1.9
2.0 1.9
Spark AI
Parquet vectorized reader
2.0 1.9
1.4 to 1.6
RDD table cache
to 1.3 2.0 to 2.2
RDD table cache : Java objects
Table cache : Own memory layout by Project Tungsten for table
cache
Parquet : Own memory layout, but different class from table
cacheSpark
version
In-Memory Storage Evolution (2/2)
10In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Spark AI
Table cache
2.0 1.9
Parquet vectorized reader
2.42.3
Pandas UDF with Arrow ORC vectorized reader
ColumnVector becomes a public class
ColumnVector class becomes public class from Spark 2.3
Table cache, Parquet, ORC, and Arrow use common ColumnVector
class
Spark
version
Implementation in Spark 1.4 to 1.6
• Table cache uses CachedBatch that is not accessed directly
from generated code
11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
case class CachedBatch(
buffers: Array[Array[Byte]],
stats: Row)
Spark AI
2.0 1.9
CachedBatch.buffers
Implementation in Spark 2.0
• Parquet uses ColumnVector class that has well-defined
methods that could be called from generated code
12In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
public abstract class ColumnVector {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
public final class OnHeapColumnVector
extends ColumnVector {
private byte[] byteData;
…
private float[] floatData;
…
}
Spark AI
2.0 1.9
copy
2.0 1.9
Spark AI
ColumnVector
ColumnarBatch
Implementation in Spark 2.3
• Table cache, Parquet, and Arrow also use ColumnVector
• ColumnVector becomes a public class to define APIs
13In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
/**
* An interface representing in-memory columnar data in
Spark. This interface defines the main APIs
* to access the data, as well as their batched versions.
The batched versions are considered to be
* faster and preferable whenever possible.
*/
@Evolving
public abstract class ColumnVector … {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java
public final class OnHeapColumnVector
extends ColumnVector {
// Array for each type.
private byte[] byteData;
…
private float[] floatData;
…
}
public final class ArrowColumnVector
extends ColumnVector {
…
}
Table cache
Parquet
vectorized readers
Pandas UDF with Arrow
ColumnVector.java
ColumnVector for Your Columnar
• Developers can write an own class, which extends
ColumnVector, to support a new columnar or to exchange
data with other formats
14In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
MyColumnarClass
extends ColumnVector
Columnar
data
source
Implementation in Spark 2.4
• ORC also uses ColumnVector
15In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
/**
* An interface representing in-memory columnar data in
Spark. This interface defines the main APIs
* to access the data, as well as their batched versions.
The batched versions are considered to be
* faster and preferable whenever possible.
*/
@Evolving
public abstract class ColumnVector … {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java
public final class OnHeapColumnVector
extends ColumnVector {
// Array for each type.
private byte[] byteData;
…
private float[] floatData;
…
}
public final class ArrowColumnVector
extends ColumnVector {
…
}
Table cache
Parquet and ORC
vectorized readers
Pandas UDF with Arrow
ColumnVector.java
Outline
• Introduction
• Deep dive columnar storage
• Deep dive generated code of columnar storage
• Next steps
16In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
How Spark Program is Executed?
• A Spark program is translated into Java code to be executed
17In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Source: Michael et al., Spark SQL:
Relational Data Processing in Spark,
SIGMOD’15Catalyst
while (rowIterator.hasNext()) {
Row row = rowIterator.next;
…
}
Java virtual machine
Spark Programdf = ...
df.cache
df1 = df.selectExpr(“y + 1.2”)
Access Columnar Storage (before
2.0)
• While columnar storage is used, generated code gets data
from row storage
Data conversion is required
18In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
float y = row.getFloat(1);
float f = y + 1.2;
…
}
df1 = df.selectExpr(“y + 1.2")
Catalyst
df:
2.0 1.9
CachedBatch
Spark AI
row:
2.0
2.0Spark
Data conversion
Columnar
storage
Row
storage
Access Columnar Storage (from
2.0)
• When columnar storage is used, reading data elements
directly accesses columnar storage
– Removed copy for Parquet in 2.0 and table cache in 2.3
19In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector column1 = …
while (i++ < numRows) {
float y = column1.getFloat(i);
float f = y + 1.2;
…
}
df1 = df.selectExpr(“y + 1.2")
Catalyst
ColumnVector 2.0 1.9
2.0
i = 0
1.9
i = 1
y:
3.2 3.1f:
df:
Access Columnar Storage (from
2.3)
• Generate this pattern for all cases regarding ColumnVector
• Use for-loop to encourage compiler optimizations
– Hotspot compiler applies loop optimizations to a well-formed loop
20In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector column1 = …
for (int i = 0; i < numRows; i++) {
float y = column1.getFloat(i);
float f = y + 1.2;
…
}
Catalyst
ColumnVector 2.0 1.9
2.0
i = 0
1.9
i = 1
y:
3.2 3.1f:
df:df1 = df.selectExpr(“y + 1.2")
How Columnar Storage is used in
PySpark
• Share data in columnar storages of Spark and Pandas
– No serialization and deserialization
– 3-100x performance improvements
21In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector
Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin
Source: ”Introducing Pandas UDF for PySpark” by Databricks blog
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
Outline
• Introduction
• Deep dive columnar storage
• Deep dive generated code of columnar storage
• Next steps
22In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Next Steps
• Short-term
– support an array type in ColumnVector for table cache
– support additional external columnar storage
• Middle-term
– exploit SIMD instructions to process multiple rows in a column
in generated code
• Extension of SPARK-25728 (Tungsten IR)
23#UnifiedAnalytics #SparkAISummit
Integrate Spark with Others
• Frameworks: DL/ML frameworks
• SPARK-24579
• SPARK-26413
• Resources: GPU, FPGA, ..
• SPARK-27396
• SAIS2019: “Apache Arrow-Based
Unified Data Sharing and
Transferring Format Among
CPU and Accelerators”
24In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
From rapids.ai
FPGA
GPU
Takeaway
• Columnar storage is used to improve performance for
– table cache, Parquet, ORC, and Arrow
• Columnar storage from Spark 2.3
– improves performance of PySpark with Pandas UDF using
Arrow
– can be connected with external other columnar storages by
using a public class “ColumnVector”
25#UnifiedAnalytics #SparkAISummit
Thanks Spark Community
• Especially, @andrewor14, @bryanCutler, @cloud-fan,
@dongjoon-hyun, @gatorsmile, @hvanhovell, @mgaido91,
@ueshin, @viirya
26#UnifiedAnalytics #SparkAISummit

More Related Content

What's hot

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Spark tuning
Spark tuningSpark tuning
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
Databricks
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 

What's hot (20)

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 

Similar to In-Memory Evolution in Apache Spark

SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
Kazuaki Ishizaki
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
Taro L. Saito
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
Dylan Wan
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
pumaranikar
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
Eyal Ben Ivri
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Edureka!
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
HostedbyConfluent
 
What's new with Apache Camel 3? | DevNation Tech Talk
What's new with Apache Camel 3? | DevNation Tech TalkWhat's new with Apache Camel 3? | DevNation Tech Talk
What's new with Apache Camel 3? | DevNation Tech Talk
Red Hat Developers
 

Similar to In-Memory Evolution in Apache Spark (20)

SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
What's new with Apache Camel 3? | DevNation Tech Talk
What's new with Apache Camel 3? | DevNation Tech TalkWhat's new with Apache Camel 3? | DevNation Tech Talk
What's new with Apache Camel 3? | DevNation Tech Talk
 

More from Kazuaki Ishizaki

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
Kazuaki Ishizaki
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
Kazuaki Ishizaki
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperable
Kazuaki Ishizaki
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
Kazuaki Ishizaki
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
Kazuaki Ishizaki
 
icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
Kazuaki Ishizaki
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
Kazuaki Ishizaki
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0
Kazuaki Ishizaki
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
Kazuaki Ishizaki
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
Kazuaki Ishizaki
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
Kazuaki Ishizaki
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
Kazuaki Ishizaki
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
Kazuaki Ishizaki
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
Kazuaki Ishizaki
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
Kazuaki Ishizaki
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_publicKazuaki Ishizaki
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
Kazuaki Ishizaki
 

More from Kazuaki Ishizaki (20)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperable
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 

Recently uploaded

top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 

Recently uploaded (20)

top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 

In-Memory Evolution in Apache Spark

  • 1. Kazuaki Ishizaki IBM Research – Tokyo @kiszk In-Memory Storage Evolution in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 2. About Me – Kazuaki Ishizaki • Researcher at IBM Research in compiler optimizations • Working for IBM Java virtual machine over 20 years – In particular, just-in-time compiler • Committer of Apache Spark (SQL package) from 2018 • ACM Distinguished Member • Homepage: http://ibm.biz/ishizaki b: https://github.com/kiszk wit: @kiszk https://slideshare.net/ishizaki 2In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 3. Why is In-Memory Storage? • In-memory storage is mandatory for high performance • In-memory columnar storage is necessary to – Support first-class citizen column format Parquet – Achieve better compression rate for table cache 3In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit memory address memory address SummitAISpark 5000.01.92.0 321Summit AI Spark 5000.0 1.9 2.0 3 2 1 Row format Column format Row 0 Row 1 Row 2 Column x Column y Column z
  • 4. What I Will Talk about • Columnar storage is used to improve performance for – table cache, Parquet, ORC, and Arrow • Columnar storage from Spark 2.3 – improves performance of PySpark with Pandas UDF using Arrow – can be connected with external other columnar storages by using a public class “ColumnVector” 4#UnifiedAnalytics #SparkAISummit
  • 5. How Columnar Storage is Used • Table cache ORC • Pandas UDF Parquet 5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit df = ... df.cache df1 = df.selectExpr(“y + 1.2”) df = spark.read.parquet(“c”) df1 = df.selectExpr(“y + 1.2”) df = spark.read.format(“orc”).load(“c”) df1 = df.selectExpr(“y + 1.2”) @pandas_udf(‘double’) def plus(v): return v + 1.2 df1 = df.withColumn(‘yy’, plus(df.y))
  • 6. Performance among Spark Versions • DataFrame table cache from Spark 2.0 to Spark 2.4 6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Spark 2.0 Spark 2.3 Spark 2.4 Performance comparison among different Spark versions Relative elapsed time shorter is better df.filter(“i % 16 == 0").count
  • 7. How This Improvement is Achieved • Structure of columnar storage • Generated code to access columnar storage 7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 8. Outline • Introduction • Deep dive into columnar storage • Deep dive into generated code of columnar storage • Next steps 8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 9. In-Memory Storage Evolution (1/2) 9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit AI| Spark| Spark AI Table cache 2.0 1.9 2.0 1.9 Spark AI Parquet vectorized reader 2.0 1.9 1.4 to 1.6 RDD table cache to 1.3 2.0 to 2.2 RDD table cache : Java objects Table cache : Own memory layout by Project Tungsten for table cache Parquet : Own memory layout, but different class from table cacheSpark version
  • 10. In-Memory Storage Evolution (2/2) 10In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit Spark AI Table cache 2.0 1.9 Parquet vectorized reader 2.42.3 Pandas UDF with Arrow ORC vectorized reader ColumnVector becomes a public class ColumnVector class becomes public class from Spark 2.3 Table cache, Parquet, ORC, and Arrow use common ColumnVector class Spark version
  • 11. Implementation in Spark 1.4 to 1.6 • Table cache uses CachedBatch that is not accessed directly from generated code 11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit case class CachedBatch( buffers: Array[Array[Byte]], stats: Row) Spark AI 2.0 1.9 CachedBatch.buffers
  • 12. Implementation in Spark 2.0 • Parquet uses ColumnVector class that has well-defined methods that could be called from generated code 12In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit public abstract class ColumnVector { float getFloat(…) … UTF8String getUTF8String(…) … … } public final class OnHeapColumnVector extends ColumnVector { private byte[] byteData; … private float[] floatData; … } Spark AI 2.0 1.9 copy 2.0 1.9 Spark AI ColumnVector ColumnarBatch
  • 13. Implementation in Spark 2.3 • Table cache, Parquet, and Arrow also use ColumnVector • ColumnVector becomes a public class to define APIs 13In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit /** * An interface representing in-memory columnar data in Spark. This interface defines the main APIs * to access the data, as well as their batched versions. The batched versions are considered to be * faster and preferable whenever possible. */ @Evolving public abstract class ColumnVector … { float getFloat(…) … UTF8String getUTF8String(…) … … } https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java public final class OnHeapColumnVector extends ColumnVector { // Array for each type. private byte[] byteData; … private float[] floatData; … } public final class ArrowColumnVector extends ColumnVector { … } Table cache Parquet vectorized readers Pandas UDF with Arrow ColumnVector.java
  • 14. ColumnVector for Your Columnar • Developers can write an own class, which extends ColumnVector, to support a new columnar or to exchange data with other formats 14In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit MyColumnarClass extends ColumnVector Columnar data source
  • 15. Implementation in Spark 2.4 • ORC also uses ColumnVector 15In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit /** * An interface representing in-memory columnar data in Spark. This interface defines the main APIs * to access the data, as well as their batched versions. The batched versions are considered to be * faster and preferable whenever possible. */ @Evolving public abstract class ColumnVector … { float getFloat(…) … UTF8String getUTF8String(…) … … } https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java public final class OnHeapColumnVector extends ColumnVector { // Array for each type. private byte[] byteData; … private float[] floatData; … } public final class ArrowColumnVector extends ColumnVector { … } Table cache Parquet and ORC vectorized readers Pandas UDF with Arrow ColumnVector.java
  • 16. Outline • Introduction • Deep dive columnar storage • Deep dive generated code of columnar storage • Next steps 16In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 17. How Spark Program is Executed? • A Spark program is translated into Java code to be executed 17In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit Source: Michael et al., Spark SQL: Relational Data Processing in Spark, SIGMOD’15Catalyst while (rowIterator.hasNext()) { Row row = rowIterator.next; … } Java virtual machine Spark Programdf = ... df.cache df1 = df.selectExpr(“y + 1.2”)
  • 18. Access Columnar Storage (before 2.0) • While columnar storage is used, generated code gets data from row storage Data conversion is required 18In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit while (rowIterator.hasNext()) { Row row = rowIterator.next(); float y = row.getFloat(1); float f = y + 1.2; … } df1 = df.selectExpr(“y + 1.2") Catalyst df: 2.0 1.9 CachedBatch Spark AI row: 2.0 2.0Spark Data conversion Columnar storage Row storage
  • 19. Access Columnar Storage (from 2.0) • When columnar storage is used, reading data elements directly accesses columnar storage – Removed copy for Parquet in 2.0 and table cache in 2.3 19In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector column1 = … while (i++ < numRows) { float y = column1.getFloat(i); float f = y + 1.2; … } df1 = df.selectExpr(“y + 1.2") Catalyst ColumnVector 2.0 1.9 2.0 i = 0 1.9 i = 1 y: 3.2 3.1f: df:
  • 20. Access Columnar Storage (from 2.3) • Generate this pattern for all cases regarding ColumnVector • Use for-loop to encourage compiler optimizations – Hotspot compiler applies loop optimizations to a well-formed loop 20In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector column1 = … for (int i = 0; i < numRows; i++) { float y = column1.getFloat(i); float f = y + 1.2; … } Catalyst ColumnVector 2.0 1.9 2.0 i = 0 1.9 i = 1 y: 3.2 3.1f: df:df1 = df.selectExpr(“y + 1.2")
  • 21. How Columnar Storage is used in PySpark • Share data in columnar storages of Spark and Pandas – No serialization and deserialization – 3-100x performance improvements 21In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin Source: ”Introducing Pandas UDF for PySpark” by Databricks blog @pandas_udf(‘double’) def plus(v): return v + 1.2
  • 22. Outline • Introduction • Deep dive columnar storage • Deep dive generated code of columnar storage • Next steps 22In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 23. Next Steps • Short-term – support an array type in ColumnVector for table cache – support additional external columnar storage • Middle-term – exploit SIMD instructions to process multiple rows in a column in generated code • Extension of SPARK-25728 (Tungsten IR) 23#UnifiedAnalytics #SparkAISummit
  • 24. Integrate Spark with Others • Frameworks: DL/ML frameworks • SPARK-24579 • SPARK-26413 • Resources: GPU, FPGA, .. • SPARK-27396 • SAIS2019: “Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators” 24In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit From rapids.ai FPGA GPU
  • 25. Takeaway • Columnar storage is used to improve performance for – table cache, Parquet, ORC, and Arrow • Columnar storage from Spark 2.3 – improves performance of PySpark with Pandas UDF using Arrow – can be connected with external other columnar storages by using a public class “ColumnVector” 25#UnifiedAnalytics #SparkAISummit
  • 26. Thanks Spark Community • Especially, @andrewor14, @bryanCutler, @cloud-fan, @dongjoon-hyun, @gatorsmile, @hvanhovell, @mgaido91, @ueshin, @viirya 26#UnifiedAnalytics #SparkAISummit