Presentation slide for "In-Memory Storage Evolution in Apache Spark" at Spark+AI Summit 2019
https://databricks.com/session/in-memory-storage-evolution-in-apache-spark
1. Kazuaki Ishizaki
IBM Research – Tokyo
@kiszk
In-Memory Storage
Evolution in Apache Spark
#UnifiedAnalytics #SparkAISummit
2. About Me – Kazuaki Ishizaki
• Researcher at IBM Research in compiler optimizations
• Working for IBM Java virtual machine over 20 years
– In particular, just-in-time compiler
• Committer of Apache Spark (SQL package) from 2018
• ACM Distinguished Member
• Homepage: http://ibm.biz/ishizaki
b: https://github.com/kiszk wit: @kiszk
https://slideshare.net/ishizaki
2In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
3. Why is In-Memory Storage?
• In-memory storage is mandatory for high performance
• In-memory columnar storage is necessary to
– Support first-class citizen column format Parquet
– Achieve better compression rate for table cache
3In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
memory address memory address
SummitAISpark
5000.01.92.0
321Summit
AI
Spark
5000.0
1.9
2.0
3
2
1
Row format Column format
Row 0
Row 1
Row 2
Column x
Column y
Column z
4. What I Will Talk about
• Columnar storage is used to improve performance for
– table cache, Parquet, ORC, and Arrow
• Columnar storage from Spark 2.3
– improves performance of PySpark with Pandas UDF using
Arrow
– can be connected with external other columnar storages by
using a public class “ColumnVector”
4#UnifiedAnalytics #SparkAISummit
6. Performance among Spark Versions
• DataFrame table cache from Spark 2.0 to Spark 2.4
6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Spark 2.0
Spark 2.3
Spark 2.4
Performance comparison among different Spark versions
Relative elapsed time
shorter is better
df.filter(“i % 16 == 0").count
7. How This Improvement is Achieved
• Structure of columnar storage
• Generated code to access columnar storage
7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
8. Outline
• Introduction
• Deep dive into columnar storage
• Deep dive into generated code of columnar storage
• Next steps
8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
9. In-Memory Storage Evolution (1/2)
9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
AI|
Spark| Spark AI
Table cache
2.0
1.9
2.0 1.9
Spark AI
Parquet vectorized reader
2.0 1.9
1.4 to 1.6
RDD table cache
to 1.3 2.0 to 2.2
RDD table cache : Java objects
Table cache : Own memory layout by Project Tungsten for table
cache
Parquet : Own memory layout, but different class from table
cacheSpark
version
10. In-Memory Storage Evolution (2/2)
10In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Spark AI
Table cache
2.0 1.9
Parquet vectorized reader
2.42.3
Pandas UDF with Arrow ORC vectorized reader
ColumnVector becomes a public class
ColumnVector class becomes public class from Spark 2.3
Table cache, Parquet, ORC, and Arrow use common ColumnVector
class
Spark
version
11. Implementation in Spark 1.4 to 1.6
• Table cache uses CachedBatch that is not accessed directly
from generated code
11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
case class CachedBatch(
buffers: Array[Array[Byte]],
stats: Row)
Spark AI
2.0 1.9
CachedBatch.buffers
12. Implementation in Spark 2.0
• Parquet uses ColumnVector class that has well-defined
methods that could be called from generated code
12In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
public abstract class ColumnVector {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
public final class OnHeapColumnVector
extends ColumnVector {
private byte[] byteData;
…
private float[] floatData;
…
}
Spark AI
2.0 1.9
copy
2.0 1.9
Spark AI
ColumnVector
ColumnarBatch
13. Implementation in Spark 2.3
• Table cache, Parquet, and Arrow also use ColumnVector
• ColumnVector becomes a public class to define APIs
13In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
/**
* An interface representing in-memory columnar data in
Spark. This interface defines the main APIs
* to access the data, as well as their batched versions.
The batched versions are considered to be
* faster and preferable whenever possible.
*/
@Evolving
public abstract class ColumnVector … {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java
public final class OnHeapColumnVector
extends ColumnVector {
// Array for each type.
private byte[] byteData;
…
private float[] floatData;
…
}
public final class ArrowColumnVector
extends ColumnVector {
…
}
Table cache
Parquet
vectorized readers
Pandas UDF with Arrow
ColumnVector.java
14. ColumnVector for Your Columnar
• Developers can write an own class, which extends
ColumnVector, to support a new columnar or to exchange
data with other formats
14In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
MyColumnarClass
extends ColumnVector
Columnar
data
source
15. Implementation in Spark 2.4
• ORC also uses ColumnVector
15In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
/**
* An interface representing in-memory columnar data in
Spark. This interface defines the main APIs
* to access the data, as well as their batched versions.
The batched versions are considered to be
* faster and preferable whenever possible.
*/
@Evolving
public abstract class ColumnVector … {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java
public final class OnHeapColumnVector
extends ColumnVector {
// Array for each type.
private byte[] byteData;
…
private float[] floatData;
…
}
public final class ArrowColumnVector
extends ColumnVector {
…
}
Table cache
Parquet and ORC
vectorized readers
Pandas UDF with Arrow
ColumnVector.java
16. Outline
• Introduction
• Deep dive columnar storage
• Deep dive generated code of columnar storage
• Next steps
16In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
17. How Spark Program is Executed?
• A Spark program is translated into Java code to be executed
17In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Source: Michael et al., Spark SQL:
Relational Data Processing in Spark,
SIGMOD’15Catalyst
while (rowIterator.hasNext()) {
Row row = rowIterator.next;
…
}
Java virtual machine
Spark Programdf = ...
df.cache
df1 = df.selectExpr(“y + 1.2”)
18. Access Columnar Storage (before
2.0)
• While columnar storage is used, generated code gets data
from row storage
Data conversion is required
18In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
float y = row.getFloat(1);
float f = y + 1.2;
…
}
df1 = df.selectExpr(“y + 1.2")
Catalyst
df:
2.0 1.9
CachedBatch
Spark AI
row:
2.0
2.0Spark
Data conversion
Columnar
storage
Row
storage
19. Access Columnar Storage (from
2.0)
• When columnar storage is used, reading data elements
directly accesses columnar storage
– Removed copy for Parquet in 2.0 and table cache in 2.3
19In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector column1 = …
while (i++ < numRows) {
float y = column1.getFloat(i);
float f = y + 1.2;
…
}
df1 = df.selectExpr(“y + 1.2")
Catalyst
ColumnVector 2.0 1.9
2.0
i = 0
1.9
i = 1
y:
3.2 3.1f:
df:
20. Access Columnar Storage (from
2.3)
• Generate this pattern for all cases regarding ColumnVector
• Use for-loop to encourage compiler optimizations
– Hotspot compiler applies loop optimizations to a well-formed loop
20In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector column1 = …
for (int i = 0; i < numRows; i++) {
float y = column1.getFloat(i);
float f = y + 1.2;
…
}
Catalyst
ColumnVector 2.0 1.9
2.0
i = 0
1.9
i = 1
y:
3.2 3.1f:
df:df1 = df.selectExpr(“y + 1.2")
21. How Columnar Storage is used in
PySpark
• Share data in columnar storages of Spark and Pandas
– No serialization and deserialization
– 3-100x performance improvements
21In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector
Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin
Source: ”Introducing Pandas UDF for PySpark” by Databricks blog
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
22. Outline
• Introduction
• Deep dive columnar storage
• Deep dive generated code of columnar storage
• Next steps
22In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
23. Next Steps
• Short-term
– support an array type in ColumnVector for table cache
– support additional external columnar storage
• Middle-term
– exploit SIMD instructions to process multiple rows in a column
in generated code
• Extension of SPARK-25728 (Tungsten IR)
23#UnifiedAnalytics #SparkAISummit
24. Integrate Spark with Others
• Frameworks: DL/ML frameworks
• SPARK-24579
• SPARK-26413
• Resources: GPU, FPGA, ..
• SPARK-27396
• SAIS2019: “Apache Arrow-Based
Unified Data Sharing and
Transferring Format Among
CPU and Accelerators”
24In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
From rapids.ai
FPGA
GPU
25. Takeaway
• Columnar storage is used to improve performance for
– table cache, Parquet, ORC, and Arrow
• Columnar storage from Spark 2.3
– improves performance of PySpark with Pandas UDF using
Arrow
– can be connected with external other columnar storages by
using a public class “ColumnVector”
25#UnifiedAnalytics #SparkAISummit