Looking back at Spark 2.x and forward to 3.0

Kazuaki Ishizaki (石崎一明)
IBM Research – Tokyo (日本アイ・ビー・エム（株）東京基礎研究所)
@kiszk
Looking back at Spark 2.x and forward to 3.0
1

About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research - Tokyo https://ibm.biz/ishizaki
– Compiler optimization
– Language runtime
– Parallel processing
▪ Working for IBM Java virtual machine (now OpenJ9) from over 20 years
– In particular, just-in-time compiler
▪ Apache Spark Committer for SQL package (from 2018/9)
– My first PR has been merged on 2015/12
▪ ACM Distinguished Member
▪ SNS
– @kiszk
– Slideshare: https://www.slideshare.net/ishizaki
2 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki

Today’s Talk
▪ I will not talk about distributed framework
– You are more familiar than myself
▪ I will not talk about SQL, machine learning, and other libraries
– I expect @maropu will talk about SQL in the next session

Today’s Talk
▪ I will not talk about distributed framework
– You are more familiar than myself
▪ I will not talk about SQL, machine learning, and other libraries
– I expect @maropu will talk about SQL in the next session
▪ I will talk about how a program is executed on an executor at a node

Outline
▪ How a DataFrame/Dataset program is executed?
▪ What are problems in Spark 2.x?
▪ What’s new in Spark 3.0?
▪ Why am I appointed to a committer?

Apache Spark Program is Written by a User
▪ This DataFrame program is written in Scala
df: DataFrame[int] = (1 to 100).toDF
df.selectExpr(“value + 1”)
.selectExpr(“value + 2”)
.show

Java code is actually executed
▪ A DataFrame/Dataset program is translated to Java program to be
actually executed
– An optimizer combines two arithmetic operations into one
– Whole-stage codegen puts multiple operations (read, selectExpr, and
projection) into one loop
7
while (itr.hasNext()) { // execute a row
// get a value from a row in DF
int value =((Row)itr.next()).getInt(0);
// compute a new value
int mapValue = value + 3;
// store a new value to a row in DF
outRow.write(0, mapValue);
append(outRow);
}
df: DataFrame[int] = …
df.selectExpr(“value + 1”)
.selectExpr(“value + 2”)
.show
Code
generation
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
1, 2, 3, 4, …
Unsafe data (on heap)

How a Program is Translated to Java Code
From Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust

Who is More Familiar with Each Module
▪ Four Japanese committers are in this room
Project Tungsten

Major Items in Spark 2.x to Me
▪ Improve performance
– by improving data representation
– by eliminating serialization/deserialization (ser/de)
– by improving generated code
▪ Stable code generation
– No more Java exception while a program has large number of columns
(>1000)

Array Internal Representation
▪ Before Spark 2.1, an array (UnsafeArrayData) is internally represented by
using an sparse/indirect structure
– Good for small memory consumption if an array is sparse
▪ After Spark 2.1, the array representation is dense/contiguous
– Good for performance
len = 2 7 8
a[0] a[1]offset[0] offset[1]
len = 2 Non
Null
Non
Null 7 8
a[0] a[1]
SPARK-15962 improves this representation

This is (the first) tough PR for me
▪ Spent three months with 270 conversations

A Simple Dataset Program with Array
▪ Read an integer array in a row
▪ Create a new array from the first element
ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS
ds.map(a => Array(a(0)))

Weird Generated Pseudo Code with DataSet
▪ Data conversion is too slow
– Between internal representation (Tungsten) and Java object format (Object[])
▪ Element-wise data copy is too slow
ArrayData inArray;
while (itr.hasNext()) {
inArray = ((Row)itr.next().getArray(0);
append(outRow);
}
ds: DataSet[Array[Int]] =
Seq(Array(7, 8)).toDS
Data conversion
Element-wise data copy
int[] mapArray = new int[1] { a[0] };
Code
generation
Data conversion
Ser
De
Copy each element with null check
Data conversion
Data conversion Copy with Java object creation

Generated Source Java Code
▪ Data conversion is done by boxing or unboxing
▪ Element-wise data copy is done
by for-loop
Data conversion
Code
generation
ArrayData inArray;
inArray = ((Row)itr.next().getArray(0);
Object[] tmp = new Object[inArray.numElements()];
for (int i = 0; i < tmp.length; i ++) {
tmp[i] = (inArray.isNullAt(i)) ?
null : inArray.getInt(i);
}
ArrayData array =
new GenericIntArrayData(tmpArray);
int[] javaArray = array.toIntArray();
int[] mapArray = (int[])map_func.apply(javaArray);
outArray = new GenericArrayData(mapArray);
for (int i = 0; i < outArray.numElements(); i++) {
if (outArray.isNullAt(i)) {
arrayWriter.setNullInt(i);
} else {
arrayWriter.write(i, outArray.getInt(i));
}
}
append(outRow);
}
Ser
De

Too Long Actually-Generated Java Code (Spark 2.0)
▪ Too to
Data conversion Element-wise data copy
final int[] mapelements_value = mapelements_isNull ?
null : (int[]) mapelements_value1.apply(deserializetoobject_value);
mapelements_isNull = mapelements_value == null;
final boolean serializefromobject_isNull = mapelements_isNull;
final ArrayData serializefromobject_value = serializefromobject_isNull ?
null : new GenericArrayData(mapelements_value);
serializefromobject_holder.reset();
serializefromobject_rowWriter.zeroOutNullBytes();
if (serializefromobject_isNull) {
serializefromobject_rowWriter.setNullAt(0);
} else {
final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
if (serializefromobject_value instanceof UnsafeArrayData) {
final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
serializefromobject_holder.grow(serializefromobject_sizeInBytes);
((UnsafeArrayData) serializefromobject_value).writeToMemory(
serializefromobject_holder.buffer, serializefromobject_holder.cursor);
serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
} else {
final int serializefromobject_numElements = serializefromobject_value.numElements();
serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements;
serializefromobject_index++) {
if (serializefromobject_value.isNullAt(serializefromobject_index)) {
serializefromobject_arrayWriter.setNullAt(serializefromobject_index);
} else {
final int serializefromobject_element = serializefromobject_value.getInt(serializefromobject_index);
serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
}
}
}
serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor,
serializefromobject_holder.cursor - serializefromobject_tmpCursor);
serializefromobject_rowWriter.alignToWords(serializefromobject_holder.cursor - serializefromobject_tmpCursor);
}
serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
append(serializefromobject_result);
if (shouldStop()) return;
}
}
protected void processNext() throws java.io.IOException {
while (inputadapter_input.hasNext()) {
InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
ArrayData inputadapter_value = inputadapter_isNull ?
null : (inputadapter_row.getArray(0));
boolean deserializetoobject_isNull1 = inputadapter_isNull;
ArrayData deserializetoobject_value1 = null;
if (!inputadapter_isNull) {
final int deserializetoobject_n = inputadapter_value.numElements();
final Object[] deserializetoobject_values = new Object[deserializetoobject_n];
for (int deserializetoobject_j = 0;
deserializetoobject_j < deserializetoobject_n; deserializetoobject_j ++) {
if (inputadapter_value.isNullAt(deserializetoobject_j)) {
deserializetoobject_values[deserializetoobject_j] = null;
} else {
boolean deserializetoobject_feNull = false;
int deserializetoobject_fePrim =
inputadapter_value.getInt(deserializetoobject_j);
boolean deserializetoobject_teNull = deserializetoobject_feNull;
int deserializetoobject_tePrim = -1;
if (!deserializetoobject_feNull) {
deserializetoobject_tePrim = deserializetoobject_fePrim;
}
if (deserializetoobject_teNull) {
deserializetoobject_values[deserializetoobject_j] = null;
} else {
deserializetoobject_values[deserializetoobject_j] = deserializetoobject_tePrim;
}
}
}
deserializetoobject_value1 = new GenericArrayData(deserializetoobject_values);
}
boolean deserializetoobject_isNull = deserializetoobject_isNull1;
final int[] deserializetoobject_value = deserializetoobject_isNull ?
null : (int[]) deserializetoobject_value1.toIntArray();
deserializetoobject_isNull = deserializetoobject_value == null;
Object mapelements_obj = ((Expression) references[0]).eval(null);
scala.Function1 mapelements_value1 = (scala.Function1) mapelements_obj;
boolean mapelements_isNull = false || deserializetoobject_isNull;
ds.map(a => Array(a(0))).debugCodegen

Simple Generated Code for Array on Spark 2.2
▪ Data conversion and element-wise copy are not used
▪ Bulk copy is faster than element-wise data copy
17
Bulk data copy Copy whole array using memcpy()
inArray =((Row)itr.next()).getArray(0);
int[]mapArray=(int[])map.apply(javaArray);
append(outRow);
}
Bulk data copy
int[] mapArray = new int[1] { a[0] };
Bulk data copy
Bulk data copy
SPARK-15985 and SPARK-17490 simplify Ser/De by using bulk data copy
Code
generation

Simple Generated Java Code for Array
▪ Data conversion and element-wise copy are not used
▪ Bulk copy is faster than element-wise data copy
18
Bulk data copy Copy whole array using memcpy()
SPARK-15985 and SPARK-17490 simplify Ser/De by using bulk data copy
inArray =((Row)itr.next()).getArray(0);
int[] javaArray = inArray.toIntArray();
int[] mapArray = (int[])mapFunc.apply(javaArray);
outArray = UnsafeArrayData
.fromPrimitiveArray(mapArray);
outArray.writeToMemory(outRow);
append(outRow);
}
Code
generation

Dataset for Array Is Not Extremely Slow
▪ Good news: 4.5x faster than Spark 2.0
▪ Bad news: still 12x slower than DataFrame
19
0 10 20 30 40 50 60
Relative execution time over DataFrame
DataFrame Dataset
ds = Seq(Array(…), Array(…), …)
.toDS.cache
4.5x
df = Seq(Array(…), Array(…), …)
.toDF（”a”).cache
df.selectExpr(“Array(a[0])”)
12x
Shorter is better

Spark 2.4 Supports Array Built-in Functions
▪ These built-in functions operate on array elements without writing a loop
– Single array input: array_min, array_max, array_position, ...
– Two-array input: array_intersect, array_union, array_except,
▪ Before Spark 2.4, users have to write a function using Dataset or UDF
20
SPARK-23899 is an umbrella entry
ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS
ds.map(a => a.mix)
df: DataSet[Array[Int]] = Seq(Array(7, 8)).toDF(“a”)
df.array_min(“a”)
@ueshin co-wrote an blog entry at
https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html

Pre-Spark 2.3 Throws Java Exception with Large Columns
▪ Java class file has multiple limitations
– Bytecode size of a method is less than 64KB
– Entry size of constant pool (e.g. symbol name) is less than 64KB
21
df.groupBy(“id”).agg(max(“c1”), sum(“c2”), …, min(“c4000”))
01:11:11.123 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to
compile: org.codehaus.janino.JaninoRuntimeException: Code of method
"apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions
/UnsafeRow;" of class
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows
beyond 64 KB
...
Generated a huge method whose bytecode size is
more than 64KB

Spark 2.3 Fixes Java Exception with Large Columns
▪ Generate small size of methods conservatively when potentially large
Java code is generated
– Apply this policy to multiple places into source files
22
SPARK-22150 is an umbrella entry that has 25 sub-tasks

Spark 2.3 Fixes Java Exception with Large Columns
▪ Generate small size of methods conservatively when potentially large
Java code is generated
– Apply this policy to multiple places into source files
23
SPARK-22150 is an umbrella entry that has 25 sub-tasks

Major Items in Spark 3.0
▪ JDK11 and Scala 2.12 support (available in master branch)
– SPARK-24417
▪ Tungsten intermediate representation (IR)
– Easy to restructure generated code
▪ SPARK-25728 (under proposal)
▪ DataSource V2 API
– SPARK-25528
▪ …

Motivation of Tungsten IR
▪ It is not easy to restructure Java code after generating the code
– Code generation is done by string concatenation
25
int i = ...
func1(i + 1, i * 2);
...
func1(i + 500, i * 2);
func1(i + 501, i * 2);
func1(i + 1000, i * 2);
Hard to split here into two parts
without parsing Java code

Structured IR Allows us to Restructure Generated Code
▪ Ease of code restructuring in blue
▪ Ease of rebuilding an expression in green
26
Method
Invoke
+
load i
500
*
load i
2
Invoke
+
load i
501
Easy to split here
into two parts

Other Major Items in Spark 2.x
▪ PySpark Performance Improvement
– To use Pandas UDF with Apache Arrow can drastically improve performance
of PySpark
▪ https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
▪ Project Hydrogen
– Barrier execution mode for integrating ML/DL frameworks with Spark
▪ https://databricks.com/blog/2018/07/25/bay-area-apache-spark-meetup-summary-
databricks-hq.html

What I am Interested in
▪ Tungsten IR in Spark
– ease of code restructuring
– (In the future) apply multiple optimizations
▪ Improvement of generated code in Spark
– for Parquet reader
– data representation for array table cache
▪ Integration of Spark with DL/ML frameworks (TensorFlow…) and others

Possible Integration of Spark thru Arrow (My View)
▪ Frameworks: DL/ML frameworks (TensorFlow…)
▪ Resource: GPU, …
– RAPIDS (by NVIDIA) may help integrate with GPU
From rapids.ai
2 12
1 11
In-memory Columnar

Why Am I Appointed to a Committer?
▪ Continue to make contributions to certain component (SQL)
▪ Review many pull requests
▪ Share knowledge based on my expertise in the community
– Compiler and Java virtual machine
▪ Meet committers and contributors in person
– Hadoop Source Code Reading, Hadoop Spark Conference Japan,
Spark Summit, other meetups

31
ぜひオープンソースにコントリビューションを
Sparkコミュニティに飛び込もう！
by Apache Spark committer 猿田さん
https://www.slideshare.net/hadoopxnttdata/apache-spark-commnity-nttdata-sarutak

Looking back at Spark 2.x and forward to 3.0

More Related Content

What's hot

Similar to Looking back at Spark 2.x and forward to 3.0

More from Kazuaki Ishizaki

Recently uploaded

Looking back at Spark 2.x and forward to 3.0