SlideShare a Scribd company logo
Kazuaki Ishizaki
IBM Research – Tokyo
@kiszk
Demystifying
DataFrame and Dataset
#SFdev20
About Me – Kazuaki Ishizaki
• Researcher at IBM Research in compiler optimizations
• Working for IBM Java virtual machine over 20 years
– In particular, just-in-time compiler
• Contributor to SQL package in Apache Spark
• Committer of GPUEnabler
– Apache Spark Plug-in to execute GPU code on Spark
• https://github.com/IBMSparkGPU/GPUEnabler
• Homepage: http://ibm.biz/ishizaki
• Github: https://github.com/kiszk, Twitter: @kiszk
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki2
Spark 2.2 makes Dataset Faster
Compared to Spark 2.0 (also 2.1)
• 1.6x faster for map() with scalar variable
• 4.5x faster for map() with primitive array
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki3
ds = (1 to ...).toDS.cache
ds.map(value => value + 1)
ds = Seq(Array(…), Array(…), …).toDS.cache
ds.map(array => Array(array(0)))
from enhancements in Catalyst optimizer and codegen
How Does It Matter?
• Application programmers
– ML pipeline will become faster
• Library developers
– You will want to use Dataset instead of RDD
• SQL package developers
– You will know detail on Dataset
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki4
DataFrame, Dataset, and RDD
df = (1 to 100).toDF.cache
df.selectExpr(“value + 1”)
ds = (1 to 100).toDS.cache
ds.map(value => value + 1)
SQLDataFrame (DF) Dataset (DS)
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki5
RDD
rdd = sc.parallelize(
(1 to 100)).cache
rdd.map(value => value + 1)
Based on lambda expressionBased on SQL operation
How DF, DS, and RDD Work
• We expect the same performance on DF and DS
From Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust
DS
DF
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki6
Catalyst
// generated Java code
// for DF and DS
while (itr.hasNext()) {
Row r = (Row)itr.next();
int v = r.getInt(0);
…
}
Dataset for Int Value is Slow on
Spark 2.0
• 1.7x slow for addition to scalar int value
0 0.5 1 1.5 2
Relative execution time over DataFrame
DataFrame Dataset
df.selectExpr(“value + 1”)
ds.map(value => value + 1)
All experiments are done using
16-core Intel E2667-v3 3.2GHz, with 384GB mem, Ubuntu 16.04,
OpenJDK 1.8.0u212-b13 with 256GB heap, Spark master=local[1]
Spark branch-2.0 and branch-2.2, ds/df/rdd are cached
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki7
Shorter is better
Dataset for Int Array is Extremely
Slow on Spark 2.0
• 54x slow for creating a new array for primitive int array
0 10 20 30 40 50 60
Relative execution time over DataFrame
DataFrame Dataset
df = Seq(Array(…), Array(…), …)
.toDF(“a”).cache
df.selectExpr(“Array(a[0])”)
ds = Seq(Array(…), Array(…), …)
.toDS.cache
ds.map(a => Array(a(0)))
SPARK-16070 pointed out this performance problem
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki8
Shorter is better
Outline
• How DataFrame and Dataset are executed?
• Deep dive for two cases
– Why Dataset was slow
– How Spark 2.2 improves performance
• Deep dive for another case
– Why Dataset is slow
– How future Spark will improve performance
• How Dataset is used in ML pipeline?
• Next Steps
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki9
Simple Generated Code for int
with DF
• Generated code performs `+` to int
– Use strongly-typed ‘int’ variable
1: while (itr.hasNext()) { // execute a row
2: // get a value from a row in DF
3: int value =((Row)itr.next()).getInt(0);
4: // compute a new value
5: int mapValue = value + 1;
6: // store a new value to a row in DF
7: outRow.write(0, mapValue);
8: append(outRow);
9: }
// df: DataFrame for int …
df.selectExpr(“value + 1”)
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki10
Catalyst
generates
Java code
Catalyst
Weird Generated Code with DS
• Generated code performs four data conversions
– Use standard API with generic type method apply(Object)
while (itr.hasNext()) {
int value = ((Row)itr.next()).getInt(0);
Object objValue = new Integer(value);
int mapValue = ((Integer)
mapFunc.apply(objVal)).toValue();
outRow.write(0, mapValue);
append(outRow);
}
Object apply(Object obj) {
int value = Integer(obj).toValue();
return new Integer(apply$II(value));
}
int apply$II(int value) { return value + 1; }
// ds: Dataset[Int] …
ds.map(value => value + 1)
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki11
Catalyst
generates
Java code
Scalac generates
these functions
from lambda expression
Catalyst
Simple Generated Code with DS
on Spark 2.2
• SPARK-19008 enables generated code to use int value
– Use non-standard API with type-specialized method
apply$II(int)
while (itr.hasNext()) {
int value = ((Row)itr.next()).getInt(0);
int mapValue = mapFunc.apply$II(value);
outRow.write(0, mapValue);
append(outRow);
}
int apply$II(int value) { return value + 1;}
// ds: Dataset[Int] …
ds.map(value => value + 1)
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki12
Catalyst
Scalac generates
this function
from lambda expression
Dataset is Not Slow on Spark 2.2
0 0.5 1 1.5 2
Relative execution time over DataFrame
DataFrame Dataset
df.selectExpr(“value + 1”)
ds.map(value => value + 1)
Shorter is better
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki13
1.6x
• Improve performance by 1.6x compared to Spark 2.0
– DataFrame is 9% faster than Dataset
RDD/DF/DS achieve close
performance on Spark 2.2
• RDD is 5% faster than DataFrame
0 0.5 1 1.5 2
Relative execution time over DataFrame
RDD DataFrame Dataset
df.selectExpr(“value + 1”)
ds.map(value => value + 1)
rdd.map(value => value + 1)
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki14
9%
5%
Shorter is better
Simple Generated Code for Array
with DF
• All operations access data in internal data format for array
ArrayData (devised in Project Tungsten)
// df: DataFrame for Array(Int) …
df.selectExpr(“Array(a[0])”)
ArrayData inArray, outArray;
while (itr.hasNext()) {
inArray = ((Row)itr.next()).getArray(0);
int value = inArray.getInt(0)); //a[0]
outArray = new UnsafeArrayData(); ...
outArray.setInt(0, value); //Array(a[0])
outArray.writeToMemory(outRow);
append(outRow);
}
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki15
Catalyst
generates
Java code
Internal data format (ArrayData)
inArray.getInt(0)
outArray.setInt(0, …)
Catalyst
Lambda Expression with DS
requires Java Object
• Need serialization/deserialzation (Ser/De) between internal
data format and Java object for lambda expression
// ds: DataSet[Array(Int)] …
ds.map(a => Array(a(0)))
ArrayData inArray, outArray;
while (itr.hasNext()) {
inArray = ((Row)itr.next()).getArray(0);
append(outRow);
}
Internal data formatJava object
Ser/De
Array(a(0))
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki16
int[] mapArray = Array(a(0));
Java bytecode for
lambda expression
Catalyst
Deserialization
Serialization
Ser/De
ArrayData inArray, outArray;
while (itr.hasNext()) {
inArray = ((Row)itr.next().getArray(0);
Object[] tmp = new Object[inArray.numElements()];
for (int i = 0; i < tmp.length; i ++) {
tmp[i] = (inArray.isNullAt(i)) ? Null : inArray.getInt(i);
}
ArrayData array = new GenericIntArrayData(tmpArray);
int[] javaArray = array.toIntArray();
int[] mapArray = (int[])map_func.apply(javaArray);
outArray = new GenericArrayData(mapArray);
for (int i = 0; i < outArray.numElements(); i++) {
if (outArray.isNullAt(i)) {
arrayWriter.setNullInt(i);
} else {
arrayWriter.write(i, outArray.getInt(i));
}
}
append(outRow);
}
ArrayData inArray, outArray;
while (itr.hasNext()) {
inArray = ((Row)itr.next().getArray(0);
append(outRow);
}
Extremely Weird Code for Array
with DS on Spark 2.0
• Data conversion is too slow
• Element-wise copy is slow
// ds: DataSet[Array(Int)] …
ds.map(a => Array(a(0)))
Data conversion
Element-wise data copy
Element-wise data copy
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki17
Data conversion Copy with Java object creation
Element-wise data copy Copy each element with null check
int[] mapArray = Array(a(0));
Catalyst
Data conversion
Element-wise data copy
Ser
De
Simple Generated Code for Array
on Spark 2.2
• SPARK-15985, SPARK-17490, and SPARK-15962
simplify Ser/De by using bulk data copy
– Data conversion and element-wise copy are not used
– Bulk copy is faster
// ds: DataSet[Array(Int)] …
ds.map(a => Array(a(0)))
while (itr.hasNext()) {
inArray =((Row)itr.next()).getArray(0);
int[] javaArray = inArray.toIntArray();
int[] mapArray = (int[])mapFunc.apply(javaArray);
outArray = UnsafeArrayData
.fromPrimitiveArray(mapArray);
outArray.writeToMemory(outRow);
append(outRow);
}
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki18
Bulk data copy Copy whole array using memcpy()
Catalyst
while (itr.hasNext()) {
inArray =((Row)itr.next()).getArray(0);
int[]mapArray=(int[])map.apply(javaArray);
append(outRow);
}
Bulk data copy
int[] mapArray = Array(a(0));
Bulk data copy
Bulk data copy
0 10 20 30 40 50 60
Relative execution time over DataFrame
DataFrame Dataset
Dataset for Array is Not
Extremely Slow over DataFrame
• Improve performance by 4.5 compared to Spark 2.0
– DataFrame is 12x faster than Dataset
ds = Seq(Array(…), Array(…), …)
.toDS.cache
ds.map(a => Array(a(0)))
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki19
4.5x
df = Seq(Array(…), Array(…), …)
.toDF(”a”).cache
df.selectExpr(“Array(a[0])”)
Would this overhead be negligible
if map() is much computation intensive?
12x
Shorter is better
RDD and DataFrame for Array
Achieve Close Performance
• DataFrame is 1.2x faster than RDD
0 5 10 15
Relative execution time over DataFrame
RDD DataFrame Dataset
ds = Seq(Array(…), Array(…), …)
.toDS.cache
ds.map(a => Array(a(0)))
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki20
df = Seq(Array(…), Array(…), …)
.toDF(”a”).cache
df.selectExpr(“Array(a[0])”)
rdd = sc.parallelize(Seq(
Array(…), Array(…), …)).cache
rdd.map(a => Array(a(0))) 1.2
12
Shorter is better
Enable Lambda Expression to
Use Internal Data Format
• Our prototype modifies Java bytecode of lambda
expression to access internal data format
– We improved performance by avoiding Ser/De
// ds: DataSet[Array(Int)] …
ds.map(a => Array(a(0)))
while (itr.hasNext()) {
inArray = ((Row)itr.next().getArray(0);
outArray = mapFunc.applyTungsten(inArray);
outArray.writeToMemory(outRow);
append(outRow);
}
Internal data format
“Accelerating Spark Datasets by inlining deserialization”, our academic paper at IPDPS2017
1
1
apply(Object)
1011
applyTungsten(ArrayData)
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki21
Catalyst
generates
Java code
Catalyst
Our
prototype
Outline
• How DataFrame and Dataset are executed?
• Deep dive for two cases
– Why Dataset is slow
– How Spark 2.2 improve performance
• Deep dive for another case
– Why Dataset is slow
• How Dataset are used in ML pipeline?
• Next Steps
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki22
0 0.5 1 1.5
Relative execution time over DataFrame
RDD DataFrame Dataset
Dataset for Two Scalar Adds is
Slow on Spark 2.2
• DataFrame and Dataset are faster than RDD
• Dataset is 1.1x slower than DataFrame
df.selectExpr(“value + 1”)
.selectExpr(“value + 2”)
ds.map(value => value + 1)
.map(value => value + 2)
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki23
rdd.map(value => value + 1)
.map(value => value + 2)
Shorter is better
☺

1.1
1.4
Adds are Combined with DF
• Spark 2.2 understands operations in selectExpr() and
combines two adds into one add ‘value + 3’
while (itr.hasNext()) {
Row row = (Row)itr.next();
int value = row.getInt(0);
int mapValue = value + 3;
projectRow.write(0, mapValue);
append(projectRow);
}
// df: DataFrame for Int …
df.selectExpr(“value + 1”)
.selectExpr(“value + 2”)
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki24
Catalyst
Catalyst
generates
Java code
Two Method Calls for Adds
Remain with DS
• Spark 2.2 cannot understand lambda expressions stored
as Java bytecode
while (itr.hasNext()) {
int value = ((Row)itr.next()).getInt(0);
int mapValue = mapFunc1.apply$II(value);
mapValue += mapFunc2.apply$II(mapValue);
outRow.write(0, mapValue);
append(outRow);
}
// ds: Dataset[Int] …
ds.map(value => value + 1)
.map(value => value + 2)
class MapFunc1 {
int apply$II(int value) { return value + 1;}}
class MapFunc2 {
int apply$II(int value) { return value + 2;}}
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki25
Catalyst
Scalac generates
these functions
from lambda expression
class MapFunc1 {
int apply$II(int value) { return value + 1;}}
class MapFunc2 {
int apply$II(int value) { return value + 2;}}
Adds Will be Combined on Future
Spark 2.x
• SPARK-14083 will allow future Spark to understand Java
byte code lambda expressions and to combine them
while (itr.hasNext()) {
int value = ((Row)itr.next()).getInt(0);
int mapValue = value + 3;
outRow.write(0, mapValue);
append(outRow);
}
// ds: Dataset[Int] …
ds.map(value => value + 1)
.map(value => value + 2)
Dataset will become faster
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki26
Catalyst
Outline
• How DataFrame and Dataset are executed?
• Deep dive for two cases
– Why Dataset is slow
– How Spark 2.2 improve performance
• Deep dive for another case
– Why Dataset is slow
• How Dataset are used in ML pipeline?
• Next Steps
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki27
How Dataset is Used in ML
Pipeline
• ML Pipeline is constructed by using DataFrame/Dataset
• Algorithm still operates on data in RDD
– Conversion overhead between Dataset and RDD
val trainingDS: Dataset = ...
val tokenizer = new Tokenizer()
val hashingTF = new HashingTF()
val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(
Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(trainingDS)
def fit(ds: Dataset[_]) = {
...
train(ds)
}
def train(ds:Dataset[_]):LogisticRegressionModel = {
val instances: RDD[Instance] =
ds.select(...).rdd.map { // convert DS to RDD
case Row(...) => ...
}
instances.persist(...)
instances.treeAggregate(...)
}
ML pipeline Machine learning algorithm (LogisticRegression)
Derived from https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki28
If We Use Dataset for Machine
Learning Algorithm
• ML Pipeline is constructed by using DataFrame/Dataset
• Algorithm also operates on data in DataFrame/Dataset
– No conversion between Dataset and RDD
– Optimizations applied by Catalyst
– Writing of control flows
val trainingDataset: Dataset = ...
val tokenizer = new Tokenizer()
val hashingTF = new HashingTF()
val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(
Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(trainingDataset)
def train(ds:Dataset[_]):LogisticRegressionModel = {
val instances: Dataset[Instance] =
ds.select(...).rdd.map {
case Row(...) => ...
}
instances.persist(...)
instances.treeAggregate(...)
}
ML pipeline
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki29
Machine learning algorithm (LogisticRegression)
Next Steps
• Achieve further performance improvements for Dataset
– Implement Java bytecode analysis and modification
• Exploit optimizations in Catalyst (SPARK-14083)
• Reduce overhead of data conversion (Our paper)
• Exploit SIMD/GPU in Spark by using simple generated
code
– http://www.spark.tc/simd-and-gpu/
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki30
Takeaway
• Dataset on Spark 2.2 is much faster than on Spark 2.0/2.1
• How Dataset was executed and was improved
– Data conversion
– Serialization/deserialization
– Java bytecode of lambda expression
• Continue to improve performance of Dataset
• Let us start using Dataset from Today!
Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki31
Thank you
@kiszk on twitter
@kiszk on github
http://ibm.biz/ishizaki
• @sameeragarwal, @hvanhovell, @gatorsmile,
@cloud-fan, @ueshin, @davies, @rxin,
@joshrosen, @maropu, @viirya
• Spark Technology Center

More Related Content

What's hot

あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法
x1 ichi
 
CMUデータベース輪読会第8回
CMUデータベース輪読会第8回CMUデータベース輪読会第8回
CMUデータベース輪読会第8回
Keisuke Suzuki
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
Kenta Oono
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
Takeshi Yamamuro
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Wayne Chen
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
Kohei KaiGai
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
DataWorks Summit
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
Chris Fregly
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
Jim Dowling
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
Chris Fregly
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
Preferred Networks
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
NAVER D2
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Holden Karau
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Jim Dowling
 
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPU
Shohei Hido
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
Kohei KaiGai
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Jim Dowling
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
Kohei KaiGai
 

What's hot (20)

あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法
 
CMUデータベース輪読会第8回
CMUデータベース輪読会第8回CMUデータベース輪読会第8回
CMUデータベース輪読会第8回
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
 
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPU
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 

Similar to Demystifying DataFrame and Dataset

icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
Kazuaki Ishizaki
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
Yasoda Jayaweera
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 

Similar to Demystifying DataFrame and Dataset (20)

icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 

More from Kazuaki Ishizaki

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
Kazuaki Ishizaki
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
Kazuaki Ishizaki
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperable
Kazuaki Ishizaki
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
Kazuaki Ishizaki
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
Kazuaki Ishizaki
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
Kazuaki Ishizaki
 
SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
Kazuaki Ishizaki
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache Spark
Kazuaki Ishizaki
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
Kazuaki Ishizaki
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
Kazuaki Ishizaki
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
Kazuaki Ishizaki
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
Kazuaki Ishizaki
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
Kazuaki Ishizaki
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_publicKazuaki Ishizaki
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
Kazuaki Ishizaki
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
Kazuaki Ishizaki
 

More from Kazuaki Ishizaki (18)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperable
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache Spark
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
 

Recently uploaded

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 

Recently uploaded (20)

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 

Demystifying DataFrame and Dataset

  • 1. Kazuaki Ishizaki IBM Research – Tokyo @kiszk Demystifying DataFrame and Dataset #SFdev20
  • 2. About Me – Kazuaki Ishizaki • Researcher at IBM Research in compiler optimizations • Working for IBM Java virtual machine over 20 years – In particular, just-in-time compiler • Contributor to SQL package in Apache Spark • Committer of GPUEnabler – Apache Spark Plug-in to execute GPU code on Spark • https://github.com/IBMSparkGPU/GPUEnabler • Homepage: http://ibm.biz/ishizaki • Github: https://github.com/kiszk, Twitter: @kiszk Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki2
  • 3. Spark 2.2 makes Dataset Faster Compared to Spark 2.0 (also 2.1) • 1.6x faster for map() with scalar variable • 4.5x faster for map() with primitive array Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki3 ds = (1 to ...).toDS.cache ds.map(value => value + 1) ds = Seq(Array(…), Array(…), …).toDS.cache ds.map(array => Array(array(0))) from enhancements in Catalyst optimizer and codegen
  • 4. How Does It Matter? • Application programmers – ML pipeline will become faster • Library developers – You will want to use Dataset instead of RDD • SQL package developers – You will know detail on Dataset Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki4
  • 5. DataFrame, Dataset, and RDD df = (1 to 100).toDF.cache df.selectExpr(“value + 1”) ds = (1 to 100).toDS.cache ds.map(value => value + 1) SQLDataFrame (DF) Dataset (DS) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki5 RDD rdd = sc.parallelize( (1 to 100)).cache rdd.map(value => value + 1) Based on lambda expressionBased on SQL operation
  • 6. How DF, DS, and RDD Work • We expect the same performance on DF and DS From Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust DS DF Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki6 Catalyst // generated Java code // for DF and DS while (itr.hasNext()) { Row r = (Row)itr.next(); int v = r.getInt(0); … }
  • 7. Dataset for Int Value is Slow on Spark 2.0 • 1.7x slow for addition to scalar int value 0 0.5 1 1.5 2 Relative execution time over DataFrame DataFrame Dataset df.selectExpr(“value + 1”) ds.map(value => value + 1) All experiments are done using 16-core Intel E2667-v3 3.2GHz, with 384GB mem, Ubuntu 16.04, OpenJDK 1.8.0u212-b13 with 256GB heap, Spark master=local[1] Spark branch-2.0 and branch-2.2, ds/df/rdd are cached Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki7 Shorter is better
  • 8. Dataset for Int Array is Extremely Slow on Spark 2.0 • 54x slow for creating a new array for primitive int array 0 10 20 30 40 50 60 Relative execution time over DataFrame DataFrame Dataset df = Seq(Array(…), Array(…), …) .toDF(“a”).cache df.selectExpr(“Array(a[0])”) ds = Seq(Array(…), Array(…), …) .toDS.cache ds.map(a => Array(a(0))) SPARK-16070 pointed out this performance problem Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki8 Shorter is better
  • 9. Outline • How DataFrame and Dataset are executed? • Deep dive for two cases – Why Dataset was slow – How Spark 2.2 improves performance • Deep dive for another case – Why Dataset is slow – How future Spark will improve performance • How Dataset is used in ML pipeline? • Next Steps Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki9
  • 10. Simple Generated Code for int with DF • Generated code performs `+` to int – Use strongly-typed ‘int’ variable 1: while (itr.hasNext()) { // execute a row 2: // get a value from a row in DF 3: int value =((Row)itr.next()).getInt(0); 4: // compute a new value 5: int mapValue = value + 1; 6: // store a new value to a row in DF 7: outRow.write(0, mapValue); 8: append(outRow); 9: } // df: DataFrame for int … df.selectExpr(“value + 1”) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki10 Catalyst generates Java code Catalyst
  • 11. Weird Generated Code with DS • Generated code performs four data conversions – Use standard API with generic type method apply(Object) while (itr.hasNext()) { int value = ((Row)itr.next()).getInt(0); Object objValue = new Integer(value); int mapValue = ((Integer) mapFunc.apply(objVal)).toValue(); outRow.write(0, mapValue); append(outRow); } Object apply(Object obj) { int value = Integer(obj).toValue(); return new Integer(apply$II(value)); } int apply$II(int value) { return value + 1; } // ds: Dataset[Int] … ds.map(value => value + 1) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki11 Catalyst generates Java code Scalac generates these functions from lambda expression Catalyst
  • 12. Simple Generated Code with DS on Spark 2.2 • SPARK-19008 enables generated code to use int value – Use non-standard API with type-specialized method apply$II(int) while (itr.hasNext()) { int value = ((Row)itr.next()).getInt(0); int mapValue = mapFunc.apply$II(value); outRow.write(0, mapValue); append(outRow); } int apply$II(int value) { return value + 1;} // ds: Dataset[Int] … ds.map(value => value + 1) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki12 Catalyst Scalac generates this function from lambda expression
  • 13. Dataset is Not Slow on Spark 2.2 0 0.5 1 1.5 2 Relative execution time over DataFrame DataFrame Dataset df.selectExpr(“value + 1”) ds.map(value => value + 1) Shorter is better Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki13 1.6x • Improve performance by 1.6x compared to Spark 2.0 – DataFrame is 9% faster than Dataset
  • 14. RDD/DF/DS achieve close performance on Spark 2.2 • RDD is 5% faster than DataFrame 0 0.5 1 1.5 2 Relative execution time over DataFrame RDD DataFrame Dataset df.selectExpr(“value + 1”) ds.map(value => value + 1) rdd.map(value => value + 1) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki14 9% 5% Shorter is better
  • 15. Simple Generated Code for Array with DF • All operations access data in internal data format for array ArrayData (devised in Project Tungsten) // df: DataFrame for Array(Int) … df.selectExpr(“Array(a[0])”) ArrayData inArray, outArray; while (itr.hasNext()) { inArray = ((Row)itr.next()).getArray(0); int value = inArray.getInt(0)); //a[0] outArray = new UnsafeArrayData(); ... outArray.setInt(0, value); //Array(a[0]) outArray.writeToMemory(outRow); append(outRow); } Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki15 Catalyst generates Java code Internal data format (ArrayData) inArray.getInt(0) outArray.setInt(0, …) Catalyst
  • 16. Lambda Expression with DS requires Java Object • Need serialization/deserialzation (Ser/De) between internal data format and Java object for lambda expression // ds: DataSet[Array(Int)] … ds.map(a => Array(a(0))) ArrayData inArray, outArray; while (itr.hasNext()) { inArray = ((Row)itr.next()).getArray(0); append(outRow); } Internal data formatJava object Ser/De Array(a(0)) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki16 int[] mapArray = Array(a(0)); Java bytecode for lambda expression Catalyst Deserialization Serialization Ser/De
  • 17. ArrayData inArray, outArray; while (itr.hasNext()) { inArray = ((Row)itr.next().getArray(0); Object[] tmp = new Object[inArray.numElements()]; for (int i = 0; i < tmp.length; i ++) { tmp[i] = (inArray.isNullAt(i)) ? Null : inArray.getInt(i); } ArrayData array = new GenericIntArrayData(tmpArray); int[] javaArray = array.toIntArray(); int[] mapArray = (int[])map_func.apply(javaArray); outArray = new GenericArrayData(mapArray); for (int i = 0; i < outArray.numElements(); i++) { if (outArray.isNullAt(i)) { arrayWriter.setNullInt(i); } else { arrayWriter.write(i, outArray.getInt(i)); } } append(outRow); } ArrayData inArray, outArray; while (itr.hasNext()) { inArray = ((Row)itr.next().getArray(0); append(outRow); } Extremely Weird Code for Array with DS on Spark 2.0 • Data conversion is too slow • Element-wise copy is slow // ds: DataSet[Array(Int)] … ds.map(a => Array(a(0))) Data conversion Element-wise data copy Element-wise data copy Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki17 Data conversion Copy with Java object creation Element-wise data copy Copy each element with null check int[] mapArray = Array(a(0)); Catalyst Data conversion Element-wise data copy Ser De
  • 18. Simple Generated Code for Array on Spark 2.2 • SPARK-15985, SPARK-17490, and SPARK-15962 simplify Ser/De by using bulk data copy – Data conversion and element-wise copy are not used – Bulk copy is faster // ds: DataSet[Array(Int)] … ds.map(a => Array(a(0))) while (itr.hasNext()) { inArray =((Row)itr.next()).getArray(0); int[] javaArray = inArray.toIntArray(); int[] mapArray = (int[])mapFunc.apply(javaArray); outArray = UnsafeArrayData .fromPrimitiveArray(mapArray); outArray.writeToMemory(outRow); append(outRow); } Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki18 Bulk data copy Copy whole array using memcpy() Catalyst while (itr.hasNext()) { inArray =((Row)itr.next()).getArray(0); int[]mapArray=(int[])map.apply(javaArray); append(outRow); } Bulk data copy int[] mapArray = Array(a(0)); Bulk data copy Bulk data copy
  • 19. 0 10 20 30 40 50 60 Relative execution time over DataFrame DataFrame Dataset Dataset for Array is Not Extremely Slow over DataFrame • Improve performance by 4.5 compared to Spark 2.0 – DataFrame is 12x faster than Dataset ds = Seq(Array(…), Array(…), …) .toDS.cache ds.map(a => Array(a(0))) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki19 4.5x df = Seq(Array(…), Array(…), …) .toDF(”a”).cache df.selectExpr(“Array(a[0])”) Would this overhead be negligible if map() is much computation intensive? 12x Shorter is better
  • 20. RDD and DataFrame for Array Achieve Close Performance • DataFrame is 1.2x faster than RDD 0 5 10 15 Relative execution time over DataFrame RDD DataFrame Dataset ds = Seq(Array(…), Array(…), …) .toDS.cache ds.map(a => Array(a(0))) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki20 df = Seq(Array(…), Array(…), …) .toDF(”a”).cache df.selectExpr(“Array(a[0])”) rdd = sc.parallelize(Seq( Array(…), Array(…), …)).cache rdd.map(a => Array(a(0))) 1.2 12 Shorter is better
  • 21. Enable Lambda Expression to Use Internal Data Format • Our prototype modifies Java bytecode of lambda expression to access internal data format – We improved performance by avoiding Ser/De // ds: DataSet[Array(Int)] … ds.map(a => Array(a(0))) while (itr.hasNext()) { inArray = ((Row)itr.next().getArray(0); outArray = mapFunc.applyTungsten(inArray); outArray.writeToMemory(outRow); append(outRow); } Internal data format “Accelerating Spark Datasets by inlining deserialization”, our academic paper at IPDPS2017 1 1 apply(Object) 1011 applyTungsten(ArrayData) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki21 Catalyst generates Java code Catalyst Our prototype
  • 22. Outline • How DataFrame and Dataset are executed? • Deep dive for two cases – Why Dataset is slow – How Spark 2.2 improve performance • Deep dive for another case – Why Dataset is slow • How Dataset are used in ML pipeline? • Next Steps Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki22
  • 23. 0 0.5 1 1.5 Relative execution time over DataFrame RDD DataFrame Dataset Dataset for Two Scalar Adds is Slow on Spark 2.2 • DataFrame and Dataset are faster than RDD • Dataset is 1.1x slower than DataFrame df.selectExpr(“value + 1”) .selectExpr(“value + 2”) ds.map(value => value + 1) .map(value => value + 2) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki23 rdd.map(value => value + 1) .map(value => value + 2) Shorter is better ☺  1.1 1.4
  • 24. Adds are Combined with DF • Spark 2.2 understands operations in selectExpr() and combines two adds into one add ‘value + 3’ while (itr.hasNext()) { Row row = (Row)itr.next(); int value = row.getInt(0); int mapValue = value + 3; projectRow.write(0, mapValue); append(projectRow); } // df: DataFrame for Int … df.selectExpr(“value + 1”) .selectExpr(“value + 2”) Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki24 Catalyst Catalyst generates Java code
  • 25. Two Method Calls for Adds Remain with DS • Spark 2.2 cannot understand lambda expressions stored as Java bytecode while (itr.hasNext()) { int value = ((Row)itr.next()).getInt(0); int mapValue = mapFunc1.apply$II(value); mapValue += mapFunc2.apply$II(mapValue); outRow.write(0, mapValue); append(outRow); } // ds: Dataset[Int] … ds.map(value => value + 1) .map(value => value + 2) class MapFunc1 { int apply$II(int value) { return value + 1;}} class MapFunc2 { int apply$II(int value) { return value + 2;}} Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki25 Catalyst Scalac generates these functions from lambda expression
  • 26. class MapFunc1 { int apply$II(int value) { return value + 1;}} class MapFunc2 { int apply$II(int value) { return value + 2;}} Adds Will be Combined on Future Spark 2.x • SPARK-14083 will allow future Spark to understand Java byte code lambda expressions and to combine them while (itr.hasNext()) { int value = ((Row)itr.next()).getInt(0); int mapValue = value + 3; outRow.write(0, mapValue); append(outRow); } // ds: Dataset[Int] … ds.map(value => value + 1) .map(value => value + 2) Dataset will become faster Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki26 Catalyst
  • 27. Outline • How DataFrame and Dataset are executed? • Deep dive for two cases – Why Dataset is slow – How Spark 2.2 improve performance • Deep dive for another case – Why Dataset is slow • How Dataset are used in ML pipeline? • Next Steps Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki27
  • 28. How Dataset is Used in ML Pipeline • ML Pipeline is constructed by using DataFrame/Dataset • Algorithm still operates on data in RDD – Conversion overhead between Dataset and RDD val trainingDS: Dataset = ... val tokenizer = new Tokenizer() val hashingTF = new HashingTF() val lr = new LogisticRegression() val pipeline = new Pipeline().setStages( Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(trainingDS) def fit(ds: Dataset[_]) = { ... train(ds) } def train(ds:Dataset[_]):LogisticRegressionModel = { val instances: RDD[Instance] = ds.select(...).rdd.map { // convert DS to RDD case Row(...) => ... } instances.persist(...) instances.treeAggregate(...) } ML pipeline Machine learning algorithm (LogisticRegression) Derived from https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki28
  • 29. If We Use Dataset for Machine Learning Algorithm • ML Pipeline is constructed by using DataFrame/Dataset • Algorithm also operates on data in DataFrame/Dataset – No conversion between Dataset and RDD – Optimizations applied by Catalyst – Writing of control flows val trainingDataset: Dataset = ... val tokenizer = new Tokenizer() val hashingTF = new HashingTF() val lr = new LogisticRegression() val pipeline = new Pipeline().setStages( Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(trainingDataset) def train(ds:Dataset[_]):LogisticRegressionModel = { val instances: Dataset[Instance] = ds.select(...).rdd.map { case Row(...) => ... } instances.persist(...) instances.treeAggregate(...) } ML pipeline Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki29 Machine learning algorithm (LogisticRegression)
  • 30. Next Steps • Achieve further performance improvements for Dataset – Implement Java bytecode analysis and modification • Exploit optimizations in Catalyst (SPARK-14083) • Reduce overhead of data conversion (Our paper) • Exploit SIMD/GPU in Spark by using simple generated code – http://www.spark.tc/simd-and-gpu/ Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki30
  • 31. Takeaway • Dataset on Spark 2.2 is much faster than on Spark 2.0/2.1 • How Dataset was executed and was improved – Data conversion – Serialization/deserialization – Java bytecode of lambda expression • Continue to improve performance of Dataset • Let us start using Dataset from Today! Demystifying DataFrame and Dataset (#SFdev20) / Kazuaki Ishizaki31
  • 32. Thank you @kiszk on twitter @kiszk on github http://ibm.biz/ishizaki • @sameeragarwal, @hvanhovell, @gatorsmile, @cloud-fan, @ueshin, @davies, @rxin, @joshrosen, @maropu, @viirya • Spark Technology Center