SlideShare a Scribd company logo
Apache Arrow and
Pandas UDF on Apache Spark
Takuya UESHIN
2018-12-08, Apache Arrow Tokyo Meetup 2018
2
About Me
- Software Engineer @databricks
- Apache Spark Committer
- Twitter: @ueshin
- GitHub: github.com/ueshin
3
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
4
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Physical Operators
• Work In Progress
• Follow-up Events
5
Apache Spark and PySpark
“Apache Spark™ is a unified analytics engine for large-scale data
processing.”
https://spark.apache.org/
• The latest release:
2.4.0 (2018/11/02)
• PySpark is a Python API
• SparkR is an R API
6
PySpark and Pandas
“pandas is an open source, BSD-licensed library providing
high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.”
• https://pandas.pydata.org/
• The latest release: v0.23.4 Final (2018/08/03)
• PySpark supports Pandas >= "0.19.2"
7
PySpark and Pandas
PySpark can convert data between PySpark DataFrame and
Pandas DataFrame.
• pdf = df.toPandas()
• df = spark.createDataFrame(pdf)
We can use Arrow as an intermediate format by setting config:
“spark.sql.execution.arrow.enabled” to “true” (“false” by default).
8
Python UDF and Pandas UDF
• UDF: User Defined Function
• Python UDF
• Serialize/Deserialize data with Pickle
• Fetch data block, but invoke UDF row by row
• Pandas UDF
• Serialize/Deserialize data with Arrow
• Fetch data block, and invoke UDF block by block
• PandasUDFType: SCALAR, GROUPED_MAP, GROUPED_AGG
We don’t need any config, but the declaration is different.
9
Python UDF and Pandas UDF
@udf(’double’)
def plus_one(v):
return v + 1
@pandas_udf(’double’, PandasUDFType.SCALAR)
def pandas_plus_one(v):
return v + 1
10
Python UDF and Pandas UDF
• SCALAR
• A transformation: One or more Pandas Series -> One Pandas Series
• The length of the returned Pandas Series must be of the same as the
input Pandas Series
• GROUPED_MAP
• A transformation: One Pandas DataFrame -> One Pandas DataFrame
• The length of the returned Pandas DataFrame can be arbitrary
• GROUPED_AGG
• A transformation: One or more Pandas Series -> One scalar
• The returned value type should be a primitive data type
11
Performance: Python UDF vs Pandas UDF
From a blog post: Introducing Pandas UDF for PySpark
• Plus One
• Cumulative Probability
• Subtract Mean
“Pandas UDFs perform much
better than Python UDFs,
ranging from 3x to over 100x.”
12
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
13
Apache Arrow
“A cross-language development platform for in-memory data”
https://arrow.apache.org/
• The latest release
- 0.11.0 (2018/10/08)
• Columnar In-Memory
• docs/memory_layout.html
PySpark supports Arrow >= "0.8.0"
• "0.10.0" is recommended
14
Apache Arrow and Pandas UDF
• Use Arrow to Serialize/Deserialize data
• Streaming format for Interprocess messaging / communication (IPC)
• ArrowWriter and ArrowColumnVector
• Communicate JVM and Python worker via Socket
• ArrowPythonRunner
• worker.py
• Physical Operators for each PythonUDFType
• ArrowEvalPythonExec
• FlatMapGroupsInPandasExec
• AggregateInPandasExec
15
Overview of Pandas UDF execution
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
16
Arrow IPC format and Converters
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
17
Encapsulated message format
• https://arrow.apache.org/docs/ipc.html
• Messages
• Schema, RecordBatch, DictionaryBatch, Tensor
• Formats
• Streaming format
– Schema + (DictionaryBatch + RecordBatch)+
• File format
– header + (Streaming format) + footer
Pandas UDFs use Streaming format.
18
Arrow Converters in Spark
in Java/Scala
• ArrowWriter [src]
• A wrapper for writing VectorSchemaRoot and ValueVectors
• ArrowColumnVector [src]
• A wrapper for reading ValueVectors, works with ColumnarBatch
in Python
• ArrowStreamPandasSerializer [src]
• A wrapper for RecordBatchReader and RecordBatchWriter
19
Handling Communication
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
20
Handling Communication
ArrowPythonRunner [src]
• Handle the communication between JVM and the Python
worker
• Create or reuse a Python worker
• Open a Socket to communicate
• Write data to the socket with ArrowWriter in a separate thread
• Read data from the socket
• Return an iterator of ColumnarBatch of ArrowColumnVectors
21
Physical Operators
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
ArrowColumnVectors
ArrowWriter
groups of rows
ColumnarBatches
ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
22
Physical Operators
Create a RDD to execute the UDF.
• There are several operators for each PythonUDFType
• Group input data and pass to ArrowPythonRunner
• SCALAR: every configured number of rows
– “spark.sql.execution.arrow.maxRecordsPerBatch” (10,000 by default)
• GROUP_XXX: every group
• Read the result iterator of ColumnarBatch
• Return the iterator of rows over ColumnarBatches
23
Python worker
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
24
Python worker
worker.py [src]
• Open a Socket to communicate
• Set up a UDF execution for each PythonUDFType
• Create a map function
– prepare the arguments
– invoke the UDF
– check and return the result
• Execute the map function over the input iterator of Pandas
DataFrame
• Write back the results
25
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
26
Work In Progress
We can track issues related to Pandas UDF.
• [SPARK-22216] Improving PySpark/Pandas interoperability
• 37 subtasks in total
• 3 subtasks are in progress
• 4 subtasks are open
27
Work In Progress
• Window Pandas UDF
• [SPARK-24561] User-defined window functions with pandas udf
(bounded window)
• Performance Improvement of toPandas -> merged!
• [SPARK-25274] Improve toPandas with Arrow by sending out-of-order
record batches
• SparkR
• [SPARK-25981] Arrow optimization for conversion from R DataFrame
to Spark DataFrame
28
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
29
Follow-up Events
Spark Developers Meetup
• 2018/12/15 (Sat) 10:00-18:00
• @ Yahoo! LODGE
• https://passmarket.yahoo.co.jp/event/show/detail/01a98dzxf
auj.html
30
Follow-up Events
Hadoop/Spark Conference Japan 2019
• 2019/03/14 (Thu)
• @ Oi-machi
• http://hadoop.apache.jp/
31
Follow-up Events
Spark+AI Summit 2019
• 2019/04/23 (Tue) - 04/25 (Thu)
• @ Moscone West Convention Center, San Francisco
• https://databricks.com/sparkaisummit/north-america
Thank you!
33
Appendix
How to contribute?
• See: Contributing to Spark
• Open an issue on JIRA
• Send a pull-request at GitHub
• Communicate with committers and reviewers
• Congratulations!
Thanks for your contributions!
34
Appendix
• PySpark Usage Guide for Pandas with Apache Arrow
• https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.
html
• Vectorized UDF: Scalable Analysis with Python and PySpark
• https://databricks.com/session/vectorized-udf-scalable-analysis-with-
python-and-pyspark
• Demo for Apache Arrow Tokyo Meetup 2018
• https://databricks-prod-cloudfront.cloud.databricks.com/public/4027
ec902e239c93eaaa8714f173bcfc/142158605138935/354623205913920
1/7497868276316206/latest.html

More Related Content

What's hot

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 

What's hot (20)

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 

Similar to Apache Arrow and Pandas UDF on Apache Spark

Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Spark7
Spark7Spark7
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Databricks
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
Databricks
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
bartzon
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Juan Pedro Moreno
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
tieleman
 
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Sourabh Bajaj
 
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
aiuy
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
Amazon Web Services
 

Similar to Apache Arrow and Pandas UDF on Apache Spark (20)

Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Spark7
Spark7Spark7
Spark7
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and DataductLarge-Scale ETL Data Flows With Data Pipeline and Dataduct
Large-Scale ETL Data Flows With Data Pipeline and Dataduct
 
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-a...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 

More from Takuya UESHIN

Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
Takuya UESHIN
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
Takuya UESHIN
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Deep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningDeep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance Tuning
Takuya UESHIN
 
Failing gracefully
Failing gracefullyFailing gracefully
Failing gracefully
Takuya UESHIN
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
Takuya UESHIN
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)
Takuya UESHIN
 
20100724 HBaseプログラミング
20100724 HBaseプログラミング20100724 HBaseプログラミング
20100724 HBaseプログラミング
Takuya UESHIN
 

More from Takuya UESHIN (11)

Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Deep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningDeep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance Tuning
 
Failing gracefully
Failing gracefullyFailing gracefully
Failing gracefully
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
 
20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)
 
20100724 HBaseプログラミング
20100724 HBaseプログラミング20100724 HBaseプログラミング
20100724 HBaseプログラミング
 

Recently uploaded

Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
Massimo Artizzu
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
Karya Keeper
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
kalichargn70th171
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
sjcobrien
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 

Recently uploaded (20)

Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 

Apache Arrow and Pandas UDF on Apache Spark

  • 1. Apache Arrow and Pandas UDF on Apache Spark Takuya UESHIN 2018-12-08, Apache Arrow Tokyo Meetup 2018
  • 2. 2 About Me - Software Engineer @databricks - Apache Spark Committer - Twitter: @ueshin - GitHub: github.com/ueshin
  • 3. 3 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 4. 4 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Physical Operators • Work In Progress • Follow-up Events
  • 5. 5 Apache Spark and PySpark “Apache Spark™ is a unified analytics engine for large-scale data processing.” https://spark.apache.org/ • The latest release: 2.4.0 (2018/11/02) • PySpark is a Python API • SparkR is an R API
  • 6. 6 PySpark and Pandas “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.” • https://pandas.pydata.org/ • The latest release: v0.23.4 Final (2018/08/03) • PySpark supports Pandas >= "0.19.2"
  • 7. 7 PySpark and Pandas PySpark can convert data between PySpark DataFrame and Pandas DataFrame. • pdf = df.toPandas() • df = spark.createDataFrame(pdf) We can use Arrow as an intermediate format by setting config: “spark.sql.execution.arrow.enabled” to “true” (“false” by default).
  • 8. 8 Python UDF and Pandas UDF • UDF: User Defined Function • Python UDF • Serialize/Deserialize data with Pickle • Fetch data block, but invoke UDF row by row • Pandas UDF • Serialize/Deserialize data with Arrow • Fetch data block, and invoke UDF block by block • PandasUDFType: SCALAR, GROUPED_MAP, GROUPED_AGG We don’t need any config, but the declaration is different.
  • 9. 9 Python UDF and Pandas UDF @udf(’double’) def plus_one(v): return v + 1 @pandas_udf(’double’, PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1
  • 10. 10 Python UDF and Pandas UDF • SCALAR • A transformation: One or more Pandas Series -> One Pandas Series • The length of the returned Pandas Series must be of the same as the input Pandas Series • GROUPED_MAP • A transformation: One Pandas DataFrame -> One Pandas DataFrame • The length of the returned Pandas DataFrame can be arbitrary • GROUPED_AGG • A transformation: One or more Pandas Series -> One scalar • The returned value type should be a primitive data type
  • 11. 11 Performance: Python UDF vs Pandas UDF From a blog post: Introducing Pandas UDF for PySpark • Plus One • Cumulative Probability • Subtract Mean “Pandas UDFs perform much better than Python UDFs, ranging from 3x to over 100x.”
  • 12. 12 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 13. 13 Apache Arrow “A cross-language development platform for in-memory data” https://arrow.apache.org/ • The latest release - 0.11.0 (2018/10/08) • Columnar In-Memory • docs/memory_layout.html PySpark supports Arrow >= "0.8.0" • "0.10.0" is recommended
  • 14. 14 Apache Arrow and Pandas UDF • Use Arrow to Serialize/Deserialize data • Streaming format for Interprocess messaging / communication (IPC) • ArrowWriter and ArrowColumnVector • Communicate JVM and Python worker via Socket • ArrowPythonRunner • worker.py • Physical Operators for each PythonUDFType • ArrowEvalPythonExec • FlatMapGroupsInPandasExec • AggregateInPandasExec
  • 15. 15 Overview of Pandas UDF execution Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 16. 16 Arrow IPC format and Converters Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 17. 17 Encapsulated message format • https://arrow.apache.org/docs/ipc.html • Messages • Schema, RecordBatch, DictionaryBatch, Tensor • Formats • Streaming format – Schema + (DictionaryBatch + RecordBatch)+ • File format – header + (Streaming format) + footer Pandas UDFs use Streaming format.
  • 18. 18 Arrow Converters in Spark in Java/Scala • ArrowWriter [src] • A wrapper for writing VectorSchemaRoot and ValueVectors • ArrowColumnVector [src] • A wrapper for reading ValueVectors, works with ColumnarBatch in Python • ArrowStreamPandasSerializer [src] • A wrapper for RecordBatchReader and RecordBatchWriter
  • 19. 19 Handling Communication Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 20. 20 Handling Communication ArrowPythonRunner [src] • Handle the communication between JVM and the Python worker • Create or reuse a Python worker • Open a Socket to communicate • Write data to the socket with ArrowWriter in a separate thread • Read data from the socket • Return an iterator of ColumnarBatch of ArrowColumnVectors
  • 22. 22 Physical Operators Create a RDD to execute the UDF. • There are several operators for each PythonUDFType • Group input data and pass to ArrowPythonRunner • SCALAR: every configured number of rows – “spark.sql.execution.arrow.maxRecordsPerBatch” (10,000 by default) • GROUP_XXX: every group • Read the result iterator of ColumnarBatch • Return the iterator of rows over ColumnarBatches
  • 23. 23 Python worker Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 24. 24 Python worker worker.py [src] • Open a Socket to communicate • Set up a UDF execution for each PythonUDFType • Create a map function – prepare the arguments – invoke the UDF – check and return the result • Execute the map function over the input iterator of Pandas DataFrame • Write back the results
  • 25. 25 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 26. 26 Work In Progress We can track issues related to Pandas UDF. • [SPARK-22216] Improving PySpark/Pandas interoperability • 37 subtasks in total • 3 subtasks are in progress • 4 subtasks are open
  • 27. 27 Work In Progress • Window Pandas UDF • [SPARK-24561] User-defined window functions with pandas udf (bounded window) • Performance Improvement of toPandas -> merged! • [SPARK-25274] Improve toPandas with Arrow by sending out-of-order record batches • SparkR • [SPARK-25981] Arrow optimization for conversion from R DataFrame to Spark DataFrame
  • 28. 28 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 29. 29 Follow-up Events Spark Developers Meetup • 2018/12/15 (Sat) 10:00-18:00 • @ Yahoo! LODGE • https://passmarket.yahoo.co.jp/event/show/detail/01a98dzxf auj.html
  • 30. 30 Follow-up Events Hadoop/Spark Conference Japan 2019 • 2019/03/14 (Thu) • @ Oi-machi • http://hadoop.apache.jp/
  • 31. 31 Follow-up Events Spark+AI Summit 2019 • 2019/04/23 (Tue) - 04/25 (Thu) • @ Moscone West Convention Center, San Francisco • https://databricks.com/sparkaisummit/north-america
  • 33. 33 Appendix How to contribute? • See: Contributing to Spark • Open an issue on JIRA • Send a pull-request at GitHub • Communicate with committers and reviewers • Congratulations! Thanks for your contributions!
  • 34. 34 Appendix • PySpark Usage Guide for Pandas with Apache Arrow • https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow. html • Vectorized UDF: Scalable Analysis with Python and PySpark • https://databricks.com/session/vectorized-udf-scalable-analysis-with- python-and-pyspark • Demo for Apache Arrow Tokyo Meetup 2018 • https://databricks-prod-cloudfront.cloud.databricks.com/public/4027 ec902e239c93eaaa8714f173bcfc/142158605138935/354623205913920 1/7497868276316206/latest.html