Apache Arrow and
Pandas UDF on Apache Spark
Takuya UESHIN
2018-12-08, Apache Arrow Tokyo Meetup 2018
2
About Me
- Software Engineer @databricks
- Apache Spark Committer
- Twitter: @ueshin
- GitHub: github.com/ueshin
3
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
4
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Physical Operators
• Work In Progress
• Follow-up Events
5
Apache Spark and PySpark
“Apache Spark™ is a unified analytics engine for large-scale data
processing.”
https://spark.apache.org/
• The latest release:
2.4.0 (2018/11/02)
• PySpark is a Python API
• SparkR is an R API
6
PySpark and Pandas
“pandas is an open source, BSD-licensed library providing
high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.”
• https://pandas.pydata.org/
• The latest release: v0.23.4 Final (2018/08/03)
• PySpark supports Pandas >= "0.19.2"
7
PySpark and Pandas
PySpark can convert data between PySpark DataFrame and
Pandas DataFrame.
• pdf = df.toPandas()
• df = spark.createDataFrame(pdf)
We can use Arrow as an intermediate format by setting config:
“spark.sql.execution.arrow.enabled” to “true” (“false” by default).
8
Python UDF and Pandas UDF
• UDF: User Defined Function
• Python UDF
• Serialize/Deserialize data with Pickle
• Fetch data block, but invoke UDF row by row
• Pandas UDF
• Serialize/Deserialize data with Arrow
• Fetch data block, and invoke UDF block by block
• PandasUDFType: SCALAR, GROUPED_MAP, GROUPED_AGG
We don’t need any config, but the declaration is different.
9
Python UDF and Pandas UDF
@udf(’double’)
def plus_one(v):
return v + 1
@pandas_udf(’double’, PandasUDFType.SCALAR)
def pandas_plus_one(v):
return v + 1
10
Python UDF and Pandas UDF
• SCALAR
• A transformation: One or more Pandas Series -> One Pandas Series
• The length of the returned Pandas Series must be of the same as the
input Pandas Series
• GROUPED_MAP
• A transformation: One Pandas DataFrame -> One Pandas DataFrame
• The length of the returned Pandas DataFrame can be arbitrary
• GROUPED_AGG
• A transformation: One or more Pandas Series -> One scalar
• The returned value type should be a primitive data type
11
Performance: Python UDF vs Pandas UDF
From a blog post: Introducing Pandas UDF for PySpark
• Plus One
• Cumulative Probability
• Subtract Mean
“Pandas UDFs perform much
better than Python UDFs,
ranging from 3x to over 100x.”
12
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
13
Apache Arrow
“A cross-language development platform for in-memory data”
https://arrow.apache.org/
• The latest release
- 0.11.0 (2018/10/08)
• Columnar In-Memory
• docs/memory_layout.html
PySpark supports Arrow >= "0.8.0"
• "0.10.0" is recommended
14
Apache Arrow and Pandas UDF
• Use Arrow to Serialize/Deserialize data
• Streaming format for Interprocess messaging / communication (IPC)
• ArrowWriter and ArrowColumnVector
• Communicate JVM and Python worker via Socket
• ArrowPythonRunner
• worker.py
• Physical Operators for each PythonUDFType
• ArrowEvalPythonExec
• FlatMapGroupsInPandasExec
• AggregateInPandasExec
15
Overview of Pandas UDF execution
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
16
Arrow IPC format and Converters
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
17
Encapsulated message format
• https://arrow.apache.org/docs/ipc.html
• Messages
• Schema, RecordBatch, DictionaryBatch, Tensor
• Formats
• Streaming format
– Schema + (DictionaryBatch + RecordBatch)+
• File format
– header + (Streaming format) + footer
Pandas UDFs use Streaming format.
18
Arrow Converters in Spark
in Java/Scala
• ArrowWriter [src]
• A wrapper for writing VectorSchemaRoot and ValueVectors
• ArrowColumnVector [src]
• A wrapper for reading ValueVectors, works with ColumnarBatch
in Python
• ArrowStreamPandasSerializer [src]
• A wrapper for RecordBatchReader and RecordBatchWriter
19
Handling Communication
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
20
Handling Communication
ArrowPythonRunner [src]
• Handle the communication between JVM and the Python
worker
• Create or reuse a Python worker
• Open a Socket to communicate
• Write data to the socket with ArrowWriter in a separate thread
• Read data from the socket
• Return an iterator of ColumnarBatch of ArrowColumnVectors
21
Physical Operators
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
ArrowColumnVectors
ArrowWriter
groups of rows
ColumnarBatches
ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
22
Physical Operators
Create a RDD to execute the UDF.
• There are several operators for each PythonUDFType
• Group input data and pass to ArrowPythonRunner
• SCALAR: every configured number of rows
– “spark.sql.execution.arrow.maxRecordsPerBatch” (10,000 by default)
• GROUP_XXX: every group
• Read the result iterator of ColumnarBatch
• Return the iterator of rows over ColumnarBatches
23
Python worker
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
24
Python worker
worker.py [src]
• Open a Socket to communicate
• Set up a UDF execution for each PythonUDFType
• Create a map function
– prepare the arguments
– invoke the UDF
– check and return the result
• Execute the map function over the input iterator of Pandas
DataFrame
• Write back the results
25
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
26
Work In Progress
We can track issues related to Pandas UDF.
• [SPARK-22216] Improving PySpark/Pandas interoperability
• 37 subtasks in total
• 3 subtasks are in progress
• 4 subtasks are open
27
Work In Progress
• Window Pandas UDF
• [SPARK-24561] User-defined window functions with pandas udf
(bounded window)
• Performance Improvement of toPandas -> merged!
• [SPARK-25274] Improve toPandas with Arrow by sending out-of-order
record batches
• SparkR
• [SPARK-25981] Arrow optimization for conversion from R DataFrame
to Spark DataFrame
28
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
29
Follow-up Events
Spark Developers Meetup
• 2018/12/15 (Sat) 10:00-18:00
• @ Yahoo! LODGE
• https://passmarket.yahoo.co.jp/event/show/detail/01a98dzxf
auj.html
30
Follow-up Events
Hadoop/Spark Conference Japan 2019
• 2019/03/14 (Thu)
• @ Oi-machi
• http://hadoop.apache.jp/
31
Follow-up Events
Spark+AI Summit 2019
• 2019/04/23 (Tue) - 04/25 (Thu)
• @ Moscone West Convention Center, San Francisco
• https://databricks.com/sparkaisummit/north-america
Thank you!
33
Appendix
How to contribute?
• See: Contributing to Spark
• Open an issue on JIRA
• Send a pull-request at GitHub
• Communicate with committers and reviewers
• Congratulations!
Thanks for your contributions!
34
Appendix
• PySpark Usage Guide for Pandas with Apache Arrow
• https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.
html
• Vectorized UDF: Scalable Analysis with Python and PySpark
• https://databricks.com/session/vectorized-udf-scalable-analysis-with-
python-and-pyspark
• Demo for Apache Arrow Tokyo Meetup 2018
• https://databricks-prod-cloudfront.cloud.databricks.com/public/4027
ec902e239c93eaaa8714f173bcfc/142158605138935/354623205913920
1/7497868276316206/latest.html

Apache Arrow and Pandas UDF on Apache Spark

  • 1.
    Apache Arrow and PandasUDF on Apache Spark Takuya UESHIN 2018-12-08, Apache Arrow Tokyo Meetup 2018
  • 2.
    2 About Me - SoftwareEngineer @databricks - Apache Spark Committer - Twitter: @ueshin - GitHub: github.com/ueshin
  • 3.
    3 Agenda • Apache Sparkand PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 4.
    4 Agenda • Apache Sparkand PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Physical Operators • Work In Progress • Follow-up Events
  • 5.
    5 Apache Spark andPySpark “Apache Spark™ is a unified analytics engine for large-scale data processing.” https://spark.apache.org/ • The latest release: 2.4.0 (2018/11/02) • PySpark is a Python API • SparkR is an R API
  • 6.
    6 PySpark and Pandas “pandasis an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.” • https://pandas.pydata.org/ • The latest release: v0.23.4 Final (2018/08/03) • PySpark supports Pandas >= "0.19.2"
  • 7.
    7 PySpark and Pandas PySparkcan convert data between PySpark DataFrame and Pandas DataFrame. • pdf = df.toPandas() • df = spark.createDataFrame(pdf) We can use Arrow as an intermediate format by setting config: “spark.sql.execution.arrow.enabled” to “true” (“false” by default).
  • 8.
    8 Python UDF andPandas UDF • UDF: User Defined Function • Python UDF • Serialize/Deserialize data with Pickle • Fetch data block, but invoke UDF row by row • Pandas UDF • Serialize/Deserialize data with Arrow • Fetch data block, and invoke UDF block by block • PandasUDFType: SCALAR, GROUPED_MAP, GROUPED_AGG We don’t need any config, but the declaration is different.
  • 9.
    9 Python UDF andPandas UDF @udf(’double’) def plus_one(v): return v + 1 @pandas_udf(’double’, PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1
  • 10.
    10 Python UDF andPandas UDF • SCALAR • A transformation: One or more Pandas Series -> One Pandas Series • The length of the returned Pandas Series must be of the same as the input Pandas Series • GROUPED_MAP • A transformation: One Pandas DataFrame -> One Pandas DataFrame • The length of the returned Pandas DataFrame can be arbitrary • GROUPED_AGG • A transformation: One or more Pandas Series -> One scalar • The returned value type should be a primitive data type
  • 11.
    11 Performance: Python UDFvs Pandas UDF From a blog post: Introducing Pandas UDF for PySpark • Plus One • Cumulative Probability • Subtract Mean “Pandas UDFs perform much better than Python UDFs, ranging from 3x to over 100x.”
  • 12.
    12 Agenda • Apache Sparkand PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 13.
    13 Apache Arrow “A cross-languagedevelopment platform for in-memory data” https://arrow.apache.org/ • The latest release - 0.11.0 (2018/10/08) • Columnar In-Memory • docs/memory_layout.html PySpark supports Arrow >= "0.8.0" • "0.10.0" is recommended
  • 14.
    14 Apache Arrow andPandas UDF • Use Arrow to Serialize/Deserialize data • Streaming format for Interprocess messaging / communication (IPC) • ArrowWriter and ArrowColumnVector • Communicate JVM and Python worker via Socket • ArrowPythonRunner • worker.py • Physical Operators for each PythonUDFType • ArrowEvalPythonExec • FlatMapGroupsInPandasExec • AggregateInPandasExec
  • 15.
    15 Overview of PandasUDF execution Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 16.
    16 Arrow IPC formatand Converters Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 17.
    17 Encapsulated message format •https://arrow.apache.org/docs/ipc.html • Messages • Schema, RecordBatch, DictionaryBatch, Tensor • Formats • Streaming format – Schema + (DictionaryBatch + RecordBatch)+ • File format – header + (Streaming format) + footer Pandas UDFs use Streaming format.
  • 18.
    18 Arrow Converters inSpark in Java/Scala • ArrowWriter [src] • A wrapper for writing VectorSchemaRoot and ValueVectors • ArrowColumnVector [src] • A wrapper for reading ValueVectors, works with ColumnarBatch in Python • ArrowStreamPandasSerializer [src] • A wrapper for RecordBatchReader and RecordBatchWriter
  • 19.
    19 Handling Communication Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groupsof rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 20.
    20 Handling Communication ArrowPythonRunner [src] •Handle the communication between JVM and the Python worker • Create or reuse a Python worker • Open a Socket to communicate • Write data to the socket with ArrowWriter in a separate thread • Read data from the socket • Return an iterator of ColumnarBatch of ArrowColumnVectors
  • 21.
  • 22.
    22 Physical Operators Create aRDD to execute the UDF. • There are several operators for each PythonUDFType • Group input data and pass to ArrowPythonRunner • SCALAR: every configured number of rows – “spark.sql.execution.arrow.maxRecordsPerBatch” (10,000 by default) • GROUP_XXX: every group • Read the result iterator of ColumnarBatch • Return the iterator of rows over ColumnarBatches
  • 23.
    23 Python worker Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groupsof rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 24.
    24 Python worker worker.py [src] •Open a Socket to communicate • Set up a UDF execution for each PythonUDFType • Create a map function – prepare the arguments – invoke the UDF – check and return the result • Execute the map function over the input iterator of Pandas DataFrame • Write back the results
  • 25.
    25 Agenda • Apache Sparkand PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 26.
    26 Work In Progress Wecan track issues related to Pandas UDF. • [SPARK-22216] Improving PySpark/Pandas interoperability • 37 subtasks in total • 3 subtasks are in progress • 4 subtasks are open
  • 27.
    27 Work In Progress •Window Pandas UDF • [SPARK-24561] User-defined window functions with pandas udf (bounded window) • Performance Improvement of toPandas -> merged! • [SPARK-25274] Improve toPandas with Arrow by sending out-of-order record batches • SparkR • [SPARK-25981] Arrow optimization for conversion from R DataFrame to Spark DataFrame
  • 28.
    28 Agenda • Apache Sparkand PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 29.
    29 Follow-up Events Spark DevelopersMeetup • 2018/12/15 (Sat) 10:00-18:00 • @ Yahoo! LODGE • https://passmarket.yahoo.co.jp/event/show/detail/01a98dzxf auj.html
  • 30.
    30 Follow-up Events Hadoop/Spark ConferenceJapan 2019 • 2019/03/14 (Thu) • @ Oi-machi • http://hadoop.apache.jp/
  • 31.
    31 Follow-up Events Spark+AI Summit2019 • 2019/04/23 (Tue) - 04/25 (Thu) • @ Moscone West Convention Center, San Francisco • https://databricks.com/sparkaisummit/north-america
  • 32.
  • 33.
    33 Appendix How to contribute? •See: Contributing to Spark • Open an issue on JIRA • Send a pull-request at GitHub • Communicate with committers and reviewers • Congratulations! Thanks for your contributions!
  • 34.
    34 Appendix • PySpark UsageGuide for Pandas with Apache Arrow • https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow. html • Vectorized UDF: Scalable Analysis with Python and PySpark • https://databricks.com/session/vectorized-udf-scalable-analysis-with- python-and-pyspark • Demo for Apache Arrow Tokyo Meetup 2018 • https://databricks-prod-cloudfront.cloud.databricks.com/public/4027 ec902e239c93eaaa8714f173bcfc/142158605138935/354623205913920 1/7497868276316206/latest.html