Improving Pandas and
PySpark interoperability
with Apache Arrow
Li Jin
PyData NYC
November 2017
• The information presented here is offered for informational purposes only and should not be used for any other
purpose (including, without limitation, the making of investment decisions). Examples provided herein are for
illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell
or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This
presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the
right to require the return of this presentation at any time.
• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so,
such copyrights and/or trademarks are most likely owned by the entity that created the material and are used
purely for identification and comment as fair use under international copyright and/or trademark laws. Use of
such image, copyright or trademark does not imply any association with such organization (or endorsement of
such organization) by Two Sigma, nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
IMPORTANT LEGAL INFORMATION
About Me
3
• Li Jin (@icexelloss)
• Software Engineer @ Two Sigma Investments
• Apache Arrow Committer
• Analytics Tools Smith
• Other Open Source Projects:
• Flint: A Time Series Library on Spark
• Cook: A Fair Scheduler on Mesos
• PySpark Overview
• PySpark UDF: current state and limitation
• Apache Arrow Overview
• Improvement to PySpark UDF with Apache Arrow
• Future Roadmap
This Talk
4
PySpark Overview
5
• A tool for distributed data analysis
• Apache project
• JVM-based with Python interface (PySpark)
• Functionality:
• Relational: Join, group, aggregate …
• Stats and ML: Spark MLlib
• Streaming
• …
Apache Spark
6
• Bigger Data:
• Pandas: 10G
• Spark: 1000G
• Better Parallelism:
• Pandas: Single core
• Spark: Hundreds of cores
Why Spark
7
• Python interface for Spark
• API front-end for built-in Spark functions
• df.withColumn(‘v2’, df.v1 + 1)
• Translated to Java code, running in JVM
• Interface for native Python code (User-defined function)
• df.withColumn(‘v2’, udf(lambda x: x+1, ‘double’)(df.v1))
• Running in Python runtime
PySpark Overview
8
PySpark UDF:
Current state and
limitation
9
• PySpark’s interface to interact with other Python libraries
• Types of UDFs:
• Row UDF
• Group UDF
PySpark User Defined Function (UDF)
10
• Operates on row by row basis
• Similar to `map` operator
• Example:
• String processing
• Timestamp processing
• Poor performance
• 1-2 orders of magnitude slower comparing to alternatives (built-in Spark
functions or vectorized operations)
Row UDF: Current
11
• UDF that operates on multiple rows
• Similar to `groupBy` followed by `map` operator
• Example:
• Monthly weighted mean
• Not supported out of box
• Poor performance
Group UDF: Current
12
• (values – values.mean()) / values.std()
Group UDF: Example
13
Group UDF: Example
14
Group UDF: Example
15
80% of
the code is
boilerplate
Slow
• Inefficient data movement between Java and Python (Serialization /
Deserialization)
• Scalar computation model
UDF Issues
16
Apache Arrow
17
• In memory columnar format
• Building on the success of Parquet
• Standard from the start:
• Developers from 13+ major open source projects involved
• Benefits:
• Share the effort
• Create an ecosystem
Apache Arrow
18
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
Hbase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
High Performance Sharing & Interchange
Before With Arrow
Columnar Data Format
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’
]
}]
Record Batch Construction
Schema
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
• Maximize CPU throughput
• Pipelining
• SIMD
• Cache locality
• Scatter/gather I/O
In Memory Columnar Format for Speed
• PySpark “toPandas” Improvement
• 53x Speedup
• Streaming Arrow Performance
• 7.75GB/s data movement
• Arrow Parquet C++ Integration
• 4GB/s reads
• Pandas Integration
• 9.71GB/s
Results
Read more on http://arrow.apache.org/blog/
23
Improving PySpark
UDF
24
Vectorizing Row
UDF
25
How PySpark UDF works
26
Executor
Python
Worker
UDF: Row -> Row
Rows (Pickle)
Rows (Pickle)
• Inefficient data movement (Serialization / Deserialization)
• Scalar computation model
Recap: Current issues with UDF
27
Profile lambda x: x+1
8 Mb/s
91.8% in
Ser/Deser
Vectorized UDF
Executor
Python
Worker
UDF: pd.DataFrame -> pd.DataFrame
Rows ->
RB
RB ->
Rows
Row UDF vs Vectorized UDF
* Actual runtime for row UDF is 2s without profiling
20x Speed Up
(Profiler overhead
adjusted*)
Row UDF vs Vectorized UDF
Ser/Deser
Overhead
Removed
Row UDF vs Vectorized UDF
Less System Call
Faster I/O
Improving Group
UDF
33
• Split-apply-combine
• Break a problem into smaller pieces
• Operate on each piece independently
• Put all pieces back together
• Common pattern supported in SQL, Spark, Pandas, R …
Introduce Group UDF
• Split: groupBy
• Apply: UDF (pd.DataFrame -> pd.DataFrame)
• Combine: Inherently done by Spark
Split-Apply-Combine (UDF)
Introduce groupBy().apply()
Rows
Rows
Rows
Groups
Groups
Groups
Groups
Groups
Groups
Each Group:
pd.DataFrame -> pd.DataFramegroupBy
• (values – values.mean()) / values.std()
Previous Example
37
Group UDF: Before and After
For updated API, see: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Before: After*:
Performance
Reference: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
39
• Available in the upcoming Apache Spark 2.3 release
• Try it with Databricks community version:
• https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-
for-pyspark.html
Try It!
40
• Improving PySpark/Pandas interoperability (SPARK-22216)
• Working towards Arrow 1.0 release
• More Arrow integration
Future Roadmap
41
• dev@spark.apache.org
• dev@arrow.apache.org
Get involved
42
Bryan Cutler
Hyukjin Kwon
Jeff Reback
Leif Walsh
Li Jin
Liang-Chi Hsieh
Reynold Xin
Takuya Ueshin
Wenchen Fan
Wes McKinney
Xiao Li
Collaborators
43
Questions
44

Improving Pandas and PySpark performance and interoperability with Apache Arrow

  • 1.
    Improving Pandas and PySparkinteroperability with Apache Arrow Li Jin PyData NYC November 2017
  • 2.
    • The informationpresented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved IMPORTANT LEGAL INFORMATION
  • 3.
    About Me 3 • LiJin (@icexelloss) • Software Engineer @ Two Sigma Investments • Apache Arrow Committer • Analytics Tools Smith • Other Open Source Projects: • Flint: A Time Series Library on Spark • Cook: A Fair Scheduler on Mesos
  • 4.
    • PySpark Overview •PySpark UDF: current state and limitation • Apache Arrow Overview • Improvement to PySpark UDF with Apache Arrow • Future Roadmap This Talk 4
  • 5.
  • 6.
    • A toolfor distributed data analysis • Apache project • JVM-based with Python interface (PySpark) • Functionality: • Relational: Join, group, aggregate … • Stats and ML: Spark MLlib • Streaming • … Apache Spark 6
  • 7.
    • Bigger Data: •Pandas: 10G • Spark: 1000G • Better Parallelism: • Pandas: Single core • Spark: Hundreds of cores Why Spark 7
  • 8.
    • Python interfacefor Spark • API front-end for built-in Spark functions • df.withColumn(‘v2’, df.v1 + 1) • Translated to Java code, running in JVM • Interface for native Python code (User-defined function) • df.withColumn(‘v2’, udf(lambda x: x+1, ‘double’)(df.v1)) • Running in Python runtime PySpark Overview 8
  • 9.
  • 10.
    • PySpark’s interfaceto interact with other Python libraries • Types of UDFs: • Row UDF • Group UDF PySpark User Defined Function (UDF) 10
  • 11.
    • Operates onrow by row basis • Similar to `map` operator • Example: • String processing • Timestamp processing • Poor performance • 1-2 orders of magnitude slower comparing to alternatives (built-in Spark functions or vectorized operations) Row UDF: Current 11
  • 12.
    • UDF thatoperates on multiple rows • Similar to `groupBy` followed by `map` operator • Example: • Monthly weighted mean • Not supported out of box • Poor performance Group UDF: Current 12
  • 13.
    • (values –values.mean()) / values.std() Group UDF: Example 13
  • 14.
  • 15.
    Group UDF: Example 15 80%of the code is boilerplate Slow
  • 16.
    • Inefficient datamovement between Java and Python (Serialization / Deserialization) • Scalar computation model UDF Issues 16
  • 17.
  • 18.
    • In memorycolumnar format • Building on the success of Parquet • Standard from the start: • Developers from 13+ major open source projects involved • Benefits: • Share the effort • Create an ecosystem Apache Arrow 18 Calcite Cassandra Deeplearning4j Drill Hadoop Hbase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 19.
    High Performance Sharing& Interchange Before With Arrow
  • 20.
    Columnar Data Format persons= [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 21.
    Record Batch Construction Schema Dictionary Batch Record Batch Record Batch Record Batch name(offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 22.
    • Maximize CPUthroughput • Pipelining • SIMD • Cache locality • Scatter/gather I/O In Memory Columnar Format for Speed
  • 23.
    • PySpark “toPandas”Improvement • 53x Speedup • Streaming Arrow Performance • 7.75GB/s data movement • Arrow Parquet C++ Integration • 4GB/s reads • Pandas Integration • 9.71GB/s Results Read more on http://arrow.apache.org/blog/ 23
  • 24.
  • 25.
  • 26.
    How PySpark UDFworks 26 Executor Python Worker UDF: Row -> Row Rows (Pickle) Rows (Pickle)
  • 27.
    • Inefficient datamovement (Serialization / Deserialization) • Scalar computation model Recap: Current issues with UDF 27
  • 28.
    Profile lambda x:x+1 8 Mb/s 91.8% in Ser/Deser
  • 29.
    Vectorized UDF Executor Python Worker UDF: pd.DataFrame-> pd.DataFrame Rows -> RB RB -> Rows
  • 30.
    Row UDF vsVectorized UDF * Actual runtime for row UDF is 2s without profiling 20x Speed Up (Profiler overhead adjusted*)
  • 31.
    Row UDF vsVectorized UDF Ser/Deser Overhead Removed
  • 32.
    Row UDF vsVectorized UDF Less System Call Faster I/O
  • 33.
  • 34.
    • Split-apply-combine • Breaka problem into smaller pieces • Operate on each piece independently • Put all pieces back together • Common pattern supported in SQL, Spark, Pandas, R … Introduce Group UDF
  • 35.
    • Split: groupBy •Apply: UDF (pd.DataFrame -> pd.DataFrame) • Combine: Inherently done by Spark Split-Apply-Combine (UDF)
  • 36.
  • 37.
    • (values –values.mean()) / values.std() Previous Example 37
  • 38.
    Group UDF: Beforeand After For updated API, see: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html Before: After*:
  • 39.
  • 40.
    • Available inthe upcoming Apache Spark 2.3 release • Try it with Databricks community version: • https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs- for-pyspark.html Try It! 40
  • 41.
    • Improving PySpark/Pandasinteroperability (SPARK-22216) • Working towards Arrow 1.0 release • More Arrow integration Future Roadmap 41
  • 42.
  • 43.
    Bryan Cutler Hyukjin Kwon JeffReback Leif Walsh Li Jin Liang-Chi Hsieh Reynold Xin Takuya Ueshin Wenchen Fan Wes McKinney Xiao Li Collaborators 43
  • 44.