Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Improving Python and Spark Performance
and Interoperability with Apache Arrow
Julien Le Dem
Principal Architect
Dremio
Li ...
© 2017 Dremio Corporation, Two Sigma Investments, LP
About Us
• Architect at @DremioHQ
• Formerly Tech Lead at Twitter on ...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Agenda
• Current state and limitations of PySpark UDFs
• Apache Arrow...
Current state 
and limitations 
of PySpark UDFs
© 2017 Dremio Corporation, Two Sigma Investments, LP
Why do we need User Defined Functions?
• Some computation is more eas...
© 2017 Dremio Corporation, Two Sigma Investments, LP
What is PySpark UDF
• PySpark UDF is a user defined function executed...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Row UDF
• Operates on a row by row basis
– Similar to `map` operator
...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Group UDF
• UDF that operates on more than one row
– Similar to `grou...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Group UDF
• Not supported out of box:
– Need boiler plate code to pac...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Data Normalization
(values – values.mean()) / values.std()
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Data Normalization
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Monthly Data Normalization
Useful bits
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Monthly Data Normalization
Boilerplate
Boilerplate
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Monthly Data Normalization
• Poor performance ­ 16x slower t...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Problems
• Packing / unpacking nested rows
• Inefficient data movemen...
Apache 
Arrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
Arrow: An open source standard
• Common need for in memory columnar
•...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Arrow goals
• Well­documented and cross language compatible
• Designe...
© 2017 Dremio Corporation, Two Sigma Investments, LP
High Performance Sharing & Interchange
Before With Arrow
• Each syste...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Columnar data
persons = [{
nam e:’Joe',
age:18,
phones:[
‘555-111-111...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Record Batch Construction
Schema 
Negotiation
Schema 
Negotiation
Dic...
© 2017 Dremio Corporation, Two Sigma Investments, LP
In memory columnar format for speed
• Maximize CPU throughput
­ Pipel...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Results
­ PySpark Integration: 
53x speedup (IBM spark work on SPARK­...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Arrow Releases
178
195
311
85
237
131
76
17
Changes Days
Improvements 
to PySpark  
with Arrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
How PySpark UDF works
Execut
or
Python
Worker
UDF: scalar -> scalar
B...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Current Issues with UDF
• Serialize / Deserialize in Python
• Scalar ...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Profile lambda x: x+1 Actual Runtime is 2s without profiling.
8 Mb/s
...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Vectorize Row UDF
Executor
Python
Worker
UDF: pd.DataFrame ­> pd.Data...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Why pandas.DataFrame
• Fast, feature­rich, widely used by Python user...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
20x Speed Up
Actual Runtime is 2s without pr...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
Overhead
Removed
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
Less System Call
Faster I/O
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
4.5x Speed Up
© 2017 Dremio Corporation, Two Sigma Investments, LP
Support Group UDF
• Split­apply­combine:
­ Break a problem into small...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Split­Apply­Combine (Current)
• Split: groupBy, window, …
• Apply: me...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Split­Apply­Combine (with Group UDF)
• Split: groupBy, window, …
• Ap...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Introduce groupBy().apply()
• UDF: pd.DataFrame ­> pd.DataFrame
– Tre...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Introduce groupBy().apply()
RowsRows
RowsRows
RowsRows
GroupsGroups
G...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Previous Example: Data Normalization
(values – values.mean()) / value...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Previous Example: Data Normalization
5x Speed Up
Current: Group UDF:
© 2017 Dremio Corporation, Two Sigma Investments, LP
Limitations
• Requires Spark Row <­> Arrow RecordBatch conversion
– I...
Future 
Roadmap
© 2017 Dremio Corporation, Two Sigma Investments, LP
What’s Next (Arrow)
• Arrow RPC/REST
• Arrow IPC
• Apache {Spark, Dri...
© 2017 Dremio Corporation, Two Sigma Investments, LP
What’s Next (PySpark UDF)
• Continue working on SPARK­20396
• Support...
© 2017 Dremio Corporation, Two Sigma Investments, LP
What’s Next (PySpark UDF)
© 2017 Dremio Corporation, Two Sigma Investments, LP
Get Involved
• Watch SPARK­20396
• Join the Arrow community
– dev@arr...
© 2017 Dremio Corporation, Two Sigma Investments, LP
Thank you
• Bryan Cutler (IBM), Wes McKinney (Two Sigma Investments) ...
This document is being distributed for informational and educational purposes only and is not an offer to sell or the soli...
Upcoming SlideShare
Loading in …5
×

of

Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 1 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 2 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 3 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 4 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 5 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 6 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 7 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 8 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 9 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 10 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 11 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 12 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 13 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 14 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 15 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 16 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 17 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 18 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 19 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 20 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 21 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 22 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 23 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 24 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 25 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 26 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 27 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 28 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 29 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 30 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 31 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 32 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 33 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 34 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 35 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 36 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 37 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 38 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 39 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 40 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 41 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 42 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 43 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 44 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 45 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 46 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 47 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 48 Improving Python and Spark Performance and Interoperability with Apache Arrow Slide 49
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Improving Python and Spark Performance and Interoperability with Apache Arrow

Download to read offline

Apache Arrow-based interconnection between various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently,

  • Be the first to like this

Improving Python and Spark Performance and Interoperability with Apache Arrow

  1. 1. Improving Python and Spark Performance and Interoperability with Apache Arrow Julien Le Dem Principal Architect Dremio Li Jin Software Engineer Two Sigma Investments
  2. 2. © 2017 Dremio Corporation, Two Sigma Investments, LP About Us • Architect at @DremioHQ • Formerly Tech Lead at Twitter on Data  Platforms • Creator of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Incubator,  Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer at Two Sigma Investments • Building a python­based analytics platform with PySpark • Other open source projects: – Flint: A Time Series Library on  Spark – Cook: A Fair Share Scheduler on  Mesos
  3. 3. © 2017 Dremio Corporation, Two Sigma Investments, LP Agenda • Current state and limitations of PySpark UDFs • Apache Arrow overview • Improvements realized • Future roadmap
  4. 4. Current state  and limitations  of PySpark UDFs
  5. 5. © 2017 Dremio Corporation, Two Sigma Investments, LP Why do we need User Defined Functions? • Some computation is more easily expressed with Python than Spark  built­in functions. • Examples: – weighted mean – weighted correlation  – exponential moving average
  6. 6. © 2017 Dremio Corporation, Two Sigma Investments, LP What is PySpark UDF • PySpark UDF is a user defined function executed in  Python runtime. • Two types: – Row UDF:  • lambda x: x + 1 • lambda date1, date2: (date1 - date2).years – Group UDF (subject of this presentation): • lambda values: np.mean(np.array(values))
  7. 7. © 2017 Dremio Corporation, Two Sigma Investments, LP Row UDF • Operates on a row by row basis – Similar to `map` operator • Example … df.withColumn( ‘v2’, udf(lambda x: x+1, DoubleType())(df.v1) ) • Performance: – 60x slower than build­in functions for simple case
  8. 8. © 2017 Dremio Corporation, Two Sigma Investments, LP Group UDF • UDF that operates on more than one row – Similar to `groupBy` followed by `map` operator • Example: – Compute weighted mean by month
  9. 9. © 2017 Dremio Corporation, Two Sigma Investments, LP Group UDF • Not supported out of box: – Need boiler plate code to pack/unpack multiple rows into a nested row • Poor performance – Groups are materialized and then converted to Python data structures
  10. 10. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Data Normalization (values – values.mean()) / values.std()
  11. 11. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Data Normalization
  12. 12. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Monthly Data Normalization Useful bits
  13. 13. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Monthly Data Normalization Boilerplate Boilerplate
  14. 14. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Monthly Data Normalization • Poor performance ­ 16x slower than baseline groupBy().agg(collect_list())
  15. 15. © 2017 Dremio Corporation, Two Sigma Investments, LP Problems • Packing / unpacking nested rows • Inefficient data movement (Serialization / Deserialization) • Scalar computation model: object boxing and interpreter overhead
  16. 16. Apache  Arrow
  17. 17. © 2017 Dremio Corporation, Two Sigma Investments, LP Arrow: An open source standard • Common need for in memory columnar • Building on the success of Parquet. • Top­level Apache project • Standard from the start – Developers from 13+ major open source projects involved • Benefits: – Share the effort – Create an ecosystem Calcite Cassandra Deeplearning4 j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  18. 18. © 2017 Dremio Corporation, Two Sigma Investments, LP Arrow goals • Well­documented and cross language compatible • Designed to take advantage of modern CPU • Embeddable  ­ In execution engines, storage layers, etc. • Interoperable
  19. 19. © 2017 Dremio Corporation, Two Sigma Investments, LP High Performance Sharing & Interchange Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader)
  20. 20. © 2017 Dremio Corporation, Two Sigma Investments, LP Columnar data persons = [{ nam e:’Joe', age:18, phones:[ ‘555-111-1111’, ‘555-222-2222’ ] },{ nam e:’Jack', age:37, phones:[‘555-333-3333’] }]
  21. 21. © 2017 Dremio Corporation, Two Sigma Investments, LP Record Batch Construction Schema  Negotiation Schema  Negotiation Dictionary  Batch Dictionary  Batch Record  Batch Record  Batch Record  Batch Record  Batch Record  Batch Record  Batch name (offset)name (offset) name (data)name (data) age (data)age (data) phones (list offset)phones (list offset) phones (data)phones (data) data header (describes offsets into data)data header (describes offsets into data) name (bitmap)name (bitmap) age (bitmap)age (bitmap) phones (bitmap)phones (bitmap) phones (offset)phones (offset) { nam e:’Joe', age:18, phones:[ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory  The entire record batch is contiguous on wire Each box (vector) is contiguous memory  The entire record batch is contiguous on wire
  22. 22. © 2017 Dremio Corporation, Two Sigma Investments, LP In memory columnar format for speed • Maximize CPU throughput ­ Pipelining ­ SIMD ­ cache locality • Scatter/gather I/O
  23. 23. © 2017 Dremio Corporation, Two Sigma Investments, LP Results ­ PySpark Integration:  53x speedup (IBM spark work on SPARK­13534) http://s.apache.org/arrowresult1 ­ Streaming Arrow Performance 7.75GB/s data movement http://s.apache.org/arrowresult2 ­ Arrow Parquet C++ Integration 4GB/s reads http://s.apache.org/arrowresult3 ­ Pandas Integration 9.71GB/s http://s.apache.org/arrowresult4
  24. 24. © 2017 Dremio Corporation, Two Sigma Investments, LP Arrow Releases 178 195 311 85 237 131 76 17 Changes Days
  25. 25. Improvements  to PySpark   with Arrow
  26. 26. © 2017 Dremio Corporation, Two Sigma Investments, LP How PySpark UDF works Execut or Python Worker UDF: scalar -> scalar Batched Rows Batched Rows
  27. 27. © 2017 Dremio Corporation, Two Sigma Investments, LP Current Issues with UDF • Serialize / Deserialize in Python • Scalar computation model (Python for loop)
  28. 28. © 2017 Dremio Corporation, Two Sigma Investments, LP Profile lambda x: x+1 Actual Runtime is 2s without profiling. 8 Mb/s 91.8%
  29. 29. © 2017 Dremio Corporation, Two Sigma Investments, LP Vectorize Row UDF Executor Python Worker UDF: pd.DataFrame ­> pd.DataFrame Rows ­>  RB RB ­>  Rows
  30. 30. © 2017 Dremio Corporation, Two Sigma Investments, LP Why pandas.DataFrame • Fast, feature­rich, widely used by Python users • Already exists in PySpark (toPandas) • Compatible with popular Python libraries: ­ NumPy, StatsModels, SciPy, scikit­learn… • Zero copy to/from Arrow
  31. 31. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF 20x Speed Up Actual Runtime is 2s without profiling
  32. 32. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF Overhead Removed
  33. 33. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF Less System Call Faster I/O
  34. 34. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF 4.5x Speed Up
  35. 35. © 2017 Dremio Corporation, Two Sigma Investments, LP Support Group UDF • Split­apply­combine: ­ Break a problem into smaller pieces ­ Operate on each piece independently ­ Put all pieces back together • Common pattern supported in SQL, Spark, Pandas, R … 
  36. 36. © 2017 Dremio Corporation, Two Sigma Investments, LP Split­Apply­Combine (Current) • Split: groupBy, window, … • Apply: mean, stddev, collect_list, rank … • Combine: Inherently done by Spark
  37. 37. © 2017 Dremio Corporation, Two Sigma Investments, LP Split­Apply­Combine (with Group UDF) • Split: groupBy, window, … • Apply: UDF • Combine: Inherently done by Spark
  38. 38. © 2017 Dremio Corporation, Two Sigma Investments, LP Introduce groupBy().apply() • UDF: pd.DataFrame ­> pd.DataFrame – Treat each group as a pandas DataFrame – Apply UDF on each group – Assemble as PySpark DataFrame
  39. 39. © 2017 Dremio Corporation, Two Sigma Investments, LP Introduce groupBy().apply() RowsRows RowsRows RowsRows GroupsGroups GroupsGroups GroupsGroups GroupsGroups GroupsGroups GroupsGroups                  Each Group: pd.DataFrame ­> pd.DataFramegroupBy
  40. 40. © 2017 Dremio Corporation, Two Sigma Investments, LP Previous Example: Data Normalization (values – values.mean()) / values.std()
  41. 41. © 2017 Dremio Corporation, Two Sigma Investments, LP Previous Example: Data Normalization 5x Speed Up Current: Group UDF:
  42. 42. © 2017 Dremio Corporation, Two Sigma Investments, LP Limitations • Requires Spark Row <­> Arrow RecordBatch conversion – Incompatible memory layout (row vs column) • (groupBy) No local aggregation – Difficult due to how PySpark works. See  https://issues.apache.org/jira/browse/SPARK­10915 
  43. 43. Future  Roadmap
  44. 44. © 2017 Dremio Corporation, Two Sigma Investments, LP What’s Next (Arrow) • Arrow RPC/REST • Arrow IPC • Apache {Spark, Drill, Kudu} to Arrow Integration – Faster UDFs, Storage interfaces
  45. 45. © 2017 Dremio Corporation, Two Sigma Investments, LP What’s Next (PySpark UDF) • Continue working on SPARK­20396 • Support Pandas UDF with more PySpark functions: – groupBy().agg() – window
  46. 46. © 2017 Dremio Corporation, Two Sigma Investments, LP What’s Next (PySpark UDF)
  47. 47. © 2017 Dremio Corporation, Two Sigma Investments, LP Get Involved • Watch SPARK­20396 • Join the Arrow community – dev@arrow.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – Follow @ApacheArrow
  48. 48. © 2017 Dremio Corporation, Two Sigma Investments, LP Thank you • Bryan Cutler (IBM), Wes McKinney (Two Sigma Investments) for  helping build this feature • Apache Arrow community • Spark Summit organizers • Two Sigma and Dremio for supporting this work
  49. 49. This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Apache Arrow-based interconnection between various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently,

Views

Total views

554

On Slideshare

0

From embeds

0

Number of embeds

270

Actions

Downloads

10

Shares

0

Comments

0

Likes

0

×