Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 48

Improving Python and Spark Performance and Interoperability with Apache Arrow with Julien Le Dem and Li Jin

2

Share

Download to read offline

Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. In many use cases though, a PySpark job can perform worse than an equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master.

Apache Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently, without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead. When remote, scatter-gather I/O sends the memory representation directly to the socket avoiding serialization costs.

Improving Python and Spark Performance and Interoperability with Apache Arrow with Julien Le Dem and Li Jin

  1. 1. Improving Python and Spark Performance and Interoperability with Apache Arrow Julien Le Dem Principal Architect Dremio Li Jin Software Engineer Two Sigma Investments
  2. 2. © 2017 Dremio Corporation, Two Sigma Investments, LP About Us • Architect at @DremioHQ • Formerly Tech Lead at Twitter on Data Platforms • Creator of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer at Two Sigma Investments • Building a python-based analytics platform with PySpark • Other open source projects: – Flint: A Time Series Library on Spark – Cook: A Fair Share Scheduler on Mesos
  3. 3. © 2017 Dremio Corporation, Two Sigma Investments, LP Agenda • Current state and limitations of PySpark UDFs • Apache Arrow overview • Improvements realized • Future roadmap
  4. 4. Current state and limitations of PySpark UDFs
  5. 5. © 2017 Dremio Corporation, Two Sigma Investments, LP Why do we need User Defined Functions? • Some computation is more easily expressed with Python than Spark built-in functions. • Examples: – weighted mean – weighted correlation – exponential moving average
  6. 6. © 2017 Dremio Corporation, Two Sigma Investments, LP What is PySpark UDF • PySpark UDF is a user defined function executed in Python runtime. • Two types: – Row UDF: • lambda x: x + 1 • lambda date1, date2: (date1 - date2).years – Group UDF (subject of this presentation): • lambda values: np.mean(np.array(values))
  7. 7. © 2017 Dremio Corporation, Two Sigma Investments, LP Row UDF • Operates on a row by row basis – Similar to `map` operator • Example … df.withColumn( ‘v2’, udf(lambda x: x+1, DoubleType())(df.v1) ) • Performance: – 60x slower than build-in functions for simple case
  8. 8. © 2017 Dremio Corporation, Two Sigma Investments, LP Group UDF • UDF that operates on more than one row – Similar to `groupBy` followed by `map` operator • Example: – Compute weighted mean by month
  9. 9. © 2017 Dremio Corporation, Two Sigma Investments, LP Group UDF • Not supported out of box: – Need boiler plate code to pack/unpack multiple rows into a nested row • Poor performance – Groups are materialized and then converted to Python data structures
  10. 10. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Data Normalization (values – values.mean()) / values.std()
  11. 11. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Data Normalization
  12. 12. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Monthly Data Normalization Useful bits
  13. 13. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Monthly Data Normalization Boilerplate Boilerplate
  14. 14. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Monthly Data Normalization • Poor performance - 16x slower than baseline groupBy().agg(collect_list())
  15. 15. © 2017 Dremio Corporation, Two Sigma Investments, LP Problems • Packing / unpacking nested rows • Inefficient data movement (Serialization / Deserialization) • Scalar computation model: object boxing and interpreter overhead
  16. 16. Apache Arrow
  17. 17. © 2017 Dremio Corporation, Two Sigma Investments, LP Arrow: An open source standard • Common need for in memory columnar • Building on the success of Parquet. • Top-level Apache project • Standard from the start – Developers from 13+ major open source projects involved • Benefits: – Share the effort – Create an ecosystem Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  18. 18. © 2017 Dremio Corporation, Two Sigma Investments, LP Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU • Embeddable - In execution engines, storage layers, etc. • Interoperable
  19. 19. © 2017 Dremio Corporation, Two Sigma Investments, LP High Performance Sharing & Interchange Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  20. 20. © 2017 Dremio Corporation, Two Sigma Investments, LP Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }] J o e J a c k 0 3 7 Offset Values Name 18 37 Age 5 5 5 - 1 1 1 0 12 24 - Str. Offset Values Phones 1 1 1 1 5 5 5 - 2 2 2 - … 36 0 2 3 4 List Offset Values
  21. 21. © 2017 Dremio Corporation, Two Sigma Investments, LP Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  22. 22. © 2017 Dremio Corporation, Two Sigma Investments, LP In memory columnar format for speed • Maximize CPU throughput - Pipelining - SIMD - cache locality • Scatter/gather I/O
  23. 23. © 2017 Dremio Corporation, Two Sigma Investments, LP Results - PySpark Integration: 53x speedup (IBM spark work on SPARK-13534) http://s.apache.org/arrowresult1 - Streaming Arrow Performance 7.75GB/s data movement http://s.apache.org/arrowresult2 - Arrow Parquet C++ Integration 4GB/s reads http://s.apache.org/arrowresult3 - Pandas Integration 9.71GB/s http://s.apache.org/arrowresult4
  24. 24. © 2017 Dremio Corporation, Two Sigma Investments, LP Arrow Releases 178 195 311 85 237 131 76 17 October 10, 2016 0.1.0 February 18, 2017 0.2.0 May 5, 2017 0.3.0 May 22, 2017 0.4.0 Changes Days
  25. 25. Improvements to PySpark with Arrow
  26. 26. © 2017 Dremio Corporation, Two Sigma Investments, LP How PySpark UDF works Executor Python Worker UDF: scalar -> scalar Batched Rows Batched Rows
  27. 27. © 2017 Dremio Corporation, Two Sigma Investments, LP Current Issues with UDF • Serialize / Deserialize in Python • Scalar computation model (Python for loop)
  28. 28. © 2017 Dremio Corporation, Two Sigma Investments, LP Profile lambda x: x+1 Actual Runtime is 2s without profiling. 8 Mb/s 91.8%
  29. 29. © 2017 Dremio Corporation, Two Sigma Investments, LP Vectorize Row UDF Executor Python Worker UDF: pd.DataFrame -> pd.DataFrame Rows -> RB RB -> Rows Immutable Arrow Batch Immutable Arrow Batch
  30. 30. © 2017 Dremio Corporation, Two Sigma Investments, LP Why pandas.DataFrame • Fast, feature-rich, widely used by Python users • Already exists in PySpark (toPandas) • Compatible with popular Python libraries: - NumPy, StatsModels, SciPy, scikit-learn… • Zero copy to/from Arrow
  31. 31. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF 20x Speed Up Actual Runtime is 2s without profiling
  32. 32. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF Overhead Removed
  33. 33. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF Less System Call Faster I/O
  34. 34. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF 4.5x Speed Up
  35. 35. © 2017 Dremio Corporation, Two Sigma Investments, LP Support Group UDF • Split-apply-combine: - Break a problem into smaller pieces - Operate on each piece independently - Put all pieces back together • Common pattern supported in SQL, Spark, Pandas, R …
  36. 36. © 2017 Dremio Corporation, Two Sigma Investments, LP Split-Apply-Combine (Current) • Split: groupBy, window, … • Apply: mean, stddev, collect_list, rank … • Combine: Inherently done by Spark
  37. 37. © 2017 Dremio Corporation, Two Sigma Investments, LP Split-Apply-Combine (with Group UDF) • Split: groupBy, window, … • Apply: UDF • Combine: Inherently done by Spark
  38. 38. © 2017 Dremio Corporation, Two Sigma Investments, LP Introduce groupBy().apply() • UDF: pd.DataFrame -> pd.DataFrame – Treat each group as a pandas DataFrame – Apply UDF on each group – Assemble as PySpark DataFrame
  39. 39. © 2017 Dremio Corporation, Two Sigma Investments, LP Introduce groupBy().apply() Rows Rows Rows Groups Groups Groups Groups Groups Groups Each Group: pd.DataFrame -> pd.DataFrame groupBy
  40. 40. © 2017 Dremio Corporation, Two Sigma Investments, LP Previous Example: Data Normalization (values – values.mean()) / values.std()
  41. 41. © 2017 Dremio Corporation, Two Sigma Investments, LP Previous Example: Data Normalization 5x Speed Up Current: Group UDF:
  42. 42. © 2017 Dremio Corporation, Two Sigma Investments, LP Limitations • Requires Spark Row <-> Arrow RecordBatch conversion – Incompatible memory layout (row vs column) • (groupBy) No local aggregation – Difficult due to how PySpark works. See https://issues.apache.org/jira/browse/SPARK-10915
  43. 43. Future Roadmap
  44. 44. © 2017 Dremio Corporation, Two Sigma Investments, LP What’s Next (Arrow) • Arrow RPC/REST • Arrow IPC • Apache {Spark, Drill, Kudu} to Arrow Integration – Faster UDFs, Storage interfaces
  45. 45. © 2017 Dremio Corporation, Two Sigma Investments, LP What’s Next (PySpark UDF) • Continue working on SPARK-20396 • Support Pandas UDF with more PySpark functions: – groupBy().agg() – window
  46. 46. © 2017 Dremio Corporation, Two Sigma Investments, LP What’s Next (PySpark UDF)
  47. 47. © 2017 Dremio Corporation, Two Sigma Investments, LP Get Involved • Watch SPARK-20396 • Join the Arrow community – dev@arrow.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – Follow @ApacheArrow
  48. 48. © 2017 Dremio Corporation, Two Sigma Investments, LP Thank you • Bryan Cutler (IBM), Wes McKinney (Two Sigma Investments) for helping build this feature • Apache Arrow community • Spark Summit organizers • Two Sigma and Dremio for supporting this work

×