Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
1 of 51

Future of Pandas - Jeff Reback



Download to read offline

Architecture overview for the future of the Python Pandas data analytics library.

Related Books

Free with a 30 day trial from Scribd

See all

Future of Pandas - Jeff Reback

  1. 1. Future of Pandas Jeff Reback PyData NYC November 2017
  2. 2. • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved IMPORTANT LEGAL INFORMATION
  3. 3. @jreback ● Former quant ● Senior Engineer at Two Sigma, working on holistic approaches to modeling ● Core committer to pandas for last 5 years ● Managed pandas since 2013
  4. 4. Kudos!
  5. 5. Kudos! Complaints!
  6. 6. Overview ● State of the Pandas ○ The Good ○ The Bad ○ The Ugly
  7. 7. Overview ● State of the Pandas ○ The Good ○ The Bad ○ The Ugly ● The Present
  8. 8. Overview ● State of the Pandas ○ The Good ○ The Bad ○ The Ugly ● The Present ● The Future
  9. 9. The Good
  10. 10. pandas’s role in the Python Data Ecosystem pandas Numerical Computing IO / Data Access Data Visualization Statistics + Machine Learning Libraries Users
  11. 11. The Bad
  12. 12. @wesm "10 Things I Hate About pandas"
  13. 13. DataTypes - what are we Missing?
  14. 14. DataTypes - Missing values can cause dtype changes
  15. 15. ● Complex groupby operations awkward and slow ● Copy Semantics API
  16. 16. API - Opaque UDFs
  17. 17. API - Groupby Performance
  18. 18. API - Aggregation Syntax
  19. 19. API - Copy Semantics
  20. 20. ● In-memory format that is custom ● Eager evaluation model, no query planning ● "Slow", limited multicore algorithms for large datasets Performance
  21. 21. Data Tooling Spectrum Small Data “Medium” Data “Big” Data < 5GB 5-100GB > 100 GB pandas starts to fail as an effective tool somewhere around the 10 GB mark
  22. 22. Block Storage
  23. 23. Block Storage Float 1.0 2.0 1.0 2.0 1.0 2.0 1.0 2.0 Int 1 2 3 4 Index RangeIndex (0, 4, 1) Columns Index [‘A’, ‘B’, ‘C’] Block Manager AxesBlocks
  24. 24. The Ugly
  25. 25. Big Data Unfriendly Each system has its own internal memory format 70-80% computation wasted on serialization and deserialization Similar functionality implemented in multiple projects
  26. 26. The Present
  27. 27. CategoricalDtype efficient memory & first class Categoricals efficient IO out-of-core and multi-core In current pandas
  28. 28. The Future
  29. 29. pandas2 architecture Arrow-optimized data connectors Arrow in-memory format Python user API, User-defined functions Logical Data Frame Expression Graphs Parallel Dataflow Execution Engine Apache Arrow Ibis pandas2 DataFrame semantics & compatibility
  30. 30. Ibis Python user API, User-defined functions Logical Data Frame Expression Graphs
  31. 31. Ibis in a nutshell Ibis Python code Compiler Back End compiled code Abstract Syntax Tree
  32. 32. Apache Arrow Arrow-optimized data connectors Arrow in-memory format Parallel Dataflow Execution Engine
  33. 33. Apache Arrow project The Arrow supports zero-copy reads and is optimized for data locality. Fast Arrow acts as a new high-performance interface between various systems. Flexible Apache Arrow is backed by key developers of 13 major open source projects Standard
  34. 34. Big Data friendly All systems utilize the same memory format No overhead for cross-system communication Projects can share functionality (eg, Parquet-to-Arrow reader)
  35. 35. DataFrame
  36. 36. Computation ● Kernel functions: atomic units of computation ● Operator nodes: input/output types, operator parallelism properties
  37. 37. Physical Operator Graphs Log Add b a (a + b).log().sum() Sum
  38. 38. Parallel Evaluation of Operator Graphs a b tmp tmp 2 out Add SumLog
  39. 39. Parallel Evaluation of Operator Graphs
  40. 40. Status & Links pandas2 Ibis 0.12.0 released in October. Arrow 0.7.1 released in October.
  41. 41. What can the community do? ● We love contributions. ● We love donations (to NUMFocus).
  42. 42. Thanks