Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Future of pandas

3,156 views

Published on

Architecture overview for the future of the Python Pandas data analytics library.

Published in: Data & Analytics

Future of pandas

  1. 1. Future of Pandas Jeff Reback PyData NYC November 2017
  2. 2. • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved IMPORTANT LEGAL INFORMATION
  3. 3. @jreback ● Former quant ● Senior Engineer at Two Sigma, working on holistic approaches to modeling ● Core committer to pandas for last 5 years ● Managed pandas since 2013
  4. 4. Kudos!
  5. 5. Kudos! Complaints!
  6. 6. Overview ● State of the Pandas ○ The Good ○ The Bad ○ The Ugly
  7. 7. Overview ● State of the Pandas ○ The Good ○ The Bad ○ The Ugly ● The Present
  8. 8. Overview ● State of the Pandas ○ The Good ○ The Bad ○ The Ugly ● The Present ● The Future
  9. 9. The Good
  10. 10. pandas’s role in the Python Data Ecosystem pandas Numerical Computing IO / Data Access Data Visualization Statistics + Machine Learning Libraries Users
  11. 11. The Bad
  12. 12. http://wesmckinney.com/blog/apache-arrow-pandas-internals/ @wesm "10 Things I Hate About pandas"
  13. 13. DataTypes - what are we Missing?
  14. 14. DataTypes - Missing values can cause dtype changes
  15. 15. ● Complex groupby operations awkward and slow ● Copy Semantics API
  16. 16. API - Opaque UDFs
  17. 17. API - Opaque UDFs
  18. 18. API - Opaque UDFs
  19. 19. API - Aggregation Syntax
  20. 20. API - Copy Semantics
  21. 21. ● In-memory format that is custom ● Eager evaluation model, no query planning ● "Slow", limited multicore algorithms for large datasets Performance
  22. 22. Data Tooling Spectrum Small Data “Medium” Data “Big” Data < 5GB 5-100GB > 100 GB pandas starts to fail as an effective tool somewhere around the 10 GB mark
  23. 23. Block Storage
  24. 24. Block Storage Float 1.0 2.0 1.0 2.0 1.0 2.0 1.0 2.0 Int 1 2 3 4 Index RangeIndex (0, 4, 1) Columns Index [‘A’, ‘B’, ‘C’] Block Manager AxesBlocks
  25. 25. The Ugly
  26. 26. Big Data Unfriendly Each system has its own internal memory format 70-80% computation wasted on serialization and deserialization Similar functionality implemented in multiple projects
  27. 27. The Present
  28. 28. CategoricalDtype efficient memory & first class Categoricals efficient IO out-of-core and multi-core In current pandas
  29. 29. The Future
  30. 30. pandas2 architecture pandas2 Arrow-optimized data connectors Arrow in-memory format Parallel Dataflow Execution Engine Apache Arrow Python user API, User-defined functions Logical Data Frame Expression Graphs Ibis DataFrame semantics & compatibility
  31. 31. Ibis in a nutshell Ibis Python code Compiler Back End compiled code Abstract Syntax Tree
  32. 32. pandas2 architecture pandas2 Arrow-optimized data connectors Arrow in-memory format Parallel Dataflow Execution Engine Apache Arrow Python user API, User-defined functions Logical Data Frame Expression Graphs Ibis DataFrame semantics & compatibility
  33. 33. Apache Arrow project The Arrow supports zero-copy reads and is optimized for data locality. Fast Arrow acts as a new high-performance interface between various systems. Flexible Apache Arrow is backed by many key open source projects. Standard
  34. 34. Big Data friendly All systems utilize the same memory format No overhead for cross-system communication Projects can share functionality (eg, Parquet-to-Arrow reader)
  35. 35. DataFrame
  36. 36. Computation ● Kernel functions: atomic units of computation ● Operator nodes: input/output types, operator parallelism properties
  37. 37. Physical Operator Graphs Log Add b a (a + b).log().sum() Sum
  38. 38. Parallel Evaluation of Operator Graphs a b tmp tmp 2 out Add SumLog
  39. 39. pandas2 architecture pandas2 Arrow-optimized data connectors Arrow in-memory format Parallel Dataflow Execution Engine Apache Arrow Python user API, User-defined functions Logical Data Frame Expression Graphs Ibis DataFrame semantics & compatibility
  40. 40. Status & Links pandas2 https://github.com/pandas-dev/pandas2 Ibis https://github.com/ibis-project/ibis 0.12.0 released in October. Arrow https://github.com/apache/arrow 0.7.1 released in October.
  41. 41. What can the community do? http://pandas.pydata.org/
  42. 42. Thanks https://www.slideshare.net/secret/sUXFArGxQ1RFX7 @jreback

×