Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extending Pandas with Custom Types - Will Ayd

487 views

Published on

Pandas v.0.23 brought to life a new extension interface through which you can extend NumPy's type system. This talk will explain what that means in more detail and provide practical examples of how the new interface can be leveraged to drastically improve your reporting.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Extending Pandas with Custom Types - Will Ayd

  1. 1. Extending Pandas Will Ayd PyData Los Angeles October 23, 2018
  2. 2. Motivation • Pandas historically bound to NumPy type system and its limitations: • Integer and bool types cannot store missing data • Non-numeric types (ex: Categorical, Datetime w/ TZ) not natively supported • Custom types required extensive updates to pandas internals
  3. 3. Important Concepts • Extension Type • Description of the data type; similar to NumPy • Can be registered for creation via string (i.e. …astype(“foo”)) • Extension Array • Class which does the actual “heavy lifting” • No restrictions on construction, though must be convertible to NumPy array • Limited to one dimension, though may be backed by 0..n arrays • A Series is a container for an “array-like” thing - Tom Augspurger
  4. 4. Integer NA Demo
  5. 5. IntegerNA Illustrated Existing Implementation 1.0 NA 2.0 Values Internals Sum: 3.0 Min: 1.0 Max: 2.0 Extension Implementation 1 ? 2 0 1 0 Values Mask Internals Sum: 3 Min: 1 Max: 2
  6. 6. Getting Started with IntNA
  7. 7. Memory Reductions 196_608 / 524_288 = 37.5% of original memory by preventing implicit cast to float64
  8. 8. CyberPandas Demo
  9. 9. Series as a Container
  10. 10. Accessor and Custom Attrs
  11. 11. BitArray Backed Int Demo * Under Development (GH22238)
  12. 12. Overview IntegerNA 1 ? 2 0 1 0 Values Mask Internals Sum: 3 Min: 1 Max: 2 IntegerNA with BitArray backing 1 ? 2 010 Values Mask Internals Sum: 3 Min: 1 Max: 2
  13. 13. Behind the Scenes • Instead of addressable units bitarray packs bool values into bit sequences • Serialization between bitarray and NumPy allows EA to fit into pandas paradigm * Under Development - actual implementation may differ
  14. 14. Potential Benefits • Memory footprint of mask potentially reduced up to a factor of 8 (using bits instead of bytes) • Theoretical bitarray backed BoolNA implementation could reduce footprint by a factor of 16 (both values and mask go from bytes to bits)
  15. 15. Closing Notes
  16. 16. Further Research • Extending Pandas in the pandas documentation • Extension Arrays for Pandas by Tom Augspurger • Extending Pandas using Apache Arrow and Numba by Uwe Korn
  17. 17. Contact Me! • GitHub - https://github.com/willayd • Email - will_ayd@innobi.io

×