Extending Pandas
Will Ayd
PyData Los Angeles
October 23, 2018
Motivation
• Pandas historically bound to NumPy type system and its limitations:
• Integer and bool types cannot store missing data
• Non-numeric types (ex: Categorical, Datetime w/ TZ) not natively supported
• Custom types required extensive updates to pandas internals
Important Concepts
• Extension Type
• Description of the data type; similar to NumPy
• Can be registered for creation via string (i.e. …astype(“foo”))
• Extension Array
• Class which does the actual “heavy lifting”
• No restrictions on construction, though must be convertible to NumPy array
• Limited to one dimension, though may be backed by 0..n arrays
• A Series is a container for an “array-like” thing - Tom Augspurger
Integer NA Demo
IntegerNA Illustrated
Existing Implementation
1.0
NA
2.0
Values
Internals
Sum:
3.0
Min:
1.0
Max:
2.0
Extension Implementation
1
?
2
0
1
0
Values Mask
Internals
Sum:
3
Min:
1
Max:
2
Getting Started with IntNA
Memory Reductions
196_608 / 524_288 = 37.5% of original memory by preventing implicit cast to float64
CyberPandas Demo
Series as a Container
Accessor and Custom Attrs
BitArray Backed Int Demo
* Under Development (GH22238)
Overview
IntegerNA
1
?
2
0
1
0
Values Mask
Internals
Sum:
3
Min:
1
Max:
2
IntegerNA with BitArray backing
1
?
2
010
Values Mask
Internals
Sum:
3
Min:
1
Max:
2
Behind the Scenes
• Instead of addressable units bitarray
packs bool values into bit
sequences
• Serialization between bitarray and
NumPy allows EA to fit into pandas
paradigm
* Under Development - actual implementation may differ
Potential Benefits
• Memory footprint of mask potentially reduced up to a factor of 8 (using bits
instead of bytes)
• Theoretical bitarray backed BoolNA implementation could reduce footprint
by a factor of 16 (both values and mask go from bytes to bits)
Closing Notes
Further Research
• Extending Pandas in the pandas documentation
• Extension Arrays for Pandas by Tom Augspurger
• Extending Pandas using Apache Arrow and Numba by Uwe Korn
Contact Me!
• GitHub - https://github.com/willayd
• Email - will_ayd@innobi.io

Extending Pandas with Custom Types - Will Ayd