Extending Pandas using Apache Arrow and Numba

1
PyData Berlin 2018
Uwe L. Korn
Extending Pandas
using Apache Arrow and Numba

2
PyData Berlin 2018
Uwe L. Korn
Extending Pandas
using Apache Arrow and Numba

3
PyData Berlin 2018
Uwe L. Korn
Strings, Strings, please give me Strings!

4
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com

5
1. Shortcomings of Pandas
2. ExtensionArrays
3. Arrow for storage
4. Numba for compute
5. All the stuﬀ
Agenda

6
Pandas Series
• Payload stored in a numpy.ndarray
• Index for data alignment
• Rich analytical API
• Accessors like .dt or .str

7
Shortcomings
• Limited to NumPy data types, otherwise object
• NumPy’s focus is numerical data and tensors
• Pandas performs well when NumPy performs well
• Most popular:
• no native variable-length strings
• integers are non-nullable

10
Why are objects bad?
Python Data Science Handbook, Jake VanderPlas; O’Reilly Media, Nov 2016
https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html

11
Extending Pandas (0.23+)
• Two new interfaces:
• ExtensionDtype
• What type of scalars?
• ExtensionArray
• Implement basic array ops
• Pandas provides algorithms on top

13
Extending Pandas (0.23+)
• _from_sequence
• _from_factorized
• __getitem__
• __len__
• dtype
• nbytes
• isna
• copy
• _concat_same_type
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.extensions.ExtensionArray.html
13

14
Apache Arrow
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for eﬃciency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM
• Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

15
Nice properties
• More native datatypes: string, date, nullable int, list of X, …
• Everything is nullable
• Memory can be chunked
• Zero-copy to other ecosystems like Java / R
• Highly eﬃcient I/O

16
Not so nice properties
• Still a young project
• Not much analytic on top (yet!)
• Core is in modern C++
• Extremely fast but hard to extend in Python

17
Writing Algorithms in Python is easy!
but slow

18
Photo by Matthew Brodeur on Unsplash

20
Anatomy of an Arrow StringArray
• 3 memory buffers
• bitmap to indicate valid (non-null) entries
• uint32 array of offsets:„where does the string start“
• uint8 array of characters (UTF-8 encoded)
• int64 offset
• allows zero-copy slicing

23 Photo by Niklas Tidbury on Unsplash

24
Fletcher
https://github.com/xhochy/fletcher
• Implements Extension{Array,Dtype} with Apache Arrow as storage
• Uses Numba to implement the necessary analytic on top

30
ExtensionArray Implementations
https://github.com/ContinuumIO/cyberpandas
IPArray
(PR) https://github.com/geopandas/geopandas
GeometryArray
(WIP) https://github.com/xhochy/fletcher
Apache Arrow + Numba backed Arrays

31 Photo by Israel Sundseth on Unsplash
pip install fletcher

32
By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
24. - 26. October
+ 2 days of sprints (27/28.10.)
ZKM Karlsruhe, DEKarlsruhe
Call for Participation opens next week.

33
I’m Uwe Korn
Twitter: @xhochy
https://github.com/xhochy
Thank you!

Extending Pandas using Apache Arrow and Numba

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Extending Pandas using Apache Arrow and Numba

Similar to Extending Pandas using Apache Arrow and Numba (20)

More from Uwe Korn

More from Uwe Korn (7)

Recently uploaded

Recently uploaded (20)

Extending Pandas using Apache Arrow and Numba