1
PyData Berlin 2018
Uwe L. Korn
Extending Pandas
using Apache Arrow and Numba
2
PyData Berlin 2018
Uwe L. Korn
Extending Pandas
using Apache Arrow and Numba
3
PyData Berlin 2018
Uwe L. Korn
Strings, Strings, please give me Strings!
4
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com
5
1. Shortcomings of Pandas
2. ExtensionArrays
3. Arrow for storage
4. Numba for compute
5. All the stuff
Agenda
6
Pandas Series
• Payload stored in a numpy.ndarray
• Index for data alignment
• Rich analytical API
• Accessors like .dt or .str
7
Shortcomings
• Limited to NumPy data types, otherwise object
• NumPy’s focus is numerical data and tensors
• Pandas performs well when NumPy performs well
• Most popular:
• no native variable-length strings
• integers are non-nullable
8
What’s the problem?
9
What’s the problem?
10
Why are objects bad?
Python Data Science Handbook, Jake VanderPlas; O’Reilly Media, Nov 2016
https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html
11
Extending Pandas (0.23+)
• Two new interfaces:
• ExtensionDtype
• What type of scalars?
• ExtensionArray
• Implement basic array ops
• Pandas provides algorithms on top
10x !!112
13
Extending Pandas (0.23+)
• _from_sequence
• _from_factorized
• __getitem__
• __len__
• dtype
• nbytes
• isna
• copy
• _concat_same_type
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.extensions.ExtensionArray.html
13
14
Apache Arrow
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for efficiency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM
• Brought Parquet to Pandas and made PySpark fast (@pandas_udf)
15
Nice properties
• More native datatypes: string, date, nullable int, list of X, …
• Everything is nullable
• Memory can be chunked
• Zero-copy to other ecosystems like Java / R
• Highly efficient I/O
16
Not so nice properties
• Still a young project
• Not much analytic on top (yet!)
• Core is in modern C++
• Extremely fast but hard to extend in Python
17
Writing Algorithms in Python is easy!
but slow
18
Photo by Matthew Brodeur on Unsplash
19
Fast for-loops with Numba
20
Anatomy of an Arrow StringArray
• 3 memory buffers
• bitmap to indicate valid (non-null) entries
• uint32 array of offsets:„where does the string start“
• uint8 array of characters (UTF-8 encoded)
• int64 offset
• allows zero-copy slicing
21
Numba @jitclass
22
Numba @jitclass
23 Photo by Niklas Tidbury on Unsplash
24
Fletcher
https://github.com/xhochy/fletcher
• Implements Extension{Array,Dtype} with Apache Arrow as storage
• Uses Numba to implement the necessary analytic on top
Demo25
26
Fletcher Demo
27
Fletcher Demo
28
Fletcher Demo
29
Fletcher Demo
30
ExtensionArray Implementations
https://github.com/ContinuumIO/cyberpandas
IPArray
(PR) https://github.com/geopandas/geopandas
GeometryArray
(WIP) https://github.com/xhochy/fletcher
Apache Arrow + Numba backed Arrays
31 Photo by Israel Sundseth on Unsplash
pip install fletcher
32
By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
24. - 26. October
+ 2 days of sprints (27/28.10.)
ZKM Karlsruhe, DEKarlsruhe
Call for Participation opens next week.
33
I’m Uwe Korn
Twitter: @xhochy
https://github.com/xhochy
Thank you!

Extending Pandas using Apache Arrow and Numba