Extending Pandas using Apache Arrow and Numba

Uwe Korn
Uwe KornML / Data Engineer
1
PyData Berlin 2018
Uwe L. Korn
Extending Pandas
using Apache Arrow and Numba
2
PyData Berlin 2018
Uwe L. Korn
Extending Pandas
using Apache Arrow and Numba
3
PyData Berlin 2018
Uwe L. Korn
Strings, Strings, please give me Strings!
4
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com
5
1. Shortcomings of Pandas
2. ExtensionArrays
3. Arrow for storage
4. Numba for compute
5. All the stuff
Agenda
6
Pandas Series
• Payload stored in a numpy.ndarray
• Index for data alignment
• Rich analytical API
• Accessors like .dt or .str
7
Shortcomings
• Limited to NumPy data types, otherwise object
• NumPy’s focus is numerical data and tensors
• Pandas performs well when NumPy performs well
• Most popular:
• no native variable-length strings
• integers are non-nullable
8
What’s the problem?
9
What’s the problem?
10
Why are objects bad?
Python Data Science Handbook, Jake VanderPlas; O’Reilly Media, Nov 2016
https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html
11
Extending Pandas (0.23+)
• Two new interfaces:
• ExtensionDtype
• What type of scalars?
• ExtensionArray
• Implement basic array ops
• Pandas provides algorithms on top
10x !!112
13
Extending Pandas (0.23+)
• _from_sequence
• _from_factorized
• __getitem__
• __len__
• dtype
• nbytes
• isna
• copy
• _concat_same_type
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.extensions.ExtensionArray.html
13
14
Apache Arrow
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for efficiency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM
• Brought Parquet to Pandas and made PySpark fast (@pandas_udf)
15
Nice properties
• More native datatypes: string, date, nullable int, list of X, …
• Everything is nullable
• Memory can be chunked
• Zero-copy to other ecosystems like Java / R
• Highly efficient I/O
16
Not so nice properties
• Still a young project
• Not much analytic on top (yet!)
• Core is in modern C++
• Extremely fast but hard to extend in Python
17
Writing Algorithms in Python is easy!
but slow
18
Photo by Matthew Brodeur on Unsplash
19
Fast for-loops with Numba
20
Anatomy of an Arrow StringArray
• 3 memory buffers
• bitmap to indicate valid (non-null) entries
• uint32 array of offsets:„where does the string start“
• uint8 array of characters (UTF-8 encoded)
• int64 offset
• allows zero-copy slicing
21
Numba @jitclass
22
Numba @jitclass
23 Photo by Niklas Tidbury on Unsplash
24
Fletcher
https://github.com/xhochy/fletcher
• Implements Extension{Array,Dtype} with Apache Arrow as storage
• Uses Numba to implement the necessary analytic on top
Demo25
26
Fletcher Demo
27
Fletcher Demo
28
Fletcher Demo
29
Fletcher Demo
30
ExtensionArray Implementations
https://github.com/ContinuumIO/cyberpandas
IPArray
(PR) https://github.com/geopandas/geopandas
GeometryArray
(WIP) https://github.com/xhochy/fletcher
Apache Arrow + Numba backed Arrays
31 Photo by Israel Sundseth on Unsplash
pip install fletcher
32
By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
24. - 26. October
+ 2 days of sprints (27/28.10.)
ZKM Karlsruhe, DEKarlsruhe
Call for Participation opens next week.
33
I’m Uwe Korn
Twitter: @xhochy
https://github.com/xhochy
Thank you!
1 of 33

Recommended

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy by
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyFulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn
1.7K views22 slides
Apache Arrow: Cross-language Development Platform for In-memory Data by
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
6.6K views30 slides
Future of pandas by
Future of pandasFuture of pandas
Future of pandasJeff Reback
5.4K views53 slides
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par... by
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn
1.4K views20 slides
Improving data interoperability in Python and R by
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
2.6K views14 slides
Ursa Labs and Apache Arrow in 2019 by
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
4.2K views30 slides

More Related Content

What's hot

Apache Arrow -- Cross-language development platform for in-memory data by
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
2.9K views23 slides
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P... by
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
103.9K views48 slides
Data Analysis and Statistics in Python using pandas and statsmodels by
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsWes McKinney
19.8K views29 slides
DataFrames: The Extended Cut by
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
8.5K views34 slides
Apache Arrow at DataEngConf Barcelona 2018 by
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
2K views37 slides
PyCon Singapore 2013 Keynote by
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynoteWes McKinney
94.6K views19 slides

What's hot(20)

Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P... by Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney103.9K views
Data Analysis and Statistics in Python using pandas and statsmodels by Wes McKinney
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney19.8K views
DataFrames: The Extended Cut by Wes McKinney
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney8.5K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
PyCon Singapore 2013 Keynote by Wes McKinney
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney94.6K views
Python for Financial Data Analysis with pandas by Wes McKinney
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney61.8K views
Rust is for "Big Data" by Andy Grove
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
Andy Grove2.6K views
pandas: Powerful data analysis tools for Python by Wes McKinney
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney9.8K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie... by Uwe Korn
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn757 views
What's new in pandas and the SciPy stack for financial users by Wes McKinney
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney11.8K views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney2.1K views
Memory Interoperability in Analytics and Machine Learning by Wes McKinney
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney5.6K views
DataFrames: The Good, Bad, and Ugly by Wes McKinney
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Wes McKinney12.9K views
Data Science Languages and Industry Analytics by Wes McKinney
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney5.5K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views

Similar to Extending Pandas using Apache Arrow and Numba

Data Science meets Software Development by
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
1.1K views60 slides
Session 2 by
Session 2Session 2
Session 2HarithaAshok3
57 views48 slides
3 python packages by
3 python packages3 python packages
3 python packagesFEG
57 views13 slides
Kaggle tokyo 2018 by
Kaggle tokyo 2018Kaggle tokyo 2018
Kaggle tokyo 2018Cournapeau David
130 views25 slides
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems by
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn
422 views35 slides
Python indroduction by
Python indroductionPython indroduction
Python indroductionFEG
96 views35 slides

Similar to Extending Pandas using Apache Arrow and Numba(20)

Data Science meets Software Development by Alexis Seigneurin
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
Alexis Seigneurin1.1K views
3 python packages by FEG
3 python packages3 python packages
3 python packages
FEG57 views
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems by Uwe Korn
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Uwe Korn422 views
Python indroduction by FEG
Python indroductionPython indroduction
Python indroduction
FEG96 views
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx by kalai75
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptxQ-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
kalai7510 views
Data Science at Scale: Using Apache Spark for Data Science at Bitly by Sarah Guido
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido5.4K views
Data Science at Scale by Sarah Guido by Spark Summit
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit1.9K views
The Background Noise of the Internet by Andrew Morris
The Background Noise of the InternetThe Background Noise of the Internet
The Background Noise of the Internet
Andrew Morris7.8K views
PyData: The Next Generation | Data Day Texas 2015 by Cloudera, Inc.
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.1.8K views
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca... by Uwe Korn
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn407 views
Игорь Фесенко "Direction of C# as a High-Performance Language" by Fwdays
Игорь Фесенко "Direction of C# as a High-Performance Language"Игорь Фесенко "Direction of C# as a High-Performance Language"
Игорь Фесенко "Direction of C# as a High-Performance Language"
Fwdays1.1K views
The Joy of SciPy by kammeyer
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
kammeyer4K views
Travis Oliphant "Python for Speed, Scale, and Science" by Fwdays
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
Fwdays218 views

Recently uploaded

CRIJ4385_Death Penalty_F23.pptx by
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
7 views24 slides
Data Journeys Hard Talk workshop final.pptx by
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptxinfo828217
10 views18 slides
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx by
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptxDataScienceConferenc1
5 views15 slides
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptxayeshabaig2004
7 views30 slides
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptxDataScienceConferenc1
6 views16 slides
LIVE OAK MEMORIAL PARK.pptx by
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptxms2332always
7 views6 slides

Recently uploaded(20)

CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20047 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... by DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 views
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204215 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra17 views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...

Extending Pandas using Apache Arrow and Numba