Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Uwe Korn
Uwe KornML / Data Engineer
1
Fulfilling Apache Arrow's Promises:
Pandas on JVM memory without a copy
PyCon.DE Karlsruhe 2018
Uwe L. Korn
2
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com
3
What’s Apache Arrow?
• Published in February 2016
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for efficiency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib), Ruby,
Lua, R, JavaScript, Go, Rust, Matlab and the JVM
• Brought Parquet to Pandas and made PySpark fast (@pandas_udf)
4
February 2016: Birth of Apache Arrow
Just a goal…
5
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
probability density
function (PDF)
SQL
Engine
6
Looks simple?
• It isn’t.
• „Data“ is very heterogeneous landscape
• Most common setup:
• Java/Scala, i.e. JVM, for data processing
• Python for machine learning
7
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver JayDeBeApi
P
Y
T
H
O
N
R
O
W
S
J
D
B
C
R
O
W
S
8
org.apache.arrow.adapter.jdbc
• Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot
• Do conversion of rows to columns in the JVM
• Data is stored„off-heap“, i.e:
• not managed by the JVM
• native memorly layout, same as in pyarrow
9
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
?
10
So we’re done? No.
• We still only have Arrow data in the JVM
• Arrow and Pandas have a slightly different memory layout
• We have this today in PySpark
• It’s fast
• Still involves a copy over the network
• Arrow → pandas conversion is tuned but still a copy
11
pyarrow.jvm
• Access Arrow data created in the JVM from Python
• Involves no copy of the data
• Translation of the helper objects
• Actually passes memory addresses around
No copy between the JVM and Python!
NumPy & the BlockManager
Photo by Susan Holt Simpson on Unsplash
13
Pandas Shortcomings
• Limited to NumPy data types, otherwise object
• Columns are not separate, grouped by type
• Nullability is not type-safe (yet)
—> Arrow memory does not match Pandas memory
—> Copy 😢
14
Pandas ExtensionArrays
• Introduced new interfaces in 0.23
• ExtensionDtype
• What type of scalars?
• ExtensionArray
• Implement basic array ops
• Pandas provides algorithms on top
• Still, experimental, wait for 0.24
15 Photo by Niklas Tidbury on Unsplash
16
fletcher
• https://github.com/xhochy/fletcher
• Implements Extension{Array,Dtype} with Apache Arrow as storage
• Uses Numba to implement the necessary analytic on top
• Needs {pandas, Arrow, …} master
No copy between Apache Arrow and pandas!
17
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
pyarrow.jvm

/
fletcher
18
???
Does it work?
19
Does it work?
20
Does it work?
Make your
best decision
today.
blueyonder.ai/en/careers
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
21
Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• C++ Github mirror: https://github.com/
apache/parquet-cpp
22
Get Involved!
1 of 22

Recommended

pandas.(to/from)_sql is simple but not fast by
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastUwe Korn
278 views35 slides
Extending Pandas using Apache Arrow and Numba by
Extending Pandas using Apache Arrow and NumbaExtending Pandas using Apache Arrow and Numba
Extending Pandas using Apache Arrow and NumbaUwe Korn
5.5K views33 slides
Apache Arrow: Cross-language Development Platform for In-memory Data by
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
6.6K views30 slides
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16 by
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
777 views20 slides
Enabling Python to be a Better Big Data Citizen by
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
6K views19 slides
Future of pandas by
Future of pandasFuture of pandas
Future of pandasJeff Reback
5.4K views53 slides

More Related Content

What's hot

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult... by
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf
829 views10 slides
Pandas/Data Analysis at Baypiggies by
Pandas/Data Analysis at BaypiggiesPandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesAndy Hayden
1K views18 slides
DataFrames: The Extended Cut by
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
8.5K views34 slides
Presto by
PrestoPresto
PrestoChen Chun
1.7K views24 slides
PyCon Singapore 2013 Keynote by
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynoteWes McKinney
94.6K views19 slides
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16 by
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
2K views31 slides

What's hot(20)

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult... by MLconf
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
MLconf829 views
Pandas/Data Analysis at Baypiggies by Andy Hayden
Pandas/Data Analysis at BaypiggiesPandas/Data Analysis at Baypiggies
Pandas/Data Analysis at Baypiggies
Andy Hayden1K views
DataFrames: The Extended Cut by Wes McKinney
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney8.5K views
Presto by Chen Chun
PrestoPresto
Presto
Chen Chun1.7K views
PyCon Singapore 2013 Keynote by Wes McKinney
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney94.6K views
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16 by BigMine
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine2K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views
Presto as a Service - Tips for operation and monitoring by Taro L. Saito
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
Taro L. Saito6.8K views
Presto in my_use_case2 by wyukawa
Presto in my_use_case2Presto in my_use_case2
Presto in my_use_case2
wyukawa 1.8K views
Rust is for "Big Data" by Andy Grove
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
Andy Grove2.6K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
An Incomplete Data Tools Landscape for Hackers in 2015 by Wes McKinney
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney8.1K views
Fabian Hueske – Juggling with Bits and Bytes by Flink Forward
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward7.4K views
Resource-Efficient Deep Learning Model Selection on Apache Spark by Databricks
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks366 views
Apache Spark MLlib 2.0 Preview: Data Science and Production by Databricks
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks13.9K views
Strata2017 sg by wyukawa
Strata2017 sgStrata2017 sg
Strata2017 sg
wyukawa 2.6K views
Deep Learning on Apache® Spark™ : Workflows and Best Practices by Jen Aman
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman352 views

Similar to Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie... by
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
757 views18 slides
Data Science at Scale: Using Apache Spark for Data Science at Bitly by
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
5.4K views38 slides
How Apache Arrow and Parquet boost cross-language interoperability by
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
2.9K views17 slides
Next-generation Python Big Data Tools, powered by Apache Arrow by
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
13K views22 slides
Apache Spark in Industry by
Apache Spark in IndustryApache Spark in Industry
Apache Spark in IndustryDorian Beganovic
241 views33 slides
Apache Spark for Everyone - Women Who Code Workshop by
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
147 views40 slides

Similar to Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy(20)

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie... by Uwe Korn
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn757 views
Data Science at Scale: Using Apache Spark for Data Science at Bitly by Sarah Guido
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido5.4K views
How Apache Arrow and Parquet boost cross-language interoperability by Uwe Korn
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn2.9K views
Next-generation Python Big Data Tools, powered by Apache Arrow by Wes McKinney
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney13K views
Apache Spark for Everyone - Women Who Code Workshop by Amanda Casari
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari147 views
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca... by Uwe Korn
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn407 views
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems by Uwe Korn
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Uwe Korn422 views
Deep Learning on Apache® Spark™: Workflows and Best Practices by Databricks
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks2.4K views
Deep Learning on Apache® Spark™: Workflows and Best Practices by Jen Aman
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman1.1K views
Scalable Scientific Computing with Dask by Uwe Korn
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
Uwe Korn743 views
Apache Arrow and Python: The latest by Wes McKinney
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney5.8K views
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop by Jim Dowling
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Jim Dowling347 views
Apache Arrow (Strata-Hadoop World San Jose 2016) by Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney17K views
3 python packages by FEG
3 python packages3 python packages
3 python packages
FEG57 views
Koalas: Unifying Spark and pandas APIs by Xiao Li
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Xiao Li92 views
Data Science meets Software Development by Alexis Seigneurin
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
Alexis Seigneurin1.1K views
Apache spark-melbourne-april-2015-meetup by Ned Shawa
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa1.1K views
Spark summit 2019 infrastructure for deep learning in apache spark 0425 by Wee Hyong Tok
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok918 views
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ... by Databricks
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks696 views

Recently uploaded

UNEP FI CRS Climate Risk Results.pptx by
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptxpekka28
11 views51 slides
Short Story Assignment by Kelly Nguyen by
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyenkellynguyen01
18 views17 slides
3196 The Case of The East River by
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
11 views4 slides
Cross-network in Google Analytics 4.pdf by
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdfGA4 Tutorials
6 views7 slides
JConWorld_ Continuous SQL with Kafka and Flink by
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
100 views36 slides
Survey on Factuality in LLM's.pptx by
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptxNeethaSherra1
5 views9 slides

Recently uploaded(20)

UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
JConWorld_ Continuous SQL with Kafka and Flink by Timothy Spann
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann100 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Building Real-Time Travel Alerts by Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann109 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
Supercharging your Data with Azure AI Search and Azure OpenAI by Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 views
Understanding Hallucinations in LLMs - 2023 09 29.pptx by Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski13 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
Introduction to Microsoft Fabric.pdf by ishaniuudeshika
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika24 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials10 views
RuleBookForTheFairDataEconomy.pptx by noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 views

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

  • 1. 1 Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy PyCon.DE Karlsruhe 2018 Uwe L. Korn
  • 2. 2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com
  • 3. 3 What’s Apache Arrow? • Published in February 2016 • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)
  • 4. 4 February 2016: Birth of Apache Arrow Just a goal…
  • 5. 5 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas probability density function (PDF) SQL Engine
  • 6. 6 Looks simple? • It isn’t. • „Data“ is very heterogeneous landscape • Most common setup: • Java/Scala, i.e. JVM, for data processing • Python for machine learning
  • 7. 7 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas SQL Engine JDBC Driver JayDeBeApi P Y T H O N R O W S J D B C R O W S
  • 8. 8 org.apache.arrow.adapter.jdbc • Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot • Do conversion of rows to columns in the JVM • Data is stored„off-heap“, i.e: • not managed by the JVM • native memorly layout, same as in pyarrow
  • 9. 9 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S ?
  • 10. 10 So we’re done? No. • We still only have Arrow data in the JVM • Arrow and Pandas have a slightly different memory layout • We have this today in PySpark • It’s fast • Still involves a copy over the network • Arrow → pandas conversion is tuned but still a copy
  • 11. 11 pyarrow.jvm • Access Arrow data created in the JVM from Python • Involves no copy of the data • Translation of the helper objects • Actually passes memory addresses around No copy between the JVM and Python!
  • 12. NumPy & the BlockManager Photo by Susan Holt Simpson on Unsplash
  • 13. 13 Pandas Shortcomings • Limited to NumPy data types, otherwise object • Columns are not separate, grouped by type • Nullability is not type-safe (yet) —> Arrow memory does not match Pandas memory —> Copy 😢
  • 14. 14 Pandas ExtensionArrays • Introduced new interfaces in 0.23 • ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top • Still, experimental, wait for 0.24
  • 15. 15 Photo by Niklas Tidbury on Unsplash
  • 16. 16 fletcher • https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as storage • Uses Numba to implement the necessary analytic on top • Needs {pandas, Arrow, …} master No copy between Apache Arrow and pandas!
  • 17. 17 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S pyarrow.jvm
 / fletcher
  • 21. Make your best decision today. blueyonder.ai/en/careers Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 21
  • 22. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 22 Get Involved!