Submit Search
Upload
Python Data Ecosystem: Thoughts on Building for the Future
Report
Share
Wes McKinney
Director of Ursa Labs, Open Source Developer at Ursa Labs
Follow
•
6 likes
•
5,379 views
1
of
37
Python Data Ecosystem: Thoughts on Building for the Future
•
6 likes
•
5,379 views
Report
Share
Download Now
Download to read offline
Technology
Keynote from PyData Berlin 2016-05-21
Read more
Wes McKinney
Director of Ursa Labs, Open Source Developer at Ursa Labs
Follow
Recommended
Next-generation Python Big Data Tools, powered by Apache Arrow by
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
13K views
•
22 slides
An Incomplete Data Tools Landscape for Hackers in 2015 by
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
8.1K views
•
22 slides
Ibis: Scaling the Python Data Experience by
Ibis: Scaling the Python Data Experience
Wes McKinney
3.8K views
•
13 slides
My Data Journey with Python (SciPy 2015 Keynote) by
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
7.4K views
•
37 slides
Memory Interoperability in Analytics and Machine Learning by
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
5.6K views
•
27 slides
PyData: The Next Generation by
PyData: The Next Generation
Wes McKinney
22.2K views
•
31 slides
More Related Content
What's hot
Python Data Wrangling: Preparing for the Future by
Python Data Wrangling: Preparing for the Future
Wes McKinney
12.5K views
•
27 slides
Improving Python and Spark (PySpark) Performance and Interoperability by
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
19.8K views
•
37 slides
Apache Arrow (Strata-Hadoop World San Jose 2016) by
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
17K views
•
28 slides
Apache Arrow -- Cross-language development platform for in-memory data by
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
2.9K views
•
23 slides
Improving data interoperability in Python and R by
Improving data interoperability in Python and R
Wes McKinney
2.6K views
•
14 slides
High Performance Python on Apache Spark by
High Performance Python on Apache Spark
Wes McKinney
16.6K views
•
35 slides
What's hot
(20)
Python Data Wrangling: Preparing for the Future by Wes McKinney
Python Data Wrangling: Preparing for the Future
Wes McKinney
•
12.5K views
Improving Python and Spark (PySpark) Performance and Interoperability by Wes McKinney
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
•
19.8K views
Apache Arrow (Strata-Hadoop World San Jose 2016) by Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
•
17K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Improving data interoperability in Python and R by Wes McKinney
Improving data interoperability in Python and R
Wes McKinney
•
2.6K views
High Performance Python on Apache Spark by Wes McKinney
High Performance Python on Apache Spark
Wes McKinney
•
16.6K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Apache Arrow and Python: The latest by Wes McKinney
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Apache Spark Briefing by Thomas W. Dinsmore
Apache Spark Briefing
Thomas W. Dinsmore
•
4K views
Ibis: Scaling Python Analytics on Hadoop and Impala by Wes McKinney
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
•
7.6K views
Improving Python and Spark Performance and Interoperability with Apache Arrow by Julien Le Dem
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
•
4.4K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P... by Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
•
103.9K views
Large Scale Graph Analytics with JanusGraph by P. Taylor Goetz
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
•
19.1K views
Apache Arrow - An Overview by Dremio Corporation
Apache Arrow - An Overview
Dremio Corporation
•
2K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney by Hakka Labs
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
PyData: The Next Generation | Data Day Texas 2015 by Cloudera, Inc.
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
HBase and Drill: How loosley typed SQL is ideal for NoSQL by DataWorks Summit
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
•
641 views
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes by DataWorks Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
•
361 views
Viewers also liked
Raising the Tides: Open Source Analytics for Data Science by
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
3.2K views
•
28 slides
PyCon APAC 2016 Keynote by
PyCon APAC 2016 Keynote
Wes McKinney
3.6K views
•
36 slides
Enabling Python to be a Better Big Data Citizen by
Enabling Python to be a Better Big Data Citizen
Wes McKinney
6K views
•
19 slides
pandas: Powerful data analysis tools for Python by
pandas: Powerful data analysis tools for Python
Wes McKinney
9.8K views
•
38 slides
Productive Data Tools for Quants by
Productive Data Tools for Quants
Wes McKinney
1.7K views
•
21 slides
What's new in pandas and the SciPy stack for financial users by
What's new in pandas and the SciPy stack for financial users
Wes McKinney
11.8K views
•
23 slides
Viewers also liked
(14)
Raising the Tides: Open Source Analytics for Data Science by Wes McKinney
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
PyCon APAC 2016 Keynote by Wes McKinney
PyCon APAC 2016 Keynote
Wes McKinney
•
3.6K views
Enabling Python to be a Better Big Data Citizen by Wes McKinney
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
pandas: Powerful data analysis tools for Python by Wes McKinney
pandas: Powerful data analysis tools for Python
Wes McKinney
•
9.8K views
Productive Data Tools for Quants by Wes McKinney
Productive Data Tools for Quants
Wes McKinney
•
1.7K views
What's new in pandas and the SciPy stack for financial users by Wes McKinney
What's new in pandas and the SciPy stack for financial users
Wes McKinney
•
11.8K views
Data Tools and the Data Scientist Shortage by Wes McKinney
Data Tools and the Data Scientist Shortage
Wes McKinney
•
3.7K views
DataFrames: The Good, Bad, and Ugly by Wes McKinney
DataFrames: The Good, Bad, and Ugly
Wes McKinney
•
12.9K views
Python for Financial Data Analysis with pandas by Wes McKinney
Python for Financial Data Analysis with pandas
Wes McKinney
•
61.8K views
Structured Data Challenges in Finance and Statistics by Wes McKinney
Structured Data Challenges in Finance and Statistics
Wes McKinney
•
5.3K views
User Experience for Business Analysts by Carol Smith
User Experience for Business Analysts
Carol Smith
•
3.7K views
Python for Science and Engineering: a presentation to A*STAR and the Singapor... by pythoncharmers
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
pythoncharmers
•
7K views
Falcon: Fault Localization in Concurrent Programs by Sangmin Park
Falcon: Fault Localization in Concurrent Programs
Sangmin Park
•
539 views
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding... by Sangmin Park
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Sangmin Park
•
412 views
Similar to Python Data Ecosystem: Thoughts on Building for the Future
Improving Data Interoperability for Python and R by
Improving Data Interoperability for Python and R
Work-Bench
10.3K views
•
14 slides
High-Performance Python On Spark by
High-Performance Python On Spark
Jen Aman
1.7K views
•
35 slides
Building a Hadoop Data Warehouse with Impala by
Building a Hadoop Data Warehouse with Impala
huguk
2K views
•
37 slides
Part 2: A Visual Dive into Machine Learning and Deep Learning by
Part 2: A Visual Dive into Machine Learning and Deep Learning
Cloudera, Inc.
1.5K views
•
32 slides
Building a Hadoop Data Warehouse with Impala by
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
7.3K views
•
40 slides
Data Science and CDSW by
Data Science and CDSW
Jason Hubbard
1.3K views
•
19 slides
Similar to Python Data Ecosystem: Thoughts on Building for the Future
(20)
Improving Data Interoperability for Python and R by Work-Bench
Improving Data Interoperability for Python and R
Work-Bench
•
10.3K views
High-Performance Python On Spark by Jen Aman
High-Performance Python On Spark
Jen Aman
•
1.7K views
Building a Hadoop Data Warehouse with Impala by huguk
Building a Hadoop Data Warehouse with Impala
huguk
•
2K views
Part 2: A Visual Dive into Machine Learning and Deep Learning by Cloudera, Inc.
Part 2: A Visual Dive into Machine Learning and Deep Learning
Cloudera, Inc.
•
1.5K views
Building a Hadoop Data Warehouse with Impala by Swiss Big Data User Group
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
•
7.3K views
Data Science and CDSW by Jason Hubbard
Data Science and CDSW
Jason Hubbard
•
1.3K views
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,... by Data Con LA
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
•
369 views
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud by Stefan Lipp
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
•
403 views
Building data pipelines with kite by Joey Echeverria
Building data pipelines with kite
Joey Echeverria
•
5.7K views
Hadoop 3 (2017 hadoop taiwan workshop) by Wei-Chiu Chuang
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang
•
551 views
A brave new world in mutable big data relational storage (Strata NYC 2017) by Todd Lipcon
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
•
7.3K views
Pandas & Cloudera: Scaling the Python Data Experience by Turi, Inc.
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.
•
648 views
Analyzing Hadoop Data Using Sparklyr by Cloudera, Inc.
Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
•
2.4K views
Data Science and Machine Learning for the Enterprise by Cloudera, Inc.
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
•
1.3K views
GSJUG: Mastering Data Streaming Pipelines 09May2023 by Timothy Spann
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
•
255 views
Twitter with hadoop for oow by Gwen (Chen) Shapira
Twitter with hadoop for oow
Gwen (Chen) Shapira
•
1.5K views
Machine Learning in the Enterprise 2019 by Timothy Spann
Machine Learning in the Enterprise 2019
Timothy Spann
•
878 views
Large-Scale Data Science on Hadoop (Intel Big Data Day) by Uri Laserson
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
•
1.8K views
PyData Boston 2013 by Travis Oliphant
PyData Boston 2013
Travis Oliphant
•
3.9K views
Hambug R Meetup - Intro to H2O by Sri Ambati
Hambug R Meetup - Intro to H2O
Sri Ambati
•
272 views
More from Wes McKinney
Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
1.1K views
•
31 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
1.1K views
•
26 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
1.5K views
•
53 slides
New Directions for Apache Arrow by
New Directions for Apache Arrow
Wes McKinney
1.9K views
•
27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
2.2K views
•
31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
2K views
•
47 slides
More from Wes McKinney
(14)
Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.5K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
2K views
Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
970 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack by Wes McKinney
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.2K views
Shared Infrastructure for Data Science by Wes McKinney
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017) by Wes McKinney
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Recently uploaded
CryptoBotsAI by
CryptoBotsAI
chandureddyvadala199
42 views
•
5 slides
Optimizing Communication to Optimize Human Behavior - LCBM by
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar
39 views
•
49 slides
"Running students' code in isolation. The hard way", Yurii Holiuk by
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays
38 views
•
34 slides
Netmera Presentation.pdf by
Netmera Presentation.pdf
Mustafa Kuğu
22 views
•
50 slides
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash
171 views
•
59 slides
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」 by
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PC Cluster Consortium
27 views
•
12 slides
Recently uploaded
(20)
CryptoBotsAI by chandureddyvadala199
CryptoBotsAI
chandureddyvadala199
•
42 views
Optimizing Communication to Optimize Human Behavior - LCBM by Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar
•
39 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays
•
38 views
Netmera Presentation.pdf by Mustafa Kuğu
Netmera Presentation.pdf
Mustafa Kuğu
•
22 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash
•
171 views
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」 by PC Cluster Consortium
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PC Cluster Consortium
•
27 views
Qualifying SaaS, IaaS.pptx by Sachin Bhandari
Qualifying SaaS, IaaS.pptx
Sachin Bhandari
•
1.1K views
LLMs in Production: Tooling, Process, and Team Structure by Aggregage
LLMs in Production: Tooling, Process, and Team Structure
Aggregage
•
65 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language Models
Yunyao Li
•
104 views
Inawisdom IDP by PhilipBasford
Inawisdom IDP
PhilipBasford
•
17 views
Mobile Core Solutions & Successful Cases.pdf by IPLOOK Networks
Mobile Core Solutions & Successful Cases.pdf
IPLOOK Networks
•
16 views
Transcript: Redefining the book supply chain: A glimpse into the future - Tec... by BookNet Canada
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
BookNet Canada
•
43 views
The Power of Generative AI in Accelerating No Code Adoption.pdf by Saeed Al Dhaheri
The Power of Generative AI in Accelerating No Code Adoption.pdf
Saeed Al Dhaheri
•
44 views
Telenity Solutions Brief by Mustafa Kuğu
Telenity Solutions Brief
Mustafa Kuğu
•
14 views
What is Authentication Active Directory_.pptx by HeenaMehta35
What is Authentication Active Directory_.pptx
HeenaMehta35
•
15 views
Deep Tech and the Amplified Organisation: Core Concepts by Holonomics
Deep Tech and the Amplified Organisation: Core Concepts
Holonomics
•
17 views
MVP and prioritization.pdf by rahuldharwal141
MVP and prioritization.pdf
rahuldharwal141
•
40 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada
Fwdays
•
59 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue
•
120 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue
•
209 views
Python Data Ecosystem: Thoughts on Building for the Future
1.
1 © Cloudera,
Inc. All rights reserved. Python Data Ecosystem: Thoughts on Building for the Future Wes McKinney @wesmckinn PyData Berlin 2016-‐05-‐21
2.
2 © Cloudera,
Inc. All rights reserved. Me • Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubaWng)} • Mostly work in Python and Cython/C/C++
3.
3 © Cloudera,
Inc. All rights reserved. In process: Python for Data Analysis: 2nd Edi4on Coming early 2017
4.
4 © Cloudera,
Inc. All rights reserved. Building open source communiWes
5.
5 © Cloudera,
Inc. All rights reserved. Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals. Wikipedia
6.
6 © Cloudera,
Inc. All rights reserved. Step 1 Be open and transparent
7.
7 © Cloudera,
Inc. All rights reserved. Step 2 Reach out to others
8.
8 © Cloudera,
Inc. All rights reserved. Step 3 Strive for consensus
9.
9 © Cloudera,
Inc. All rights reserved. Step 4 Value contribuWons extending beyond lines of code
10.
10 © Cloudera,
Inc. All rights reserved. Step 5 Make things harder for bad actors
11.
11 © Cloudera,
Inc. All rights reserved.
12.
12 © Cloudera,
Inc. All rights reserved. Handling problems carefully
13.
13 © Cloudera,
Inc. All rights reserved. http://numfocus.org http://apache.org
14.
14 © Cloudera,
Inc. All rights reserved. Python packaging
15.
15 © Cloudera,
Inc. All rights reserved. Packaging is hard • Reproducible infrastructure • Reproducible toolchains • Reproducible build scripts • IntegraWon tesWng • MulWple library version builds • MulWple Python versions • Dependency resoluWon • HosWng and distribuWon • MulWple environment management
16.
16 © Cloudera,
Inc. All rights reserved. ReflecWng on the past
17.
17 © Cloudera,
Inc. All rights reserved.
18.
18 © Cloudera,
Inc. All rights reserved. conda-‐forge • Community-‐curated conda package channel (on anaconda.org) • Reproducible build infrastructure (Docker + Circle CI + Travis CI + Appveyor) • Automated GitHub helper tools conda config --add channels conda-forge
19.
19 © Cloudera,
Inc. All rights reserved. What’s important to me right now?
20.
20 © Cloudera,
Inc. All rights reserved. Important things • Building bridges with other data science communiWes (R, Julia, Scala, etc.) • Enabling Python to more efficiently talk to other systems (e.g. Hadoop things) • Building Python tools for new and changing varieWes of data
21.
21 © Cloudera,
Inc. All rights reserved. RAM as the new disk? • SSD – DRAM performance convergence • NVM developments (3D Xpoint)Memory working set Consumer Consumer Consumer
22.
22 © Cloudera,
Inc. All rights reserved. Problems • Memory (data structure) representaWons • Metadata representaWons • Memory ownership, life-‐cycle
23.
23 © Cloudera,
Inc. All rights reserved. NumPy solved this problem for Python scienWsts • Common memory representaWon • ndarray strided, homogeneous buffer • Common metadata • NumPy dtypes • No well-‐defined memory sharing / messaging model: case by case basis
24.
24 © Cloudera,
Inc. All rights reserved. Problems NumPy doesn’t solve as well • Nested data types (think JSON) • Missing / NULL data • Strings and category types • Columnar memory representaWon for tables (think: analyWc SQL databases)
25.
25 © Cloudera,
Inc. All rights reserved. Apache Arrow http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
26.
26 © Cloudera,
Inc. All rights reserved. Arrow in a Slide • New Top-‐level Apache Sonware FoundaWon project • Focused on Columnar In-‐Memory AnalyWcs 1. 10-‐100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relaWonal and complex data as-‐is • Developers from 13+ major open source projects involved • A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
27.
27 © Cloudera,
Inc. All rights reserved. Focus on CPU Efficiency 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache Locality • Super-‐scalar & vectorized operaWon • Minimal Structure Overhead • Constant value access • With minimal structure overhead • Operate directly on columnar compressed data
28.
28 © Cloudera,
Inc. All rights reserved. High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
29.
29 © Cloudera,
Inc. All rights reserved. Arrow in acWon: Feather File Format for Python and R • Problem: fast, language-‐ agnosWc binary data frame file format • By Wes McKinney (Python) and Hadley Wickham (R) • Read speeds close to disk IO performance
30.
30 © Cloudera,
Inc. All rights reserved. Real World Example: Feather File Format for Python and R library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path) import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path) R Python
31.
31 © Cloudera,
Inc. All rights reserved. More on Feather array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
32.
32 © Cloudera,
Inc. All rights reserved. Feather: the good and not-‐so-‐good • Good • Language-‐agnosWc memory representaWon • Extremely fast • New storage features can be added without much difficulty • Not-‐so-‐good • Data must be convert to/from storage representaWon (Arrow) and in-‐ memory “proprietary” data structures (R / Python data frames)
33.
33 © Cloudera,
Inc. All rights reserved. Apache Parquet: Python support is coming • Collaborating with Uwe Korn from Blue Yonder pandas Arrow (C++ / Python) Parquet (C++)
34.
34 © Cloudera,
Inc. All rights reserved. Shared needs for Python, R, Julia, ... • If PLs can establish a common data frame C/C++-‐level memory representaWon, we can share algorithms and libraries much more easily • Example: dplyr’s in-‐memory backend • Other requirements • Permissive licensing (Python / Julia require MIT/Apache-‐like) • Common build/test/packaging for shared C/C++ library components
35.
35 © Cloudera,
Inc. All rights reserved. Real World Example: Python With Spark, Drill, Impala in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
36.
36 © Cloudera,
Inc. All rights reserved. Get Involved in Arrow • Join the community • dev@arrow.apache.org • Slack: hups://apachearrowslackin.herokuapp.com/ • hup://arrow.apache.org • @ApacheArrow
37.
37 © Cloudera,
Inc. All rights reserved. Thank you Wes McKinney @wesmckinn Views are my own