Submit Search
Upload
Python Data Ecosystem: Thoughts on Building for the Future
•
6 likes
•
5,383 views
Wes McKinney
Follow
Keynote from PyData Berlin 2016-05-21
Read less
Read more
Technology
Report
Share
Report
Share
1 of 37
Download now
Download to read offline
Recommended
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
Wes McKinney
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
PyData: The Next Generation
PyData: The Next Generation
Wes McKinney
Data Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
DataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
Recommended
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
Wes McKinney
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
PyData: The Next Generation
PyData: The Next Generation
Wes McKinney
Data Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
DataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
Improving data interoperability in Python and R
Improving data interoperability in Python and R
Wes McKinney
High Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
Apache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
Apache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
Apache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
PyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney
More Related Content
What's hot
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
Improving data interoperability in Python and R
Improving data interoperability in Python and R
Wes McKinney
High Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
Apache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
Apache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
Apache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
What's hot
(20)
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Improving data interoperability in Python and R
Improving data interoperability in Python and R
High Performance Python on Apache Spark
High Performance Python on Apache Spark
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow and Python: The latest
Apache Arrow and Python: The latest
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Spark Briefing
Apache Spark Briefing
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
Apache Arrow - An Overview
Apache Arrow - An Overview
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Viewers also liked
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
PyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
Productive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
Wes McKinney
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Wes McKinney
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney
User Experience for Business Analysts
User Experience for Business Analysts
Carol Smith
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
pythoncharmers
Falcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent Programs
Sangmin Park
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Sangmin Park
Viewers also liked
(14)
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
PyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Productive Data Tools for Quants
Productive Data Tools for Quants
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
User Experience for Business Analysts
User Experience for Business Analysts
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Falcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent Programs
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Similar to Python Data Ecosystem: Thoughts on Building for the Future
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
Work-Bench
High-Performance Python On Spark
High-Performance Python On Spark
Jen Aman
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Timothy Spann
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
Part 2: A Visual Dive into Machine Learning and Deep Learning
Part 2: A Visual Dive into Machine Learning and Deep Learning
Cloudera, Inc.
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
Data Science and CDSW
Data Science and CDSW
Jason Hubbard
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
Building data pipelines with kite
Building data pipelines with kite
Joey Echeverria
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
Pandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.
Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
Twitter with hadoop for oow
Twitter with hadoop for oow
Gwen (Chen) Shapira
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Timothy Spann
Similar to Python Data Ecosystem: Thoughts on Building for the Future
(20)
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
High-Performance Python On Spark
High-Performance Python On Spark
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Part 2: A Visual Dive into Machine Learning and Deep Learning
Part 2: A Visual Dive into Machine Learning and Deep Learning
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Data Science and CDSW
Data Science and CDSW
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Building data pipelines with kite
Building data pipelines with kite
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Pandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data Experience
Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
Twitter with hadoop for oow
Twitter with hadoop for oow
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
More from Wes McKinney
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
New Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
Shared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
More from Wes McKinney
(15)
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
New Directions for Apache Arrow
New Directions for Apache Arrow
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Shared Infrastructure for Data Science
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Recently uploaded
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
debabhi2
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MIND CTI
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Roshan Dwivedi
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
apidays
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
jfdjdjcjdnsjd
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
sammart93
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Principled Technologies
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Khushali Kathiriya
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Remote DBA Services
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
wesley chun
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
Recently uploaded
(20)
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Python Data Ecosystem: Thoughts on Building for the Future
1.
1 © Cloudera,
Inc. All rights reserved. Python Data Ecosystem: Thoughts on Building for the Future Wes McKinney @wesmckinn PyData Berlin 2016-‐05-‐21
2.
2 © Cloudera,
Inc. All rights reserved. Me • Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubaWng)} • Mostly work in Python and Cython/C/C++
3.
3 © Cloudera,
Inc. All rights reserved. In process: Python for Data Analysis: 2nd Edi4on Coming early 2017
4.
4 © Cloudera,
Inc. All rights reserved. Building open source communiWes
5.
5 © Cloudera,
Inc. All rights reserved. Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals. Wikipedia
6.
6 © Cloudera,
Inc. All rights reserved. Step 1 Be open and transparent
7.
7 © Cloudera,
Inc. All rights reserved. Step 2 Reach out to others
8.
8 © Cloudera,
Inc. All rights reserved. Step 3 Strive for consensus
9.
9 © Cloudera,
Inc. All rights reserved. Step 4 Value contribuWons extending beyond lines of code
10.
10 © Cloudera,
Inc. All rights reserved. Step 5 Make things harder for bad actors
11.
11 © Cloudera,
Inc. All rights reserved.
12.
12 © Cloudera,
Inc. All rights reserved. Handling problems carefully
13.
13 © Cloudera,
Inc. All rights reserved. http://numfocus.org http://apache.org
14.
14 © Cloudera,
Inc. All rights reserved. Python packaging
15.
15 © Cloudera,
Inc. All rights reserved. Packaging is hard • Reproducible infrastructure • Reproducible toolchains • Reproducible build scripts • IntegraWon tesWng • MulWple library version builds • MulWple Python versions • Dependency resoluWon • HosWng and distribuWon • MulWple environment management
16.
16 © Cloudera,
Inc. All rights reserved. ReflecWng on the past
17.
17 © Cloudera,
Inc. All rights reserved.
18.
18 © Cloudera,
Inc. All rights reserved. conda-‐forge • Community-‐curated conda package channel (on anaconda.org) • Reproducible build infrastructure (Docker + Circle CI + Travis CI + Appveyor) • Automated GitHub helper tools conda config --add channels conda-forge
19.
19 © Cloudera,
Inc. All rights reserved. What’s important to me right now?
20.
20 © Cloudera,
Inc. All rights reserved. Important things • Building bridges with other data science communiWes (R, Julia, Scala, etc.) • Enabling Python to more efficiently talk to other systems (e.g. Hadoop things) • Building Python tools for new and changing varieWes of data
21.
21 © Cloudera,
Inc. All rights reserved. RAM as the new disk? • SSD – DRAM performance convergence • NVM developments (3D Xpoint)Memory working set Consumer Consumer Consumer
22.
22 © Cloudera,
Inc. All rights reserved. Problems • Memory (data structure) representaWons • Metadata representaWons • Memory ownership, life-‐cycle
23.
23 © Cloudera,
Inc. All rights reserved. NumPy solved this problem for Python scienWsts • Common memory representaWon • ndarray strided, homogeneous buffer • Common metadata • NumPy dtypes • No well-‐defined memory sharing / messaging model: case by case basis
24.
24 © Cloudera,
Inc. All rights reserved. Problems NumPy doesn’t solve as well • Nested data types (think JSON) • Missing / NULL data • Strings and category types • Columnar memory representaWon for tables (think: analyWc SQL databases)
25.
25 © Cloudera,
Inc. All rights reserved. Apache Arrow http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
26.
26 © Cloudera,
Inc. All rights reserved. Arrow in a Slide • New Top-‐level Apache Sonware FoundaWon project • Focused on Columnar In-‐Memory AnalyWcs 1. 10-‐100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relaWonal and complex data as-‐is • Developers from 13+ major open source projects involved • A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
27.
27 © Cloudera,
Inc. All rights reserved. Focus on CPU Efficiency 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache Locality • Super-‐scalar & vectorized operaWon • Minimal Structure Overhead • Constant value access • With minimal structure overhead • Operate directly on columnar compressed data
28.
28 © Cloudera,
Inc. All rights reserved. High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
29.
29 © Cloudera,
Inc. All rights reserved. Arrow in acWon: Feather File Format for Python and R • Problem: fast, language-‐ agnosWc binary data frame file format • By Wes McKinney (Python) and Hadley Wickham (R) • Read speeds close to disk IO performance
30.
30 © Cloudera,
Inc. All rights reserved. Real World Example: Feather File Format for Python and R library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path) import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path) R Python
31.
31 © Cloudera,
Inc. All rights reserved. More on Feather array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
32.
32 © Cloudera,
Inc. All rights reserved. Feather: the good and not-‐so-‐good • Good • Language-‐agnosWc memory representaWon • Extremely fast • New storage features can be added without much difficulty • Not-‐so-‐good • Data must be convert to/from storage representaWon (Arrow) and in-‐ memory “proprietary” data structures (R / Python data frames)
33.
33 © Cloudera,
Inc. All rights reserved. Apache Parquet: Python support is coming • Collaborating with Uwe Korn from Blue Yonder pandas Arrow (C++ / Python) Parquet (C++)
34.
34 © Cloudera,
Inc. All rights reserved. Shared needs for Python, R, Julia, ... • If PLs can establish a common data frame C/C++-‐level memory representaWon, we can share algorithms and libraries much more easily • Example: dplyr’s in-‐memory backend • Other requirements • Permissive licensing (Python / Julia require MIT/Apache-‐like) • Common build/test/packaging for shared C/C++ library components
35.
35 © Cloudera,
Inc. All rights reserved. Real World Example: Python With Spark, Drill, Impala in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
36.
36 © Cloudera,
Inc. All rights reserved. Get Involved in Arrow • Join the community • dev@arrow.apache.org • Slack: hups://apachearrowslackin.herokuapp.com/ • hup://arrow.apache.org • @ApacheArrow
37.
37 © Cloudera,
Inc. All rights reserved. Thank you Wes McKinney @wesmckinn Views are my own
Download now