Home
Explore
Submit Search
Upload
Login
Signup
Advertisement
Next-generation Python Big Data Tools, powered by Apache Arrow
Report
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Apr. 6, 2016
•
0 likes
28 likes
×
Be the first to like this
Show More
•
12,945 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Check these out next
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
Data Science Languages and Industry Analytics
Wes McKinney
DataFrames: The Extended Cut
Wes McKinney
Application Architectures with Hadoop
hadooparchbook
Apache Spark Briefing
Thomas W. Dinsmore
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
Apache Arrow and Python: The latest
Wes McKinney
Light Up Your Dark Data
Anaconda
1
of
22
Top clipped slide
Next-generation Python Big Data Tools, powered by Apache Arrow
Apr. 6, 2016
•
0 likes
28 likes
×
Be the first to like this
Show More
•
12,945 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download Now
Download to read offline
Report
Technology
Given at SF Big Analytics Meetup 4/5/2016
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Advertisement
Advertisement
Advertisement
Recommended
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
5.4K views
•
37 slides
Ibis: Scaling the Python Data Experience
Wes McKinney
3.8K views
•
13 slides
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
8.1K views
•
22 slides
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
7.4K views
•
37 slides
PyData: The Next Generation
Wes McKinney
22.2K views
•
31 slides
High Performance Python on Apache Spark
Wes McKinney
16.5K views
•
35 slides
More Related Content
Slideshows for you
(20)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
•
17K views
Data Science Languages and Industry Analytics
Wes McKinney
•
5.5K views
DataFrames: The Extended Cut
Wes McKinney
•
8.5K views
Application Architectures with Hadoop
hadooparchbook
•
3.2K views
Apache Spark Briefing
Thomas W. Dinsmore
•
4K views
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
•
4K views
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
Light Up Your Dark Data
Anaconda
•
3.2K views
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
•
7.3K views
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
•
7.6K views
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
•
19.7K views
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
•
2.6K views
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
•
12.7K views
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
•
361 views
Architectural considerations for Hadoop Applications
hadooparchbook
•
9.9K views
Introduction to Apache Kudu
Shravan (Sean) Pabba
•
583 views
Exponea - Kafka and Hadoop as components of architecture
MartinStrycek
•
476 views
Architecting Applications with Hadoop
markgrover
•
756 views
Introduction to the Hadoop EcoSystem
Shivaji Dutta
•
1.8K views
Viewers also liked
(20)
Api Strat Portland 2017 Serverless Extensibility talk
Glenn Block
•
4.1K views
あなたの開発チームには、チームワークがあふれていますか?
Yusuke Amano
•
31.4K views
サイボウズのフロントエンド開発 現在とこれからの挑戦
Teppei Sato
•
20.4K views
遅いクエリと向き合う仕組み #CybozuMeetup
S Akai
•
14.6K views
サイボウズのサービスを支えるログ基盤
Shin'ya Ueoka
•
47K views
すべての人にチームワークを サイボウズのアクセシビリティ
Kobayashi Daisuke
•
13.5K views
形態素解析
Works Applications
•
14.6K views
WalB: Real-time and Incremental Backup System for Block Devices
uchan_nos
•
5.2K views
3000社の業務データ絞り込みを支える技術
Ryo Mitoma
•
11K views
離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー
Teppei Sato
•
20.6K views
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
Jumpei Miyata
•
7.3K views
すべてを自動化せよ! 〜生産性向上チームの挑戦〜
Jumpei Miyata
•
17.2K views
Kubernetes in 30 minutes (2017/03/10)
lestrrat
•
29.2K views
プロジェクト管理でkintone
Cybozucommunity
•
70.3K views
Kubernetesにまつわるエトセトラ(主に苦労話)
Works Applications
•
29.4K views
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
•
2.6K views
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
•
3.3K views
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
•
24.5K views
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
•
2.1K views
Time Series Analysis with Spark
Sandy Ryza
•
6.3K views
Advertisement
Similar to Next-generation Python Big Data Tools, powered by Apache Arrow
(20)
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
Improving data interoperability in Python and R
Wes McKinney
•
2.6K views
Improving Data Interoperability for Python and R
Work-Bench
•
10.3K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
High-Performance Python On Spark
Jen Aman
•
1.7K views
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
Building a Hadoop Data Warehouse with Impala
huguk
•
2K views
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
•
2.9K views
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
•
7.3K views
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
960 views
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
•
2.1K views
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
•
1.9K views
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
•
920 views
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Frank Munz
•
3.5K views
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.1K views
PySpark Best Practices
Cloudera, Inc.
•
9.3K views
Using Data Lakes
Amazon Web Services
•
430 views
Data Science and CDSW
Jason Hubbard
•
1.3K views
More from Wes McKinney
(16)
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
995 views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.4K views
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
1.9K views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
PyCon APAC 2016 Keynote
Wes McKinney
•
3.6K views
Data Tools and the Data Scientist Shortage
Wes McKinney
•
3.7K views
Advertisement
Recently uploaded
(20)
EMDS-FIWARE-Workshop-WP2_Inventory.pptx
FIWARE
•
0 views
ChatGPT_Prompts.pptx
Chakrit Phain
•
0 views
在哪里可以办美国大学文凭《西佛罗里达大学毕业证成绩单仿制》
efagvah
•
0 views
Into The Box 2023 Keynote Day 1
Ortus Solutions, Corp
•
0 views
lect1.pdf
AtkaAli
•
0 views
Secure all things with CBSecurity 3
Ortus Solutions, Corp
•
0 views
KC - Kanban Coaching
Adail Viana Neto
•
0 views
CS504 Updated Handouts.pdf
MuhammadWaseem853949
•
0 views
EPAK_EANT_Präsentation (EN).pdf
Simone Agresti
•
0 views
Ethereum's Transaction Momentum: Closing the Gap with Visa
Mobiloitte Technologies
•
0 views
ARTIFICIAL intelligence ppt
GTGAMING11
•
0 views
在哪里可以办美国大学文凭《夏威夷太平洋大学毕业证成绩单仿制》
efagvah
•
0 views
evpn_in_service_provider_network-web.pdf
ThanhTrungBui5
•
0 views
How to use ChatGPT for an ISMS implementation.pdf
Andrey Prozorov, CISM, CIPP/E, CDPSE. LA 27001
•
0 views
在哪里可以办美国大学文凭《南密西西比大学毕业证成绩单仿制》
efagvah
•
0 views
Site Directed Mutagenesis (SDM).pptx
TechnoIndiaUniversit
•
0 views
What is CHAT GPT AI.pdf
Roberberry
•
0 views
在哪里可以办新西兰大学文凭《曼努考理工学院毕业证成绩单仿制》
efagvah
•
0 views
AI HELPS PARALYSED MAN TO WALK NATURALLY.pdf
sudhakargeruganti
•
0 views
Europe Dedicated Server
ShivamShakya32
•
0 views
Next-generation Python Big Data Tools, powered by Apache Arrow
1 © Cloudera,
Inc. All rights reserved. Next-‐genera;on Python Big Data Tools, powered by Apache Arrow Wes McKinney @wesmckinn SF Big Analy;cs Meetup, 2016-‐04-‐05
2 © Cloudera,
Inc. All rights reserved. Me • Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incuba;ng)} • Mostly work in Python and Cython/C/C++
3 © Cloudera,
Inc. All rights reserved. In process: Python for Data Analysis: 2nd Edi4on Coming late 2016 / early 2017
4 © Cloudera,
Inc. All rights reserved. Python + Big Data: The State of things • See “Python and Apache Hadoop: A State of the Union” from February 17 • Areas where much more work needed • Binary file format read/write support (e.g. Parquet files) • File system libraries (HDFS, S3, etc.) • Client drivers (Spark, Hive, Impala, Kudu) • Compute system integra;on (Spark, Impala, etc.)
5 © Cloudera,
Inc. All rights reserved. Apache Arrow Many slides here from my joint talk with Jacques Nadeau, VP Apache Arrow
6 © Cloudera,
Inc. All rights reserved. Arrow in a Slide • New Top-‐level Apache Sofware Founda;on project • Announced Feb 17, 2016 • Focused on Columnar In-‐Memory Analy;cs 1. 10-‐100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both rela;onal and complex data as-‐is • Developers from 13+ major open source projects involved • A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
7 © Cloudera,
Inc. All rights reserved. Apache Arrow: What is it? • hkp://arrow.apache.org • Not a piece of sofware, exactly! • A standardized in-‐memory representa;on for columnar data • Enables • Suitable for implemen;ng high-‐performance analy;cs in-‐memory (think like “pandas internals”) • Cheap data interchange amongst systems, likle or no serializa;on • Flexible support for complex JSON-‐like data • Targets: Impala, Kudu, Parquet, Spark
8 © Cloudera,
Inc. All rights reserved. Focus on CPU Efficiency 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache Locality • Super-‐scalar & vectorized opera;on • Minimal Structure Overhead • Constant value access • With minimal structure overhead • Operate directly on columnar compressed data
9 © Cloudera,
Inc. All rights reserved. High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
10 © Cloudera,
Inc. All rights reserved. Big Data Systems: Poor Python IO performance h9p://wesmckinney.com/blog/pandas-‐and-‐apache-‐arrow/
11 © Cloudera,
Inc. All rights reserved. Real World Example: Feather File Format for Python and R • Problem: fast, language-‐ agnos;c binary data frame file format • Wriken by Wes McKinney (Python) Hadley Wickham (R) • Read speeds close to disk IO performance Arrow array 0 Arrow array 1 … Arrow array n Feather metadata Feather file Apache Arrow memory Google flatbuffers
12 © Cloudera,
Inc. All rights reserved. Real World Example: Feather File Format for Python and R library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path) import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path) R Python
13 © Cloudera,
Inc. All rights reserved. Apache Parquet: Binary columnar storage format • I just became a Parquet commiker! • github.com/apache/parquet-‐cpp • Python users will soon be able to read Parquet files via PyArrow • parquet-‐cpp <-‐> PyArrow <-‐> pandas
14 © Cloudera,
Inc. All rights reserved. Language Bindings • Target Languages • Java (beta) • CPP (underway) • Python & Pandas (underway) • R • Julia • Ini;al Focus • Read a structure • Write a structure • Manage Memory
15 © Cloudera,
Inc. All rights reserved. pandas and Arrow in context
16 © Cloudera,
Inc. All rights reserved. RPC & IPC: Moving Data Between Systems RPC • Avoid Serializa;on & Deserializa;on • Layer TBD: Focused on suppor;ng vectored io • Scaker/gather reads/writes against socket IPC • Alpha implementa;on using memory mapped files • Moving data between Python and Drill • Working on shared alloca;on approach • Shared reference coun;ng and well-‐defined ownership seman;cs
17 © Cloudera,
Inc. All rights reserved. Execu;ng data science languages in the compute layer UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
18 © Cloudera,
Inc. All rights reserved. Real World Example: Python With Spark, Drill, Impala in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
19 © Cloudera,
Inc. All rights reserved. What’s Next • Parquet for Python & C++ • Using Arrow as intermediary • Available IPC Implementa;on • Spark, Drill Integra;on • Faster UDFs, Storage interfaces
20 © Cloudera,
Inc. All rights reserved. Apache Arrow in prac;ce
21 © Cloudera,
Inc. All rights reserved. Get Involved • Join the community • dev@arrow.apache.org • Slack: hkps://apachearrowslackin.herokuapp.com/ • hkp://arrow.apache.org • @ApacheArrow
22 © Cloudera,
Inc. All rights reserved. Thank you Wes McKinney @wesmckinn Views are my own
Advertisement