Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1© Cloudera, Inc. All rights reserved.
Apache Arrow and Python in
context
Wes McKinney @wesmckinn
Data Science Summit 2016...
2© Cloudera, Inc. All rights reserved.
Me
• Data Science Tools at Cloudera
• Creator of pandas
• Wrote Python for Data Ana...
3© Cloudera, Inc. All rights reserved.
WrangleConf - July 28 in San Francisco
http://wrangleconf.com
Storytelling from rea...
4© Cloudera, Inc. All rights reserved.
Python + Big Data: The State of things
• See “Python and Apache Hadoop: A State of ...
5© Cloudera, Inc. All rights reserved.
Apache
Arrow
Many slides here from my joint talk with Jacques Nadeau, VP Apache Arr...
6© Cloudera, Inc. All rights reserved.
Arrow in a Slide
• New Top-level Apache Software Foundation project
• Announced Feb...
7© Cloudera, Inc. All rights reserved.
High Performance Sharing & Interchange
Today With Arrow
• Each system has its own i...
8© Cloudera, Inc. All rights reserved.
Apache Arrow: What is it?
• http://arrow.apache.org
• Specification matters more th...
9© Cloudera, Inc. All rights reserved.
Focus on CPU Efficiency
Traditional
Memory Buffer
Arrow
Memory Buffer
•Cache Locali...
10© Cloudera, Inc. All rights reserved.
Example: Feather File Format for Python and R
•Problem: fast, language-
agnostic b...
11© Cloudera, Inc. All rights reserved.
Real World Example: Feather File Format for Python
and R
library(feather)
path <- ...
12© Cloudera, Inc. All rights reserved.
In progress: Parquet on HDFS for pandas users
pandas
pyarrow
libarrow libarrow_io
...
13© Cloudera, Inc. All rights reserved.
Language Bindings
• Target Languages
• Java (beta)
• CPP (underway)
• Python & Pan...
14© Cloudera, Inc. All rights reserved.
RPC & IPC: Moving Data Between Systems
RPC
• Avoid Serialization & Deserialization...
15© Cloudera, Inc. All rights reserved.
Executing data science languages in the compute layer
16© Cloudera, Inc. All rights reserved.
Real World Example: Python With Spark, Drill, Impala
17© Cloudera, Inc. All rights reserved.
What’s on the horizon
• Parquet for Python & C++
• Using Arrow as intermediary
• I...
18© Cloudera, Inc. All rights reserved.
Get Involved
• Join the community
• dev@arrow.apache.org
• Slack: https://apachear...
19© Cloudera, Inc. All rights reserved.
Thank you
Wes McKinney @wesmckinn
Views are my own
Upcoming SlideShare
Loading in …5
×

Apache Arrow and Python: The latest

4,158 views

Published on

Talk from Data Science Summit 2016 in San Francisco

Published in: Technology

Apache Arrow and Python: The latest

  1. 1. 1© Cloudera, Inc. All rights reserved. Apache Arrow and Python in context Wes McKinney @wesmckinn Data Science Summit 2016-07-12
  2. 2. 2© Cloudera, Inc. All rights reserved. Me • Data Science Tools at Cloudera • Creator of pandas • Wrote Python for Data Analysis 2012 (2nd ed coming 2017) • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubating)} • Mostly work in Python and Cython/C/C++
  3. 3. 3© Cloudera, Inc. All rights reserved. WrangleConf - July 28 in San Francisco http://wrangleconf.com Storytelling from real-world data science work (and BBQ, of course)
  4. 4. 4© Cloudera, Inc. All rights reserved. Python + Big Data: The State of things • See “Python and Apache Hadoop: A State of the Union” from February 17 • Areas where much more work needed • Binary file format read/write support (e.g. Parquet files) • File system libraries (HDFS, S3, etc.) • Client drivers (Spark, Hive, Impala, Kudu) • Compute system integration (Spark, Impala, etc.)
  5. 5. 5© Cloudera, Inc. All rights reserved. Apache Arrow Many slides here from my joint talk with Jacques Nadeau, VP Apache Arrow
  6. 6. 6© Cloudera, Inc. All rights reserved. Arrow in a Slide • New Top-level Apache Software Foundation project • Announced Feb 17, 2016 • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is • Developers from 13+ major open source projects involved Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  7. 7. 7© Cloudera, Inc. All rights reserved. High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  8. 8. 8© Cloudera, Inc. All rights reserved. Apache Arrow: What is it? • http://arrow.apache.org • Specification matters more than Implementation • A standardized in-memory representation for columnar data • Enables • Suitable for implementing high-performance analytics in-memory (think like “pandas internals”) • Cheap data interchange amongst systems, little or no serialization • Flexible support for complex JSON-like data • Targets: Impala, Kudu, Parquet, Spark
  9. 9. 9© Cloudera, Inc. All rights reserved. Focus on CPU Efficiency Traditional Memory Buffer Arrow Memory Buffer •Cache Locality •Super-scalar & vectorized operation •Minimal Structure Overhead •Constant value access • With minimal structure overhead •Operate directly on columnar compressed data
  10. 10. 10© Cloudera, Inc. All rights reserved. Example: Feather File Format for Python and R •Problem: fast, language- agnostic binary data frame file format •Written by Wes McKinney (Python) Hadley Wickham (R) •Read speeds close to disk IO performance
  11. 11. 11© Cloudera, Inc. All rights reserved. Real World Example: Feather File Format for Python and R library(feather) path <- "my_data.feather" write_feather(df, path) df <- read_feather(path) import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path) R Python
  12. 12. 12© Cloudera, Inc. All rights reserved. In progress: Parquet on HDFS for pandas users pandas pyarrow libarrow libarrow_io Parquet files in HDFS / filesystems Arrow-Parquet adapter Native libhdfs, other filesystem interfaces C++ libraries Python + C extensions Data structures parquet-cpp Raw filesystem interface Python wrapper classes
  13. 13. 13© Cloudera, Inc. All rights reserved. Language Bindings • Target Languages • Java (beta) • CPP (underway) • Python & Pandas (underway) • R • Julia • Initial Focus • Read a structure • Write a structure • Manage Memory
  14. 14. 14© Cloudera, Inc. All rights reserved. RPC & IPC: Moving Data Between Systems RPC • Avoid Serialization & Deserialization • Layer TBD: Focused on supporting vectored io • Scatter/gather reads/writes against socket IPC • Alpha implementation using memory mapped files • Moving data between Python and Drill • Working on shared allocation approach • Shared reference counting and well-defined ownership semantics
  15. 15. 15© Cloudera, Inc. All rights reserved. Executing data science languages in the compute layer
  16. 16. 16© Cloudera, Inc. All rights reserved. Real World Example: Python With Spark, Drill, Impala
  17. 17. 17© Cloudera, Inc. All rights reserved. What’s on the horizon • Parquet for Python & C++ • Using Arrow as intermediary • IPC Implementation + Java/C++ interop • Spark, Drill Integration • Faster UDFs, Storage interfaces
  18. 18. 18© Cloudera, Inc. All rights reserved. Get Involved • Join the community • dev@arrow.apache.org • Slack: https://apachearrowslackin.herokuapp.com/ • http://arrow.apache.org • @ApacheArrow
  19. 19. 19© Cloudera, Inc. All rights reserved. Thank you Wes McKinney @wesmckinn Views are my own

×