Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Uwe L. Korn
PyData Paris 14th June 2016
How Apache Arrow and Parquet
boost cross-language interop
About me
• Data Scientist at Blue Yonder (@BlueYonderTech)
• We optimize Replenishment and Pricing for the Retail
industry...
Agenda
The Problem
Arrow
Parquet
Outlook
Why is columnar better?
Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
Different Systems - Varying
Python Support
• Various levels of Python Support
• Build in Python
• Python API
• No Python at...
Apache Arrow
• Specification for in-memory
columnar data layout
• No overhead for cross-system /
cross-language communicat...
Apache Arrow - The Impact
• An example: Retrieve a dataset from an MPP database
and analyze it in Pandas
• Run a query in ...
Apache Arrow
• Top-level Apache project from the beginning
• Not only a specification: also includes C++ / Java /
Python /...
Arrow in Action: Feather
• Language-agnostic file format for
binary data frame storage
• Read performance close to raw
dis...
Apache Parquet
Apache Parquet
• Binary file format for nested columnar data
• Inspired from Google Dremel paper
• space and query efficient...
The Basics
• 1 File, includes metadata
• Several row groups
• all with the same number of column chunks
• n pages per colu...
Using Parquet in Python
• You can use it already today with Python:
• sqlContext.read.parquet(“..“).toPandas()	
• Needs to...
State of Arrow & Parquet
Arrow
in-memory spec for columnar data
• Java (beta)
• C++ (in progress)
• Python (in progress)
•...
Upcoming
• Parquet <-Arrow-> Pandas
• IPC on its way
• alpha implementation using memory mapped files
• JVM <-> native wit...
Get Involved!
• dev@arrow.apache.org & dev@parquet.apache.org
• https://apachearrowslackin.herokuapp.com/
• https://arrow....
Questions ?!
Upcoming SlideShare
Loading in …5
×

How Apache Arrow and Parquet boost cross-language interoperability

1,834 views

Published on

PyData Paris 2016 about the importance and recent developments on the Python side of Apache Arrow and Apache Parquet.

Published in: Data & Analytics
  • Be the first to comment

How Apache Arrow and Parquet boost cross-language interoperability

  1. 1. Uwe L. Korn PyData Paris 14th June 2016 How Apache Arrow and Parquet boost cross-language interop
  2. 2. About me • Data Scientist at Blue Yonder (@BlueYonderTech) • We optimize Replenishment and Pricing for the Retail industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL
  3. 3. Agenda The Problem Arrow Parquet Outlook
  4. 4. Why is columnar better? Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
  5. 5. Different Systems - Varying Python Support • Various levels of Python Support • Build in Python • Python API • No Python at all • Each tool/algorithm works on columnar data • Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
  6. 6. Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system / cross-language communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Supports nested data structures Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
  7. 7. Apache Arrow - The Impact • An example: Retrieve a dataset from an MPP database and analyze it in Pandas • Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again • Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and Pandas
  8. 8. Apache Arrow • Top-level Apache project from the beginning • Not only a specification: also includes C++ / Java / Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, .. • Combined effort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, .. • Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
  9. 9. Arrow in Action: Feather • Language-agnostic file format for binary data frame storage • Read performance close to raw disk I/O • by Wes McKinney (Python) and Hadley Wickham (R) • Julia Support in progress Arrow Arrays Feather Metadata (flatbuffers)
  10. 10. Apache Parquet
  11. 11. Apache Parquet • Binary file format for nested columnar data • Inspired from Google Dremel paper • space and query efficient • multiple encodings • predicate pushdown • column-wise compression • many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world
  12. 12. The Basics • 1 File, includes metadata • Several row groups • all with the same number of column chunks • n pages per column chunk • Benefits: • pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made- simple-with-parquet File Row Group Column Chunk Page
  13. 13. Using Parquet in Python • You can use it already today with Python: • sqlContext.read.parquet(“..“).toPandas() • Needs to pass through Spark, very slow • Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion
  14. 14. State of Arrow & Parquet Arrow in-memory spec for columnar data • Java (beta) • C++ (in progress) • Python (in progress) • Planned: • Julia • R Parquet columnar on-disk storage • Java (mature) • C++ (in progress) • Python (in progress) • Planned: • Julia • R
  15. 15. Upcoming • Parquet <-Arrow-> Pandas • IPC on its way • alpha implementation using memory mapped files • JVM <-> native with shared reference counting
  16. 16. Get Involved! • dev@arrow.apache.org & dev@parquet.apache.org • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/ • https://parquet.apache.org/ • @ApacheArrow & @ApacheParquet
  17. 17. Questions ?!

×