How Apache Arrow and Parquet boost cross-language interoperability
1. Uwe L. Korn
PyData Paris 14th June 2016
How Apache Arrow and Parquet
boost cross-language interop
2. About me
• Data Scientist at Blue Yonder (@BlueYonderTech)
• We optimize Replenishment and Pricing for the Retail
industry with Predictive Analytics
• Contributor to Apache {Arrow, Parquet}
• Work in Python, Cython, C++11 and SQL
5. Different Systems - Varying
Python Support
• Various levels of Python Support
• Build in Python
• Python API
• No Python at all
• Each tool/algorithm works on
columnar data
• Separate conversion routines for
each pair
• causes overhead
• there’s no one-size-fits-all solution
Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
6. Apache Arrow
• Specification for in-memory
columnar data layout
• No overhead for cross-system /
cross-language communication
• Designed for efficiency (exploit
SIMD, cache locality, ..)
• Supports nested data structures
Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
7. Apache Arrow - The Impact
• An example: Retrieve a dataset from an MPP database
and analyze it in Pandas
• Run a query in the DB
• Pass it in columnar form to the DB driver
• The OBDC layer transform it into row-wise form
• Pandas makes it columnar again
• Ugly real-life solution: export as CSV, bypass ODBC
• In future: Use Arrow as interface between the DB and
Pandas
8. Apache Arrow
• Top-level Apache project from the beginning
• Not only a specification: also includes C++ / Java /
Python / .. code.
• Arrow structures / classes
• RPC (upcoming) & IPC (alpha) support
• Conversion code for Parquet, Pandas, ..
• Combined effort from developer of over 13 major OSS
projects
• Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, ..
• Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
9. Arrow in Action: Feather
• Language-agnostic file format for
binary data frame storage
• Read performance close to raw
disk I/O
• by Wes McKinney (Python) and
Hadley Wickham (R)
• Julia Support in progress
Arrow Arrays
Feather Metadata
(flatbuffers)
11. Apache Parquet
• Binary file format for nested columnar data
• Inspired from Google Dremel paper
• space and query efficient
• multiple encodings
• predicate pushdown
• column-wise compression
• many tools use Parquet as the default input format
• very popular in the JVM/Hadoop-based world
12. The Basics
• 1 File, includes metadata
• Several row groups
• all with the same number of column chunks
• n pages per column chunk
• Benefits:
• pre-partitioned for fast distributed access
• statistics in the metadata for predicate pushdown
Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made-
simple-with-parquet
File
Row Group
Column Chunk
Page
13. Using Parquet in Python
• You can use it already today with Python:
• sqlContext.read.parquet(“..“).toPandas()
• Needs to pass through Spark, very slow
• Native Python support on its way:
• Parquet I/O to Arrow
• Arrow provides NumPy conversion
14. State of Arrow & Parquet
Arrow
in-memory spec for columnar data
• Java (beta)
• C++ (in progress)
• Python (in progress)
• Planned:
• Julia
• R
Parquet
columnar on-disk storage
• Java (mature)
• C++ (in progress)
• Python (in progress)
• Planned:
• Julia
• R
15. Upcoming
• Parquet <-Arrow-> Pandas
• IPC on its way
• alpha implementation using memory mapped files
• JVM <-> native with shared reference counting