Apache Arrow (Strata-Hadoop World San Jose 2016)

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
DREMIO
Faster conclusions using in-memory
columnar SQL and machine learning
Strata San Jose - March 30, 2016
Apache	
  
Arrow	
  
DREMIO
Who
Wes McKinney
•  Engineer at Cloudera, formerly
DataPad CEO/founder
•  Wrote bestseller Python for
Data Analysis 2012
•  Open source projects
–  Python {pandas, Ibis,
statsmodels}
–  Apache {Arrow, Parquet, Kudu
(incubating)}
•  Mostly work in Python and
Cython/C/C++
Jacques Nadeau
•  CTO & Co-Founder at
Dremio, formerly Architect
at MapR
•  Open Source projects
–  Apache {Arrow, Parquet,
Calcite, Drill, HBase,
Phoenix}
•  Mostly work in Java
DREMIO
Arrow in a Slide
•  New Top-level Apache Software Foundation project
–  Announced Feb 17, 2016
•  Focused on Columnar In-Memory Analytics
1.  10-100x speedup on many workloads
2.  Common data layer enables companies to choose best of breed
systems
3.  Designed to work with any programming language
4.  Support for both relational and complex data as-is
•  Developers from 13+ major open source projects involved
–  A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
DREMIO
Agenda
•  Purpose
•  Memory Representation
•  Language Bindings
•  IPC & RPC
•  Example Integrations
DREMIO
Purpose
DREMIO
Overview
•  A high speed in-memory representation
•  Well-documented and cross language
compatible
•  Designed to take advantage of modern
CPU characteristics
•  Embeddable in execution engines, storage
layers, etc.
DREMIO
Focus on CPU Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer	
  
Arrow
Memory Buffer	
  
•  Cache Locality
•  Super-scalar & vectorized
operation
•  Minimal Structure
Overhead
•  Constant value access
–  With minimal structure
overhead
•  Operate directly on
columnar compressed data
DREMIO
High Performance Sharing & Interchange
Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
DREMIO
Shared Need > Open Source Opportunity
•  Columnar is Complex
•  Shredded Columnar is even
more complex
•  We all need to go to same
place
•  Take Advantage of Open
Source approach
•  Once we pick a shared
solution, we get interchange
for “free”
“We	
  are	
  also	
  considering	
  switching	
  to	
  
a	
  columnar	
  canonical	
  in-­‐memory	
  
format	
  for	
  data	
  that	
  needs	
  to	
  be	
  
materialized	
  during	
  query	
  processing,	
  
in	
  order	
  to	
  take	
  advantage	
  of	
  SIMD	
  
instrucBons”	
  -­‐Impala	
  Team	
  
“A	
  large	
  fracBon	
  of	
  the	
  CPU	
  Bme	
  is	
  spent	
  
waiBng	
  for	
  data	
  to	
  be	
  fetched	
  from	
  main	
  
memory…we	
  are	
  designing	
  cache-­‐friendly	
  
algorithms	
  and	
  data	
  structures	
  so	
  Spark	
  
applicaBons	
  will	
  spend	
  less	
  Bme	
  waiBng	
  to	
  
fetch	
  data	
  from	
  memory	
  and	
  more	
  Bme	
  
doing	
  useful	
  work	
  –	
  Spark	
  Team	
  
DREMIO
In Memory Representation
DREMIO
Columnar data persons	
  =	
  [{	
  
	
  name:	
  'wes',	
  
	
  iq:	
  180,	
  
	
  addresses:	
  [	
  
	
  {number:	
  2,	
  street	
  'a'},	
  	
  
	
  {number:	
  3,	
  street	
  'bb'}	
  
	
  ]	
  
},	
  {	
  
	
  name:	
  'joe',	
  
	
  iq:	
  100,	
  
	
  addresses:	
  [	
  
	
  	
   	
  {number:	
  4,	
  street	
  'ccc'},	
  	
  
	
  	
   	
  {number:	
  5, 	
  street	
  'dddd'},	
  	
  
	
  	
   	
  {number:	
  2,	
  street	
  'f'}	
  
	
  	
   	
  ]	
  
}]	
  
DREMIO
Simple Example: persons.iq
person.iq
180
100
DREMIO
Simple Example: persons.addresses.number
person.addresses
0
2
5
person.addresses.number
2
3
4
5
6
offset
DREMIO
Columnar data
person.addresses.street
person.addresses
0
2
5
offset
0
1
3
6
10
a
b
b
c
c
c
d
d
d
d
f
person.addresses.number
2
3
4
5
6
offset
DREMIO
Language Bindings
DREMIO
Language Bindings
•  Target Languages
–  Java (beta)
–  CPP (underway)
–  Python & Pandas (underway)
–  R
–  Julia
•  Initial Focus
–  Read a structure
–  Write a structure
–  Manage Memory
DREMIO
Java: Creating Dynamic Off-heap Structures
FieldWriter	
  w=	
  getWriter();	
  
w.varChar("name").write("Wes");	
  
w.integer("iq").write(180);	
  
ListWriter	
  list	
  =	
  writer.list("addresses");	
  
list.startList();	
  
	
  	
  MapWriter	
  map	
  =	
  list.map();	
  
	
  	
  map.start();	
  
	
  	
  	
  	
  map.integer("number").writeInt(2);	
  
	
  	
  	
  	
  map.varChar("street").write("a");	
  
	
  	
  map.end();	
  
	
  	
  map.start();	
  
	
  	
  	
  	
  map.integer("number").writeInt(3);	
  
	
  	
  	
  	
  map.varChar("street").write("bb");	
  
	
  	
  map.end();	
  
list.endList();	
  
{	
  
	
  name:	
  'wes',	
  
	
  iq:	
  180,	
  
	
  addresses:	
  [	
  
	
  {number:	
  2,	
  street	
  'a'},	
  	
  
	
  {number:	
  3,	
  street	
  'bb'}	
  
	
  ]	
  
}	
  	
  
Json	
  RepresentaBon	
   ProgrammaBc	
  ConstrucBon	
  
DREMIO
Java: Memory Management (& NVMe)
•  Chunk-based managed allocator
–  Built on top of Netty’s JEMalloc implementation
•  Create a tree of allocators
–  Limit and transfer semantics across allocators
–  Leak detection and location accounting
•  Wrap native memory from other applications
•  New support for integration with Intel’s Persistent
Memory library via Apache Mnemonic
DREMIO
RPC & IPC
DREMIO
Common Message Pattern
•  Schema Negotiation
–  Logical Description of structure
–  Identification of dictionary
encoded Nodes
•  Dictionary Batch
–  Dictionary ID, Values
•  Record Batch
–  Batches of records up to 64K
–  Leaf nodes up to 2B values
Schema	
  
NegoBaBon	
  
DicBonary	
  
Batch	
  
Record	
  
Batch	
  
Record	
  
Batch	
  
Record	
  
Batch	
  
1..N	
  
Batches	
  
0..N	
  
Batches	
  
DREMIO
Record Batch Construction
Schema	
  
NegoBaBon	
  
DicBonary	
  
Batch	
  
Record	
  
Batch	
  
Record	
  
Batch	
  
Record	
  
Batch	
  
name	
  (offset)	
  
name	
  (data)	
  
iq	
  (data)	
  
addresses	
  (list	
  offset)	
  
addresses.number	
  
addresses.street	
  (offset)	
   addresses.street	
  (data)	
  
data	
  header	
  (describes	
  offsets	
  into	
  data)	
  
name	
  (bitmap)	
  
iq	
  (bitmap)	
  
addresses	
  (bitmap)	
  
addresses.number	
  (bitmap)	
  
addresses.street	
  (bitmap)	
  
{	
  
	
  name:	
  'wes',	
  
	
  iq:	
  180,	
  
	
  addresses:	
  [	
  
	
  {number:	
  2,	
  	
  
	
  	
  	
  	
  	
  	
  	
  street	
  'a'},	
  	
  
	
  {number:	
  3,	
  	
  
	
  	
  	
  	
  	
  	
  	
  street	
  'bb'}	
  
	
  ]	
  
}	
  
Each	
  box	
  is	
  
conBguous	
  memory,	
  
enBrely	
  conBguous	
  on	
  
wire	
  
DREMIO
RPC & IPC: Moving Data Between Systems
RPC
•  Avoid Serialization & Deserialization
•  Layer TBD: Focused on supporting vectored io
–  Scatter/gather reads/writes against socket
IPC
•  Alpha implementation using memory mapped files
–  Moving data between Python and Drill
•  Working on shared allocation approach
–  Shared reference counting and well-defined ownership
semantics
DREMIO
Real World Examples
DREMIO
Real World Example: Python With Spark or Drill
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
DREMIO
Real World Example: Feather File Format for
Python and R
•  Problem: fast, language-
agnostic binary data
frame file format
•  Written by Wes
McKinney (Python)
Hadley Wickham (R)
•  Read speeds close to
disk IO performance
Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather file
Apache Arrow
memory
Google
flatbuffers
DREMIO
Real World Example: Feather File Format for
Python and R
library(feather)	
  
	
  	
  
path	
  <-­‐	
  "my_data.feather"	
  
write_feather(df,	
  path)	
  
	
  	
  
df	
  <-­‐	
  read_feather(path)	
  
import	
  feather	
  
	
  	
  
path	
  =	
  'my_data.feather'	
  
	
  	
  
feather.write_dataframe(df,	
  path)	
  
df	
  =	
  feather.read_dataframe(path)	
  
R	
   Python	
  
DREMIO
What’s Next
•  Parquet for Python & C++
– Using Arrow Representation
•  Available IPC Implementation
•  Spark, Drill Integration
– Faster UDFs, Storage interfaces
DREMIO
Get Involved
•  Join the community
– dev@arrow.apache.org
– Slack:
https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– @ApacheArrow, @wesmckinn, @intjesus
1 of 28

Recommended

How Apache Arrow and Parquet boost cross-language interoperability by
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
2.9K views17 slides
An Incomplete Data Tools Landscape for Hackers in 2015 by
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
8.1K views22 slides
My Data Journey with Python (SciPy 2015 Keynote) by
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
7.4K views37 slides
Improving data interoperability in Python and R by
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
2.6K views14 slides
Ibis: Scaling the Python Data Experience by
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceWes McKinney
3.8K views13 slides
Python Data Ecosystem: Thoughts on Building for the Future by
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
5.4K views37 slides

More Related Content

What's hot

Next-generation Python Big Data Tools, powered by Apache Arrow by
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
13K views22 slides
Python for Financial Data Analysis with pandas by
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
61.8K views22 slides
Data Science Languages and Industry Analytics by
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
5.5K views19 slides
Apache Arrow -- Cross-language development platform for in-memory data by
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
2.9K views23 slides
Apache Arrow at DataEngConf Barcelona 2018 by
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
2K views37 slides
Improving Python and Spark (PySpark) Performance and Interoperability by
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
19.8K views37 slides

What's hot(20)

Next-generation Python Big Data Tools, powered by Apache Arrow by Wes McKinney
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney13K views
Python for Financial Data Analysis with pandas by Wes McKinney
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney61.8K views
Data Science Languages and Industry Analytics by Wes McKinney
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney5.5K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
Improving Python and Spark (PySpark) Performance and Interoperability by Wes McKinney
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney19.8K views
Python Data Wrangling: Preparing for the Future by Wes McKinney
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney12.5K views
Strata London 2016: The future of column oriented data processing with Arrow ... by Julien Le Dem
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem2.1K views
Sql on everything with drill by Julien Le Dem
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem2.7K views
DataFrames: The Extended Cut by Wes McKinney
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney8.5K views
Scaling HDFS to Manage Billions of Files with Key-Value Stores by DataWorks Summit
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit3.8K views
DataFrames: The Good, Bad, and Ugly by Wes McKinney
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Wes McKinney12.9K views
Improving Python and Spark Performance and Interoperability with Apache Arrow by Julien Le Dem
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem4.4K views
Strata NY 2018: The deconstructed database by Julien Le Dem
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem1.6K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney6.6K views
HBase and Drill: How loosley typed SQL is ideal for NoSQL by DataWorks Summit
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit641 views
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par... by Uwe Korn
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Uwe Korn1.4K views

Viewers also liked

Introducing Apache Giraph for Large Scale Graph Processing by
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
24.5K views23 slides
Time Series Analysis with Spark by
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with SparkSandy Ryza
6.3K views40 slides
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data by
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
9.9K views44 slides
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX by
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
3.3K views48 slides
Hadoop Graph Processing with Apache Giraph by
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
12.1K views31 slides
HPE Keynote Hadoop Summit San Jose 2016 by
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016DataWorks Summit/Hadoop Summit
2.6K views12 slides

Viewers also liked(13)

Introducing Apache Giraph for Large Scale Graph Processing by sscdotopen
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen24.5K views
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
Sandy Ryza6.3K views
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data by Mike Percy
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy9.9K views
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX by rhatr
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr3.3K views
Hadoop Graph Processing with Apache Giraph by DataWorks Summit
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
DataWorks Summit12.1K views
Introduction to Apache Kudu by Jeff Holoman
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman5.8K views
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets by Turi, Inc.
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.825 views
Kudu - Fast Analytics on Fast Data by Ryan Bosshart
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart2.1K views
Machine Learning with GraphLab Create by Turi, Inc.
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
Turi, Inc.1.6K views
Apache kudu by Asim Jalis
Apache kuduApache kudu
Apache kudu
Asim Jalis6.3K views
Efficient Data Storage for Analytics with Apache Parquet 2.0 by Cloudera, Inc.
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.158.6K views

Similar to Apache Arrow (Strata-Hadoop World San Jose 2016)

HUG_Ireland_Apache_Arrow_Tomer_Shiran by
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
531 views28 slides
Jump Start with Apache Spark 2.0 on Databricks by
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
2.8K views78 slides
The columnar roadmap: Apache Parquet and Apache Arrow by
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
3.3K views49 slides
PySpark Cassandra - Amsterdam Spark Meetup by
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
2K views55 slides
Jump Start into Apache® Spark™ and Databricks by
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
3.9K views39 slides
Strata NY 2017 Parquet Arrow roadmap by
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
2.6K views45 slides

Similar to Apache Arrow (Strata-Hadoop World San Jose 2016)(20)

HUG_Ireland_Apache_Arrow_Tomer_Shiran by John Mulhall
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall531 views
Jump Start with Apache Spark 2.0 on Databricks by Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks2.8K views
The columnar roadmap: Apache Parquet and Apache Arrow by DataWorks Summit
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit3.3K views
PySpark Cassandra - Amsterdam Spark Meetup by Frens Jan Rumph
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph2K views
Jump Start into Apache® Spark™ and Databricks by Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks3.9K views
Strata NY 2017 Parquet Arrow roadmap by Julien Le Dem
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem2.6K views
Apache Spark for Everyone - Women Who Code Workshop by Amanda Casari
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari147 views
A look under the hood at Apache Spark's API and engine evolutions by Databricks
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks3.2K views
Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney970 views
Transformation Processing Smackdown; Spark vs Hive vs Pig by Lester Martin
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin10.9K views
An introduction To Apache Spark by Amir Sedighi
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi10.4K views
Paris Data Geek - Spark Streaming by Djamel Zouaoui
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui2.7K views
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro... by Spark Summit
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit18.2K views
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics by Miklos Christine
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine1.2K views
Big Data, Data Lake, Fast Data - Dataserialiation-Formats by Guido Schmutz
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz505 views
Rust is for "Big Data" by Andy Grove
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
Andy Grove2.6K views
New Developments in Spark by Databricks
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks9.7K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
1.1K views31 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
1.1K views26 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
1.5K views53 slides
New Directions for Apache Arrow by
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
1.9K views27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 slides

More from Wes McKinney(18)

Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.5K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney2.1K views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney1.4K views
Apache Arrow: Leveling Up the Data Science Stack by Wes McKinney
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney3.5K views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney1.1K views
Shared Infrastructure for Data Science by Wes McKinney
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney8.5K views
Data Science Without Borders (JupyterCon 2017) by Wes McKinney
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney6.2K views
Memory Interoperability in Analytics and Machine Learning by Wes McKinney
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney5.6K views
Raising the Tides: Open Source Analytics for Data Science by Wes McKinney
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney3.2K views
PyCon APAC 2016 Keynote by Wes McKinney
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney3.6K views
Enabling Python to be a Better Big Data Citizen by Wes McKinney
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney6K views
Data Tools and the Data Scientist Shortage by Wes McKinney
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
Wes McKinney3.7K views

Recently uploaded

Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
66 views27 slides
Innovation & Entrepreneurship strategies in Dairy Industry by
Innovation & Entrepreneurship strategies in Dairy IndustryInnovation & Entrepreneurship strategies in Dairy Industry
Innovation & Entrepreneurship strategies in Dairy IndustryPervaizDar1
39 views26 slides
"Package management in monorepos", Zoltan Kochan by
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan KochanFwdays
37 views18 slides
Business Analyst Series 2023 - Week 4 Session 7 by
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7DianaGray10
152 views31 slides
Mobile Core Solutions & Successful Cases.pdf by
Mobile Core Solutions & Successful Cases.pdfMobile Core Solutions & Successful Cases.pdf
Mobile Core Solutions & Successful Cases.pdfIPLOOK Networks
16 views7 slides
What is Authentication Active Directory_.pptx by
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptxHeenaMehta35
15 views7 slides

Recently uploaded(20)

Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty66 views
Innovation & Entrepreneurship strategies in Dairy Industry by PervaizDar1
Innovation & Entrepreneurship strategies in Dairy IndustryInnovation & Entrepreneurship strategies in Dairy Industry
Innovation & Entrepreneurship strategies in Dairy Industry
PervaizDar139 views
"Package management in monorepos", Zoltan Kochan by Fwdays
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan Kochan
Fwdays37 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10152 views
Mobile Core Solutions & Successful Cases.pdf by IPLOOK Networks
Mobile Core Solutions & Successful Cases.pdfMobile Core Solutions & Successful Cases.pdf
Mobile Core Solutions & Successful Cases.pdf
IPLOOK Networks16 views
What is Authentication Active Directory_.pptx by HeenaMehta35
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptx
HeenaMehta3515 views
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell by Fwdays
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
Fwdays14 views
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」 by PC Cluster Consortium
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
Optimizing Communication to Optimize Human Behavior - LCBM by Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar39 views
Initiating and Advancing Your Strategic GIS Governance Strategy by Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software198 views
The Power of Generative AI in Accelerating No Code Adoption.pdf by Saeed Al Dhaheri
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdf
Saeed Al Dhaheri44 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE85 views
Deep Tech and the Amplified Organisation: Core Concepts by Holonomics
Deep Tech and the Amplified Organisation: Core ConceptsDeep Tech and the Amplified Organisation: Core Concepts
Deep Tech and the Amplified Organisation: Core Concepts
Holonomics17 views

Apache Arrow (Strata-Hadoop World San Jose 2016)

  • 1. DREMIO Faster conclusions using in-memory columnar SQL and machine learning Strata San Jose - March 30, 2016 Apache   Arrow  
  • 2. DREMIO Who Wes McKinney •  Engineer at Cloudera, formerly DataPad CEO/founder •  Wrote bestseller Python for Data Analysis 2012 •  Open source projects –  Python {pandas, Ibis, statsmodels} –  Apache {Arrow, Parquet, Kudu (incubating)} •  Mostly work in Python and Cython/C/C++ Jacques Nadeau •  CTO & Co-Founder at Dremio, formerly Architect at MapR •  Open Source projects –  Apache {Arrow, Parquet, Calcite, Drill, HBase, Phoenix} •  Mostly work in Java
  • 3. DREMIO Arrow in a Slide •  New Top-level Apache Software Foundation project –  Announced Feb 17, 2016 •  Focused on Columnar In-Memory Analytics 1.  10-100x speedup on many workloads 2.  Common data layer enables companies to choose best of breed systems 3.  Designed to work with any programming language 4.  Support for both relational and complex data as-is •  Developers from 13+ major open source projects involved –  A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 4. DREMIO Agenda •  Purpose •  Memory Representation •  Language Bindings •  IPC & RPC •  Example Integrations
  • 6. DREMIO Overview •  A high speed in-memory representation •  Well-documented and cross language compatible •  Designed to take advantage of modern CPU characteristics •  Embeddable in execution engines, storage layers, etc.
  • 7. DREMIO Focus on CPU Efficiency 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer   Arrow Memory Buffer   •  Cache Locality •  Super-scalar & vectorized operation •  Minimal Structure Overhead •  Constant value access –  With minimal structure overhead •  Operate directly on columnar compressed data
  • 8. DREMIO High Performance Sharing & Interchange Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  • 9. DREMIO Shared Need > Open Source Opportunity •  Columnar is Complex •  Shredded Columnar is even more complex •  We all need to go to same place •  Take Advantage of Open Source approach •  Once we pick a shared solution, we get interchange for “free” “We  are  also  considering  switching  to   a  columnar  canonical  in-­‐memory   format  for  data  that  needs  to  be   materialized  during  query  processing,   in  order  to  take  advantage  of  SIMD   instrucBons”  -­‐Impala  Team   “A  large  fracBon  of  the  CPU  Bme  is  spent   waiBng  for  data  to  be  fetched  from  main   memory…we  are  designing  cache-­‐friendly   algorithms  and  data  structures  so  Spark   applicaBons  will  spend  less  Bme  waiBng  to   fetch  data  from  memory  and  more  Bme   doing  useful  work  –  Spark  Team  
  • 11. DREMIO Columnar data persons  =  [{    name:  'wes',    iq:  180,    addresses:  [    {number:  2,  street  'a'},      {number:  3,  street  'bb'}    ]   },  {    name:  'joe',    iq:  100,    addresses:  [        {number:  4,  street  'ccc'},          {number:  5,  street  'dddd'},          {number:  2,  street  'f'}        ]   }]  
  • 16. DREMIO Language Bindings •  Target Languages –  Java (beta) –  CPP (underway) –  Python & Pandas (underway) –  R –  Julia •  Initial Focus –  Read a structure –  Write a structure –  Manage Memory
  • 17. DREMIO Java: Creating Dynamic Off-heap Structures FieldWriter  w=  getWriter();   w.varChar("name").write("Wes");   w.integer("iq").write(180);   ListWriter  list  =  writer.list("addresses");   list.startList();      MapWriter  map  =  list.map();      map.start();          map.integer("number").writeInt(2);          map.varChar("street").write("a");      map.end();      map.start();          map.integer("number").writeInt(3);          map.varChar("street").write("bb");      map.end();   list.endList();   {    name:  'wes',    iq:  180,    addresses:  [    {number:  2,  street  'a'},      {number:  3,  street  'bb'}    ]   }     Json  RepresentaBon   ProgrammaBc  ConstrucBon  
  • 18. DREMIO Java: Memory Management (& NVMe) •  Chunk-based managed allocator –  Built on top of Netty’s JEMalloc implementation •  Create a tree of allocators –  Limit and transfer semantics across allocators –  Leak detection and location accounting •  Wrap native memory from other applications •  New support for integration with Intel’s Persistent Memory library via Apache Mnemonic
  • 20. DREMIO Common Message Pattern •  Schema Negotiation –  Logical Description of structure –  Identification of dictionary encoded Nodes •  Dictionary Batch –  Dictionary ID, Values •  Record Batch –  Batches of records up to 64K –  Leaf nodes up to 2B values Schema   NegoBaBon   DicBonary   Batch   Record   Batch   Record   Batch   Record   Batch   1..N   Batches   0..N   Batches  
  • 21. DREMIO Record Batch Construction Schema   NegoBaBon   DicBonary   Batch   Record   Batch   Record   Batch   Record   Batch   name  (offset)   name  (data)   iq  (data)   addresses  (list  offset)   addresses.number   addresses.street  (offset)   addresses.street  (data)   data  header  (describes  offsets  into  data)   name  (bitmap)   iq  (bitmap)   addresses  (bitmap)   addresses.number  (bitmap)   addresses.street  (bitmap)   {    name:  'wes',    iq:  180,    addresses:  [    {number:  2,                  street  'a'},      {number:  3,                  street  'bb'}    ]   }   Each  box  is   conBguous  memory,   enBrely  conBguous  on   wire  
  • 22. DREMIO RPC & IPC: Moving Data Between Systems RPC •  Avoid Serialization & Deserialization •  Layer TBD: Focused on supporting vectored io –  Scatter/gather reads/writes against socket IPC •  Alpha implementation using memory mapped files –  Moving data between Python and Drill •  Working on shared allocation approach –  Shared reference counting and well-defined ownership semantics
  • 24. DREMIO Real World Example: Python With Spark or Drill in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  • 25. DREMIO Real World Example: Feather File Format for Python and R •  Problem: fast, language- agnostic binary data frame file format •  Written by Wes McKinney (Python) Hadley Wickham (R) •  Read speeds close to disk IO performance Arrow array 0 Arrow array 1 … Arrow array n Feather metadata Feather file Apache Arrow memory Google flatbuffers
  • 26. DREMIO Real World Example: Feather File Format for Python and R library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  • 27. DREMIO What’s Next •  Parquet for Python & C++ – Using Arrow Representation •  Available IPC Implementation •  Spark, Drill Integration – Faster UDFs, Storage interfaces
  • 28. DREMIO Get Involved •  Join the community – dev@arrow.apache.org – Slack: https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – @ApacheArrow, @wesmckinn, @intjesus