SlideShare a Scribd company logo
1 of 34
Download to read offline
©  2016 Dremio  Corporation @DremioHQ
The	
  future	
  of	
  column	
  oriented	
  data	
  
processing	
  with	
  Arrow	
  and	
  Parquet
Julien	
  Le	
  Dem
Principal	
  Architect,	
  Dremio
VP	
  Apache	
  Parquet,	
  Apache	
  Arrow	
  PMC
©  2016 Dremio  Corporation @DremioHQ
• Architect	
  at	
  @DremioHQ
• Formerly	
  Tech	
  Lead	
  at	
  Twitter	
  on	
  Data	
  Platforms.
• Creator	
  of	
  Parquet
• Apache	
  member
• Apache	
  PMCs:	
  Arrow,	
  Incubator,	
  Pig,	
  Parquet
Julien	
  Le	
  Dem
@J_ Julien
©  2016 Dremio  Corporation @DremioHQ
Agenda
• Benefits	
  of	
  Columnar	
  formats
– On	
  disk	
  (Apache	
  Parquet)
– In	
  memory	
  (Apache	
  Arrow)
• Community	
  Driven	
  Standard
• Interoperability	
  and	
  Ecosystem
©  2016 Dremio  Corporation @DremioHQ
Benefits	
  of	
  Columnar	
  formats
@EmrgencyKittens
©  2016 Dremio  Corporation @DremioHQ
Columnar	
  layout
Logical table
representation
Row layout
Column layout
©  2016 Dremio  Corporation @DremioHQ
On	
  Disk	
  and	
  in	
  Memory
• Different	
  trade	
  offs
– On	
  disk:	
  Storage.	
  
• Accessed	
  by	
  multiple	
  queries.
• Priority	
  to	
  I/O	
  reduction	
  (but	
  still	
  needs	
  good	
  CPU	
  throughput).
• Mostly	
  Streaming	
  access.
– In	
  memory:	
  Transient.
• Specific	
  to	
  one	
  query	
  execution.
• Priority	
  to	
  CPU	
  throughput	
  (but	
  still	
  needs	
  good	
  I/O).
• Streaming	
  and	
  Random	
  access.
©  2016 Dremio  Corporation @DremioHQ
Parquet	
  on	
  disk	
  columnar	
  format
©  2016 Dremio  Corporation @DremioHQ
Parquet	
  on	
  disk	
  columnar	
  format
• Nested	
  data	
  structures
• Compact	
  format:	
  
– type	
  aware	
  encodings	
  
– better	
  compression
• Optimized	
  I/O:
– Projection	
  push	
  down	
  (column	
  pruning)
– Predicate	
  push	
  down	
  (filters	
  based	
  on	
  stats)
©  2016 Dremio  Corporation @DremioHQ
Access	
  only	
  the	
  data	
  you	
  need
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Columnar Statistics
Read	
  only	
  the	
  
data	
  you	
  need!
©  2016 Dremio  Corporation @DremioHQ
Parquet	
  nested	
  representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Borrowed	
  from	
  the	
  Google	
  Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
©  2016 Dremio  Corporation @DremioHQ
Arrow	
  in	
  memory	
  columnar	
  format
©  2016 Dremio  Corporation @DremioHQ
Arrow	
  in	
  memory	
  columnar	
  format
• Nested	
  Data	
  Structures
• Maximize	
  CPU	
  throughput
– Pipelining
– SIMD
– cache	
  locality
• Scatter/gather	
  I/O
©  2016 Dremio  Corporation @DremioHQ
CPU	
  pipeline
©  2016 Dremio  Corporation @DremioHQ
Minimize	
  CPU	
  cache	
  misses
a	
  cache	
  miss	
  costs	
  10	
  to	
  100s	
  cycles	
  depending	
   on	
  the	
  level
©  2016 Dremio  Corporation @DremioHQ
Focus	
  on	
  CPU	
  Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache	
  Locality
• Super-­‐scalar	
  &	
  vectorized
operation
• Minimal	
  Structure	
  Overhead
• Constant	
  value	
  access	
  
– With	
  minimal	
  structure	
  
overhead
• Operate	
  directly	
  on	
  columnar	
  
compressed	
  data
©  2016 Dremio  Corporation @DremioHQ
Columnar	
  data
persons  =  [{
name:  ’Joe',
age:  18,
phones:  [  
‘555-­‐111-­‐1111’,  
‘555-­‐222-­‐2222’
]  
},  {
name:  ’Jack',
age:  37,
phones:  [  ‘555-­‐333-­‐3333’  ]
}]
J
o
e
J
a
c
k
0
3
7
Offset Values
Name
18
37
Age
5
5
5
-
1
1
1
0
12
24
-
Str.
Offset
Values
Phones
1
1
1
1
5
5
5
-
2
2
2
-
…
36
0
2
3
4
List
Offset
Values
©  2016 Dremio  Corporation @DremioHQ
Java:	
  Memory	
  Management
• Chunk-­‐based	
  managed	
  allocator
– Built	
  on	
  top	
  of	
  Netty’s JEMalloc implementation
• Create	
  a	
  tree	
  of	
  allocators
– Limit	
  and	
  transfer	
  semantics	
  across	
  allocators
– Leak	
  detection	
  and	
  location	
  accounting
• Wrap	
  native	
  memory	
  from	
  other	
  applications
©  2016 Dremio  Corporation @DremioHQ
Community	
  Driven	
  Standard
©  2016 Dremio  Corporation @DremioHQ
An	
  open	
  source	
  standard
• Arrow:	
  Common	
  need	
  for	
  in	
  memory	
  columnar.
• Benefits:
– Share	
  the	
  effort
– Create	
  an	
  ecosystem
• Building	
  on	
  the	
  success	
  of	
  Parquet.
• Standard	
  from	
  the	
  start
©  2016 Dremio  Corporation @DremioHQ
Shared	
  Need	
  >	
  Open	
  Source	
  Opportunity
“We	
  are	
  also	
  considering	
  switching	
  to	
  
a	
  columnar	
  canonical	
  in-­‐memory	
  
format	
  for	
  data	
  that	
  needs	
  to	
  be	
  
materialized	
  during	
  query	
  processing,	
  
in	
  order	
  to	
  take	
  advantage	
  of	
  SIMD	
  
instructions” -­‐Impala	
  Team
“A	
  large	
  fraction	
  of	
  the	
  CPU	
  time	
  is	
  spent	
  
waiting	
  for	
  data	
  to	
  be	
  fetched	
  from	
  main	
  
memory…we	
  are	
  designing	
  cache-­‐friendly	
  
algorithms	
  and	
  data	
  structures	
  so	
  Spark	
  
applications	
  will	
  spend	
  less	
  time	
  waiting	
  to	
  
fetch	
  data	
  from	
  memory	
  and	
  more	
  time	
  
doing	
  useful	
  work”	
  – Spark	
  Team
“Drill	
  provides	
  a	
  flexible	
  hierarchical	
  
columnar	
  data	
  model	
  that	
  can	
  
represent	
  complex,	
  highly	
  dynamic	
  
and	
  evolving	
  data	
  models	
  and	
  allows	
  
efficient	
  processing	
  of	
  it	
  without	
  need	
  
to	
  flatten	
  or	
  materialize.” -­‐Drill	
  Team
©  2016 Dremio  Corporation @DremioHQ
Arrow	
  goals
• Well-­‐documented	
  and	
  cross	
  language	
  
compatible
• Designed	
  to	
  take	
  advantage	
  of	
  modern	
  CPU	
  
characteristics
• Embeddable	
  in	
  execution	
  engines,	
  storage	
  
layers,	
  etc.
• Interoperable
©  2016 Dremio  Corporation @DremioHQ
The	
  Apache	
  Arrow	
  Project
• New	
  Top-­‐level	
  Apache	
  Software	
  Foundation	
  project
– Announced	
  Feb	
  17,	
  2016
• Focused	
  on	
  Columnar	
  In-­‐Memory	
  Analytics
1. 10-­‐100x	
  speedup on	
  many	
  workloads
2. Common	
  data	
  layer	
  enables	
  companies	
  to	
  choose	
  best	
  of	
  
breed	
  systems	
  
3. Designed	
  to	
  work	
  with	
  any	
  programming	
   language
4. Support	
  for	
  both	
  relational	
  and	
  complex	
  data	
  as-­‐is
• Developers	
  from	
  13+	
  major	
  open	
  source	
  projects	
  involved
– A	
  significant	
  %	
  of	
  the	
  world’s	
  data	
  will	
  be	
  processed	
  through	
  
Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
©  2016 Dremio  Corporation @DremioHQ
Interoperability	
  and	
  Ecosystem
©  2016 Dremio  Corporation @DremioHQ
High	
  Performance	
  Sharing	
  &	
  Interchange
Today With Arrow
• Each	
  system	
  has	
  its	
  own	
  internal	
  memory	
  
format
• 70-­‐80%	
  CPU	
  wasted	
  on	
  serialization	
  and	
  
deserialization
• Functionality	
  duplication	
  and	
  unnecessary	
  
conversions
• All	
  systems	
  utilize	
  the	
  same	
  memory	
  
format
• No	
  overhead	
  for	
  cross-­‐system	
  
communication
• Projects	
  can	
  share	
  functionality	
  (eg:
Parquet-­‐to-­‐Arrow	
  reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
©  2016 Dremio  Corporation @DremioHQ
Language	
  Bindings
Parquet
• Target	
  Languages
– Java
– CPP	
  (underway)
– Python	
  &	
  Pandas	
  (underway)
Arrow
• Target	
  Languages
– Java	
  (beta)
– CPP	
  (underway)
– Python	
  &	
  Pandas	
  (underway)
– R
– Julia
• Initial	
  Focus
– Read	
  a	
  structure
– Write	
  a	
  structure	
  
– Manage	
  Memory
©  2016 Dremio  Corporation @DremioHQ
RPC	
  &	
  IPC
©  2016 Dremio  Corporation @DremioHQ
Common	
  Message	
  Pattern
• Schema	
  Negotiation
– Logical	
  Description	
  of	
  structure
– Identification	
  of	
  dictionary	
  encoded	
  
Nodes
• Dictionary	
  Batch
– Dictionary	
  ID,	
  Values
• Record	
  Batch
– Batches	
  of	
  records	
  up	
  to	
  64K
– Leaf	
  nodes	
  up	
  to	
  2B	
  values
Schema	
  
Negotiation
Dictionary	
  
Batch
Record	
  
Batch
Record	
  
Batch
Record	
  
Batch
1..N	
  
Batches
0..N	
  
Batches
©  2016 Dremio  Corporation @DremioHQ
Record	
  Batch	
  Construction
Schema	
  
Negotiation
Dictionary	
  
Batch
Record	
  
Batch
Record	
  
Batch
Record	
  
Batch
name	
  (offset)
name	
  (data)
age	
  (data)
phones	
  (list	
  offset)
phones	
  (data)
data	
  header	
  (describes	
  offsets	
  into	
  data)
name	
  (bitmap)
age	
  (bitmap)
phones	
  (bitmap)
phones	
  (offset)
{
name:  ’Joe',
age:  18,
phones:  [
‘555-­‐111-­‐1111’,  
‘555-­‐222-­‐2222’
]  
}
Each	
  box	
  (vector)	
  is	
  contiguous	
  memory	
  
The	
  entire	
  record	
  batch	
  is	
  contiguous	
  on	
  wire
©  2016 Dremio  Corporation @DremioHQ
Moving	
  Data	
  Between	
  Systems
RPC
• Avoid	
  Serialization	
  &	
  Deserialization
• Layer	
  TBD:	
  Focused	
  on	
  supporting	
  vectored	
  io
– Scatter/gather	
  reads/writes	
  against	
  socket
IPC
• Alpha	
  implementation	
  using	
  memory	
  mapped	
  files
– Moving	
  data	
  between	
  Python	
  and	
  Drill
• Working	
  on	
  shared	
  allocation	
  approach
– Shared	
  reference	
  counting	
  and	
  well-­‐defined	
  ownership	
  semantics
©  2016 Dremio  Corporation @DremioHQ
Example	
  data	
  exchanges:
©  2016 Dremio  Corporation @DremioHQ
RPC:	
  Query	
  execution
Immutable Arrow
Batch
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Actor
Result
SELECT SUM(a) FROM t GROUP BY b
The	
  memory	
  
representation	
  is	
  sent	
  
over	
  the	
  wire.
No	
  serialization	
  
overhead.
©  2016 Dremio  Corporation @DremioHQ
IPC:	
  Python	
  with	
  Spark	
  or	
  Drill
SQL engine
Python
process
User
defined
function
SQL
Operator
1
SQL
Operator
2
reads reads
Immutable Arrow
Batch
©  2016 Dremio  Corporation @DremioHQ
What’s	
  Next
• Parquet	
  – Arrow	
  conversion	
  for	
  Python	
  &	
  C++
• Arrow	
  IPC	
  Implementation
• Apache	
  {Spark,	
  Drill}	
  to	
  Arrow	
  Integration
– Faster	
  UDFs,	
  Storage	
  interfaces
• Support	
  for	
  integration	
  with	
  Intel’s	
  Persistent	
  
Memory	
  library	
  via	
  Apache	
  Mnemonic
©  2016 Dremio  Corporation @DremioHQ
Get	
  Involved
• Join	
  the	
  community
– dev@arrow.apache.org,	
  dev@parquet.apache.org
– Slack:	
  https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.orghttp://parquet.apache.org
– Follow	
  @ApacheParquet,	
  @ApacheArrow

More Related Content

What's hot

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
Tomer Shiran
 

What's hot (20)

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 

Viewers also liked

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

Viewers also liked (10)

ORC Files
ORC FilesORC Files
ORC Files
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 

Similar to Strata London 2016: The future of column oriented data processing with Arrow and Parquet

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 

Similar to Strata London 2016: The future of column oriented data processing with Arrow and Parquet (20)

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's
[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's
[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's
 
Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data application
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 

More from Julien Le Dem

More from Julien Le Dem (12)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Strata London 2016: The future of column oriented data processing with Arrow and Parquet

  • 1. ©  2016 Dremio  Corporation @DremioHQ The  future  of  column  oriented  data   processing  with  Arrow  and  Parquet Julien  Le  Dem Principal  Architect,  Dremio VP  Apache  Parquet,  Apache  Arrow  PMC
  • 2. ©  2016 Dremio  Corporation @DremioHQ • Architect  at  @DremioHQ • Formerly  Tech  Lead  at  Twitter  on  Data  Platforms. • Creator  of  Parquet • Apache  member • Apache  PMCs:  Arrow,  Incubator,  Pig,  Parquet Julien  Le  Dem @J_ Julien
  • 3. ©  2016 Dremio  Corporation @DremioHQ Agenda • Benefits  of  Columnar  formats – On  disk  (Apache  Parquet) – In  memory  (Apache  Arrow) • Community  Driven  Standard • Interoperability  and  Ecosystem
  • 4. ©  2016 Dremio  Corporation @DremioHQ Benefits  of  Columnar  formats @EmrgencyKittens
  • 5. ©  2016 Dremio  Corporation @DremioHQ Columnar  layout Logical table representation Row layout Column layout
  • 6. ©  2016 Dremio  Corporation @DremioHQ On  Disk  and  in  Memory • Different  trade  offs – On  disk:  Storage.   • Accessed  by  multiple  queries. • Priority  to  I/O  reduction  (but  still  needs  good  CPU  throughput). • Mostly  Streaming  access. – In  memory:  Transient. • Specific  to  one  query  execution. • Priority  to  CPU  throughput  (but  still  needs  good  I/O). • Streaming  and  Random  access.
  • 7. ©  2016 Dremio  Corporation @DremioHQ Parquet  on  disk  columnar  format
  • 8. ©  2016 Dremio  Corporation @DremioHQ Parquet  on  disk  columnar  format • Nested  data  structures • Compact  format:   – type  aware  encodings   – better  compression • Optimized  I/O: – Projection  push  down  (column  pruning) – Predicate  push  down  (filters  based  on  stats)
  • 9. ©  2016 Dremio  Corporation @DremioHQ Access  only  the  data  you  need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read  only  the   data  you  need!
  • 10. ©  2016 Dremio  Corporation @DremioHQ Parquet  nested  representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Borrowed  from  the  Google  Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 11. ©  2016 Dremio  Corporation @DremioHQ Arrow  in  memory  columnar  format
  • 12. ©  2016 Dremio  Corporation @DremioHQ Arrow  in  memory  columnar  format • Nested  Data  Structures • Maximize  CPU  throughput – Pipelining – SIMD – cache  locality • Scatter/gather  I/O
  • 13. ©  2016 Dremio  Corporation @DremioHQ CPU  pipeline
  • 14. ©  2016 Dremio  Corporation @DremioHQ Minimize  CPU  cache  misses a  cache  miss  costs  10  to  100s  cycles  depending   on  the  level
  • 15. ©  2016 Dremio  Corporation @DremioHQ Focus  on  CPU  Efficiency 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache  Locality • Super-­‐scalar  &  vectorized operation • Minimal  Structure  Overhead • Constant  value  access   – With  minimal  structure   overhead • Operate  directly  on  columnar   compressed  data
  • 16. ©  2016 Dremio  Corporation @DremioHQ Columnar  data persons  =  [{ name:  ’Joe', age:  18, phones:  [   ‘555-­‐111-­‐1111’,   ‘555-­‐222-­‐2222’ ]   },  { name:  ’Jack', age:  37, phones:  [  ‘555-­‐333-­‐3333’  ] }] J o e J a c k 0 3 7 Offset Values Name 18 37 Age 5 5 5 - 1 1 1 0 12 24 - Str. Offset Values Phones 1 1 1 1 5 5 5 - 2 2 2 - … 36 0 2 3 4 List Offset Values
  • 17. ©  2016 Dremio  Corporation @DremioHQ Java:  Memory  Management • Chunk-­‐based  managed  allocator – Built  on  top  of  Netty’s JEMalloc implementation • Create  a  tree  of  allocators – Limit  and  transfer  semantics  across  allocators – Leak  detection  and  location  accounting • Wrap  native  memory  from  other  applications
  • 18. ©  2016 Dremio  Corporation @DremioHQ Community  Driven  Standard
  • 19. ©  2016 Dremio  Corporation @DremioHQ An  open  source  standard • Arrow:  Common  need  for  in  memory  columnar. • Benefits: – Share  the  effort – Create  an  ecosystem • Building  on  the  success  of  Parquet. • Standard  from  the  start
  • 20. ©  2016 Dremio  Corporation @DremioHQ Shared  Need  >  Open  Source  Opportunity “We  are  also  considering  switching  to   a  columnar  canonical  in-­‐memory   format  for  data  that  needs  to  be   materialized  during  query  processing,   in  order  to  take  advantage  of  SIMD   instructions” -­‐Impala  Team “A  large  fraction  of  the  CPU  time  is  spent   waiting  for  data  to  be  fetched  from  main   memory…we  are  designing  cache-­‐friendly   algorithms  and  data  structures  so  Spark   applications  will  spend  less  time  waiting  to   fetch  data  from  memory  and  more  time   doing  useful  work”  – Spark  Team “Drill  provides  a  flexible  hierarchical   columnar  data  model  that  can   represent  complex,  highly  dynamic   and  evolving  data  models  and  allows   efficient  processing  of  it  without  need   to  flatten  or  materialize.” -­‐Drill  Team
  • 21. ©  2016 Dremio  Corporation @DremioHQ Arrow  goals • Well-­‐documented  and  cross  language   compatible • Designed  to  take  advantage  of  modern  CPU   characteristics • Embeddable  in  execution  engines,  storage   layers,  etc. • Interoperable
  • 22. ©  2016 Dremio  Corporation @DremioHQ The  Apache  Arrow  Project • New  Top-­‐level  Apache  Software  Foundation  project – Announced  Feb  17,  2016 • Focused  on  Columnar  In-­‐Memory  Analytics 1. 10-­‐100x  speedup on  many  workloads 2. Common  data  layer  enables  companies  to  choose  best  of   breed  systems   3. Designed  to  work  with  any  programming   language 4. Support  for  both  relational  and  complex  data  as-­‐is • Developers  from  13+  major  open  source  projects  involved – A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 23. ©  2016 Dremio  Corporation @DremioHQ Interoperability  and  Ecosystem
  • 24. ©  2016 Dremio  Corporation @DremioHQ High  Performance  Sharing  &  Interchange Today With Arrow • Each  system  has  its  own  internal  memory   format • 70-­‐80%  CPU  wasted  on  serialization  and   deserialization • Functionality  duplication  and  unnecessary   conversions • All  systems  utilize  the  same  memory   format • No  overhead  for  cross-­‐system   communication • Projects  can  share  functionality  (eg: Parquet-­‐to-­‐Arrow  reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  • 25. ©  2016 Dremio  Corporation @DremioHQ Language  Bindings Parquet • Target  Languages – Java – CPP  (underway) – Python  &  Pandas  (underway) Arrow • Target  Languages – Java  (beta) – CPP  (underway) – Python  &  Pandas  (underway) – R – Julia • Initial  Focus – Read  a  structure – Write  a  structure   – Manage  Memory
  • 26. ©  2016 Dremio  Corporation @DremioHQ RPC  &  IPC
  • 27. ©  2016 Dremio  Corporation @DremioHQ Common  Message  Pattern • Schema  Negotiation – Logical  Description  of  structure – Identification  of  dictionary  encoded   Nodes • Dictionary  Batch – Dictionary  ID,  Values • Record  Batch – Batches  of  records  up  to  64K – Leaf  nodes  up  to  2B  values Schema   Negotiation Dictionary   Batch Record   Batch Record   Batch Record   Batch 1..N   Batches 0..N   Batches
  • 28. ©  2016 Dremio  Corporation @DremioHQ Record  Batch  Construction Schema   Negotiation Dictionary   Batch Record   Batch Record   Batch Record   Batch name  (offset) name  (data) age  (data) phones  (list  offset) phones  (data) data  header  (describes  offsets  into  data) name  (bitmap) age  (bitmap) phones  (bitmap) phones  (offset) { name:  ’Joe', age:  18, phones:  [ ‘555-­‐111-­‐1111’,   ‘555-­‐222-­‐2222’ ]   } Each  box  (vector)  is  contiguous  memory   The  entire  record  batch  is  contiguous  on  wire
  • 29. ©  2016 Dremio  Corporation @DremioHQ Moving  Data  Between  Systems RPC • Avoid  Serialization  &  Deserialization • Layer  TBD:  Focused  on  supporting  vectored  io – Scatter/gather  reads/writes  against  socket IPC • Alpha  implementation  using  memory  mapped  files – Moving  data  between  Python  and  Drill • Working  on  shared  allocation  approach – Shared  reference  counting  and  well-­‐defined  ownership  semantics
  • 30. ©  2016 Dremio  Corporation @DremioHQ Example  data  exchanges:
  • 31. ©  2016 Dremio  Corporation @DremioHQ RPC:  Query  execution Immutable Arrow Batch Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Actor Result SELECT SUM(a) FROM t GROUP BY b The  memory   representation  is  sent   over  the  wire. No  serialization   overhead.
  • 32. ©  2016 Dremio  Corporation @DremioHQ IPC:  Python  with  Spark  or  Drill SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads Immutable Arrow Batch
  • 33. ©  2016 Dremio  Corporation @DremioHQ What’s  Next • Parquet  – Arrow  conversion  for  Python  &  C++ • Arrow  IPC  Implementation • Apache  {Spark,  Drill}  to  Arrow  Integration – Faster  UDFs,  Storage  interfaces • Support  for  integration  with  Intel’s  Persistent   Memory  library  via  Apache  Mnemonic
  • 34. ©  2016 Dremio  Corporation @DremioHQ Get  Involved • Join  the  community – dev@arrow.apache.org,  dev@parquet.apache.org – Slack:  https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.orghttp://parquet.apache.org – Follow  @ApacheParquet,  @ApacheArrow