SlideShare a Scribd company logo
©  2016 Dremio  Corporation @DremioHQ
The	
  future	
  of	
  column	
  oriented	
  data	
  
processing	
  with	
  Arrow	
  and	
  Parquet
Julien	
  Le	
  Dem
Principal	
  Architect,	
  Dremio
VP	
  Apache	
  Parquet,	
  Apache	
  Arrow	
  PMC
©  2016 Dremio  Corporation @DremioHQ
• Architect	
  at	
  @DremioHQ
• Formerly	
  Tech	
  Lead	
  at	
  Twitter	
  on	
  Data	
  Platforms.
• Creator	
  of	
  Parquet
• Apache	
  member
• Apache	
  PMCs:	
  Arrow,	
  Incubator,	
  Pig,	
  Parquet
Julien	
  Le	
  Dem
@J_ Julien
©  2016 Dremio  Corporation @DremioHQ
Agenda
• Benefits	
  of	
  Columnar	
  formats
– On	
  disk	
  (Apache	
  Parquet)
– In	
  memory	
  (Apache	
  Arrow)
• Community	
  Driven	
  Standard
• Interoperability	
  and	
  Ecosystem
©  2016 Dremio  Corporation @DremioHQ
Benefits	
  of	
  Columnar	
  formats
@EmrgencyKittens
©  2016 Dremio  Corporation @DremioHQ
Columnar	
  layout
Logical table
representation
Row layout
Column layout
©  2016 Dremio  Corporation @DremioHQ
On	
  Disk	
  and	
  in	
  Memory
• Different	
  trade	
  offs
– On	
  disk:	
  Storage.	
  
• Accessed	
  by	
  multiple	
  queries.
• Priority	
  to	
  I/O	
  reduction	
  (but	
  still	
  needs	
  good	
  CPU	
  throughput).
• Mostly	
  Streaming	
  access.
– In	
  memory:	
  Transient.
• Specific	
  to	
  one	
  query	
  execution.
• Priority	
  to	
  CPU	
  throughput	
  (but	
  still	
  needs	
  good	
  I/O).
• Streaming	
  and	
  Random	
  access.
©  2016 Dremio  Corporation @DremioHQ
Parquet	
  on	
  disk	
  columnar	
  format
©  2016 Dremio  Corporation @DremioHQ
Parquet	
  on	
  disk	
  columnar	
  format
• Nested	
  data	
  structures
• Compact	
  format:	
  
– type	
  aware	
  encodings	
  
– better	
  compression
• Optimized	
  I/O:
– Projection	
  push	
  down	
  (column	
  pruning)
– Predicate	
  push	
  down	
  (filters	
  based	
  on	
  stats)
©  2016 Dremio  Corporation @DremioHQ
Access	
  only	
  the	
  data	
  you	
  need
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Columnar Statistics
Read	
  only	
  the	
  
data	
  you	
  need!
©  2016 Dremio  Corporation @DremioHQ
Parquet	
  nested	
  representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Borrowed	
  from	
  the	
  Google	
  Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
©  2016 Dremio  Corporation @DremioHQ
Arrow	
  in	
  memory	
  columnar	
  format
©  2016 Dremio  Corporation @DremioHQ
Arrow	
  in	
  memory	
  columnar	
  format
• Nested	
  Data	
  Structures
• Maximize	
  CPU	
  throughput
– Pipelining
– SIMD
– cache	
  locality
• Scatter/gather	
  I/O
©  2016 Dremio  Corporation @DremioHQ
CPU	
  pipeline
©  2016 Dremio  Corporation @DremioHQ
Minimize	
  CPU	
  cache	
  misses
a	
  cache	
  miss	
  costs	
  10	
  to	
  100s	
  cycles	
  depending	
   on	
  the	
  level
©  2016 Dremio  Corporation @DremioHQ
Focus	
  on	
  CPU	
  Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache	
  Locality
• Super-­‐scalar	
  &	
  vectorized
operation
• Minimal	
  Structure	
  Overhead
• Constant	
  value	
  access	
  
– With	
  minimal	
  structure	
  
overhead
• Operate	
  directly	
  on	
  columnar	
  
compressed	
  data
©  2016 Dremio  Corporation @DremioHQ
Columnar	
  data
persons  =  [{
name:  ’Joe',
age:  18,
phones:  [  
‘555-­‐111-­‐1111’,  
‘555-­‐222-­‐2222’
]  
},  {
name:  ’Jack',
age:  37,
phones:  [  ‘555-­‐333-­‐3333’  ]
}]
J
o
e
J
a
c
k
0
3
7
Offset Values
Name
18
37
Age
5
5
5
-
1
1
1
0
12
24
-
Str.
Offset
Values
Phones
1
1
1
1
5
5
5
-
2
2
2
-
…
36
0
2
3
4
List
Offset
Values
©  2016 Dremio  Corporation @DremioHQ
Java:	
  Memory	
  Management
• Chunk-­‐based	
  managed	
  allocator
– Built	
  on	
  top	
  of	
  Netty’s JEMalloc implementation
• Create	
  a	
  tree	
  of	
  allocators
– Limit	
  and	
  transfer	
  semantics	
  across	
  allocators
– Leak	
  detection	
  and	
  location	
  accounting
• Wrap	
  native	
  memory	
  from	
  other	
  applications
©  2016 Dremio  Corporation @DremioHQ
Community	
  Driven	
  Standard
©  2016 Dremio  Corporation @DremioHQ
An	
  open	
  source	
  standard
• Arrow:	
  Common	
  need	
  for	
  in	
  memory	
  columnar.
• Benefits:
– Share	
  the	
  effort
– Create	
  an	
  ecosystem
• Building	
  on	
  the	
  success	
  of	
  Parquet.
• Standard	
  from	
  the	
  start
©  2016 Dremio  Corporation @DremioHQ
Shared	
  Need	
  >	
  Open	
  Source	
  Opportunity
“We	
  are	
  also	
  considering	
  switching	
  to	
  
a	
  columnar	
  canonical	
  in-­‐memory	
  
format	
  for	
  data	
  that	
  needs	
  to	
  be	
  
materialized	
  during	
  query	
  processing,	
  
in	
  order	
  to	
  take	
  advantage	
  of	
  SIMD	
  
instructions” -­‐Impala	
  Team
“A	
  large	
  fraction	
  of	
  the	
  CPU	
  time	
  is	
  spent	
  
waiting	
  for	
  data	
  to	
  be	
  fetched	
  from	
  main	
  
memory…we	
  are	
  designing	
  cache-­‐friendly	
  
algorithms	
  and	
  data	
  structures	
  so	
  Spark	
  
applications	
  will	
  spend	
  less	
  time	
  waiting	
  to	
  
fetch	
  data	
  from	
  memory	
  and	
  more	
  time	
  
doing	
  useful	
  work”	
  – Spark	
  Team
“Drill	
  provides	
  a	
  flexible	
  hierarchical	
  
columnar	
  data	
  model	
  that	
  can	
  
represent	
  complex,	
  highly	
  dynamic	
  
and	
  evolving	
  data	
  models	
  and	
  allows	
  
efficient	
  processing	
  of	
  it	
  without	
  need	
  
to	
  flatten	
  or	
  materialize.” -­‐Drill	
  Team
©  2016 Dremio  Corporation @DremioHQ
Arrow	
  goals
• Well-­‐documented	
  and	
  cross	
  language	
  
compatible
• Designed	
  to	
  take	
  advantage	
  of	
  modern	
  CPU	
  
characteristics
• Embeddable	
  in	
  execution	
  engines,	
  storage	
  
layers,	
  etc.
• Interoperable
©  2016 Dremio  Corporation @DremioHQ
The	
  Apache	
  Arrow	
  Project
• New	
  Top-­‐level	
  Apache	
  Software	
  Foundation	
  project
– Announced	
  Feb	
  17,	
  2016
• Focused	
  on	
  Columnar	
  In-­‐Memory	
  Analytics
1. 10-­‐100x	
  speedup on	
  many	
  workloads
2. Common	
  data	
  layer	
  enables	
  companies	
  to	
  choose	
  best	
  of	
  
breed	
  systems	
  
3. Designed	
  to	
  work	
  with	
  any	
  programming	
   language
4. Support	
  for	
  both	
  relational	
  and	
  complex	
  data	
  as-­‐is
• Developers	
  from	
  13+	
  major	
  open	
  source	
  projects	
  involved
– A	
  significant	
  %	
  of	
  the	
  world’s	
  data	
  will	
  be	
  processed	
  through	
  
Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
©  2016 Dremio  Corporation @DremioHQ
Interoperability	
  and	
  Ecosystem
©  2016 Dremio  Corporation @DremioHQ
High	
  Performance	
  Sharing	
  &	
  Interchange
Today With Arrow
• Each	
  system	
  has	
  its	
  own	
  internal	
  memory	
  
format
• 70-­‐80%	
  CPU	
  wasted	
  on	
  serialization	
  and	
  
deserialization
• Functionality	
  duplication	
  and	
  unnecessary	
  
conversions
• All	
  systems	
  utilize	
  the	
  same	
  memory	
  
format
• No	
  overhead	
  for	
  cross-­‐system	
  
communication
• Projects	
  can	
  share	
  functionality	
  (eg:
Parquet-­‐to-­‐Arrow	
  reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
©  2016 Dremio  Corporation @DremioHQ
Language	
  Bindings
Parquet
• Target	
  Languages
– Java
– CPP	
  (underway)
– Python	
  &	
  Pandas	
  (underway)
Arrow
• Target	
  Languages
– Java	
  (beta)
– CPP	
  (underway)
– Python	
  &	
  Pandas	
  (underway)
– R
– Julia
• Initial	
  Focus
– Read	
  a	
  structure
– Write	
  a	
  structure	
  
– Manage	
  Memory
©  2016 Dremio  Corporation @DremioHQ
RPC	
  &	
  IPC
©  2016 Dremio  Corporation @DremioHQ
Common	
  Message	
  Pattern
• Schema	
  Negotiation
– Logical	
  Description	
  of	
  structure
– Identification	
  of	
  dictionary	
  encoded	
  
Nodes
• Dictionary	
  Batch
– Dictionary	
  ID,	
  Values
• Record	
  Batch
– Batches	
  of	
  records	
  up	
  to	
  64K
– Leaf	
  nodes	
  up	
  to	
  2B	
  values
Schema	
  
Negotiation
Dictionary	
  
Batch
Record	
  
Batch
Record	
  
Batch
Record	
  
Batch
1..N	
  
Batches
0..N	
  
Batches
©  2016 Dremio  Corporation @DremioHQ
Record	
  Batch	
  Construction
Schema	
  
Negotiation
Dictionary	
  
Batch
Record	
  
Batch
Record	
  
Batch
Record	
  
Batch
name	
  (offset)
name	
  (data)
age	
  (data)
phones	
  (list	
  offset)
phones	
  (data)
data	
  header	
  (describes	
  offsets	
  into	
  data)
name	
  (bitmap)
age	
  (bitmap)
phones	
  (bitmap)
phones	
  (offset)
{
name:  ’Joe',
age:  18,
phones:  [
‘555-­‐111-­‐1111’,  
‘555-­‐222-­‐2222’
]  
}
Each	
  box	
  (vector)	
  is	
  contiguous	
  memory	
  
The	
  entire	
  record	
  batch	
  is	
  contiguous	
  on	
  wire
©  2016 Dremio  Corporation @DremioHQ
Moving	
  Data	
  Between	
  Systems
RPC
• Avoid	
  Serialization	
  &	
  Deserialization
• Layer	
  TBD:	
  Focused	
  on	
  supporting	
  vectored	
  io
– Scatter/gather	
  reads/writes	
  against	
  socket
IPC
• Alpha	
  implementation	
  using	
  memory	
  mapped	
  files
– Moving	
  data	
  between	
  Python	
  and	
  Drill
• Working	
  on	
  shared	
  allocation	
  approach
– Shared	
  reference	
  counting	
  and	
  well-­‐defined	
  ownership	
  semantics
©  2016 Dremio  Corporation @DremioHQ
Example	
  data	
  exchanges:
©  2016 Dremio  Corporation @DremioHQ
RPC:	
  Query	
  execution
Immutable Arrow
Batch
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Actor
Result
SELECT SUM(a) FROM t GROUP BY b
The	
  memory	
  
representation	
  is	
  sent	
  
over	
  the	
  wire.
No	
  serialization	
  
overhead.
©  2016 Dremio  Corporation @DremioHQ
IPC:	
  Python	
  with	
  Spark	
  or	
  Drill
SQL engine
Python
process
User
defined
function
SQL
Operator
1
SQL
Operator
2
reads reads
Immutable Arrow
Batch
©  2016 Dremio  Corporation @DremioHQ
What’s	
  Next
• Parquet	
  – Arrow	
  conversion	
  for	
  Python	
  &	
  C++
• Arrow	
  IPC	
  Implementation
• Apache	
  {Spark,	
  Drill}	
  to	
  Arrow	
  Integration
– Faster	
  UDFs,	
  Storage	
  interfaces
• Support	
  for	
  integration	
  with	
  Intel’s	
  Persistent	
  
Memory	
  library	
  via	
  Apache	
  Mnemonic
©  2016 Dremio  Corporation @DremioHQ
Get	
  Involved
• Join	
  the	
  community
– dev@arrow.apache.org,	
  dev@parquet.apache.org
– Slack:	
  https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.orghttp://parquet.apache.org
– Follow	
  @ApacheParquet,	
  @ApacheArrow

More Related Content

What's hot

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
Camuel Gilyadov
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
airisData
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
MapR Technologies
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
techmaddy
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Uwe Korn
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
DataWorks Summit
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 

What's hot (20)

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 

Viewers also liked

ORC Files
ORC FilesORC Files
ORC Files
Owen O'Malley
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
Owen O'Malley
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
Julien Le Dem
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 

Viewers also liked (10)

ORC Files
ORC FilesORC Files
ORC Files
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 

Similar to Strata London 2016: The future of column oriented data processing with Arrow and Parquet

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
Evolve The Adobe Digital Marketing Community
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
Wes McKinney
 
[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's
[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's
[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's
Scott Abel
 
Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data application
tomwhite
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
 

Similar to Strata London 2016: The future of column oriented data processing with Arrow and Parquet (20)

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's
[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's
[Case Study] - Nuclear Power, DITA and FrameMaker: The How's and Why's
 
Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data application
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 

More from Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
Julien Le Dem
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
Julien Le Dem
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
Julien Le Dem
 

More from Julien Le Dem (12)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Strata London 2016: The future of column oriented data processing with Arrow and Parquet

  • 1. ©  2016 Dremio  Corporation @DremioHQ The  future  of  column  oriented  data   processing  with  Arrow  and  Parquet Julien  Le  Dem Principal  Architect,  Dremio VP  Apache  Parquet,  Apache  Arrow  PMC
  • 2. ©  2016 Dremio  Corporation @DremioHQ • Architect  at  @DremioHQ • Formerly  Tech  Lead  at  Twitter  on  Data  Platforms. • Creator  of  Parquet • Apache  member • Apache  PMCs:  Arrow,  Incubator,  Pig,  Parquet Julien  Le  Dem @J_ Julien
  • 3. ©  2016 Dremio  Corporation @DremioHQ Agenda • Benefits  of  Columnar  formats – On  disk  (Apache  Parquet) – In  memory  (Apache  Arrow) • Community  Driven  Standard • Interoperability  and  Ecosystem
  • 4. ©  2016 Dremio  Corporation @DremioHQ Benefits  of  Columnar  formats @EmrgencyKittens
  • 5. ©  2016 Dremio  Corporation @DremioHQ Columnar  layout Logical table representation Row layout Column layout
  • 6. ©  2016 Dremio  Corporation @DremioHQ On  Disk  and  in  Memory • Different  trade  offs – On  disk:  Storage.   • Accessed  by  multiple  queries. • Priority  to  I/O  reduction  (but  still  needs  good  CPU  throughput). • Mostly  Streaming  access. – In  memory:  Transient. • Specific  to  one  query  execution. • Priority  to  CPU  throughput  (but  still  needs  good  I/O). • Streaming  and  Random  access.
  • 7. ©  2016 Dremio  Corporation @DremioHQ Parquet  on  disk  columnar  format
  • 8. ©  2016 Dremio  Corporation @DremioHQ Parquet  on  disk  columnar  format • Nested  data  structures • Compact  format:   – type  aware  encodings   – better  compression • Optimized  I/O: – Projection  push  down  (column  pruning) – Predicate  push  down  (filters  based  on  stats)
  • 9. ©  2016 Dremio  Corporation @DremioHQ Access  only  the  data  you  need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read  only  the   data  you  need!
  • 10. ©  2016 Dremio  Corporation @DremioHQ Parquet  nested  representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Borrowed  from  the  Google  Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 11. ©  2016 Dremio  Corporation @DremioHQ Arrow  in  memory  columnar  format
  • 12. ©  2016 Dremio  Corporation @DremioHQ Arrow  in  memory  columnar  format • Nested  Data  Structures • Maximize  CPU  throughput – Pipelining – SIMD – cache  locality • Scatter/gather  I/O
  • 13. ©  2016 Dremio  Corporation @DremioHQ CPU  pipeline
  • 14. ©  2016 Dremio  Corporation @DremioHQ Minimize  CPU  cache  misses a  cache  miss  costs  10  to  100s  cycles  depending   on  the  level
  • 15. ©  2016 Dremio  Corporation @DremioHQ Focus  on  CPU  Efficiency 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache  Locality • Super-­‐scalar  &  vectorized operation • Minimal  Structure  Overhead • Constant  value  access   – With  minimal  structure   overhead • Operate  directly  on  columnar   compressed  data
  • 16. ©  2016 Dremio  Corporation @DremioHQ Columnar  data persons  =  [{ name:  ’Joe', age:  18, phones:  [   ‘555-­‐111-­‐1111’,   ‘555-­‐222-­‐2222’ ]   },  { name:  ’Jack', age:  37, phones:  [  ‘555-­‐333-­‐3333’  ] }] J o e J a c k 0 3 7 Offset Values Name 18 37 Age 5 5 5 - 1 1 1 0 12 24 - Str. Offset Values Phones 1 1 1 1 5 5 5 - 2 2 2 - … 36 0 2 3 4 List Offset Values
  • 17. ©  2016 Dremio  Corporation @DremioHQ Java:  Memory  Management • Chunk-­‐based  managed  allocator – Built  on  top  of  Netty’s JEMalloc implementation • Create  a  tree  of  allocators – Limit  and  transfer  semantics  across  allocators – Leak  detection  and  location  accounting • Wrap  native  memory  from  other  applications
  • 18. ©  2016 Dremio  Corporation @DremioHQ Community  Driven  Standard
  • 19. ©  2016 Dremio  Corporation @DremioHQ An  open  source  standard • Arrow:  Common  need  for  in  memory  columnar. • Benefits: – Share  the  effort – Create  an  ecosystem • Building  on  the  success  of  Parquet. • Standard  from  the  start
  • 20. ©  2016 Dremio  Corporation @DremioHQ Shared  Need  >  Open  Source  Opportunity “We  are  also  considering  switching  to   a  columnar  canonical  in-­‐memory   format  for  data  that  needs  to  be   materialized  during  query  processing,   in  order  to  take  advantage  of  SIMD   instructions” -­‐Impala  Team “A  large  fraction  of  the  CPU  time  is  spent   waiting  for  data  to  be  fetched  from  main   memory…we  are  designing  cache-­‐friendly   algorithms  and  data  structures  so  Spark   applications  will  spend  less  time  waiting  to   fetch  data  from  memory  and  more  time   doing  useful  work”  – Spark  Team “Drill  provides  a  flexible  hierarchical   columnar  data  model  that  can   represent  complex,  highly  dynamic   and  evolving  data  models  and  allows   efficient  processing  of  it  without  need   to  flatten  or  materialize.” -­‐Drill  Team
  • 21. ©  2016 Dremio  Corporation @DremioHQ Arrow  goals • Well-­‐documented  and  cross  language   compatible • Designed  to  take  advantage  of  modern  CPU   characteristics • Embeddable  in  execution  engines,  storage   layers,  etc. • Interoperable
  • 22. ©  2016 Dremio  Corporation @DremioHQ The  Apache  Arrow  Project • New  Top-­‐level  Apache  Software  Foundation  project – Announced  Feb  17,  2016 • Focused  on  Columnar  In-­‐Memory  Analytics 1. 10-­‐100x  speedup on  many  workloads 2. Common  data  layer  enables  companies  to  choose  best  of   breed  systems   3. Designed  to  work  with  any  programming   language 4. Support  for  both  relational  and  complex  data  as-­‐is • Developers  from  13+  major  open  source  projects  involved – A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 23. ©  2016 Dremio  Corporation @DremioHQ Interoperability  and  Ecosystem
  • 24. ©  2016 Dremio  Corporation @DremioHQ High  Performance  Sharing  &  Interchange Today With Arrow • Each  system  has  its  own  internal  memory   format • 70-­‐80%  CPU  wasted  on  serialization  and   deserialization • Functionality  duplication  and  unnecessary   conversions • All  systems  utilize  the  same  memory   format • No  overhead  for  cross-­‐system   communication • Projects  can  share  functionality  (eg: Parquet-­‐to-­‐Arrow  reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  • 25. ©  2016 Dremio  Corporation @DremioHQ Language  Bindings Parquet • Target  Languages – Java – CPP  (underway) – Python  &  Pandas  (underway) Arrow • Target  Languages – Java  (beta) – CPP  (underway) – Python  &  Pandas  (underway) – R – Julia • Initial  Focus – Read  a  structure – Write  a  structure   – Manage  Memory
  • 26. ©  2016 Dremio  Corporation @DremioHQ RPC  &  IPC
  • 27. ©  2016 Dremio  Corporation @DremioHQ Common  Message  Pattern • Schema  Negotiation – Logical  Description  of  structure – Identification  of  dictionary  encoded   Nodes • Dictionary  Batch – Dictionary  ID,  Values • Record  Batch – Batches  of  records  up  to  64K – Leaf  nodes  up  to  2B  values Schema   Negotiation Dictionary   Batch Record   Batch Record   Batch Record   Batch 1..N   Batches 0..N   Batches
  • 28. ©  2016 Dremio  Corporation @DremioHQ Record  Batch  Construction Schema   Negotiation Dictionary   Batch Record   Batch Record   Batch Record   Batch name  (offset) name  (data) age  (data) phones  (list  offset) phones  (data) data  header  (describes  offsets  into  data) name  (bitmap) age  (bitmap) phones  (bitmap) phones  (offset) { name:  ’Joe', age:  18, phones:  [ ‘555-­‐111-­‐1111’,   ‘555-­‐222-­‐2222’ ]   } Each  box  (vector)  is  contiguous  memory   The  entire  record  batch  is  contiguous  on  wire
  • 29. ©  2016 Dremio  Corporation @DremioHQ Moving  Data  Between  Systems RPC • Avoid  Serialization  &  Deserialization • Layer  TBD:  Focused  on  supporting  vectored  io – Scatter/gather  reads/writes  against  socket IPC • Alpha  implementation  using  memory  mapped  files – Moving  data  between  Python  and  Drill • Working  on  shared  allocation  approach – Shared  reference  counting  and  well-­‐defined  ownership  semantics
  • 30. ©  2016 Dremio  Corporation @DremioHQ Example  data  exchanges:
  • 31. ©  2016 Dremio  Corporation @DremioHQ RPC:  Query  execution Immutable Arrow Batch Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Actor Result SELECT SUM(a) FROM t GROUP BY b The  memory   representation  is  sent   over  the  wire. No  serialization   overhead.
  • 32. ©  2016 Dremio  Corporation @DremioHQ IPC:  Python  with  Spark  or  Drill SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads Immutable Arrow Batch
  • 33. ©  2016 Dremio  Corporation @DremioHQ What’s  Next • Parquet  – Arrow  conversion  for  Python  &  C++ • Arrow  IPC  Implementation • Apache  {Spark,  Drill}  to  Arrow  Integration – Faster  UDFs,  Storage  interfaces • Support  for  integration  with  Intel’s  Persistent   Memory  library  via  Apache  Mnemonic
  • 34. ©  2016 Dremio  Corporation @DremioHQ Get  Involved • Join  the  community – dev@arrow.apache.org,  dev@parquet.apache.org – Slack:  https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.orghttp://parquet.apache.org – Follow  @ApacheParquet,  @ApacheArrow