SlideShare a Scribd company logo
Parquet
Columnar storage for the people
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
Nong Li nong@cloudera.com Software engineer, Cloudera Impala
http://parquet.io
Outline
• Context

from various companies

• Results

in production and benchmarks

• Format

deep-dive

http://parquet.io
Twitter Context
•

Twitter’s data
•

230M+ monthly active users generating and consuming 500M+ tweets a day.

•

100TB+ a day of compressed data

•

Scale is huge:
•

•

Instrumentation, User graph, Derived data, ...

Analytics infrastructure:
•

Several 1K+ node Hadoop clusters

•

Log collection pipeline

•

Processing tools

The Parquet Planers
Gustave Caillebotte

http://parquet.io
Twitter’s use case
•

Logs available on HDFS

•

Thrift to store logs

•

example: one schema has 87 columns, up to 7 levels of nesting.

struct LogEvent {
1: optional logbase.LogBase log_base
2: optional i64 event_value
3: optional string context
4: optional string referring_event
...
18: optional EventNamespace event_namespace
19: optional list<Item> items
20: optional map<AssociationType,Association> associations
21: optional MobileDetails mobile_details
22: optional WidgetDetails widget_details
23: optional map<ExternalService,string> external_ids
}

struct LogBase {
1: string transaction_id,
2: string ip_address,
...
15: optional string country,
16: optional string pid,
}

http://parquet.io
Goal

To have a state of the art columnar storage available across the
Hadoop platform
•

Hadoop is very reliable for big long running queries but also IO heavy.

•

Incrementally take advantage of column based storage in existing framework.

•

Not tied to any framework in particular

http://parquet.io
Columnar Storage
•

Limits IO to data actually needed:
•

•

loads only the columns that need to be accessed.

Saves space:
•

Columnar layout compresses better

•
•

@EmrgencyKittens

Type specific encodings.

Enables vectorized execution engines.

http://parquet.io
Collaboration between Twitter and Cloudera:
•

Common file format definition:
•
•

•

Language independent
Formally specified.

Implementation in Java for Map/Reduce:
•

•

https://github.com/Parquet/parquet-mr

C++ and code generation in Cloudera Impala:
•

https://github.com/cloudera/impala
http://parquet.io
Results in Impala

 TPC-H lineitem table @ 1TB scale factor

GB

http://parquet.io
Text
Seq w/ Snappy
RC w/Snappy
Parquet w/Snappy

Impala query times on TPC-DS

Seconds (wall clock)

500
375
250
125
0

Q27

Q34

Q42

Q43

Q46

Q52

Q55

Q59

Q65

Q73

Q79

Q96

http://parquet.io
Criteo: The Context
• Billions
• ~60

columns per log

• Heavy
• BI

of new events per day

analytic workload

analysts using Hive and RCFile

• Frequent

• Perfect

schema modifications

use case for Parquet + Hive !
http://parquet.io
Parquet + Hive: Basic Reqs

•

MapRed compatibility due to Hive.

•

Correctly handle evolving schemas across Parquet files.

•

Read only the columns used by query to minimize data read.

•

Interoperability with other execution engines (eg Pig, Impala, etc.)

http://parquet.io
Performance of Hive 0.11 with Parquet vs orc
Size relative to text:
orc-snappy: 35%
parquet-snappy: 33%

TPC-DS scale factor 100
All jobs calibrated to run ~50 mappers
Nodes:
2 x 6 cores, 96 GB RAM, 14 x 3TB
DISK

total CPU seconds

20000
orc-snappy
parquet-snappy

15000
10000
5000
0
q19 q34 q42 q43 q46 q52 q55 q59 q63 q65 q68

q7

q73 q79

q8

q89 q98
http://parquet.io
Twitter: production results
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compressed files (LZO)
New format: Parquet (LZO)

Scan time

Space
120.0%
100.0%
80.0%
60.0%
40.0%
20.0%
0%

100.00%
75.00%
50.00%
25.00%
0%

Space
Thrift

1

Thrift

Parquet

Parquet

30

columns

•

Space saving: 30% using the same compression algorithm

•

Scan + assembly time compared to original:
• One column: 10%
• All columns: 110%

http://parquet.io
Production savings at Twitter
• Petabytes
• Example
• Job

of storage saved.

jobs taking advantage of projection push down:

1 (Pig): reading 32% less data => 20% task time saving.

• Job

2 (Scalding): reading 14 out of 35 columns. reading 80% less
data => 66% task time saving.

• Terabytes

of scanning saved every day.

http://parquet.io

14
Format
•

Row group
Column a

Row group: A group of rows in columnar format.
•
•
•

Page 0

Max size buffered in memory while writing.

•

Page 0

Page 1
Page 2

One (or more) per split while reading. 

Page 1
Page 2

Page 3

roughly: 50MB < row group < 1 GB

Column chunk: The data for one column in a row group.
•

Page 0

Column c

Page 1

Page 4

•

Column b

Page 2

Page 3

Row group

Column chunks can be read independently for efficient scans.

Page: Unit of access in a column chunk.
•

Should be big enough for compression to be efficient.

•

Minimum size to read to access a single record (when index pages are available).

•

roughly: 8KB < page < 1MB

http://parquet.io

15
Format
•

Layout:

Row groups in columnar
format. A footer contains
column chunks offset and
schema.
• Language independent:
Well defined format. Hadoop
and Cloudera Impala support.

http://parquet.io

16
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: repetition level, definition level, value.
• Level values are bound by the depth of the schema: stored in a compact form.
Schema:
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
}

Max rep. Max def.
level
level

DocId

Backward

Forward

0

1

2

Links.Forward

Links

0

Links.Backward

DocId

Record:
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80

Columns

Document

1

2

Column

Document

Value

R

D

DocId

Backward

10

30

Forward

0

Links.Backward

10

0

2

30

1

2

Links.Forward

Links

20

0

Links.Backward

DocId

20

80

0

2

80

http://parquet.io

17
Repetition level
1

2

R

Nested lists

level1

level2: a

0

new record

level2: b

2

new level2 entry

level2: c

2

new level2 entry

level2: d

1

new level1 entry

level2: e

2

new level2 entry

level2: f

2

new level2 entry

level2: g

2

new level2 entry

level1

level2: h

0

new record

level1

Schema:
message nestedLists {
repeated group level1 {
repeated string level2;
}
}

0

level2: i

1

new level1 entry

level2: j

2

new level2 entry

level1

Records:
[[a, b, c], [d, e, f, g]]
[[h], [i, j]]
Nested lists

Columns:
Level: 0,2,2,1,2,2,2,0,1,2
Data: a,b,c,d,e,f,g,h,i,j

more details: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

http://parquet.io
Differences of Parquet and ORC Nesting support
• Parquet:
Document

Repetition/Definition levels capture the structure.
=> one column per Leaf in the schema.
Array<int> is one column.

DocId

Links

Backward

Forward

Nullity/repetition of an inner node is stored in each of its children
=> One column independently of nesting with some redundancy.

• ORC:
An extra column for each Map or List to record their size.
=> one column per Node in the schema.
Array<int> is two columns: array size and content.
=> An extra column per nesting level.
http://parquet.io
Reading assembled records
• Record level API to integrate with existing row
based engines (Hive, Pig, M/R).

a1
a2

b1

a2

• Aware of dictionary encoding: enable
optimizations.

a1

b2

a3

b3

a3
b1
b2
b3

• Assembles projection for any subset of the
columns: only those are loaded from disc.
Document

DocId

Document

Document

Document

Links

Links

Links

20

Backward

10

30

Backward

10

30

Forward

80

Forward

80

http://parquet.io

20
Projection push down
• Automated in Pig and Hive:

Based on the query being executed only the columns for the fields accessed will be fetched.
• Explicit in MapReduce, Scalding and Cascading using globing syntax.

Example: field1;field2/**;field4/{subfield1,subfield2}
Will return:
field1
all the columns under field2
subfield1 and 2 under field4
but not field3

http://parquet.io

21
Reading columns

•

To implement column based execution engine

Row:

0

1

A

0

1

B

1

1

C

2

0

0

3

•

V

1

•

D

0

•

R

0

1

Iteration on triplets: repetition level, definition level, value.

R=1 => same row
D<1 => Null

D

Repetition level = 0 indicates a new record.
Encoded or decoded values: computing aggregations on integers is faster than on
strings.

http://parquet.io
Integration APIs
•

Schema definition and record materialization:
•

•

•

Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro,
ProtocolBuffers do.
Event-based SAX-style record materialization layer. No double conversion.

Integration with existing type systems and processing frameworks:
•

Impala

•

Pig

•

Thrift and Scrooge for M/R, Cascading and Scalding

•

Cascading tuples

•

Avro

•

Hive

•

Spark

http://parquet.io
Encodings
•

Bit packing:

1

3

2

•

•

2

2

0

1

1

1

1

1

01|11|10|00 00|10|10|00

Useful for repetition level, definition levels and dictionary keys

Run Length Encoding:
•

1

1

8

1

Cheap compression

•

1

Used in combination with bit packing

•

•

0

Small integers encoded in the minimum bits required

•

0

Works well for definition level of sparse columns.

Dictionary encoding:
•
•

•

Useful for columns with few ( < 50,000 ) distinct values
When applicable, compresses better and faster than heavyweight algorithms (gzip, lzo, snappy)

Extensible: Defining new encodings is supported by the format

http://parquet.io
Parquet 2.0

•

More encodings: compact storage without heavyweight compression
•

Delta encodings: for integers, strings and sorted dictionaries.

•

Improved encoding for strings and boolean.

•

Statistics: to be used by query planners and predicate pushdown.

•

New page format: to facilitate skipping ahead at a more granular level.

http://parquet.io
Main contributors
Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings
Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala
Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down
Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration
Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration
Tom White (Cloudera): Avro integration
Avi Bryant, Colin Marc (Stripe): Cascading tuples integration
Matt Massie (Berkeley AMP lab): predicate and projection push down
David Chen (Linkedin): Avro integration improvements

http://parquet.io
How to contribute
Questions? Ideas? Want to contribute?
Contribute at: github.com/Parquet
Come talk to us.
Cloudera
Criteo
Twitter
http://parquet.io

27

More Related Content

What's hot

Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
Julien Le Dem
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
DataWorks Summit
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
InfluxData
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
Databricks
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 

What's hot (20)

Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 

Similar to Parquet Strata/Hadoop World, New York 2013

Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
Julien Le Dem
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
NAVER D2
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
EMC
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaServing Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Redis Labs
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
DataWorks Summit/Hadoop Summit
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetup
Itamar Haber
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
fnothaft
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用
LINE Corporation
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
PyData
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 

Similar to Parquet Strata/Hadoop World, New York 2013 (20)

Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaServing Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetup
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 

More from Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
Julien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
Julien Le Dem
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
Julien Le Dem
 

More from Julien Le Dem (18)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 

Recently uploaded

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
Techgropse Pvt.Ltd.
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 

Recently uploaded (20)

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 

Parquet Strata/Hadoop World, New York 2013

  • 1. Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala http://parquet.io
  • 2. Outline • Context from various companies • Results in production and benchmarks • Format deep-dive http://parquet.io
  • 3. Twitter Context • Twitter’s data • 230M+ monthly active users generating and consuming 500M+ tweets a day. • 100TB+ a day of compressed data • Scale is huge: • • Instrumentation, User graph, Derived data, ... Analytics infrastructure: • Several 1K+ node Hadoop clusters • Log collection pipeline • Processing tools The Parquet Planers Gustave Caillebotte http://parquet.io
  • 4. Twitter’s use case • Logs available on HDFS • Thrift to store logs • example: one schema has 87 columns, up to 7 levels of nesting. struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event ... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid, } http://parquet.io
  • 5. Goal To have a state of the art columnar storage available across the Hadoop platform • Hadoop is very reliable for big long running queries but also IO heavy. • Incrementally take advantage of column based storage in existing framework. • Not tied to any framework in particular http://parquet.io
  • 6. Columnar Storage • Limits IO to data actually needed: • • loads only the columns that need to be accessed. Saves space: • Columnar layout compresses better • • @EmrgencyKittens Type specific encodings. Enables vectorized execution engines. http://parquet.io
  • 7. Collaboration between Twitter and Cloudera: • Common file format definition: • • • Language independent Formally specified. Implementation in Java for Map/Reduce: • • https://github.com/Parquet/parquet-mr C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala http://parquet.io
  • 8. Results in Impala  TPC-H lineitem table @ 1TB scale factor GB http://parquet.io
  • 9. Text Seq w/ Snappy RC w/Snappy Parquet w/Snappy Impala query times on TPC-DS Seconds (wall clock) 500 375 250 125 0 Q27 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q96 http://parquet.io
  • 10. Criteo: The Context • Billions • ~60 columns per log • Heavy • BI of new events per day analytic workload analysts using Hive and RCFile • Frequent • Perfect schema modifications use case for Parquet + Hive ! http://parquet.io
  • 11. Parquet + Hive: Basic Reqs • MapRed compatibility due to Hive. • Correctly handle evolving schemas across Parquet files. • Read only the columns used by query to minimize data read. • Interoperability with other execution engines (eg Pig, Impala, etc.) http://parquet.io
  • 12. Performance of Hive 0.11 with Parquet vs orc Size relative to text: orc-snappy: 35% parquet-snappy: 33% TPC-DS scale factor 100 All jobs calibrated to run ~50 mappers Nodes: 2 x 6 cores, 96 GB RAM, 14 x 3TB DISK total CPU seconds 20000 orc-snappy parquet-snappy 15000 10000 5000 0 q19 q34 q42 q43 q46 q52 q55 q59 q63 q65 q68 q7 q73 q79 q8 q89 q98 http://parquet.io
  • 13. Twitter: production results Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed files (LZO) New format: Parquet (LZO) Scan time Space 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0% 100.00% 75.00% 50.00% 25.00% 0% Space Thrift 1 Thrift Parquet Parquet 30 columns • Space saving: 30% using the same compression algorithm • Scan + assembly time compared to original: • One column: 10% • All columns: 110% http://parquet.io
  • 14. Production savings at Twitter • Petabytes • Example • Job of storage saved. jobs taking advantage of projection push down: 1 (Pig): reading 32% less data => 20% task time saving. • Job 2 (Scalding): reading 14 out of 35 columns. reading 80% less data => 66% task time saving. • Terabytes of scanning saved every day. http://parquet.io 14
  • 15. Format • Row group Column a Row group: A group of rows in columnar format. • • • Page 0 Max size buffered in memory while writing. • Page 0 Page 1 Page 2 One (or more) per split while reading.  Page 1 Page 2 Page 3 roughly: 50MB < row group < 1 GB Column chunk: The data for one column in a row group. • Page 0 Column c Page 1 Page 4 • Column b Page 2 Page 3 Row group Column chunks can be read independently for efficient scans. Page: Unit of access in a column chunk. • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB http://parquet.io 15
  • 16. Format • Layout: Row groups in columnar format. A footer contains column chunks offset and schema. • Language independent: Well defined format. Hadoop and Cloudera Impala support. http://parquet.io 16
  • 17. Nested record shredding/assembly • Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, definition level, value. • Level values are bound by the depth of the schema: stored in a compact form. Schema: message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } Max rep. Max def. level level DocId Backward Forward 0 1 2 Links.Forward Links 0 Links.Backward DocId Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Columns Document 1 2 Column Document Value R D DocId Backward 10 30 Forward 0 Links.Backward 10 0 2 30 1 2 Links.Forward Links 20 0 Links.Backward DocId 20 80 0 2 80 http://parquet.io 17
  • 18. Repetition level 1 2 R Nested lists level1 level2: a 0 new record level2: b 2 new level2 entry level2: c 2 new level2 entry level2: d 1 new level1 entry level2: e 2 new level2 entry level2: f 2 new level2 entry level2: g 2 new level2 entry level1 level2: h 0 new record level1 Schema: message nestedLists { repeated group level1 { repeated string level2; } } 0 level2: i 1 new level1 entry level2: j 2 new level2 entry level1 Records: [[a, b, c], [d, e, f, g]] [[h], [i, j]] Nested lists Columns: Level: 0,2,2,1,2,2,2,0,1,2 Data: a,b,c,d,e,f,g,h,i,j more details: https://blog.twitter.com/2013/dremel-made-simple-with-parquet http://parquet.io
  • 19. Differences of Parquet and ORC Nesting support • Parquet: Document Repetition/Definition levels capture the structure. => one column per Leaf in the schema. Array<int> is one column. DocId Links Backward Forward Nullity/repetition of an inner node is stored in each of its children => One column independently of nesting with some redundancy. • ORC: An extra column for each Map or List to record their size. => one column per Node in the schema. Array<int> is two columns: array size and content. => An extra column per nesting level. http://parquet.io
  • 20. Reading assembled records • Record level API to integrate with existing row based engines (Hive, Pig, M/R). a1 a2 b1 a2 • Aware of dictionary encoding: enable optimizations. a1 b2 a3 b3 a3 b1 b2 b3 • Assembles projection for any subset of the columns: only those are loaded from disc. Document DocId Document Document Document Links Links Links 20 Backward 10 30 Backward 10 30 Forward 80 Forward 80 http://parquet.io 20
  • 21. Projection push down • Automated in Pig and Hive: Based on the query being executed only the columns for the fields accessed will be fetched. • Explicit in MapReduce, Scalding and Cascading using globing syntax. Example: field1;field2/**;field4/{subfield1,subfield2} Will return: field1 all the columns under field2 subfield1 and 2 under field4 but not field3 http://parquet.io 21
  • 22. Reading columns • To implement column based execution engine Row: 0 1 A 0 1 B 1 1 C 2 0 0 3 • V 1 • D 0 • R 0 1 Iteration on triplets: repetition level, definition level, value. R=1 => same row D<1 => Null D Repetition level = 0 indicates a new record. Encoded or decoded values: computing aggregations on integers is faster than on strings. http://parquet.io
  • 23. Integration APIs • Schema definition and record materialization: • • • Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. Event-based SAX-style record materialization layer. No double conversion. Integration with existing type systems and processing frameworks: • Impala • Pig • Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples • Avro • Hive • Spark http://parquet.io
  • 24. Encodings • Bit packing: 1 3 2 • • 2 2 0 1 1 1 1 1 01|11|10|00 00|10|10|00 Useful for repetition level, definition levels and dictionary keys Run Length Encoding: • 1 1 8 1 Cheap compression • 1 Used in combination with bit packing • • 0 Small integers encoded in the minimum bits required • 0 Works well for definition level of sparse columns. Dictionary encoding: • • • Useful for columns with few ( < 50,000 ) distinct values When applicable, compresses better and faster than heavyweight algorithms (gzip, lzo, snappy) Extensible: Defining new encodings is supported by the format http://parquet.io
  • 25. Parquet 2.0 • More encodings: compact storage without heavyweight compression • Delta encodings: for integers, strings and sorted dictionaries. • Improved encoding for strings and boolean. • Statistics: to be used by query planners and predicate pushdown. • New page format: to facilitate skipping ahead at a more granular level. http://parquet.io
  • 26. Main contributors Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration Tom White (Cloudera): Avro integration Avi Bryant, Colin Marc (Stripe): Cascading tuples integration Matt Massie (Berkeley AMP lab): predicate and projection push down David Chen (Linkedin): Avro integration improvements http://parquet.io
  • 27. How to contribute Questions? Ideas? Want to contribute? Contribute at: github.com/Parquet Come talk to us. Cloudera Criteo Twitter http://parquet.io 27