SlideShare a Scribd company logo
Submit Search
Upload
Login
Signup
File Format Benchmark - Avro, JSON, ORC & Parquet
Report
DataWorks Summit/Hadoop Summit
Follow
DataWorks Summit/Hadoop Summit
Jul. 11, 2016
•
0 likes
•
34,190 views
1
of
38
File Format Benchmark - Avro, JSON, ORC & Parquet
Jul. 11, 2016
•
0 likes
•
34,190 views
Download Now
Download to read offline
Report
Technology
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Follow
DataWorks Summit/Hadoop Summit
Recommended
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
101.2K views
•
40 slides
The Apache Spark File Format Ecosystem
Databricks
1.9K views
•
44 slides
Time-Series Apache HBase
HBaseCon
5.5K views
•
17 slides
ORC File - Optimizing Your Big Data
DataWorks Summit
11.5K views
•
26 slides
Hive 3 - a new horizon
Thejas Nair
2.6K views
•
50 slides
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
129.3K views
•
44 slides
More Related Content
What's hot
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
4.3K views
•
40 slides
Parquet and AVRO
airisData
8.9K views
•
29 slides
ORC Files
Owen O'Malley
50.9K views
•
29 slides
Hive + Tez: A Performance Deep Dive
DataWorks Summit
57.5K views
•
41 slides
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
18.3K views
•
39 slides
Hive+Tez: A performance deep dive
t3rmin4t0r
9.6K views
•
41 slides
What's hot
(20)
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
•
4.3K views
Parquet and AVRO
airisData
•
8.9K views
ORC Files
Owen O'Malley
•
50.9K views
Hive + Tez: A Performance Deep Dive
DataWorks Summit
•
57.5K views
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
•
18.3K views
Hive+Tez: A performance deep dive
t3rmin4t0r
•
9.6K views
Apache hive introduction
Mahmood Reza Esmaili Zand
•
1.1K views
Introduction to Apache ZooKeeper
Saurav Haloi
•
127.1K views
Introduction to Redis
Dvir Volk
•
119.9K views
Side by Side with Elasticsearch & Solr, Part 2
Sematext Group, Inc.
•
18.1K views
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
•
60.7K views
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
•
18.4K views
Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...
Vietnam Open Infrastructure User Group
•
158 views
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
•
31.7K views
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
•
1.1K views
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
•
1.5K views
ORC improvement in Apache Spark 2.3
DataWorks Summit
•
7.7K views
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
•
6.4K views
Facebook Messages & HBase
强 王
•
39K views
Druid deep dive
Kashif Khan
•
3K views
Viewers also liked
Avro introduction
Nanda8904648951
3K views
•
22 slides
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
2.1K views
•
29 slides
Data Aggregation At Scale Using Apache Flume
Arvind Prabhakar
5.8K views
•
31 slides
Moving to a data-centric architecture: Toronto Data Unconference 2015
Adam Muise
2.2K views
•
28 slides
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
1.9K views
•
35 slides
大型电商的数据服务的要点和难点
Chao Zhu
1.4K views
•
29 slides
Viewers also liked
(10)
Avro introduction
Nanda8904648951
•
3K views
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
•
2.1K views
Data Aggregation At Scale Using Apache Flume
Arvind Prabhakar
•
5.8K views
Moving to a data-centric architecture: Toronto Data Unconference 2015
Adam Muise
•
2.2K views
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
•
1.9K views
大型电商的数据服务的要点和难点
Chao Zhu
•
1.4K views
Implementing and running a secure datalake from the trenches
DataWorks Summit
•
1.5K views
Parquet overview
Julien Le Dem
•
7.2K views
Paytm labs soyouwanttodatascience
Adam Muise
•
3.8K views
Flume vs. kafka
Omid Vahdaty
•
6.5K views
Similar to File Format Benchmark - Avro, JSON, ORC & Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
1.5K views
•
43 slides
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
3.1K views
•
40 slides
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
1.1K views
•
45 slides
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
2.5K views
•
45 slides
Druid Scaling Realtime Analytics
Aaron Brooks
315 views
•
38 slides
Hive acid and_2.x new_features
Alberto Romero
1.5K views
•
34 slides
Similar to File Format Benchmark - Avro, JSON, ORC & Parquet
(20)
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
•
1.5K views
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
•
3.1K views
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
•
1.1K views
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
•
2.5K views
Druid Scaling Realtime Analytics
Aaron Brooks
•
315 views
Hive acid and_2.x new_features
Alberto Romero
•
1.5K views
Hive edw-dataworks summit-eu-april-2017
alanfgates
•
1.7K views
An Apache Hive Based Data Warehouse
DataWorks Summit
•
1.3K views
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
•
2.7K views
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
•
851 views
An Overview on Optimization in Apache Hive: Past, Present Future
DataWorks Summit/Hadoop Summit
•
1.5K views
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
•
11.6K views
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
•
3.8K views
What's new in apache hive
DataWorks Summit
•
3.4K views
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
•
3.2K views
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
•
1.6K views
You Can't Search Without Data
Bryan Bende
•
666 views
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Bryan Bende
•
2.5K views
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Josh Elser
•
36.9K views
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
•
7.3K views
More from DataWorks Summit/Hadoop Summit
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
9.6K views
•
28 slides
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
3.2K views
•
25 slides
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
6.8K views
•
33 slides
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
1.4K views
•
10 slides
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
2.1K views
•
28 slides
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
1K views
•
28 slides
More from DataWorks Summit/Hadoop Summit
(20)
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
•
9.6K views
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
•
3.2K views
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
•
6.8K views
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
•
1.4K views
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
•
2.1K views
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
•
1K views
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
•
4.2K views
Data Science Crash Course
DataWorks Summit/Hadoop Summit
•
2.5K views
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
•
3.1K views
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
•
8.8K views
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
•
5.4K views
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
•
2.1K views
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
•
10K views
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
•
1.5K views
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
•
764 views
HBase in Practice
DataWorks Summit/Hadoop Summit
•
5.4K views
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
•
568 views
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
•
963 views
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
•
1.4K views
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
•
1.2K views
Recently uploaded
Webinar: Discover the Power of SpiraTeam - A Jira Alternative To Revolutioniz...
Inflectra
33 views
•
35 slides
Navigating the Future
OnBoard
24 views
•
48 slides
info_session_gdsc_tmsl .pptx
NikitaSingh741518
16 views
•
22 slides
UiPath Tips and Techniques for Error Handling - Session 2
DianaGray10
24 views
•
9 slides
Supplier Sourcing_Cathy.pptx
CatarinaTorrenuevaMa
15 views
•
25 slides
Connecting Africa.docx
Eric Annan
13 views
•
6 slides
Recently uploaded
(20)
Webinar: Discover the Power of SpiraTeam - A Jira Alternative To Revolutioniz...
Inflectra
•
33 views
Navigating the Future
OnBoard
•
24 views
info_session_gdsc_tmsl .pptx
NikitaSingh741518
•
16 views
UiPath Tips and Techniques for Error Handling - Session 2
DianaGray10
•
24 views
Supplier Sourcing_Cathy.pptx
CatarinaTorrenuevaMa
•
15 views
Connecting Africa.docx
Eric Annan
•
13 views
web test repair.pptx
YuanzhangLin
•
25 views
Announcing InfluxDB Clustered
InfluxData
•
54 views
Mitigating Third-Party Risks: Best Practices for CISOs in Ensuring Robust Sec...
TrustArc
•
42 views
Future of Skills
Alison B. Lowndes
•
38 views
h2 meet pdf test.pdf
JohnLee971654
•
52 views
Roottoo Innovation V24_CP.pdf
roottooinnovation
•
25 views
BuilderAI Proposal_Malesniak
Michael Lesniak
•
85 views
GDSC Cloud Lead Presentation.pptx
AbhinavNautiyal8
•
13 views
Data Formats: Reading and writing JSON – XML - YAML
CSUC - Consorci de Serveis Universitaris de Catalunya
•
54 views
Getting your enterprise ready for Microsoft 365 Copilot
Vignesh Ganesan I Microsoft MVP
•
89 views
LLaMA 2.pptx
RkRahul16
•
19 views
Experts Live Europe 2023 - Ensure your compliance in Microsoft Teams with Mic...
Jasper Oosterveld
•
56 views
sap.pptx
SAP
•
14 views
Advancing Equity and Inclusion for Deaf Students in Higher Education
3Play Media
•
143 views
File Format Benchmark - Avro, JSON, ORC & Parquet
1.
File Format Benchmark
- Avro, JSON, ORC, & Parquet Owen O’Malley owen@hortonworks.com @owen_omalley June 2016
2.
2 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Who Am I? Worked on Hadoop since Jan 2006 MapReduce, Security, Hive, and ORC Worked on different file formats –Sequence File, RCFile, ORC File, T-File, and Avro requirements
3.
3 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Goal Seeking to discover unknowns –How do the different formats perform? –What could they do better? –Best part of open source is looking inside! Use real & diverse data sets –Over-reliance on similar datasets leads to weakness Open & reviewed benchmarks
4.
4 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved The File Formats
5.
5 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Avro Cross-language file format for Hadoop Schema evolution was primary goal Schema segregated from data –Unlike Protobuf and Thrift Row major format
6.
6 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved JSON Serialization format for HTTP & Javascript Text-format with MANY parsers Schema completely integrated with data Row major format Compression applied on top
7.
7 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved ORC Originally part of Hive to replace RCFile –Now top-level project Schema segregated into footer Column major format with stripes Rich type model, stored top-down Integrated compression, indexes, & stats
8.
8 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Parquet Design based on Google’s Dremel paper Schema segregated into footer Column major format with stripes Simpler type-model with logical types All data pushed to leaves of the tree Integrated compression and indexes
9.
9 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Data Sets
10.
10 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved NYC Taxi Data Every taxi cab ride in NYC from 2009 –Publically available –http://tinyurl.com/nyc-taxi-analysis 18 columns with no null values –Doubles, integers, decimals, & strings 2 months of data – 22.7 million rows
11.
11 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
12.
12 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Github Logs All actions on Github public repositories –Publically available –https://www.githubarchive.org/ 704 columns with a lot of structure & nulls –Pretty much the kitchen sink 1/2 month of data – 10.5 million rows
13.
13 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Finding the Github Schema The data is all in JSON. No schema for the data is published. We wrote a JSON schema discoverer. –Scans the document and figures out the type Available at https://github.com/hortonworks/hive-json Schema is huge (12k)
14.
14 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Sales Generated data –Real schema from a production Hive deployment –Random data based on the data statistics 55 columns with lots of nulls –A little structure –Timestamps, strings, longs, booleans, list, & struct 25 million rows
15.
15 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Storage costs
16.
16 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Compression Data size matters! –Hadoop stores all your data, but requires hardware –Is one factor in read speed ORC and Parquet use RLE & Dictionaries All the formats have general compression –ZLIB (GZip) – tight compression, slower –Snappy – some compression, faster
17.
17 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
18.
18 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Taxi Size Analysis Don’t use JSON Use either Snappy or Zlib compression Avro’s small compression window hurts Parquet Zlib is smaller than ORC –Group the column sizes by type
19.
19 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
20.
20 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
21.
21 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Taxi Size Analysis ORC did better than expected –String columns have small cardinality –Lots of timestamp columns –No doubles Need to revalidate results with original –Improve random data generator –Add non-smooth distributions
22.
22 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
23.
23 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Github Size Analysis Surprising win for JSON and Avro –Worst when uncompressed –Best with zlib Many partially shared strings –ORC and Parquet don’t compress across columns Need to investigate shared dictionaries.
24.
24 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Use Cases
25.
25 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Full Table Scans Read all columns & rows All formats except JSON are splitable –Different workers do different parts of file
26.
26 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
27.
27 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Taxi Read Performance Analysis JSON is very slow to read –Large storage size for this data set –Needs to do a LOT of string parsing Tradeoff between space & time –Less compression is sometimes faster
28.
28 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
29.
29 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Sales Read Performance Analysis Read performance is dominated by format –Compression matters less for this data set –Straight ordering: ORC, Avro, Parquet, & JSON Garbage collection is important –ORC 0.3 to 1.4% of time –Avro < 0.1% of time –Parquet 4 to 8% of time
30.
30 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
31.
31 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Github Read Performance Analysis Garbage collection is critical –ORC 2.1 to 3.4% of time –Avro 0.1% of time –Parquet 11.4 to 12.8% of time A lot of columns needs more space –Suspect that we need bigger stripes –Rows/stripe - ORC: 18.6k, Parquet: 88.1k
32.
32 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Column Projection Often just need a few columns –Only ORC & Parquet are columnar –Only read, decompress, & deserialize some columns Dataset format compression us/row projection Percent time github orc zlib 21.319 0.185 0.87% github parquet zlib 72.494 0.585 0.81% sales orc zlib 1.866 0.056 3.00% sales parquet zlib 12.893 0.329 2.55% taxi orc zlib 2.766 0.063 2.28% taxi parquet zlib 3.496 0.718 20.54%
33.
33 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Projection & Predicate Pushdown Sometimes have a filter predicate on table –Select a superset of rows that match –Selective filters have a huge impact Improves data layout options –Better than partition pruning with sorting ORC has added optional bloom filters
34.
34 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Metadata Access ORC & Parquet store metadata –Stored in file footer –File schema –Number of records –Min, max, count of each column Provides O(1) Access
35.
35 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Conclusions
36.
36 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Recomendations Disclaimer – Everything changes! –Both these benchmarks and the formats will change. For complex tables with common strings –Avro with Snappy is a good fit (w/o projection) For other tables –ORC with Zlib is a good fit Experiment with the benchmarks
37.
37 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Fun Stuff Built open benchmark suite for files Built pieces of a tool to convert files –Avro, CSV, JSON, ORC, & Parquet Built a random parameterized generator –Easy to model arbitrary tables –Can write to Avro, ORC, or Parquet
38.
38 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Thank you! Twitter: @owen_omalley Email: owen@hortonworks.com