File Format Benchmarks - Avro, JSON, ORC, & Parquet

8,490 views

Published on

Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.

Published in: Technology
1 Comment
16 Likes
Statistics
Notes
  • http://www.dbmanagement.info/Tutorials/Hadoop.htm #Hadoop #Avro #Cassandro #Drill #Flume Tutorial (Videos and Books)at $7.95
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
8,490
On SlideShare
0
From Embeds
0
Number of Embeds
3,200
Actions
Shares
0
Downloads
186
Comments
1
Likes
16
Embeds 0
No embeds

No notes for slide

File Format Benchmarks - Avro, JSON, ORC, & Parquet

  1. 1. File Format Benchmark - Avro, JSON, ORC, & Parquet Owen O’Malley owen@hortonworks.com @owen_omalley September 2016
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who Am I? Worked on Hadoop since Jan 2006 MapReduce, Security, Hive, and ORC Worked on different file formats –Sequence File, RCFile, ORC File, T-File, and Avro requirements
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal Seeking to discover unknowns –How do the different formats perform? –What could they do better? –Best part of open source is looking inside! Use real & diverse data sets –Over-reliance on similar datasets leads to weakness Open & reviewed benchmarks
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The File Formats
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Avro Cross-language file format for Hadoop Schema evolution was primary goal Schema segregated from data –Unlike Protobuf and Thrift Row major format
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved JSON Serialization format for HTTP & Javascript Text-format with MANY parsers Schema completely integrated with data Row major format Compression applied on top
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ORC Originally part of Hive to replace RCFile –Now top-level project Schema segregated into footer Column major format with stripes Rich type model, stored top-down Integrated compression, indexes, & stats
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parquet Design based on Google’s Dremel paper Schema segregated into footer Column major format with stripes Simpler type-model with logical types All data pushed to leaves of the tree Integrated compression and indexes
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Sets
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NYC Taxi Data Every taxi cab ride in NYC from 2009 –Publically available –http://tinyurl.com/nyc-taxi-analysis 18 columns with no null values –Doubles, integers, decimals, & strings 2 months of data – 22.7 million rows
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Github Logs All actions on Github public repositories –Publically available –https://www.githubarchive.org/ 704 columns with a lot of structure & nulls –Pretty much the kitchen sink  1/2 month of data – 10.5 million rows
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Finding the Github Schema The data is all in JSON. No schema for the data is published. We wrote a JSON schema discoverer. –Scans the document and figures out the type Available at https://github.com/hortonworks/hive-json Schema is huge (12k)
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sales Generated data –Real schema from a production Hive deployment –Random data based on the data statistics 55 columns with lots of nulls –A little structure –Timestamps, strings, longs, booleans, list, & struct 25 million rows
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage costs
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Compression Data size matters! –Hadoop stores all your data, but requires hardware –Is one factor in read speed ORC and Parquet use RLE & Dictionaries All the formats have general compression –ZLIB (GZip) – tight compression, slower –Snappy – some compression, faster
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxi Size Analysis Don’t use JSON Use either Snappy or Zlib compression Avro’s small compression window hurts Parquet Zlib is smaller than ORC –Group the column sizes by type
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxi Size Analysis ORC did better than expected –String columns have small cardinality –Lots of timestamp columns –No doubles  Need to revalidate results with original –Improve random data generator –Add non-smooth distributions
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Github Size Analysis Surprising win for JSON and Avro –Worst when uncompressed –Best with zlib Many partially shared strings –ORC and Parquet don’t compress across columns Need to investigate shared dictionaries.
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Cases
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Full Table Scans Read all columns & rows All formats except JSON are splitable –Different workers do different parts of file
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxi Read Performance Analysis JSON is very slow to read –Large storage size for this data set –Needs to do a LOT of string parsing Tradeoff between space & time –Less compression is sometimes faster
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sales Read Performance Analysis Read performance is dominated by format –Compression matters less for this data set –Straight ordering: ORC, Avro, Parquet, & JSON Garbage collection is important –ORC 0.3 to 1.4% of time –Avro < 0.1% of time –Parquet 4 to 8% of time
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Github Read Performance Analysis Garbage collection is critical –ORC 2.1 to 3.4% of time –Avro 0.1% of time –Parquet 11.4 to 12.8% of time A lot of columns needs more space –Suspect that we need bigger stripes –Rows/stripe - ORC: 18.6k, Parquet: 88.1k
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Column Projection Often just need a few columns –Only ORC & Parquet are columnar –Only read, decompress, & deserialize some columns Dataset format compression us/row projection Percent time github orc zlib 21.319 0.185 0.87% github parquet zlib 72.494 0.585 0.81% sales orc zlib 1.866 0.056 3.00% sales parquet zlib 12.893 0.329 2.55% taxi orc zlib 2.766 0.063 2.28% taxi parquet zlib 3.496 0.718 20.54%
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Projection & Predicate Pushdown Sometimes have a filter predicate on table –Select a superset of rows that match –Selective filters have a huge impact Improves data layout options –Better than partition pruning with sorting ORC has added optional bloom filters
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Metadata Access ORC & Parquet store metadata –Stored in file footer –File schema –Number of records –Min, max, count of each column Provides O(1) Access
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Conclusions
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Recommendations Disclaimer – Everything changes! –Both these benchmarks and the formats will change. Don’t use JSON for processing. If your use case needs column projection or predicate push down: –ORC or Parquet
  37. 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Recommendations For complex tables with common strings –Avro with Snappy is a good fit (w/o projection) For other tables –ORC with Zlib or Snappy is a good fit Tweet benchmark suggestions to @owen_omalley
  38. 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Fun Stuff Built open benchmark suite for files Built pieces of a tool to convert files –Avro, CSV, JSON, ORC, & Parquet Built a random parameterized generator –Easy to model arbitrary tables –Can write to Avro, ORC, or Parquet
  39. 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Remaining work Extend benchmark with LZO and LZ4. Finish predicate pushdown benchmark. Add C++ reader for ORC, Parquet, & Avro. Add Presto ORC reader.
  40. 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you! Twitter: @owen_omalley Email: owen@hortonworks.com

×