Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015

105,712 views

Published on

At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.

Published in: Technology

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015

  1. 1. Choosing an HDFS data storage format: Avro vs. Parquet and more Stephen O’Sullivan | @steveos
  2. 2. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience
  3. 3. 3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • Prioritize for highest business value when using emerging technology • Design with outcomes in mind • Be agile: deliver initial results quickly, then adapt and iterate • Collaborate constantly with our customers and partners OUR PHILOSOPHY
  4. 4. 4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience AGENDA Introduction Data formats How to choose Schema evolution Summary Questions
  5. 5. Introduction
  6. 6. Data formats
  7. 7. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • Storage formats • What they do DATA FORMATS
  8. 8. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Data format • Storage Format • Text • Sequence File • Avro • Parquet • Optimized Row Columnar (ORC)
  9. 9. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Text • More specifically text = csv, tsv, json records… • Convenient format to use to exchange with other applications or scripts that produce or read delimited files • Human readable and parsable • Data stores is bulky and not as efficient to query • Do not support block compression
  10. 10. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Sequence File • Provides a persistent data structure for binary key- value pairs • Row based • Commonly used to transfer data between Map Reduce jobs • Can be used as an archive to pack small files in Hadoop • Support splitting even when the data is compressed
  11. 11. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Avro • Widely used as a serialization platform • Row-based, offers a compact and fast binary format • Schema is encoded on the file so the data can be untagged • Files support block compression and are splittable • Supports schema evolution
  12. 12. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Parquet • Column-oriented binary file format • Uses the record shredding and assembly algorithm described in the Dremel paper • Each data file contains the values for a set of rows • Efficient in terms of disk I/O when specific columns need to be queried
  13. 13. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Optimized Row Columnar • Considered the evolution of the RCFile • Stores collections of rows and within the collection the row data is stored in columnar format • Introduces a lightweight indexing that enables skipping of irrelevant blocks of rows • Splittable: allows parallel processing of row collections • It comes with basic statistics on columns (min ,max, sum, and count)
  14. 14. How to choose
  15. 15. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • ..for write • ..for read HOW TO CHOOSE
  16. 16. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Functional Requirements: • What type of data do you have? • Is the data format compatible with your processing and querying tools? • What are your file sizes? • Do you have schemas that evolve over time?
  17. 17. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Speed Concerns • Parquet and ORC usually needs some additional parsing to format the data which increases the overall read time • Avro as a data serialization format: works well from system to system, handles schema evolution (more on that later) • Text is bulky and inefficient but easily readable and parsable
  18. 18. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 20 40 60 80 100 120 140 160 TimeinSeconds Narrow – Hortonworks (Hive 0.14 ) 0 500 1000 1500 2000 2500 TimeinSeconds Wide – Hortonworks (Hive 0.14) Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns
  19. 19. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 10 20 30 40 50 60 70 TimeinSeconds Narrow - hive-1.1.0+cdh5.4.2 0 100 200 300 400 500 600 700 TimeinSeconds Wide - hive-1.1.0+cdh5.4.2 Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns
  20. 20. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns 0 10 20 30 40 50 60 70 Text Avro Parquet TimeinSeconds Narrow - Spark 1.3 0 200 400 600 800 1000 1200 Text Avro Parquet TimeinSeconds Wide - Spark 1.3
  21. 21. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 200 400 600 800 1000 1200 1400 Megabytes File sizes for narrow dataset 0 2000 4000 6000 8000 10000 12000 Megabytes File sizes for wide dataset
  22. 22. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Use case • Avro – Event data that can change over time • Sequence File – Datasets shared between MR jobs • Text – Adding large amounts of data to HDFS quickly
  23. 23. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Types of queries: • Column specific queries, or few groups of columns -> Use columnar format like Parquet or ORC • Compression of the file regardless the format increases query speed times • Text is really slow to read • Parquet and ORC optimize read performance at the expense of write performance
  24. 24. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Set up: • Narrow dataset: • 10 million rows, 10 columns • Wide dataset: • 4 million rows, 1000 columns • Compression: • Snappy, except for Avro which is deflate
  25. 25. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 10 20 30 40 50 60 70 Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - Hortonworks Hive 0.14.0.2.2.4.2 Text Avro Parquet Sequence ORC
  26. 26. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 100 200 300 400 500 600 700 800 Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) TimeinSeconds Wide Dataset - Hortonworks Hive 0.14.0.2.2.4.2 Text Avro Parquet Sequence ORC
  27. 27. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 10 20 30 40 50 60 70 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - CDH hive-1.1.0+cdh5.4.2 Text Avro Parquet Sequence ORC
  28. 28. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 50 100 150 200 250 Query 1 (no conditions) Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) TimeinSeconds Wide Dataset - CDH hive-1.1.0+cdh5.4.2 Text Avro Parquet Sequence ORC
  29. 29. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 1 2 3 4 5 6 7 8 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - CDH Impala Text Avro Parquet Sequence
  30. 30. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 5 10 15 20 25 30 Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters) TimeinSeconds Wide Dataset - CDH Impala Text Avro Parquet Sequence
  31. 31. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read Ran 4 queries (using Impala) over 4 Million rows (70GB raw), and 1000 columns (wide table) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters) Seconds Query times for different data formats Avro uncompress Avro Snappy Avro Deflate Parquet Seq uncompressed Seq Snappy Text Snappy Text uncompressed
  32. 32. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Use case • Avro – Query datasets that have changed over time • Parquet – Query a few columns on a wide table
  33. 33. Schema Evolution
  34. 34. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • What is schema evolution? • Data formats that evolve • Examples • Use cases SCHEMA EVOLUTION
  35. 35. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • What is schema evolution? • Adding columns • Renaming columns • Removing columns • Why do we need it?
  36. 36. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Data formats that can evolve • Avro • Parquet • Can only add columns at the end • ORC • It’s coming (That’s what they tell me ;) )…
  37. 37. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • The data – Dr Who episodes • Original Dr Who & new Dr Who • http://www.theguardian.com/news/datablog/2010/aug/20/doctor-who-time-travel- information-is-beautiful • Avro schema for the original Dr Who {"namespace": "drwho.avro", "type": "record", "name": "drwho", "fields": [ {"name": "doctor_who_season", "type": "string"}, {"name": "doctor_actor", "type": "string"}, {"name": "episode_no", "type": "string"}, {"name": "episode_title", "type": "string"}, {"name": "date_from", "type": "string"}, {"name": "date_to", "type": "string"}, {"name": "estimated", "type": "string"}, {"name": "planet", "type": "string"}, {"name": "sub_location", "type": "string"}, {"name": "main_location", "type": "string"} ]}
  38. 38. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Original Dr Who data doctor_who_ season doctor_actor episode_no episode_title date_from date_to estimated planet sub_location main_location 3 Pertwee 51 Spearhead from Space 1970 1990 y Earth England London and other 3 Pertwee 55 Terror of the Autons 1971 1971 y Earth England Luigi Rossini's Circus 3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus 3 Pertwee 59 The Daemons 1971 1971 y Earth England Devil's End; Wiltshire 3 Pertwee 60 Day of the Daleks 1972 2100 Earth England Auderly House and environs 3 Pertwee 63 The Mutants 1972 2900 Solos 3 Pertwee 64 The Time Monster -2000 1972 Earth/ Atlantis 3 Pertwee 64 The Time Monster 1972 -2000 Earth/ Atlantis 3 Pertwee 66 Carnival of Monsters 1972 1928 n Indian Ocean; Planet Inter Minor Ocean; alien planet 3 Pertwee 67 Frontier in Space 1972 2540 n Planet Draconia; Orgon Planet alien planets 3 Pertwee 68 Planet of the Daleks 1972 2540 y Planet Spiridon Alien Planet 3 Pertwee 69 The Green Death 2540 1973 y Earth UK Llanfairfach; Wales 3 Pertwee 70 The Time Warrior 1973 1200 n Earth UK Wessex Castle 3 Pertwee 71 Invasion of The Dinosaurs 1200 1974 y Earth UK London
  39. 39. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Lets add, rename, and delete some columns • Avro schema for the new Dr Who {"namespace": "drwho.avro", "type": "record", "name": "drwho", "fields": [ {"name": "drwho_season", "type": ["null","string"], "aliases": ["doctor_who_season"]}, {"name": "drwho_actor", "type": ["null","string"], "aliases": ["doctor_actor"]}, {"name": "episode_no", "type": ["null","string"]}, {"name": "episode_title", "type": ["null","string"]}, {"name": "date_from", "type": ["null","string"]}, {"name": "date_to", "type": ["null","string"]}, {"name": "estimated", "type": "string"}, {"name": "planet", "type": ["null","string"]}, {"name": "sub_location", "type": ["null","string"]}, {"name": "main_location", "type": ["null","string"]}, {"name": "hd", "type": "string", "default": "no"} ]}
  40. 40. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Original & New Dr Who data drwho_season drwho_actor episode_no episode_title date_from date_to planet sub_location main_location hd 10 Tennant 201 New Earth 2006 5000000023 New Earth New … New York yes 10 Tennant 202 Tooth and claw 2006 1879 Earth Scotland Torchwood house; Near Balmoral yes 10 Tennant 203 school Reunion 2007 2007 Earth England Deffry Vale yes 10 Tennant 204 the Girl in the Fireplace 1727 1744 Earth France Paris yes 3 Pertwee 51 Spearhead from Space 1970 1990 Earth England London and other no 3 Pertwee 55 Terror of the Autons 1971 1971 Earth England Luigi Rossini's Circus no 3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus no 3 Pertwee 59 The Daemons 1971 1971 Earth England Devil's End; Wiltshire no 3 Pertwee 60 Day of the Daleks 1972 2100 Earth England Auderly House and environs no
  41. 41. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Use cases • New data added to an event stream • Need to see historic data with new data (and the schema has changed a lot) • Business has changed the field/column name
  42. 42. Summary
  43. 43. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience QUESTIONS?
  44. 44. 44 Yes, we’re hiring! info@svds.com THANK YOU Stephen O’Sullivan stephen@svds.com @steveos Demo code is here: github.com/silicon-valley-data- science/stampedecon-2015

×