Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015

Choosing an HDFS data storage format: Avro vs.
Parquet and more
Stephen O’Sullivan | @steveos
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• Prioritize for highest
business value when using
emerging technology
• Design with outcomes in
mind
• Be agile: deliver initial
results quickly, then adapt
and iterate
• Collaborate constantly with
our customers and partners
OUR PHILOSOPHY
4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
AGENDA
Introduction
Data formats
How to choose
Schema evolution
Summary
Questions
Introduction
Data formats
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• Storage formats
• What they do
DATA FORMATS
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Data format
• Storage Format
• Text
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar (ORC)
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Text
• More specifically text = csv, tsv, json records…
• Convenient format to use to exchange with other
applications or scripts that produce or read
delimited files
• Human readable and parsable
• Data stores is bulky and not as efficient to query
• Do not support block compression
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Sequence File
• Provides a persistent data structure for binary key-
value pairs
• Row based
• Commonly used to transfer data between Map
Reduce jobs
• Can be used as an archive to pack small files in
Hadoop
• Support splitting even when the data is
compressed
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Avro
• Widely used as a serialization platform
• Row-based, offers a compact and fast binary
format
• Schema is encoded on the file so the data can be
untagged
• Files support block compression and are splittable
• Supports schema evolution
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Parquet
• Column-oriented binary file format
• Uses the record shredding and assembly algorithm
described in the Dremel paper
• Each data file contains the values for a set of rows
• Efficient in terms of disk I/O when specific columns
need to be queried
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Optimized Row Columnar
• Considered the evolution of the RCFile
• Stores collections of rows and within the collection
the row data is stored in columnar format
• Introduces a lightweight indexing that enables
skipping of irrelevant blocks of rows
• Splittable: allows parallel processing of row
collections
• It comes with basic statistics on columns (min ,max,
sum, and count)
How to choose
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• ..for write
• ..for read
HOW TO CHOOSE
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Functional Requirements:
• What type of data do you have?
• Is the data format compatible with your
processing and querying tools?
• What are your file sizes?
• Do you have schemas that evolve over time?
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Speed Concerns
• Parquet and ORC usually needs some
additional parsing to format the data which
increases the overall read time
• Avro as a data serialization format: works well from
system to system, handles schema evolution (more
on that later)
• Text is bulky and inefficient but easily readable and
parsable
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
20
40
60
80
100
120
140
160
TimeinSeconds
Narrow – Hortonworks (Hive 0.14 )
0
500
1000
1500
2000
2500
TimeinSeconds
Wide – Hortonworks (Hive 0.14)
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
10
20
30
40
50
60
70
TimeinSeconds
Narrow - hive-1.1.0+cdh5.4.2
0
100
200
300
400
500
600
700
TimeinSeconds
Wide - hive-1.1.0+cdh5.4.2
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
0
10
20
30
40
50
60
70
Text Avro Parquet
TimeinSeconds
Narrow - Spark 1.3
0
200
400
600
800
1000
1200
Text Avro Parquet
TimeinSeconds
Wide - Spark 1.3
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
200
400
600
800
1000
1200
1400
Megabytes
File sizes for narrow dataset
0
2000
4000
6000
8000
10000
12000
Megabytes
File sizes for wide dataset
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Use case
• Avro – Event data that can change over time
• Sequence File – Datasets shared between MR
jobs
• Text – Adding large amounts of data to HDFS
quickly
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Types of queries:
• Column specific queries, or few groups of
columns -> Use columnar format like Parquet or
ORC
• Compression of the file regardless the format
increases query speed times
• Text is really slow to read
• Parquet and ORC optimize read performance at
the expense of write performance
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Set up:
• Narrow dataset:
• 10 million rows, 10 columns
• Wide dataset:
• 4 million rows, 1000 columns
• Compression:
• Snappy, except for Avro which is deflate
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
10
20
30
40
50
60
70
Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - Hortonworks Hive 0.14.0.2.2.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
100
200
300
400
500
600
700
800
Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions)
TimeinSeconds
Wide Dataset - Hortonworks Hive 0.14.0.2.2.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
10
20
30
40
50
60
70
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - CDH hive-1.1.0+cdh5.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
50
100
150
200
250
Query 1 (no
conditions)
Query 2 (5
conditions)
Query 3 (10
conditions)
Query 4 (20
conditions)
TimeinSeconds
Wide Dataset - CDH hive-1.1.0+cdh5.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
1
2
3
4
5
6
7
8
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - CDH Impala
Text
Avro
Parquet
Sequence
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
5
10
15
20
25
30
Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters)
TimeinSeconds
Wide Dataset - CDH Impala
Text
Avro
Parquet
Sequence
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
Ran 4 queries (using Impala)
over 4 Million rows (70GB raw),
and 1000 columns (wide table)
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
Query 1 (0
filters)
Query 2 (5
filters)
Query 3 (10
filters)
Query 4 (20
filters)
Seconds
Query times for different data formats
Avro uncompress
Avro Snappy
Avro Deflate
Parquet
Seq uncompressed
Seq Snappy
Text Snappy
Text uncompressed
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Use case
• Avro – Query datasets that have changed over
time
• Parquet – Query a few columns on a wide
table
Schema Evolution
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• What is schema
evolution?
• Data formats that evolve
• Examples
• Use cases
SCHEMA
EVOLUTION
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• What is schema evolution?
• Adding columns
• Renaming columns
• Removing columns
• Why do we need it?
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Data formats that can evolve
• Avro
• Parquet
• Can only add columns at the end
• ORC
• It’s coming (That’s what they tell me ;) )…
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• The data – Dr Who episodes
• Original Dr Who & new Dr Who
• http://www.theguardian.com/news/datablog/2010/aug/20/doctor-who-time-travel-
information-is-beautiful
• Avro schema for the original Dr Who
{"namespace": "drwho.avro",
"type": "record",
"name": "drwho",
"fields": [
{"name": "doctor_who_season", "type": "string"}, {"name": "doctor_actor", "type": "string"},
{"name": "episode_no", "type": "string"}, {"name": "episode_title", "type": "string"},
{"name": "date_from", "type": "string"}, {"name": "date_to", "type": "string"},
{"name": "estimated", "type": "string"}, {"name": "planet", "type": "string"},
{"name": "sub_location", "type": "string"}, {"name": "main_location", "type": "string"}
]}
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Original Dr Who data
doctor_who_
season doctor_actor episode_no episode_title date_from date_to estimated planet sub_location main_location
3 Pertwee 51 Spearhead from Space 1970 1990 y Earth England London and other
3 Pertwee 55 Terror of the Autons 1971 1971 y Earth England Luigi Rossini's Circus
3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus
3 Pertwee 59 The Daemons 1971 1971 y Earth England Devil's End; Wiltshire
3 Pertwee 60 Day of the Daleks 1972 2100 Earth England
Auderly House and
environs
3 Pertwee 63 The Mutants 1972 2900 Solos
3 Pertwee 64 The Time Monster -2000 1972 Earth/ Atlantis
3 Pertwee 64 The Time Monster 1972 -2000 Earth/ Atlantis
3 Pertwee 66 Carnival of Monsters 1972 1928 n
Indian Ocean;
Planet Inter Minor Ocean; alien planet
3 Pertwee 67 Frontier in Space 1972 2540 n
Planet Draconia;
Orgon Planet alien planets
3 Pertwee 68 Planet of the Daleks 1972 2540 y Planet Spiridon Alien Planet
3 Pertwee 69 The Green Death 2540 1973 y Earth UK Llanfairfach; Wales
3 Pertwee 70 The Time Warrior 1973 1200 n Earth UK Wessex Castle
3 Pertwee 71
Invasion of The
Dinosaurs 1200 1974 y Earth UK London
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Lets add, rename, and delete some columns
• Avro schema for the new Dr Who
{"namespace": "drwho.avro",
"type": "record",
"name": "drwho",
"fields": [
{"name": "drwho_season", "type": ["null","string"], "aliases": ["doctor_who_season"]},
{"name": "drwho_actor", "type": ["null","string"], "aliases": ["doctor_actor"]},
{"name": "episode_no", "type": ["null","string"]}, {"name": "episode_title", "type": ["null","string"]},
{"name": "date_from", "type": ["null","string"]}, {"name": "date_to", "type": ["null","string"]},
{"name": "estimated", "type": "string"}, {"name": "planet", "type": ["null","string"]},
{"name": "sub_location", "type": ["null","string"]}, {"name": "main_location", "type": ["null","string"]},
{"name": "hd", "type": "string", "default": "no"}
]}
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Original & New Dr Who data
drwho_season drwho_actor episode_no episode_title date_from date_to planet sub_location main_location hd
10 Tennant 201 New Earth 2006 5000000023 New Earth New … New York yes
10 Tennant 202 Tooth and claw 2006 1879 Earth Scotland
Torchwood house;
Near Balmoral yes
10 Tennant 203 school Reunion 2007 2007 Earth England Deffry Vale yes
10 Tennant 204
the Girl in the
Fireplace 1727 1744 Earth France Paris yes
3 Pertwee 51
Spearhead from
Space 1970 1990 Earth England London and other no
3 Pertwee 55
Terror of the
Autons 1971 1971 Earth England Luigi Rossini's Circus no
3 Pertwee 58 Colony in Space 1971 2472
planet
Uxarieus no
3 Pertwee 59 The Daemons 1971 1971 Earth England Devil's End; Wiltshire no
3 Pertwee 60 Day of the Daleks 1972 2100 Earth England
Auderly House and
environs no
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Use cases
• New data added to an event stream
• Need to see historic data with new data (and
the schema has changed a lot)
• Business has changed the field/column name
Summary
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
QUESTIONS?
44
Yes, we’re hiring!
info@svds.com
THANK YOU
Stephen O’Sullivan
stephen@svds.com
@steveos
Demo code is here:
github.com/silicon-valley-data-
science/stampedecon-2015
1 of 44

Recommended

Parquet Hadoop Summit 2013 by
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
26K views25 slides
Parquet performance tuning: the missing guide by
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
40.6K views44 slides
Hive + Tez: A Performance Deep Dive by
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
57.6K views41 slides
Spark shuffle introduction by
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
50.7K views33 slides
The Parquet Format and Performance Optimization Opportunities by
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
8.2K views32 slides
A Thorough Comparison of Delta Lake, Iceberg and Hudi by
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
11.1K views27 slides

More Related Content

What's hot

Introduction SQL Analytics on Lakehouse Architecture by
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
5.8K views52 slides
ORC File - Optimizing Your Big Data by
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
11.6K views26 slides
Productizing Structured Streaming Jobs by
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
3.2K views63 slides
Iceberg: A modern table format for big data (Strata NY 2018) by
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
2K views34 slides
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ... by
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...DataWorks Summit
1.4K views19 slides
Achieving Lakehouse Models with Spark 3.0 by
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
622 views25 slides

What's hot(20)

Introduction SQL Analytics on Lakehouse Architecture by Databricks
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks5.8K views
ORC File - Optimizing Your Big Data by DataWorks Summit
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit11.6K views
Productizing Structured Streaming Jobs by Databricks
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks3.2K views
Iceberg: A modern table format for big data (Strata NY 2018) by Ryan Blue
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue2K views
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ... by DataWorks Summit
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
DataWorks Summit1.4K views
Achieving Lakehouse Models with Spark 3.0 by Databricks
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks622 views
Apache Iceberg: An Architectural Look Under the Covers by ScyllaDB
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB1.5K views
Common Strategies for Improving Performance on Your Delta Lakehouse by Databricks
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks718 views
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai by Databricks
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks10.9K views
Iceberg: a fast table format for S3 by DataWorks Summit
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit7.5K views
Efficient Data Storage for Analytics with Apache Parquet 2.0 by Cloudera, Inc.
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.158.6K views
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac... by HostedbyConfluent
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent4.3K views
Top 5 Mistakes to Avoid When Writing Apache Spark Applications by Cloudera, Inc.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.127.8K views
From Query Plan to Query Performance: Supercharging your Apache Spark Queries... by Databricks
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks684 views
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P... by Databricks
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks1K views
Hive+Tez: A performance deep dive by t3rmin4t0r
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r9.6K views
Processing Large Data with Apache Spark -- HasGeek by Venkata Naga Ravi
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi12.4K views

Viewers also liked

The Impala Cookbook by
The Impala CookbookThe Impala Cookbook
The Impala CookbookCloudera, Inc.
90.6K views87 slides
Hadoop and Data Virtualization - A Case Study by VHA by
Hadoop and Data Virtualization - A Case Study by VHAHadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHAHortonworks
4.5K views36 slides
Parquet and impala overview external by
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview externalmattlieber
8.9K views16 slides
File Format Benchmarks - Avro, JSON, ORC, & Parquet by
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
101.8K views40 slides
HBase 0.20.0 Performance Evaluation by
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationSchubert Zhang
3.8K views7 slides
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster by
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
7.5K views33 slides

Viewers also liked(19)

Hadoop and Data Virtualization - A Case Study by VHA by Hortonworks
Hadoop and Data Virtualization - A Case Study by VHAHadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHA
Hortonworks4.5K views
Parquet and impala overview external by mattlieber
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
mattlieber8.9K views
File Format Benchmarks - Avro, JSON, ORC, & Parquet by Owen O'Malley
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley101.8K views
HBase 0.20.0 Performance Evaluation by Schubert Zhang
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
Schubert Zhang3.8K views
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster by Cloudera, Inc.
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.7.5K views
HBaseCon 2013: A Developer’s Guide to Coprocessors by Cloudera, Inc.
HBaseCon 2013: A Developer’s Guide to CoprocessorsHBaseCon 2013: A Developer’s Guide to Coprocessors
HBaseCon 2013: A Developer’s Guide to Coprocessors
Cloudera, Inc.8K views
Moving to a data-centric architecture: Toronto Data Unconference 2015 by Adam Muise
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
Adam Muise2.2K views
Implementing and running a secure datalake from the trenches by DataWorks Summit
Implementing and running a secure datalake from the trenches Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches
DataWorks Summit1.5K views
ApacheCon-Flume-Kafka-2016 by Jayesh Thakrar
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar1.9K views
大型电商的数据服务的要点和难点 by Chao Zhu
大型电商的数据服务的要点和难点 大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点
Chao Zhu1.4K views
Data Aggregation At Scale Using Apache Flume by Arvind Prabhakar
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
Arvind Prabhakar5.8K views
Introduction to streaming and messaging flume,kafka,SQS,kinesis by Omid Vahdaty
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty2.1K views
Parquet and AVRO by airisData
Parquet and AVROParquet and AVRO
Parquet and AVRO
airisData8.9K views
Paytm labs soyouwanttodatascience by Adam Muise
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
Adam Muise3.8K views
HBase Application Performance Improvement by Biju Nair
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
Biju Nair23.5K views
Building Streaming Data Applications Using Apache Kafka by Slim Baltagi
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
Slim Baltagi7.9K views

Similar to Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015

Format Wars: from VHS and Beta to Avro and Parquet by
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetDataWorks Summit
2K views44 slides
Data Governance - Atlas 7.12.2015 by
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Hortonworks
12.2K views37 slides
Building data pipelines with kite by
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
5.7K views55 slides
What's New in Apache Hive 3.0? by
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
270 views22 slides
What's New in Apache Hive 3.0 - Tokyo by
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
945 views22 slides
HDP Next: Governance by
HDP Next: GovernanceHDP Next: Governance
HDP Next: GovernanceDataWorks Summit
1.1K views32 slides

Similar to Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015(20)

Format Wars: from VHS and Beta to Avro and Parquet by DataWorks Summit
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit2K views
Data Governance - Atlas 7.12.2015 by Hortonworks
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
Hortonworks12.2K views
Building data pipelines with kite by Joey Echeverria
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria5.7K views
What's New in Apache Hive 3.0 - Tokyo by DataWorks Summit
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit945 views
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A... by Big Data Spain
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain496 views
Big data spain keynote nov 2016 by alanfgates
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
alanfgates719 views
Big Data, Ingeniería de datos, y Data Lakes en AWS by javier ramirez
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez546 views
Unlocking big data with Hadoop + MySQL by Ricky Setyawan
Unlocking big data with Hadoop + MySQLUnlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQL
Ricky Setyawan334 views
Hive edw-dataworks summit-eu-april-2017 by alanfgates
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates1.7K views
An Apache Hive Based Data Warehouse by DataWorks Summit
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit1.3K views
Michael Hausenblas- Scalable time series and stream processing for IoT applic... by WithTheBest
Michael Hausenblas- Scalable time series and stream processing for IoT applic...Michael Hausenblas- Scalable time series and stream processing for IoT applic...
Michael Hausenblas- Scalable time series and stream processing for IoT applic...
WithTheBest133 views
Colin Carter - LSPs and APIs by sconul
Colin Carter  - LSPs and APIsColin Carter  - LSPs and APIs
Colin Carter - LSPs and APIs
sconul1.6K views
Classification based security in Hadoop by Madhan Neethiraj
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
Madhan Neethiraj543 views
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins by sparkflows
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
sparkflows521 views

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo... by
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
2.6K views34 slides
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017 by
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
638 views51 slides
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017 by
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
397 views19 slides
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam... by
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
417 views19 slides
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017 by
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
394 views32 slides
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017 by
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
1.4K views62 slides

More from StampedeCon(20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo... by StampedeCon
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon2.6K views
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017 by StampedeCon
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon638 views
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017 by StampedeCon
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon397 views
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam... by StampedeCon
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon417 views
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017 by StampedeCon
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon394 views
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017 by StampedeCon
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon1.4K views
Foundations of Machine Learning - StampedeCon AI Summit 2017 by StampedeCon
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon574 views
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem... by StampedeCon
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon392 views
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti... by StampedeCon
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon221 views
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017 by StampedeCon
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon574 views
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017 by StampedeCon
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon563 views
A Different Data Science Approach - StampedeCon AI Summit 2017 by StampedeCon
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon278 views
Graph in Customer 360 - StampedeCon Big Data Conference 2017 by StampedeCon
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon1.2K views
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017 by StampedeCon
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon1.8K views
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017 by StampedeCon
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon120 views
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu... by StampedeCon
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon169 views
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz... by StampedeCon
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon666 views
Innovation in the Data Warehouse - StampedeCon 2016 by StampedeCon
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon914 views
Creating a Data Driven Organization - StampedeCon 2016 by StampedeCon
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon1.1K views
Using The Internet of Things for Population Health Management - StampedeCon 2016 by StampedeCon
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon1.1K views

Recently uploaded

Optimizing Communication to Optimize Human Behavior - LCBM by
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBMYaman Kumar
38 views49 slides
LLMs in Production: Tooling, Process, and Team Structure by
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureAggregage
57 views77 slides
State of the Union - Rohit Yadav - Apache CloudStack by
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStackShapeBlue
303 views53 slides
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc
176 views29 slides
Initiating and Advancing Your Strategic GIS Governance Strategy by
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance StrategySafe Software
184 views68 slides
Generative AI: Shifting the AI Landscape by
Generative AI: Shifting the AI LandscapeGenerative AI: Shifting the AI Landscape
Generative AI: Shifting the AI LandscapeDeakin University
67 views55 slides

Recently uploaded(20)

Optimizing Communication to Optimize Human Behavior - LCBM by Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar38 views
LLMs in Production: Tooling, Process, and Team Structure by Aggregage
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
Aggregage57 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue303 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc176 views
Initiating and Advancing Your Strategic GIS Governance Strategy by Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software184 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue207 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue108 views
"Package management in monorepos", Zoltan Kochan by Fwdays
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan Kochan
Fwdays34 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays36 views
The Power of Generative AI in Accelerating No Code Adoption.pdf by Saeed Al Dhaheri
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdf
Saeed Al Dhaheri39 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays58 views
Transcript: Redefining the book supply chain: A glimpse into the future - Tec... by BookNet Canada
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
BookNet Canada41 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue224 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue196 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue178 views

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015

  • 1. Choosing an HDFS data storage format: Avro vs. Parquet and more Stephen O’Sullivan | @steveos
  • 2. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience
  • 3. 3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • Prioritize for highest business value when using emerging technology • Design with outcomes in mind • Be agile: deliver initial results quickly, then adapt and iterate • Collaborate constantly with our customers and partners OUR PHILOSOPHY
  • 4. 4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience AGENDA Introduction Data formats How to choose Schema evolution Summary Questions
  • 7. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • Storage formats • What they do DATA FORMATS
  • 8. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Data format • Storage Format • Text • Sequence File • Avro • Parquet • Optimized Row Columnar (ORC)
  • 9. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Text • More specifically text = csv, tsv, json records… • Convenient format to use to exchange with other applications or scripts that produce or read delimited files • Human readable and parsable • Data stores is bulky and not as efficient to query • Do not support block compression
  • 10. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Sequence File • Provides a persistent data structure for binary key- value pairs • Row based • Commonly used to transfer data between Map Reduce jobs • Can be used as an archive to pack small files in Hadoop • Support splitting even when the data is compressed
  • 11. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Avro • Widely used as a serialization platform • Row-based, offers a compact and fast binary format • Schema is encoded on the file so the data can be untagged • Files support block compression and are splittable • Supports schema evolution
  • 12. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Parquet • Column-oriented binary file format • Uses the record shredding and assembly algorithm described in the Dremel paper • Each data file contains the values for a set of rows • Efficient in terms of disk I/O when specific columns need to be queried
  • 13. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Optimized Row Columnar • Considered the evolution of the RCFile • Stores collections of rows and within the collection the row data is stored in columnar format • Introduces a lightweight indexing that enables skipping of irrelevant blocks of rows • Splittable: allows parallel processing of row collections • It comes with basic statistics on columns (min ,max, sum, and count)
  • 15. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • ..for write • ..for read HOW TO CHOOSE
  • 16. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Functional Requirements: • What type of data do you have? • Is the data format compatible with your processing and querying tools? • What are your file sizes? • Do you have schemas that evolve over time?
  • 17. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Speed Concerns • Parquet and ORC usually needs some additional parsing to format the data which increases the overall read time • Avro as a data serialization format: works well from system to system, handles schema evolution (more on that later) • Text is bulky and inefficient but easily readable and parsable
  • 18. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 20 40 60 80 100 120 140 160 TimeinSeconds Narrow – Hortonworks (Hive 0.14 ) 0 500 1000 1500 2000 2500 TimeinSeconds Wide – Hortonworks (Hive 0.14) Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns
  • 19. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 10 20 30 40 50 60 70 TimeinSeconds Narrow - hive-1.1.0+cdh5.4.2 0 100 200 300 400 500 600 700 TimeinSeconds Wide - hive-1.1.0+cdh5.4.2 Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns
  • 20. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns 0 10 20 30 40 50 60 70 Text Avro Parquet TimeinSeconds Narrow - Spark 1.3 0 200 400 600 800 1000 1200 Text Avro Parquet TimeinSeconds Wide - Spark 1.3
  • 21. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 200 400 600 800 1000 1200 1400 Megabytes File sizes for narrow dataset 0 2000 4000 6000 8000 10000 12000 Megabytes File sizes for wide dataset
  • 22. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Use case • Avro – Event data that can change over time • Sequence File – Datasets shared between MR jobs • Text – Adding large amounts of data to HDFS quickly
  • 23. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Types of queries: • Column specific queries, or few groups of columns -> Use columnar format like Parquet or ORC • Compression of the file regardless the format increases query speed times • Text is really slow to read • Parquet and ORC optimize read performance at the expense of write performance
  • 24. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Set up: • Narrow dataset: • 10 million rows, 10 columns • Wide dataset: • 4 million rows, 1000 columns • Compression: • Snappy, except for Avro which is deflate
  • 25. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 10 20 30 40 50 60 70 Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - Hortonworks Hive 0.14.0.2.2.4.2 Text Avro Parquet Sequence ORC
  • 26. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 100 200 300 400 500 600 700 800 Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) TimeinSeconds Wide Dataset - Hortonworks Hive 0.14.0.2.2.4.2 Text Avro Parquet Sequence ORC
  • 27. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 10 20 30 40 50 60 70 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - CDH hive-1.1.0+cdh5.4.2 Text Avro Parquet Sequence ORC
  • 28. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 50 100 150 200 250 Query 1 (no conditions) Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) TimeinSeconds Wide Dataset - CDH hive-1.1.0+cdh5.4.2 Text Avro Parquet Sequence ORC
  • 29. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 1 2 3 4 5 6 7 8 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - CDH Impala Text Avro Parquet Sequence
  • 30. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 5 10 15 20 25 30 Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters) TimeinSeconds Wide Dataset - CDH Impala Text Avro Parquet Sequence
  • 31. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read Ran 4 queries (using Impala) over 4 Million rows (70GB raw), and 1000 columns (wide table) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters) Seconds Query times for different data formats Avro uncompress Avro Snappy Avro Deflate Parquet Seq uncompressed Seq Snappy Text Snappy Text uncompressed
  • 32. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Use case • Avro – Query datasets that have changed over time • Parquet – Query a few columns on a wide table
  • 34. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • What is schema evolution? • Data formats that evolve • Examples • Use cases SCHEMA EVOLUTION
  • 35. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • What is schema evolution? • Adding columns • Renaming columns • Removing columns • Why do we need it?
  • 36. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Data formats that can evolve • Avro • Parquet • Can only add columns at the end • ORC • It’s coming (That’s what they tell me ;) )…
  • 37. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • The data – Dr Who episodes • Original Dr Who & new Dr Who • http://www.theguardian.com/news/datablog/2010/aug/20/doctor-who-time-travel- information-is-beautiful • Avro schema for the original Dr Who {"namespace": "drwho.avro", "type": "record", "name": "drwho", "fields": [ {"name": "doctor_who_season", "type": "string"}, {"name": "doctor_actor", "type": "string"}, {"name": "episode_no", "type": "string"}, {"name": "episode_title", "type": "string"}, {"name": "date_from", "type": "string"}, {"name": "date_to", "type": "string"}, {"name": "estimated", "type": "string"}, {"name": "planet", "type": "string"}, {"name": "sub_location", "type": "string"}, {"name": "main_location", "type": "string"} ]}
  • 38. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Original Dr Who data doctor_who_ season doctor_actor episode_no episode_title date_from date_to estimated planet sub_location main_location 3 Pertwee 51 Spearhead from Space 1970 1990 y Earth England London and other 3 Pertwee 55 Terror of the Autons 1971 1971 y Earth England Luigi Rossini's Circus 3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus 3 Pertwee 59 The Daemons 1971 1971 y Earth England Devil's End; Wiltshire 3 Pertwee 60 Day of the Daleks 1972 2100 Earth England Auderly House and environs 3 Pertwee 63 The Mutants 1972 2900 Solos 3 Pertwee 64 The Time Monster -2000 1972 Earth/ Atlantis 3 Pertwee 64 The Time Monster 1972 -2000 Earth/ Atlantis 3 Pertwee 66 Carnival of Monsters 1972 1928 n Indian Ocean; Planet Inter Minor Ocean; alien planet 3 Pertwee 67 Frontier in Space 1972 2540 n Planet Draconia; Orgon Planet alien planets 3 Pertwee 68 Planet of the Daleks 1972 2540 y Planet Spiridon Alien Planet 3 Pertwee 69 The Green Death 2540 1973 y Earth UK Llanfairfach; Wales 3 Pertwee 70 The Time Warrior 1973 1200 n Earth UK Wessex Castle 3 Pertwee 71 Invasion of The Dinosaurs 1200 1974 y Earth UK London
  • 39. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Lets add, rename, and delete some columns • Avro schema for the new Dr Who {"namespace": "drwho.avro", "type": "record", "name": "drwho", "fields": [ {"name": "drwho_season", "type": ["null","string"], "aliases": ["doctor_who_season"]}, {"name": "drwho_actor", "type": ["null","string"], "aliases": ["doctor_actor"]}, {"name": "episode_no", "type": ["null","string"]}, {"name": "episode_title", "type": ["null","string"]}, {"name": "date_from", "type": ["null","string"]}, {"name": "date_to", "type": ["null","string"]}, {"name": "estimated", "type": "string"}, {"name": "planet", "type": ["null","string"]}, {"name": "sub_location", "type": ["null","string"]}, {"name": "main_location", "type": ["null","string"]}, {"name": "hd", "type": "string", "default": "no"} ]}
  • 40. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Original & New Dr Who data drwho_season drwho_actor episode_no episode_title date_from date_to planet sub_location main_location hd 10 Tennant 201 New Earth 2006 5000000023 New Earth New … New York yes 10 Tennant 202 Tooth and claw 2006 1879 Earth Scotland Torchwood house; Near Balmoral yes 10 Tennant 203 school Reunion 2007 2007 Earth England Deffry Vale yes 10 Tennant 204 the Girl in the Fireplace 1727 1744 Earth France Paris yes 3 Pertwee 51 Spearhead from Space 1970 1990 Earth England London and other no 3 Pertwee 55 Terror of the Autons 1971 1971 Earth England Luigi Rossini's Circus no 3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus no 3 Pertwee 59 The Daemons 1971 1971 Earth England Devil's End; Wiltshire no 3 Pertwee 60 Day of the Daleks 1972 2100 Earth England Auderly House and environs no
  • 41. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Use cases • New data added to an event stream • Need to see historic data with new data (and the schema has changed a lot) • Business has changed the field/column name
  • 43. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience QUESTIONS?
  • 44. 44 Yes, we’re hiring! info@svds.com THANK YOU Stephen O’Sullivan stephen@svds.com @steveos Demo code is here: github.com/silicon-valley-data- science/stampedecon-2015

Editor's Notes

  1. Description You have your Hadoop cluster, and you are ready to fill it up with data, but wait: Which format should you use to store your data? Should you store it in Plain Text, Sequence File, Avro, or Parquet? (And should you compress it?) This talk will take a closer look at some of the trade-offs, and will cover the How, Why, and When of choosing one format over another.
  2. Do not support block compression Once they are compressed they are not splittable anymore increasing read performance cost
  3. Each data file contains the values for a set of rows Within a data file, the values from each column are organized so that they are adjacent, enabling good compression values
  4. No results query 1 (which is count no conditions). This is because stinger is has meta data about the amount of data in the table (only when it’s an internal table).