SlideShare a Scribd company logo
Data Storage
Formats in HDFS
Evaluation Criteria
- The processing tools
- i.e Cloudera do not support ORC
- Whether data has a changing nature or not
- Splitability
- XML is not splittable
- Compression
- Speed up I/O operation
- Save Storage
- Increase processing time : DECOMPRESSION!
- The data size
- Processing and query performance
Common File Formats
All File Formats
ColumnarStandard
Sequence Data Structure Data Parquet ORC
Serialization
Avro
Summary of some file formats’ features
Data Format Type of Format Splittable Changing Compression Meta Data
Json, XML Standards - + - +
CSV File Standards + - - -
JSON Records Standards + + - +
Sequence Files Standards + - + -
Avro Files Serialization + + + +
ORC Files Columnar + + + +
Parquet Files Columnar + + + +
Sequence File
- An optimal solution for small files
- Save as <key, value>
- Support compression
- Record
- Block
Parquet
- Optimized for Impala
- Used by Twitter
- Data Structure
- Data partitioned into rows
- Pages can be compressed
Parquet
- Data Structure
ORC
- Optimized for Hive, Presto
- Data Structure
- Index contain basic statistics
- File footer contain a list of stripes information
- Postscript holds compression parameters
Avro
- Row base storage
- Found in Apache Kafka
- Robust Support for changing schema
- Data Structure
Avro vs Parquet
- Avro is ideal for ETL
- Parquet is ideal for query analysis
- Read operation is better in Parquet
- Write operation is better in Avro
- Avro support full changing schema
- Parquet just support append
Parquet vs ORC
- Parquet is better for nested data
- ORC is more compression efficient
Uber Use Case
The End

More Related Content

What's hot

Introduction to HDF5
Introduction to HDF5Introduction to HDF5
The executable formats (PE, ELF, HEX, SREC AND ...)
The executable formats (PE, ELF, HEX, SREC AND ...)The executable formats (PE, ELF, HEX, SREC AND ...)
The executable formats (PE, ELF, HEX, SREC AND ...)
Medhat HUSSAIN
 
Microsoft Windows File System in Operating System
Microsoft Windows File System in Operating SystemMicrosoft Windows File System in Operating System
Microsoft Windows File System in Operating System
Meghaj Mallick
 
Sql server lesson3
Sql server lesson3Sql server lesson3
Sql server lesson3
Ala Qunaibi
 
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary fileCBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
ShivaniJayaprakash1
 
Foreign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with PostgresForeign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with Postgres
EDB
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
Edhole.com
 
Eol Drupal Dman Presentation
Eol   Drupal   Dman PresentationEol   Drupal   Dman Presentation
Eol Drupal Dman PresentationDavid Shorthouse
 
[Altibase] 4-1 tablespace concept
[Altibase] 4-1 tablespace concept[Altibase] 4-1 tablespace concept
[Altibase] 4-1 tablespace concept
altistory
 
SQL Server 2012 - Semantic Search
SQL Server 2012 - Semantic SearchSQL Server 2012 - Semantic Search
SQL Server 2012 - Semantic Search
Sperasoft
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
karenostil
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File CarvingRob Zirnstein
 
All about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining DataAll about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining Data
DAGEOP LTD
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
The HDF-EOS Tools and Information Center
 
Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS Storage
Pete Kisich
 
VeloxDFS
VeloxDFSVeloxDFS
VeloxDFS
Vicente Bolea
 
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- DatasheetHitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Vantara
 
Hitachi NAS Platform 4000 Series Datasheet
Hitachi NAS Platform 4000 Series DatasheetHitachi NAS Platform 4000 Series Datasheet
Hitachi NAS Platform 4000 Series Datasheet
Hitachi Vantara
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
Syracuse University
 
Ch 1-final-file organization from korth
Ch 1-final-file organization from korthCh 1-final-file organization from korth
Ch 1-final-file organization from korth
Rupali Rana
 

What's hot (20)

Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
The executable formats (PE, ELF, HEX, SREC AND ...)
The executable formats (PE, ELF, HEX, SREC AND ...)The executable formats (PE, ELF, HEX, SREC AND ...)
The executable formats (PE, ELF, HEX, SREC AND ...)
 
Microsoft Windows File System in Operating System
Microsoft Windows File System in Operating SystemMicrosoft Windows File System in Operating System
Microsoft Windows File System in Operating System
 
Sql server lesson3
Sql server lesson3Sql server lesson3
Sql server lesson3
 
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary fileCBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
CBSE - Class 12 - Ch -5 -File Handling , access mode,CSV , Binary file
 
Foreign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with PostgresForeign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with Postgres
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
 
Eol Drupal Dman Presentation
Eol   Drupal   Dman PresentationEol   Drupal   Dman Presentation
Eol Drupal Dman Presentation
 
[Altibase] 4-1 tablespace concept
[Altibase] 4-1 tablespace concept[Altibase] 4-1 tablespace concept
[Altibase] 4-1 tablespace concept
 
SQL Server 2012 - Semantic Search
SQL Server 2012 - Semantic SearchSQL Server 2012 - Semantic Search
SQL Server 2012 - Semantic Search
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File Carving
 
All about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining DataAll about Storage - Series 2 Defining Data
All about Storage - Series 2 Defining Data
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS Storage
 
VeloxDFS
VeloxDFSVeloxDFS
VeloxDFS
 
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- DatasheetHitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
 
Hitachi NAS Platform 4000 Series Datasheet
Hitachi NAS Platform 4000 Series DatasheetHitachi NAS Platform 4000 Series Datasheet
Hitachi NAS Platform 4000 Series Datasheet
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Ch 1-final-file organization from korth
Ch 1-final-file organization from korthCh 1-final-file organization from korth
Ch 1-final-file organization from korth
 

Similar to Data storage format in hdfs

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Rukmani Gopalan
 
Storage in hadoop
Storage in hadoopStorage in hadoop
Storage in hadoop
Puneet Tripathi
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
Antonios Chatzipavlis
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
Bob Pusateri
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Amazon Web Services
 
SQL Server 2012 - FileTables
SQL Server 2012 - FileTables SQL Server 2012 - FileTables
SQL Server 2012 - FileTables
Sperasoft
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
David Horvath
 
Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)
James Aylett
 
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
RCAHMW
 
Xml
XmlXml
Xml
Anas Sa
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra
Vipin Mishra
 
Registry Technical Training
Registry Technical TrainingRegistry Technical Training
Registry Technical Training
Dave Reynolds
 
1 xml fundamentals
1 xml fundamentals1 xml fundamentals
1 xml fundamentals
Dr.Saranya K.G
 
Dynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeDynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data Merge
Clay Helberg
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
Kevin Epstein
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1Marco Gralike
 

Similar to Data storage format in hdfs (20)

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
 
Storage in hadoop
Storage in hadoopStorage in hadoop
Storage in hadoop
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
SQL Server 2012 - FileTables
SQL Server 2012 - FileTables SQL Server 2012 - FileTables
SQL Server 2012 - FileTables
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
 
Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)
 
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
 
Xml
XmlXml
Xml
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
23xml
23xml23xml
23xml
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra
 
Registry Technical Training
Registry Technical TrainingRegistry Technical Training
Registry Technical Training
 
1 xml fundamentals
1 xml fundamentals1 xml fundamentals
1 xml fundamentals
 
Dynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeDynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data Merge
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
 
XML Databases
XML DatabasesXML Databases
XML Databases
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
 

Recently uploaded

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 

Recently uploaded (20)

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 

Data storage format in hdfs

  • 2. Evaluation Criteria - The processing tools - i.e Cloudera do not support ORC - Whether data has a changing nature or not - Splitability - XML is not splittable - Compression - Speed up I/O operation - Save Storage - Increase processing time : DECOMPRESSION! - The data size - Processing and query performance
  • 3. Common File Formats All File Formats ColumnarStandard Sequence Data Structure Data Parquet ORC Serialization Avro
  • 4. Summary of some file formats’ features Data Format Type of Format Splittable Changing Compression Meta Data Json, XML Standards - + - + CSV File Standards + - - - JSON Records Standards + + - + Sequence Files Standards + - + - Avro Files Serialization + + + + ORC Files Columnar + + + + Parquet Files Columnar + + + +
  • 5. Sequence File - An optimal solution for small files - Save as <key, value> - Support compression - Record - Block
  • 6. Parquet - Optimized for Impala - Used by Twitter - Data Structure - Data partitioned into rows - Pages can be compressed
  • 8. ORC - Optimized for Hive, Presto - Data Structure - Index contain basic statistics - File footer contain a list of stripes information - Postscript holds compression parameters
  • 9. Avro - Row base storage - Found in Apache Kafka - Robust Support for changing schema - Data Structure
  • 10. Avro vs Parquet - Avro is ideal for ETL - Parquet is ideal for query analysis - Read operation is better in Parquet - Write operation is better in Avro - Avro support full changing schema - Parquet just support append
  • 11. Parquet vs ORC - Parquet is better for nested data - ORC is more compression efficient