SlideShare a Scribd company logo
© Vigen Sahakyan 2016
Hadoop Tutorial
MapReduce Data Types
and Formats
© Vigen Sahakyan 2016
Agenda
● Data Types
● In/Out Format class Hierarchy
● File Formats
● XML, JSON and Sequence File
● Avro
● Parquet
● Custom Formats(e.g. csv)
© Vigen Sahakyan 2016
Data Types
Basicly Hadoop (MapReduce) need data types which support both serialization (for efficient
read and write) and comparability ( to sort keys within sort and shuffle phase). For that purposes
hadoop has WritableComparable<T> interface, which extends Writable (A serializable object
which implements a simple, efficient, serialization protocol) and Comparable<T>. You can see
some these implementations below:
● Data Types: ByteWritable, IntWritable, LongWritable, FloatWritable, DoubleWritable, Text,
BooleanWritable, VIntWritable, VLongWritable
● Data Structures: BytesWritable, ArrayWritable, MapWritable and SortedMapWritable(both
extend AbstractMapWritable which implement Configurable).
© Vigen Sahakyan 2016
In/Out Format class Hierarchy
● InputSampler<K,V> uses InputFormat<K,V> and RecordReader<K,V> to read
records separated by delimiter and provide these records directly to your
map() function. You don’t need directly use their functionality, as MapReduce
framework care all about instead of you. Every map() function get single
record, and your task is handle that single record and nothing else.
● For reduce() function there are similar classes and interfaces
(OutputFormat<K,V>, RecordWriter<K,V>, FileOutputFormat<K,V>)
● For properly work you should set input and output format classes in JobConf.
© Vigen Sahakyan 2016
In/Out Format class Hierarchy
class CustomInputFormat extends FileInputFormat<K,V> {
List<InputSplit> getSplits(JobContext context) {
// read and return list of splits (by default block size) for given DFS file
}
RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) {
// return the instance of CustomRecordReader for every split and set delimiter for RecordReader
}
}
abstract class FileInputFormat<K,V> extends Object implements InputFormat<K,V> { ... }
interface InputFormat<K,V> { // interface for getSplits and createRecordReader
List<InputSplit> getSplits(JobContext context) {
// should be implemented in class
}
RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) {
// should be implemented in class
}
}
© Vigen Sahakyan 2016
In/Out Format class Hierarchy
class CustomRecordReader extends RecordReader<K,V> {
void initialize (InputSplit split, TaskAttemptContext context ) {
// Find and open the specific DFS file split, seek start of file and create LineReader for read data records.
}
boolean nextKeyValue() {
// physically read data record from start position with respect to delimiter
}
K getCurrentKey() { … };
V getCurrentValue() { … }; // get functions for key and value
}
abstract class RecordReader<K,V> {
void initialize (InputSplit split, TaskAttemptContext context ) { // should be implemented in class }
boolean nextKeyValue() { // should be implemented in class }
K getCurrentKey() { … }; // get functions for key
V getCurrentValue() { … }; // get functions for value
}
© Vigen Sahakyan 2016
File Formats
● XML (one of the most common file formats)
● JSON (as famous as XML but it’s more reach)
● SequenceFile (native MapReduce key/value format)
● Avro (created by Hadoop it’s record oriented but not key/value format )
● Parquet (it’s columnar oriented data storage format)
● Thrift (don’t used directly)
● Protocol Buffers (MR use elephant bird to read)
● Other Custom Formats (e.g CSV)
© Vigen Sahakyan 2016
XML, JSON and Sequence File
● XML - MapReduce doesn’t have native XML support. If you want to work with large XML files
in MapReduce and be able to split and process them in parallel then you can use Mahout’s
XMLInputFormat to work with XML files in HDFS with MapReduce. It reads records that are
delimited by specific XML begin and end tags.
● JSON - there are two problems MapReduce doesn’t come with an InputFormat that works
with JSON; and how does one even go about splitting JSON?
If you want to work with JSON inputs in MapReduce, you can use Elephant Bird
LzoJsonInputFormat input format is used as a basis to create an input format class to work
with JSON elements.
● Sequence File - it was created specifically for MapReduce tasks, it is row oriented
key/value file format. Which well supported in all hadoop environment projects (Hive, Pig
e.t.c).
All aforementioned formats don’t support code generation and schema evaluation.
© Vigen Sahakyan 2016
Avro
Avro is a RPC and data serialization framework developed by Hadoop in order to
improve data interchange, interoperability, and versioning in MapReduce. Natively Avro
isn’t key/value but record-based file format.
● Support code generation and schema evaluation
● Use JSON to define schema.
● There are 3 ways that you can use Avro in MapReduce (Mixed, Record-based, and
Key/Value-based modes).
○ Mixed - in cases where you have non Avro input and generate Avro outputs, or vice versa, in
which case the Avro mapper and reducer classes aren’t suitable. Use AvroWrapper class.
○ Record-based - in this case Avro will be used end-to-end. As Avro isn’t key/value format you
should use specific Mapper (AvroMapper) and Reducer (AvroReducer) classes.
○ Key/Value-based - you want to use Avro as native key/value format, use the AvroKeyValue,
AvroKey, and AvroValue classes to work with Avro key/value data.
© Vigen Sahakyan 2016
Parquet
Parquet is a new columnar storage format (yes it’s storage format, I didn’t say file format).
Parquet doesn’t have his own object model(in memory representation), instead it has object
model converters (to represent data as Avro, Thrift, Protocol Buffer, Pig, Hive e.t.c).
● Parquet physically store data column by column, instead of row by row (e.g Avro). For
that reason it’s called columnar storage.
● In case when you often need projection by columns or you need to do operation (avg,
max, min e.t.c) only on the specific columns, it’s more effective to store data in columnar
format, because accessing data become faster than in case of row storage.
● It support schema evaluation but doesn’t support code generation.
● You can use AvroParquetInputFormat and AvroParquetOutputFormat in MapReduce,
and AvroParquetWriter and AvroParquetReader classes in simple Java app.
● It is well supported by lots of hadoop projects ( Pig, Hive, Spark, Impala, HBase e.t.c)
© Vigen Sahakyan 2016
Custom Formats ( CSV)
You can read the custom file formats with TextInputFormat class and implement
reading and writing in your map and reduce tasks appropriately. But if you want to
write reusable and convenient code for specific file format (e.g CSV) you need to
implement your own:
● CSVInputFormat which extend FileInputFormat
● CSVRecordReader which extend RecordReader
● CSVOutputFormat which extend TextOutputFormat
● CSVRecordWriter which extend RecordWriter
And use these as input and output classes in your MapReduce. Just set them in
JobConfiguration settings.
© Vigen Sahakyan 2016
References
1. Hadoop in Practice 2nd Edition by Alex Holmes
http://www.amazon.com/Hadoop-Practice-Alex-Holmes/dp/1617292222
2. Hadoop: The Definitive Guide by Tom White
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White-ebook/dp/B00V7B1IZC
Thanks!
© Vigen Sahakyan 2016

More Related Content

What's hot

Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
Rushali Deshmukh
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
Krish_ver2
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representation
Sravanthi Emani
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
Software Architecture
Software ArchitectureSoftware Architecture
Software Architecture
Dharmalingam Ganesan
 
Client-Server Computing
Client-Server ComputingClient-Server Computing
Client-Server Computing
Cloudbells.com
 
Ecg analysis in the cloud
Ecg analysis in the cloudEcg analysis in the cloud
Ecg analysis in the cloud
gaurav jain
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
tutorialvillage
 
CS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitectureCS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_Architecture
Palani Kumar
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
PoojaShah174393
 
Dbms schemas for decision support
Dbms schemas for decision supportDbms schemas for decision support
Dbms schemas for decision support
rameswara reddy venkat
 
Programming Elasticity in the Cloud
Programming Elasticity in the CloudProgramming Elasticity in the Cloud
Programming Elasticity in the Cloud
Hong-Linh Truong
 
Distributed Systems Real Life Applications
Distributed Systems Real Life ApplicationsDistributed Systems Real Life Applications
Distributed Systems Real Life Applications
Aman Srivastava
 
Planning
PlanningPlanning
Planning
ahmad bassiouny
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
moni sindhu
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
Sachin Gowda
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
Dr. C.V. Suresh Babu
 

What's hot (20)

Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representation
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Software Architecture
Software ArchitectureSoftware Architecture
Software Architecture
 
Client-Server Computing
Client-Server ComputingClient-Server Computing
Client-Server Computing
 
Ecg analysis in the cloud
Ecg analysis in the cloudEcg analysis in the cloud
Ecg analysis in the cloud
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
CS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitectureCS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_Architecture
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Dbms schemas for decision support
Dbms schemas for decision supportDbms schemas for decision support
Dbms schemas for decision support
 
Programming Elasticity in the Cloud
Programming Elasticity in the CloudProgramming Elasticity in the Cloud
Programming Elasticity in the Cloud
 
Distributed Systems Real Life Applications
Distributed Systems Real Life ApplicationsDistributed Systems Real Life Applications
Distributed Systems Real Life Applications
 
Planning
PlanningPlanning
Planning
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
 

Viewers also liked

Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Cloudera, Inc.
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
acogoluegnes
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
Cloudera, Inc.
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
Hortonworks
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
Shalish VJ
 

Viewers also liked (7)

Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
 

Similar to Map Reduce data types and formats

Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUG
Cloudera, Inc.
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
Luis Marques
 
Allura - an Open Source MongoDB Based Document Oriented SourceForge
Allura - an Open Source MongoDB Based Document Oriented SourceForgeAllura - an Open Source MongoDB Based Document Oriented SourceForge
Allura - an Open Source MongoDB Based Document Oriented SourceForge
Rick Copeland
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
Hadoop User Group
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
KennyPratheepKumar
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
Accumulo Summit
 
Unit 3
Unit 3Unit 3
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...
Peter Bouda
 
XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7
Deniz Kılınç
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
Dr. C.V. Suresh Babu
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and PythonRapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rick Copeland
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysis
Peter Bouda
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingData
Hellen Gakuruh
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
Lukas Vlcek
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 

Similar to Map Reduce data types and formats (20)

Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUG
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Allura - an Open Source MongoDB Based Document Oriented SourceForge
Allura - an Open Source MongoDB Based Document Oriented SourceForgeAllura - an Open Source MongoDB Based Document Oriented SourceForge
Allura - an Open Source MongoDB Based Document Oriented SourceForge
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Unit 3
Unit 3Unit 3
Unit 3
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...
 
XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and PythonRapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and Python
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysis
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingData
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 

Recently uploaded

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 

Recently uploaded (20)

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 

Map Reduce data types and formats

  • 1. © Vigen Sahakyan 2016 Hadoop Tutorial MapReduce Data Types and Formats
  • 2. © Vigen Sahakyan 2016 Agenda ● Data Types ● In/Out Format class Hierarchy ● File Formats ● XML, JSON and Sequence File ● Avro ● Parquet ● Custom Formats(e.g. csv)
  • 3. © Vigen Sahakyan 2016 Data Types Basicly Hadoop (MapReduce) need data types which support both serialization (for efficient read and write) and comparability ( to sort keys within sort and shuffle phase). For that purposes hadoop has WritableComparable<T> interface, which extends Writable (A serializable object which implements a simple, efficient, serialization protocol) and Comparable<T>. You can see some these implementations below: ● Data Types: ByteWritable, IntWritable, LongWritable, FloatWritable, DoubleWritable, Text, BooleanWritable, VIntWritable, VLongWritable ● Data Structures: BytesWritable, ArrayWritable, MapWritable and SortedMapWritable(both extend AbstractMapWritable which implement Configurable).
  • 4. © Vigen Sahakyan 2016 In/Out Format class Hierarchy ● InputSampler<K,V> uses InputFormat<K,V> and RecordReader<K,V> to read records separated by delimiter and provide these records directly to your map() function. You don’t need directly use their functionality, as MapReduce framework care all about instead of you. Every map() function get single record, and your task is handle that single record and nothing else. ● For reduce() function there are similar classes and interfaces (OutputFormat<K,V>, RecordWriter<K,V>, FileOutputFormat<K,V>) ● For properly work you should set input and output format classes in JobConf.
  • 5. © Vigen Sahakyan 2016 In/Out Format class Hierarchy class CustomInputFormat extends FileInputFormat<K,V> { List<InputSplit> getSplits(JobContext context) { // read and return list of splits (by default block size) for given DFS file } RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) { // return the instance of CustomRecordReader for every split and set delimiter for RecordReader } } abstract class FileInputFormat<K,V> extends Object implements InputFormat<K,V> { ... } interface InputFormat<K,V> { // interface for getSplits and createRecordReader List<InputSplit> getSplits(JobContext context) { // should be implemented in class } RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) { // should be implemented in class } }
  • 6. © Vigen Sahakyan 2016 In/Out Format class Hierarchy class CustomRecordReader extends RecordReader<K,V> { void initialize (InputSplit split, TaskAttemptContext context ) { // Find and open the specific DFS file split, seek start of file and create LineReader for read data records. } boolean nextKeyValue() { // physically read data record from start position with respect to delimiter } K getCurrentKey() { … }; V getCurrentValue() { … }; // get functions for key and value } abstract class RecordReader<K,V> { void initialize (InputSplit split, TaskAttemptContext context ) { // should be implemented in class } boolean nextKeyValue() { // should be implemented in class } K getCurrentKey() { … }; // get functions for key V getCurrentValue() { … }; // get functions for value }
  • 7. © Vigen Sahakyan 2016 File Formats ● XML (one of the most common file formats) ● JSON (as famous as XML but it’s more reach) ● SequenceFile (native MapReduce key/value format) ● Avro (created by Hadoop it’s record oriented but not key/value format ) ● Parquet (it’s columnar oriented data storage format) ● Thrift (don’t used directly) ● Protocol Buffers (MR use elephant bird to read) ● Other Custom Formats (e.g CSV)
  • 8. © Vigen Sahakyan 2016 XML, JSON and Sequence File ● XML - MapReduce doesn’t have native XML support. If you want to work with large XML files in MapReduce and be able to split and process them in parallel then you can use Mahout’s XMLInputFormat to work with XML files in HDFS with MapReduce. It reads records that are delimited by specific XML begin and end tags. ● JSON - there are two problems MapReduce doesn’t come with an InputFormat that works with JSON; and how does one even go about splitting JSON? If you want to work with JSON inputs in MapReduce, you can use Elephant Bird LzoJsonInputFormat input format is used as a basis to create an input format class to work with JSON elements. ● Sequence File - it was created specifically for MapReduce tasks, it is row oriented key/value file format. Which well supported in all hadoop environment projects (Hive, Pig e.t.c). All aforementioned formats don’t support code generation and schema evaluation.
  • 9. © Vigen Sahakyan 2016 Avro Avro is a RPC and data serialization framework developed by Hadoop in order to improve data interchange, interoperability, and versioning in MapReduce. Natively Avro isn’t key/value but record-based file format. ● Support code generation and schema evaluation ● Use JSON to define schema. ● There are 3 ways that you can use Avro in MapReduce (Mixed, Record-based, and Key/Value-based modes). ○ Mixed - in cases where you have non Avro input and generate Avro outputs, or vice versa, in which case the Avro mapper and reducer classes aren’t suitable. Use AvroWrapper class. ○ Record-based - in this case Avro will be used end-to-end. As Avro isn’t key/value format you should use specific Mapper (AvroMapper) and Reducer (AvroReducer) classes. ○ Key/Value-based - you want to use Avro as native key/value format, use the AvroKeyValue, AvroKey, and AvroValue classes to work with Avro key/value data.
  • 10. © Vigen Sahakyan 2016 Parquet Parquet is a new columnar storage format (yes it’s storage format, I didn’t say file format). Parquet doesn’t have his own object model(in memory representation), instead it has object model converters (to represent data as Avro, Thrift, Protocol Buffer, Pig, Hive e.t.c). ● Parquet physically store data column by column, instead of row by row (e.g Avro). For that reason it’s called columnar storage. ● In case when you often need projection by columns or you need to do operation (avg, max, min e.t.c) only on the specific columns, it’s more effective to store data in columnar format, because accessing data become faster than in case of row storage. ● It support schema evaluation but doesn’t support code generation. ● You can use AvroParquetInputFormat and AvroParquetOutputFormat in MapReduce, and AvroParquetWriter and AvroParquetReader classes in simple Java app. ● It is well supported by lots of hadoop projects ( Pig, Hive, Spark, Impala, HBase e.t.c)
  • 11. © Vigen Sahakyan 2016 Custom Formats ( CSV) You can read the custom file formats with TextInputFormat class and implement reading and writing in your map and reduce tasks appropriately. But if you want to write reusable and convenient code for specific file format (e.g CSV) you need to implement your own: ● CSVInputFormat which extend FileInputFormat ● CSVRecordReader which extend RecordReader ● CSVOutputFormat which extend TextOutputFormat ● CSVRecordWriter which extend RecordWriter And use these as input and output classes in your MapReduce. Just set them in JobConfiguration settings.
  • 12. © Vigen Sahakyan 2016 References 1. Hadoop in Practice 2nd Edition by Alex Holmes http://www.amazon.com/Hadoop-Practice-Alex-Holmes/dp/1617292222 2. Hadoop: The Definitive Guide by Tom White http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White-ebook/dp/B00V7B1IZC