XML Parsing with Map Reduce

www.edureka.co/big-data-and-hadoop
Hadoop the ultimate data storage
And processing Together

Objectives
Analyze different use-cases where MapReduce is used
Differentiate between Traditional way and MapReduce way
Learn about Hadoop 2.x MapReduce architecture and components
Understand execution flow of YARN MapReduce application
Implement basic MapReduce concepts
Run a MapReduce Program
At the end of this module, you will be able to

Where MapReduce is Used?
Weather Forecasting
HealthCare
 Problem Statement:
» De-identify personal health information.
 Problem Statement:
» Finding Maximum temperature recorded in a year.

Where MapReduce is Used?
MapReduce
FeaturesLarge Scale
Distributed Model
Used in
Function
Design Pattern
Parallel
Programming
A Program Model
Classification
Analytics
Recommendation
Index and Search
Map
Reduce
Classification
Eg: Top N records
Analytics
Eg: Join, Selection
Recommendation
Eg: Sort
Summarization
Eg: Inverted Index
Implemented
Google
Apache Hadoop
HDFS
Pig
Hive
HBase
For

The Traditional Way
Very
Big
Data
Split Data matches
All
matches
grep
grep
grep cat
grep
:
matches
matches
matches
Split Data
Split Data
Split Data

MapReduce Way
Very
Big
Data
Split Data
All
matches
:
Split Data
Split Data
Split Data
M
A
P
R
E
D
U
C
E
MapReduce Framework

MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)
Deer Bear River
Dear Bear River
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
K2,List(V2)List(K2,V2)
K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)

Anatomy of a MapReduce Program
MapReduce
Map:
Reduce:
(K1, V1) List (K2, V2)
(K2, list (V2)) List (K3, V3)
Key Value

Why MapReduce?
Two biggest Advantages:
» Taking processing to the data
» Processing data in parallel
a
b
c
Map Task
HDFS Block
Data Center
Rack
Node

 ApplicationMaster
» One per application
» Short life
» Coordinates and Manages MapReduce Jobs
» Negotiates with Resource Manager to
schedule tasks
» The tasks are started by NodeManager(s)
 Job History Server
» Maintains information about submitted
MapReduce jobs after their ApplicationMaster
terminates
 Client
» Submits a MapReduce Job
 Resource Manager
» Cluster Level resource manager
» Long Life, High Quality Hardware
 Node Manager
» One per Data Node
» Monitors resources on Data Node
Hadoop 2.x MapReduce Components
 Container
» Created by NM when requested
» Allocates certain amount of resources
(memory, CPU etc.) on a slave node

BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4, …)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave..)
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN – Moving beyond MapReduce

MapReduce Application Execution
Executing MapReduce Application on YARN

YARN MR Application Execution Flow
MapReduce Job Execution
» Job Submission
» Job Initialization
» Tasks Assignment
» Memory Assignment
» Status Updates
» Failure Recovery

HDFS
Application Job Object
Client JVM
Client
Resource
Manager
Management Node
Run Job
2. Get New Application ID
4. Submit Application Context
3. Prepare the
Application submit
context
3.1 App Jar
3.2 Job Resources(Block
locations)
3.3 User Information
1. Notify Start Application

HDFS
3. Prepare the
Application submit
context
3.1 App Jar
locations)
Node Manager
5. Start AppMaster container /
Allocate Context for AppMaster
App Master
6.Alloate
Container for
AppMaster
7.Request
Resources
8.Notify with resources
Availability
Data Node
Application Job Object
Client JVM
Client
Resource
Manager
Management Node
Run Job
2. Get New Application ID
4. Submit Application Context

HDFS
Resource
Manager
3. Prepare the Application
submit context
3.1 App Jar
locations)
Management Node
Node Manager
5. Start AppMaster container / Allocate
Context for AppMaster
App Master
6. Allocate
Container for
AppMaster
7.Request
Resources
8.Notify with resources
Availability
Data Node
Client
Node Manager
Data node-1
Node Manager
Map Block
9.Start Container
in the worker node
Data node-2
Node Manager
Map Block
10.NM allocate
Container
10.NM allocate
Container
2. Get New Application
4. Submit Application
9.Start Container
in the worker
node

11.Task get Executed.
12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate
Container
13.Output of All the Maps given to reducer and Reducer get executed
14.Once Job finished, Application Master notify the Resource Manager and Client Library
15.Application Master closed.

Hadoop 2.x : YARN Workflow
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Container 1.2
Container 1.1
Container 2.1
Container 2.2
Container 2.3
App
Master 2
App
Master 1
Scheduler
Applications
Manager (AsM)
Resource
Manager

Summary: Application Workflow
Execution Sequence :
1. Client submits an application Client RM NM AM
1

1. Client submits an application
2. RM allocates a container to start AM
Client RM NM AM
1
2

3. AM registers with RM
Client RM NM AM
1
2
3

4. AM asks containers from RM
Client RM NM AM
1
2
3
4

5. AM notifies NM to launch containers
Client RM NM AM
1
2
3
4
5

6. Application code is executed in container
Client RM NM AM
1
2
3
4
5
6

7. Client contacts RM/AM to monitor application’s status
Client RM NM AM
1
2
3
4
5
7 6

7. Client contacts RM/AM to monitor application’s status
8. AM unregisters with RM
Client RM NM AM
1
2
3
4
5
7
8
6

Input Splits
INPUT DATA
Physical
Division
Logical
Division
HDFS
Blocks
Input
Splits

Relation Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
 Logical records do not fit neatly into the HDFS blocks.
 Logical records are lines that cross the boundary of the blocks.
 First split contains line 5 although it spans across blocks.
File
Lines
Block
Boundary
Block
Boundary
Block
Boundary
Block
Boundary
Split Split Split

MapReduce Job Submission Flow
Input data is distributed to nodes
Node 1 Node 2
INPUT DATA

Each map task works on a “split” of data
Map
Node 1
Map
Node 2
INPUT DATA

Mapper outputs intermediate data
Map
Node 1
Map
Node 2
INPUT DATA

Data exchange between nodes in a “shuffle” process
Map
Node 1
Map
Node 2
Node 1 Node 2
INPUT DATA

Intermediate data of the same key goes to the same reducer
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA

Intermediate data of the same key goes to the same reducer
Reducer output is stored
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA

Combiner
Combiner
Reducer
(B,1)
(C,1)
(D,1)
(E,1)
(D,1)
(B,1)
(D,1)
(A,1)
(A,1)
(C,1)
(B,1)
(D,1)
(B,2)
(C,1)
(D,2)
(E,1)
(D,2)
(A,2)
(C,1)
(B,1)
(A, [2])
(B, [2,1])
(C, [1,1])
(D, [2,2])
(E, [1])
(A,2)
(B,3)
(C,2)
(D,4)
(E,1)
Shuffle
CombinerMapper
Mapper
B
C
D
E
D
B
D
A
A
C
B
D
Block1Block2

Partitioner – Redirecting Output from Mapper
Map
Map
Map
Reducer
Reducer
Reducer
Partitioner
Partitioner
Partitioner

Getting Data to the Mapper
Input File Input File
Input split Input split Input split Input split
RecordReader RecordReader RecordReader RecordReader
Mapper Mapper Mapper Mapper
(intermediates) (intermediates) (intermediates) (intermediates)

Partition and Shuffle
Mapper Mapper Mapper Mapper
(intermediates) (intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
(intermediates) (intermediates) (intermediates)
Reducer Reducer Reducer

Demo of Word Count Program
To illustrate Default Input Format
(Text Input Format)
Demo

Input file
Input Split Input Split Input Split
Record
Reader
Record
Reader
Record
Reader
Mapper Mapper Mapper
(Intermediates) (Intermediates) (Intermediates)
InputFormat
Input Split
Record
Reader
Mapper
Input file
(Intermediates)
Input Format

Combine File
Input Format<K,V>
Text Input Format
Key Value Text
Input Format
Nline Input Format
Sequence File
Input Format<K,V>
File Input Format
<K,V>
Input Format<K,V>
org.apache.hadoop.mapreduce
<<interface>>
Composable
Input Format
<K,V>
Composite Input Format
<K,V>
DB Input
Format<T>
Sequence File As
Binary Input Format
Sequence File As
Text Input Format
Sequence File Input
Filter<K,V>
Input Format – Class Hierarchy

Reducer
RecordWriter
Output file
Reducer
RecordWriter
Output file
Reducer
RecordWriter
Output file
OutputFormat
Output Format

Text Output Format
<K,V>
Sequence File
Output Format<K,V>
Output Format <K,V>
org.apache.hadoop.mapreduce
DB Output Format
<K,V>
File Output Format
<K,V>
Null Output Format
<K,V>
Filter Output Format
<K,V>
Sequence File As Binary
Output Format
Lazy Output Format
<K,V>
Output Format – Class Hierarchy

Demo
Demo: Custom Input Format

XML Parsing with Map Reduce

More Related Content

What's hot

Viewers also liked

Similar to XML Parsing with Map Reduce

More from Edureka!

Recently uploaded

XML Parsing with Map Reduce