SlideShare a Scribd company logo
1 of 15
M
KU

AR

A
OM
.C
ND
OO
NA
H
A
YA
SH
K@
JE
90
RA
12
H_
ES
AJ
R

IC OP
S S DO
LA A
C H
F IN
O
Y CE
OM DU
AT RE
AN AP
M
FIRST A JOB HAS TO BE SUBMITTED TO HADOOP
CLUSTER. LET’S SEE HOW JOB SUBMISSION
HAPPENS IN HADOOP.
MR Program

Job

JobTracker
getNewJobId()
submitJob()

submit()
JobSubmitter
submitJobInternal()
Client JVM

Job Tracker JVM

HDFS

Job Jar
Configuration Files
Computed Input Splits
Folder in name of Job ID

•
•
•

•

Client calls the submit() (or waitForCompletion()) method on Job.
Job creates a new instance of JobSubmitter which has a method called ‘submitJobInternal()’.
Then the Job.submit() method calls JobSubmitter.submitJobInternal() method which does the
following;
• Invokes the JobTracker.getNewJobId() to generate a unique id for the job.
• Checks the output specification. If output path is not specified or if it already exists, then an
exception is thrown.
• Checks the input specification. In case of invalid input path or if splits couldn’t be computed,
then an exception is thrown. Else, the input splits are computed.
• Creates a directory in the name of Job Id in HDFS.
• Copies Job jar, configuration files and computed input splits to this directory. The Job jar has
high replication factor which is configured in ‘mapred.submit.replication’ property.
• Informs the JobTracker that the job is ready to be executed by calling submitJob() on
JobTracker.
Instead of Job.submit() the MapReduce program can call Job.waitForCompletion() method.
Difference here is, the later waits till the job is complete by polling the JobTracker for every
second. Once the job is completed successfully, counters are displayed in console. In case of job
failure, the exception is displayed in console. But the submit() method terminates once the job
submission is done.
NEXT THE SUBMITTED JOB WILL BE INITIALIZED.
NOW LET’S SEE HOW JOB INITIALIZATION
HAPPENS IN HADOOP.
JobTracker
submitJob()

Job Scheduler

Map Tasks

Reduce Tasks

Other Tasks
JS

T
1

J1

HDFS

S
1

S
2

S
3

T
2

T
3

JC

T
1

Bookkeeping Info

Input Splits stored in Job ID
directory in HDFS.

•
•
•
•
•
•
•
•

When JobTracker.submitJob() is called, JobTracker adds a token to an internal queue.
A job scheduler picks it up and creates an object representation for the job. This representation
encapsulates tasks & bookkeeping infos (to track the status & progress of job’s tasks).
Job scheduler then reads the input splits from HDFS.
Then the scheduler creates one map task for each input split.
‘mapred.reduce.tasks’ property is meant to have an integer. Job scheduler reads this property and
creates those many number of reduce tasks. Let’s assume the property’s value was 1.
Job setup (JS) & Job cleanup (JC) are the other 2 tasks created. Job setup task is run before any
map task and Job cleanup task is run after all reduce tasks are complete.
Each job has an OutputCommitter (a Java Interface) associated with it. Default implementation is
FileOutputCommitter. The OutputCommitter defines what setup and cleanup task (for job & task)
should do.
In case of FileOutputCommitter, the job setup task will create the final output directory for the job
and temp working space for task output. For cleanup task, it deletes the temp working space for
task output. Please refer the API document for more info.
INITIALIZED JOB IS SPLIT INTO TASKS AND JOB
TRACKER ASSIGNS TASKS TO TASK TRACKER.
NOW LET’S SEE HOW TASK ASSIGNMENT
HAPPENS IN HADOOP.
J3

J2

J1

Job Tracker internal queue.

Job Tracker

JT allocates a map or reduce
tasks to TT as heartbeat return
value.

TT sends a heartbeat call to JT
with task status, empty slots.

Task Tracker
Map Slots

•
•

•
•
•

Reduce Slots

Based on factors like available memory, available cores, etc., fixed number of map and reduce
slots are configured in a Task Tracker.
When Task Tracker is up and running, it sends a heartbeat call (for every 5 seconds) to Job
Tracker. This is a two way communication between Task Tracker & Job Tracker. Task Tracker uses
this call to inform Job Tracker that its alive. Also Task Tracker uses this call to inform Job Tracker
about the status of tasks and which all map & reduce slots are available.
Job Tracker might have many jobs in queue to run. It uses one of the scheduling algorithms to pick
a job from queue. Once a job is picked, its associated tasks will be picked.
When Job Tracker finds that there are empty slots (from the heartbeat call), it allocates tasks to
Task Tracker using the heartbeat return value.
Job Tracker considers the data locality when allocating map tasks; i.e., Job Tracker tries to
allocate the map job to a Task Tracker where the block is available (which is called data local). If
its not possible, Job Tracker tries to allocate the map task to a map slot in same rack (which is
called rack local). There are no such considerations for reduce task.
NOW TASKS ARE ASSIGNED TO TASK TRACKER
WHICH FOLLOWS A SERIES OF STEPS TO
EXECUTE A TASK. LET’S SEE HOW TASKS ARE
EXECUTED IN TASK TRACKER.
Distributed Cache

Task Tracker

Un-jar the job jar contents
Folder created in TT’s local.

TaskRunner
HDFS

Child Process
Map / Reduce Task
Child JVM

•
•
•
•
•
•
•
•

Now the Task Tracker has been assigned a task.
Task Tracker copies the job jar from HDFS to task tracker’s file system. Also it copies required files
from Distributed Cache.
Task Tracker creates a new folder in task tracker’s file system.
Job jar content is un-jared into the new folder.
Creates a new instance of TaskRunner.
TaskRunner launches a new JVM to run each task, so that any bug in the user-defined map &
reduce functions don’t affect the Task Tracker.
The child process communicates with its parent through the umbilical interface. It informs the
parent of the task’s progress every few seconds until the task is complete.
There is a setup and cleanup tasks executed in the same JVM where the task is executed. The
OutputCommiter implementation associated with the job determines what action to be taken
during startup and cleanup.
SINCE TASKS ARE EXECUTED IN A DISTRIBUTED
ENVIRONMENT, TRACKING THE PROGRESS AND
STATUS OF JOB IS TRICKY. LET’S SEE HOW
PROGRESS AND STATUS UPDATES ARE TAKEN
CARE IN HADOOP.
•

MapReduce jobs are long running batch job which takes anything from one minute to hours to run.
Because of this user has to get feedback on how job is processing.

•

Each job and task have status which comprises of the following;
Status of Job
or Task

Progress of
map & reduce

Value of job
counters

Status
description

Job/Task Status

•

Status of Job/Task: Possible values are RUNNING, SUCCESSFULLY COMPLETED and FAILED. Job
tracker and Task trackers set the value as the job or task progresses.

•

Each task tracks the program by tracking the proportion of the task completed.
• Map task’s progress is measured by the proportion of input processed so far.
• Measuring Reduce task’s progress is little tricky. The system does it by dividing the total
progress into 3 parts corresponding to 3 phases of Shuffle.
MR Program

Job

getStatus()

Job: SFO Crime
Job Status: Running
Task & task status

JobTracker
Task Tracker
Bytes Written: 29
Bytes Read: 29
…
…

Map / Reduce Task

Framework defined counters

Map output records: 5
Number of crimes: 10

User defined counters

Child JVM

•
•
•
•
•
•
•
•

Each task has set of counters that count various events as the task runs.
Most of these counters are build into the framework. We can define our own counters. These are
called user defined counters.
As the tasks progress, it sets a flag to indicate that its time to send the progress to Task Tracker.
This flag is checked every second by a thread and notifies the Task Tracker.
For every 5 seconds, the Task Tracker sends a heart beat to Job Tracker. The status of all tasks
are sent to Job tracker along with the heartbeat call.
Counters are sent less frequently than every 5 seconds because they can be relatively highbandwidth.
Job Tracker combines these updates to produce a global view of the status of all the jobs & its
related tasks.
Clients invoke the API Job.getStatus() for every second to get the status from Job Tracker.
THIS EXECUTION PROCESS CONTINUES TILL ALL
THE TASKS ARE COMPLETED. ONCE THE LAST
TASK IS COMPLETED, MR FRAMEWORK ENTERS
THE LAST PHASE CALLED JOB COMPLETION.
•

When Job Tracker receives a notification from the last task that it is complete, it changes the job
status to “successfully completed”.

•

When the Job polls the status, it will get to know that the job is completed. So it prints the
counters and other job statistics on console.

•

If the property ‘job.end.notification.url’ is set, the Job Tracker will send a HTTP job notification to
the client.

•

Job Tracker cleans up its working state for the job and instructs the Task Trackers to do the same.
THE END

SORRY FOR MY POOR ENGLISH. 
PLEASE SEND YOUR VALUABLE FEEDBACK TO
RAJESH_1290K@YAHOO.COM

More Related Content

What's hot

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Distributed dbms architectures
Distributed dbms architecturesDistributed dbms architectures
Distributed dbms architecturesPooja Dixit
 
Dag representation of basic blocks
Dag representation of basic blocksDag representation of basic blocks
Dag representation of basic blocksJothi Lakshmi
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDBvaluebound
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxvishal choudhary
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitEric Wendelin
 

What's hot (20)

Parallel Database
Parallel DatabaseParallel Database
Parallel Database
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Distributed dbms architectures
Distributed dbms architecturesDistributed dbms architectures
Distributed dbms architectures
 
Dag representation of basic blocks
Dag representation of basic blocksDag representation of basic blocks
Dag representation of basic blocks
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Type Checking(Compiler Design) #ShareThisIfYouLike
Type Checking(Compiler Design) #ShareThisIfYouLikeType Checking(Compiler Design) #ShareThisIfYouLike
Type Checking(Compiler Design) #ShareThisIfYouLike
 
Apache spark
Apache sparkApache spark
Apache spark
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
 
Peer to peer system
Peer to peer systemPeer to peer system
Peer to peer system
 
Data partitioning
Data partitioningData partitioning
Data partitioning
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Distributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query ProcessingDistributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query Processing
 

Viewers also liked

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreducedatasalt
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 

Viewers also liked (6)

Anatomy of Hadoop YARN
Anatomy of Hadoop YARNAnatomy of Hadoop YARN
Anatomy of Hadoop YARN
 
Anatomy of file read in hadoop
Anatomy of file read in hadoopAnatomy of file read in hadoop
Anatomy of file read in hadoop
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduce
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 

Similar to MapReduce Job Execution Process Overview

MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptxSheba41
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model examIndhujeni
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Archnextlib
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangaloresrikanthhadoop
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Decompose the monolith into AWS Step Functions
Decompose the monolith into AWS Step FunctionsDecompose the monolith into AWS Step Functions
Decompose the monolith into AWS Step FunctionsbeSharp
 

Similar to MapReduce Job Execution Process Overview (20)

BIG DT.PPT_115432.pptx
BIG DT.PPT_115432.pptxBIG DT.PPT_115432.pptx
BIG DT.PPT_115432.pptx
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
BIG DATA ANALYTICS
BIG DATA ANALYTICSBIG DATA ANALYTICS
BIG DATA ANALYTICS
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Spring batch
Spring batchSpring batch
Spring batch
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
MapReduce
MapReduceMapReduce
MapReduce
 
Java Batch
Java BatchJava Batch
Java Batch
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
COScheduler
COSchedulerCOScheduler
COScheduler
 
Spring Batch 2.0
Spring Batch 2.0Spring Batch 2.0
Spring Batch 2.0
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
 
Second Level Cache in JPA Explained
Second Level Cache in JPA ExplainedSecond Level Cache in JPA Explained
Second Level Cache in JPA Explained
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Decompose the monolith into AWS Step Functions
Decompose the monolith into AWS Step FunctionsDecompose the monolith into AWS Step Functions
Decompose the monolith into AWS Step Functions
 
Map reduce
Map reduceMap reduce
Map reduce
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

MapReduce Job Execution Process Overview

  • 2. FIRST A JOB HAS TO BE SUBMITTED TO HADOOP CLUSTER. LET’S SEE HOW JOB SUBMISSION HAPPENS IN HADOOP.
  • 3. MR Program Job JobTracker getNewJobId() submitJob() submit() JobSubmitter submitJobInternal() Client JVM Job Tracker JVM HDFS Job Jar Configuration Files Computed Input Splits Folder in name of Job ID • • • • Client calls the submit() (or waitForCompletion()) method on Job. Job creates a new instance of JobSubmitter which has a method called ‘submitJobInternal()’. Then the Job.submit() method calls JobSubmitter.submitJobInternal() method which does the following; • Invokes the JobTracker.getNewJobId() to generate a unique id for the job. • Checks the output specification. If output path is not specified or if it already exists, then an exception is thrown. • Checks the input specification. In case of invalid input path or if splits couldn’t be computed, then an exception is thrown. Else, the input splits are computed. • Creates a directory in the name of Job Id in HDFS. • Copies Job jar, configuration files and computed input splits to this directory. The Job jar has high replication factor which is configured in ‘mapred.submit.replication’ property. • Informs the JobTracker that the job is ready to be executed by calling submitJob() on JobTracker. Instead of Job.submit() the MapReduce program can call Job.waitForCompletion() method. Difference here is, the later waits till the job is complete by polling the JobTracker for every second. Once the job is completed successfully, counters are displayed in console. In case of job failure, the exception is displayed in console. But the submit() method terminates once the job submission is done.
  • 4. NEXT THE SUBMITTED JOB WILL BE INITIALIZED. NOW LET’S SEE HOW JOB INITIALIZATION HAPPENS IN HADOOP.
  • 5. JobTracker submitJob() Job Scheduler Map Tasks Reduce Tasks Other Tasks JS T 1 J1 HDFS S 1 S 2 S 3 T 2 T 3 JC T 1 Bookkeeping Info Input Splits stored in Job ID directory in HDFS. • • • • • • • • When JobTracker.submitJob() is called, JobTracker adds a token to an internal queue. A job scheduler picks it up and creates an object representation for the job. This representation encapsulates tasks & bookkeeping infos (to track the status & progress of job’s tasks). Job scheduler then reads the input splits from HDFS. Then the scheduler creates one map task for each input split. ‘mapred.reduce.tasks’ property is meant to have an integer. Job scheduler reads this property and creates those many number of reduce tasks. Let’s assume the property’s value was 1. Job setup (JS) & Job cleanup (JC) are the other 2 tasks created. Job setup task is run before any map task and Job cleanup task is run after all reduce tasks are complete. Each job has an OutputCommitter (a Java Interface) associated with it. Default implementation is FileOutputCommitter. The OutputCommitter defines what setup and cleanup task (for job & task) should do. In case of FileOutputCommitter, the job setup task will create the final output directory for the job and temp working space for task output. For cleanup task, it deletes the temp working space for task output. Please refer the API document for more info.
  • 6. INITIALIZED JOB IS SPLIT INTO TASKS AND JOB TRACKER ASSIGNS TASKS TO TASK TRACKER. NOW LET’S SEE HOW TASK ASSIGNMENT HAPPENS IN HADOOP.
  • 7. J3 J2 J1 Job Tracker internal queue. Job Tracker JT allocates a map or reduce tasks to TT as heartbeat return value. TT sends a heartbeat call to JT with task status, empty slots. Task Tracker Map Slots • • • • • Reduce Slots Based on factors like available memory, available cores, etc., fixed number of map and reduce slots are configured in a Task Tracker. When Task Tracker is up and running, it sends a heartbeat call (for every 5 seconds) to Job Tracker. This is a two way communication between Task Tracker & Job Tracker. Task Tracker uses this call to inform Job Tracker that its alive. Also Task Tracker uses this call to inform Job Tracker about the status of tasks and which all map & reduce slots are available. Job Tracker might have many jobs in queue to run. It uses one of the scheduling algorithms to pick a job from queue. Once a job is picked, its associated tasks will be picked. When Job Tracker finds that there are empty slots (from the heartbeat call), it allocates tasks to Task Tracker using the heartbeat return value. Job Tracker considers the data locality when allocating map tasks; i.e., Job Tracker tries to allocate the map job to a Task Tracker where the block is available (which is called data local). If its not possible, Job Tracker tries to allocate the map task to a map slot in same rack (which is called rack local). There are no such considerations for reduce task.
  • 8. NOW TASKS ARE ASSIGNED TO TASK TRACKER WHICH FOLLOWS A SERIES OF STEPS TO EXECUTE A TASK. LET’S SEE HOW TASKS ARE EXECUTED IN TASK TRACKER.
  • 9. Distributed Cache Task Tracker Un-jar the job jar contents Folder created in TT’s local. TaskRunner HDFS Child Process Map / Reduce Task Child JVM • • • • • • • • Now the Task Tracker has been assigned a task. Task Tracker copies the job jar from HDFS to task tracker’s file system. Also it copies required files from Distributed Cache. Task Tracker creates a new folder in task tracker’s file system. Job jar content is un-jared into the new folder. Creates a new instance of TaskRunner. TaskRunner launches a new JVM to run each task, so that any bug in the user-defined map & reduce functions don’t affect the Task Tracker. The child process communicates with its parent through the umbilical interface. It informs the parent of the task’s progress every few seconds until the task is complete. There is a setup and cleanup tasks executed in the same JVM where the task is executed. The OutputCommiter implementation associated with the job determines what action to be taken during startup and cleanup.
  • 10. SINCE TASKS ARE EXECUTED IN A DISTRIBUTED ENVIRONMENT, TRACKING THE PROGRESS AND STATUS OF JOB IS TRICKY. LET’S SEE HOW PROGRESS AND STATUS UPDATES ARE TAKEN CARE IN HADOOP.
  • 11. • MapReduce jobs are long running batch job which takes anything from one minute to hours to run. Because of this user has to get feedback on how job is processing. • Each job and task have status which comprises of the following; Status of Job or Task Progress of map & reduce Value of job counters Status description Job/Task Status • Status of Job/Task: Possible values are RUNNING, SUCCESSFULLY COMPLETED and FAILED. Job tracker and Task trackers set the value as the job or task progresses. • Each task tracks the program by tracking the proportion of the task completed. • Map task’s progress is measured by the proportion of input processed so far. • Measuring Reduce task’s progress is little tricky. The system does it by dividing the total progress into 3 parts corresponding to 3 phases of Shuffle.
  • 12. MR Program Job getStatus() Job: SFO Crime Job Status: Running Task & task status JobTracker Task Tracker Bytes Written: 29 Bytes Read: 29 … … Map / Reduce Task Framework defined counters Map output records: 5 Number of crimes: 10 User defined counters Child JVM • • • • • • • • Each task has set of counters that count various events as the task runs. Most of these counters are build into the framework. We can define our own counters. These are called user defined counters. As the tasks progress, it sets a flag to indicate that its time to send the progress to Task Tracker. This flag is checked every second by a thread and notifies the Task Tracker. For every 5 seconds, the Task Tracker sends a heart beat to Job Tracker. The status of all tasks are sent to Job tracker along with the heartbeat call. Counters are sent less frequently than every 5 seconds because they can be relatively highbandwidth. Job Tracker combines these updates to produce a global view of the status of all the jobs & its related tasks. Clients invoke the API Job.getStatus() for every second to get the status from Job Tracker.
  • 13. THIS EXECUTION PROCESS CONTINUES TILL ALL THE TASKS ARE COMPLETED. ONCE THE LAST TASK IS COMPLETED, MR FRAMEWORK ENTERS THE LAST PHASE CALLED JOB COMPLETION.
  • 14. • When Job Tracker receives a notification from the last task that it is complete, it changes the job status to “successfully completed”. • When the Job polls the status, it will get to know that the job is completed. So it prints the counters and other job statistics on console. • If the property ‘job.end.notification.url’ is set, the Job Tracker will send a HTTP job notification to the client. • Job Tracker cleans up its working state for the job and instructs the Task Trackers to do the same.
  • 15. THE END SORRY FOR MY POOR ENGLISH.  PLEASE SEND YOUR VALUABLE FEEDBACK TO RAJESH_1290K@YAHOO.COM