Effective and Efficient Job Scheduling in Grid Computing
Novel Scheduling Algorithm in DFS9(1)
1. NOVEL SCHEDULING ALGORITHM IN
DFS
B.Sudheer Kumar
IT Department
ACE Engg College
Hyderabad,India.
sudheer.itdict@gmai
l.com
kota Ankita
IT Department
ACE Engg College
Hyderabad,India.
ankikota@gmail.co
m
Myaka Mounika
IT Department
ACE Engg College
Hyderabad,India.
m.mouniika@gmail.
com
Tigulla Sandhya
IT Department
ACE Engg College
Hyderabad,India.
tsandhyareddy890@
gmail.com
Y Sowjanya
IT Department
ACE Engg College
Hyderabad,India.
sowjanyayelmati@g
mail.com
Abstract— To manage big data environment scheduling has
become a necessity. In the present scenario it is very important to
use scheduling to manage the big data in any environment.
Scheduling is the method by which threads, processes or data flows
are given access to system resources. The need for a scheduling
algorithm arises from the requirement of most modern systems to
perform multitasking and multiplexing. In order to overcome the
problems in different schedulers we present the preemption and
priority based scheduling in Hadoop. This mechanism allows the
scheduler to make more efficient decisions about the jobs which
have high priority and provides the ability to preempt the job which
has low priority.
keywords: scheduler, priority, preemption, job weight,DFS
I. INTRODUCTION
In today’s technology, several scheduling algorithms have
been developed for Hadoop. Hadoop is fast developing tool. In
Hadoop job tracker initiates and co-ordinate the work of Task-
tracker. The Scheduler abides in the job tracker and allocates
the resources of Task-tracker to the running tasks.[8] Scheduler
comes when two or three jobs are participating. When two or
three job slots become free then scheduler chooses to decide,
which tasks is to allocate on those slots.
First Come First Serve (FCFS) and Processor Sharing (PS)
are most simple and mostly used scheduling algorithms. FIFO
and FAIR scheduler in Hadoop are inspired by those two
algorithms. FCFS approach schedule the jobs according to
their submission and delay time is more in this because long
jobs takes more time to complete by keeping small jobs on
hold and in PS resources and divided evenly and each active
jobs keeps progressing and each additional jobs delays the
completion of all other tasks. Facebook and Yahoo contributed
significant work in developing schedulers i.e. Fair Scheduler
and Capacity Scheduler respectively which subsequently
released to Hadoop Community. The Shortest Remaining
Processing Time (SRPT) prioritizes the jobs which has less
work to complete. The main problem in SRPT is starvation.
Larger jobs cannot be scheduled if smaller jobs are submitted
frequently.
Job aging is the solution for starvation problem, i.e. it
virtually decreases the size of job which are waiting in queue
so that all the jobs are processed evenly.
Size-based scheduling algorithm is also used in Hadoop. It
requires the size of the job before execution. But it is not
possible to know the size before execution of the job. So, it
estimates the job size roughly by using the characteristics of
job like number of tasks the job contains. After the first task is
executed total time is estimated and updated based on running
time. The estimation component has been designed in such a
way that response time is minimized rather than estimating the
accurate length of the job.
The problems solved by using Size-based Scheduling
algorithm are that, job size information is not necessary for the
proper functioning of scheduler. Starvation problem is solved
and job response time is distributed according to the favour of
size-based scheduling. It is simple to configure and allow
resource “pools” to be consolidated because workload diversity
is intrinsically accounted for by the size-based scheduling.
Dynamic proportional share scheduling in Hadoop: It is
one of the parallel tasks scheduler in Hadoop [3]. It allows the
userto adjust their spending over time to control their allocated
capacity. It allows the scheduler to take efficient decisions
about how to prioritize the user and jobs provide tools to
improve their allocations and requirements of their job.
Fig1.Elements of Hadoop cluster
[2]Namenode is the node which stores the file system
metadata i.e. which file maps to what block locations and
which are the blocks are stored on which datanode.The
namenode maintains two in-memory tables, one which maps
the blocks to datanodes (one block maps to 3 datanodes for a
replication value of 3) and a datanode to block number
mapping.
Data node is where the actual data resides. When the
datanode stores a block of information, it maintains a
2. checksum for it as well. The data nodes update the namenode
with the block information periodically and before updating
verify the checksums. If the checksum is incorrect for a
particular block i.e. there is a disk level corruption for that
block, it skips that block while reporting to the block
information to the namenode. In this way, namenode is aware
of the disk level corruption on that datanode and takes steps
accordingly.
II. DIFFERENT SCHEDULING STRATEGIES FOR
HADOOP DISTRIBUTED FILE SYSTEM
A. FIFO scheduler
FIFO stands for “First in first out”. This is the default
scheduler in Hadoop[5]. Original scheduling algorithm that
was integrated within the Job Tracker was called FIFO. In
FIFO scheduling, an oldest job is pulled first by the job tracker
from the work queue. This scheduler is not concerned with
priority or size of the job, but the approach was simple to
implement and efficient.
Example1: When client1 with priority and client 2 without any
priority try to compete for resources then FIFO scheduler
schedule themaccording to their submission.
Example2: When both client 1 and client 2 of no priority try to
compete for resources then FIFO scheduler schedule them in
first come first serve manner.
The main disadvantage of this scheduler is priority concept is
not considered.That is small jobs get stuck when large jobs are
under processing and size of the job is not considered.
Fig2: FIFO Scheduler
From the above figure (fig2) we can depict that FIFO
schedulerschedule the jobs according to their submission i.e in
the manner of first come first serve. Here 4 clients are
requesting for resource namely c1, c2, c3, c4 (kept in queue) as
shown in above figure. By following the policy of FIFO
scheduler c1 is submitted first to name node.
B. Fair scheduler
Fair scheduleris to assign resources to jobs such that they
can get equal share of the resources on average time. Jobs
which need less time are executed first thereby resulting the
jobs which need more time can find enough execution time on
CPU. [10]The implementation is based on creating a set of
pools. All these pools have equal shares by default, and they
can be assigned by people. Each user is assigned to a pool to
approach fairness. In this way, if one user submits more jobs,
he or she can receive the same share which are cluster
resources as all other users. The number of jobs active at one
time can also be constrained, if desired, to minimize
congestion and allow work to finish in a timely manner. This
scheduler was developed by “Facebook”.
ALGORITHM
When there is an available slot open, the scheduler will
allocate this slot to the job which has the largest job Deficit.
The system will update most of the information, including job
Deficit,jobWeight,minSlots,jobFairShare[1].
1. Calculate jobWeight: JobWeight is decided by the
priority Factor of the job by default, or it can also be
decided by the size and time of the job. In addition,
users can change weightAdjuster to adjust jobWeight.
2. Update jobWeight: Each running job updates its
jobWeight by multiplying poolWeight over
poolRunningJobsWeightSum.
3. Calculate deficit: set the mapDeficit and reduceDificit
to zero for each job.
4. Update minSlots: In each pool, the scheduler distributes
the available slots to jobs based on their jobWeight.
After that, it distributes open slots to the jobs that still
need resources. If there are still some open slots after
that, they will be shared with the other pools.
The detailed steps are given below: First of all set the
minMaps or minReduces of all the running jobs in this pool to
zero. Second, repeating the following steps until we find slots
remaining is zero: calculate jobinfo.minMaps or
jobinfo.minReduces; calculate minSlots by slots Left times
jobWeight over poolLowJobsWeightSum; adjust this number
according to the number of available slots in this pool; return
these slotsToGive as the minMaps or minReduces to the
associated job; change this number by minus slotsToGive. If in
this loop, slotsLeft stay unchanged, then it will share the
remaining slots to all jobs by sorting jobs by weight and
deficit, calculating minSlots, giving slotsToGive to jobs and
updating slotsLeft. If after all of this there are still open slots,
they will be shared with other pools.
5. UpdatejobFairShare: First, distribute available slots
by jobWeight. If minSlots is larger than jobFairShare,
it will meet minSlots first then update available slots,
repeating the steps above until all minSlots are equal
or smaller than jobFairShare. At last, all jobs share the
slots that are still open equally.
6. Updatedeficit: The jobDeficit will be updated by plus
the difference between job fair and running tasks times
time Delta. Actually, the mapping and reducing parts
will update deficit separately.
7. Allocate resources: Allocate it to the job that has the
largest deficit, when there is an available slot in the
system.
The advantage of fair scheduler is that it works well when
both small and large clusters are used by the same
3. organization with a limited number of workloads. Irrespective
of the shares assigned to pools, if the system is not loaded,
jobs receive the shares that would otherwise go unused.[9]
The scheduler implementation keeps track of the computation
time for each job in the system. Periodically, the scheduler
examines jobs to compute the difference between the
computation time the job received and the time it should have
received in an ideal scheduler. The result determines the
deficit for the task. The job of the scheduler is then to ensure
that the task with the highest deficit is scheduled next.
Fig3: FAIR SCHEDULER
From above figure we can depict that fair scheduler assigns
equal share of resources. Here 4 clients namely c1,c2,c3,c4
are requesting for resources(in queue).fair scheduler schedules
all the four clients to name node with equal amount of
share(by default 25%) as shown in fig3 .
Example 1: When client 1 with priority and client 2 with no
priority compete for resources then fair scheduler gives an
equal share to both the task and then schedule them.
Example 2: When both client1 and client2 of no priority
compete for resources then fair scheduler gives an equal share
to both the tasks and schedules them.
Disadvantage of fair scheduler is that it does not consider the
job weight of each node, resulting unbalanced performance in
each node.
C. Capacity scheduler
[7]Capacity scheduling is based on queues.Each queue has
its own assigned resources. It uses FIFO strategy. In order to
prevent the users to take more resources in one queue, this
scheduler can limit the resources for the jobs for each user.
While scheduling, if a queue does not use its allocated
capacity, the reserved capacity will be assigned to other
queues. Jobs with a higher priority can access the resources
faster than lower priority jobs. [6]We can configure the
capacity scheduler within multiple Hadoop configuration files.
This Capacity scheduler was developed by “Yahoo”. In
capacity scheduling, instead of pools, several queues are
created, each with a configurable number of map and reduce
slots.
Each queue is also assigned a guaranteed capacity.
ALGORITHM
When there are open slots in some taskTracker, the scheduler
will choose a queue, then choose a job in that queue, then
choose a task in the job and last give this slot to the task. The
detailed steps are described below[4].
1. Choose_queue:Sort all queues by
numSlotsOccupied/capacity, then deal with themone
by one until we find proper job.
2. Choose_job:Sort all jobs in the selected queue by
submitted time and job priority. Then the scheduler
takes jobs into consideration one by one, and at last find
proper jobs so that the user with that job does not reach
his/her limit of resource and there are enough memory
in that node where the Task Tracker stays for the tasks
in that job.
3. Choose a task: Call obtainNewMapTask() or
obtainNewReduceTask() in JobInProgress to choose a
task, based on the locality and resource situations.
The advantage of capacity scheduler is that when you're
running a large Hadoop cluster, with multiple clients and
different types and priorities of jobs, then the capacity
scheduler is the right choice to ensure guaranteed access with
the potential to reuse unused capacity and prioritize jobs within
the queues.
Example 1: When client1 with low priority is under
execution and client2 with high priority is competing for
resource then capacity scheduler does not preempt client1 until
its execution results in starvation of client2.
Example 2: When both client1 and client2 of no priority try
to compete for resources then capacity scheduler behaves as
FIFO scheduler and schedules them in first come first serve
manner.
Disadvantage of capacity scheduler is the most complex
scheduleramong all. Users need to know the systemwell to set
up configurations and choose proper queues well. The Hadoop
road map includes a desire to support preemption, but this
functionality has not yet been implemented.
4. III. COMPARISON TABLE
Different
Scheduling
Strategies for
Hadoop
Distributed File
system
Scheduling
methodology
Benefits of
Scheduler Limitations/drawbacks
Applications of
scheduler in
Distributed file
system
(when to use
each scheduler)
Behaviorof
scheduler with
priority tasks
Behavior of
scheduler with
no priority
tasks
FIFO/FCFS
SCHEDULER
FCFS approach
schedule the jobs
according to
their submission.
1. FIFO
schedulingis
simple to
implement.
2. Efficient.
1. Priority concept is not
considered (i.e. it is non-
preemptive). That is small
jobs get stuck when large job
is under processing and size
of the job is not considered.
2. It can hurt overall
throughput.
The FIFO
scheduler should
be chosen when
we have fewer
number of jobs
with no priority.
When client1 of
priority and
client2 without
prioritycompete
for resources
then FIFO
scheduler
schedule the
clients according
to their
submission.
When both
client1 and
client2 have no
priority then
FIFO scheduler
schedules both
the clients in
first come first
serve manner.
FAIR
SCHEDULER
Fair scheduling
is a method of
assigning
resources to jobs
such that all jobs
get, on average,
on an equal
share of
resources.
1. It allows
assigning
guaranteed
minimum
shares to
queues, which
is useful for
ensuring that
certain users,
groups or
production
applications
always get
sufficient
resources.
2. Less
complex
3.The fair
scheduler
works well
when both
small and
large clusters
are used by
same
organization
with a limited
number of
workloads
1. It does not consider the job
weight of each node, resulting
unbalanced performance in
each node.
2.It can work well only with
limited workload.
The fair
scheduler is
chosen in the
presence of
diverse jobs,
because it can
provide fast
response times
for small jobs
mixed with
larger jobs
When client1
with priority and
client2 with no
prioritycompete
for resources,
then fair
scheduler gives
an equal share to
both the task and
then schedules
them.
.
When client1
andclient2 of no
prioritycompete
for resources
then fair
scheduler gives
an equal share to
both the tasks
and schedule
them
CAPACITY
SCHEDULER
The Capacity
Scheduler is
designed to
allow sharing a
large cluster
while giving
each
organization
capacity
Guarantees.
Ensure
guaranteed
access with the
potential to
reuse unused
capacity and
prioritize jobs
within queues
over large
clusters.
The most complexamong
three schedulers. In case of large
Hadoop cluster,
with multiple
clients and
different types
andpriorities of
jobs, then the
capacity
scheduler is the
right choice to
ensure
guaranteed
access with the
potential to
reuse unused
capacity and
prioritize jobs
within
queues.(map
reduce)
When client1
with lowpriority
is under
execution and
client2 with high
priority is
competing for
resource then
capacity
scheduler does
not preempt
client1 until its
execution
resulting in
starvation of
client2.
When both
client1 and
client2 of no
priority try to
compete for
resources then
capacity
scheduler
behaves as FIFO
scheduler and
schedule them in
first come first
serve manner.
5. IV. NOVEL SCHEDULER
The proposed algorithm is used to schedule the requests
efficiently. In this algorithmwe are going to schedule the tasks
and prioritize them based on frequent requests sent. The
frequent number of times client sends the requests the highest
will be the probability of getting scheduled. This algorithm is
advantageous because of its simplicity and because it
minimizes the average waiting time. Starvation problem is
solved as jobs are not made to wait in the queue for long time.
All the jobs are scheduled based on below novel algorithm. In
this algorithm the weight of the job is calculated based on the
requests sent and the priority is assigned based on the job
weight to that of day’s of scheduling.
Using this scheduling algorithm we can overcome the
following drawback
1. The process which has no priority need not wait for long
time.
2. Both the priority and preemption are achieved.
NOVEL ALGORITHM
A key issue related to scheduling is, when to make
scheduling decisions. It turns out that there are a variety of
situations in which scheduling is done. In this algorithm “the
number of times requests sent by the client, from certain time
(the day the request is sent)to till date are monitored and based
on which, the clients are scheduled in the following way”
STEP 1: Calculate job weight (K)
The job weights of all clients are calculated based on number
of times, the requests are sent. Let the number of times the
requests sent is R then the,
job weight(K)=R
STEP 2: Calculate the priority (Pi)
The priority of job is calculated based on job weight of the
client till date of scheduling. Suppose the client’s job weight is
‘k’ for ‘n’ number of days then,
Pi= (job weight)K/(number of days)n
Where i=1 to n
If suppose two clients say client1 and client2 of priority p1
and p2 are competing for resource respectively then schedule
the jobs accordingly
STEP 3: Schedule the clients
1. If p1>p2 then the job sent by client1 is scheduled first
followed by client2 and if
2 If p1<p2 then the job sent by client2 is scheduled first
followed by client1 and if
3. If p1=p2 then jobs are scheduled based on FIFO scheduling
STEP 4: Preempt the jobs
The jobs of clients are preempted based on certain conditions:
1. If task completion is <50%
2. If two tasks say (client1 and client2) are competing for
resources then the task with ‘no’ priority is preempted.
3. If two clients having some priority are competing for
resource then go to step 3
Step 5: Repeat the steps 1 to 4 until all jobs are scheduled.
Example1: Initially when we have only 1 client requesting for
resources then that client will be scheduled.
Example2: When 3 clients namely client1,client2 and client3
compete for resources then first we need to calculate the job
weight using novel algorithm(as shown above in step1) which
is as follows
If Client1 sends request one time in a day i.e. client1 is a new
client, Client2 sends request five times in 10 days and client3
sends request ten times in 5days. therefore the job weight of
client1 is 1 and client 2 job weight is 5 and client3 job weight
is 10.
Then we need to calculate the priority(as shown above in
step2)
Client1(P1)=1/1=1
Client2(P2)=5/10=0.5
Client3(P3)=10/5=2
Therefore according to priority client3 will be scheduled first
followed by client1 and client2.
V. CONCLUSION
This paper has given an account of the present working
strategies and the drawbacks of the various scheduling
algorithms. This paper also gives the importance of the
efficient scheduling and the effective algorithm to overcome
the existing problems. The reasonable approach to tackle the
issue is the “Novel Scheduling”. This research has thrown up
many questions in need to implement the “Novel System” for
better and effective retrieval of data. In present strategy the
scheduling is based on the number of requests sent i.e. the
frequent requests made to that of total of number of days the
request is made. Therefore there is every chance to overcome
the drawbacks and schedules the jobs based on priority and
preempt the low priority jobs and this can be achieved by
Novel scheduling only.
VI. REFERENCES
[1] Hadoop Scheduler by Donghe :Fair Scheduler
[2] Scheduling in hadoop,An introduction to the pluggable scheduler
framework.M.Tim Jones
6. [3] Thomas SandholmandKevinLai .Dynamic Proportional share scheduling
in hadoop . Social Computing Lab, Hewlett-Packard Labs, Palo Alto, CA
94304, USA
[4] Hadoop Scheduler by Donghe :Capacity Scheduler
[5] Hadoop Scheduler by Donghe :FIFO Scheduler
[6] Bincy P Andrews, BinuA: Survey on JobSchedulers in Hadoop Cluster.
OSR Journal of Computer Engineering (IOSR-JCE)e-ISSN: 2278-0661, p-
ISSN: 2278-8727Volume15, Issue 1 (Sep. -Oct. 2013), PP 46-50
[7] Job Scheduling in Apache Hadoop by Amr Awadallah
[8] “Survey on Improved Scheduling in Hadoop MapReduce in Cloud
Environments” by b thirumala rao
[9] Vanderster, Daniel Colin; ResourceAllocation and Scheduling Strategies
an Computational Grids, Ph.D. Thesis, University of Victoria, 2008.
[10] White,Tom, Hadoop: The definite guide, O'Reilly Media, 2nd
edition,
2010.
AUTHORS PROFILE
Battula Sudheer Kumar did his M.Tech
in Software Engineering from
Jawaharlal Nehru Technological
University (JNTU), Hyderabad in 2012,
B.Tech from Jawaharlal Nehru
Technological University (JNTU),
Hyderabad in 2009.He is having 2 years of research
experience and presently working as an Assistant Professor of
IT department in ACE Engineering College, Hyderabad. His
areas of interest include Distributed Systems and Cloud
Computing.
Kota Ankita pursuing her graduation in
Information Technology from ACE Engineering
College, Ghatkesar and my research interests on
Distributed Computing and Cloud Computing.
Myaka Mounika Natasha pursuing her graduation
in Information Technology from ACE Engineering
College, Ghatkesar and my research interests on
Distributed Computing and Cloud Computing.
Tigullah Sandhya pursuing her graduation in
Information Technology from ACE Engineering
College, Ghatkesar and my research interests on
Distributed Computing and Cloud Computing.
Y. Sowjanya pursuing her graduation in
Information Technology from ACE
Engineering College, Ghatkesar and my
research interests on Distributed
Computing and Cloud Computing.