Transcript of "Cloud schedulers and Scheduling in Hadoop"
P A L L A V J H A ( 1 0 - 1 - 5 - 0 2 3 )
P R A B H A K A R B A R U A ( 1 0 - 1 - 5 - 0 1 7 )
P R A B O D H H E N D ( 1 0 - 1 - 5 - 0 5 3 )
J U G A L A S S U D A N I ( 1 0 - 1 - 5 - 0 6 8 )
P R E M C H A N D R A ( 0 9 - 1 - 5 - 0 6 2 )
SCHEDULING IN HADOOP
When a node has an empty task slot, Hadoop chooses a task for it from
one of three categories. First, if any task has failed, it is given highest
priority. This is done to detect when a task fails repeatedly due to a bug
and stop the job. Second, unscheduled tasks are considered. For
maps, tasks with data local to the node are chosen first. Finally, Hadoop
looks for a task to speculate on.
To select speculative tasks, Hadoop monitors task progress using a
progress score, which is a number from 0 to 1. For a map, the score is
the fraction of input data read. For a reduce task, the execution is
divided into three phases, each of which accounts for 1/3 of the score:
1.The copy phase, when the task is copying outputs
of all maps. In this phase, the score is the percent of
maps that output has been copied from.
2.The sort phase, when map outputs are sorted by key.
Here the score is the percent of data merged.
3.The reduce phase, when a user-defined function is
applied to the map outputs. Here the score is the
percent of data passed through the reduce function
ASSUMPTIONS IN HADOOP’S SCHEDULER
1. Nodes can perform work at roughly the same rate.
2. Tasks progress at a constant rate throughout time.
3. There is no cost to launching a speculative task on a
node that would otherwise have an idle slot.
4. A task‟s progress score is roughly equal to the fraction of its
total work that it has done. Specifically,
in a reduce task, the copy, reduce and merge phases
each take 1/3 of the total time.
5. Tasks tend to finish in waves, so a task with a low
progress score is likely a slow task.
6. Different tasks of the same category (map or reduce)
require roughly the same amount of work.
an RDD is a read-only, partitioned collection of records. RDDs can only
created through deterministic operations on either
(1) data in stable storage or
(2) other RDDs. We call these operations transformations to
differentiate them from other operations on RDDs. Examples of
transformations include map, filter, and join.
RDDs do not need to be materialized at all times. Instead, an RDD has
enough information about how it was
derived from other datasets (its lineage) to compute its
partitions from data in stable storage. This is a powerful property: in
essence, a program cannot reference an
RDD that it cannot reconstruct after a failure.
RDDs do not need to be materialized at all
times. Instead, an RDD has enough
information about how it was derived from
other datasets (its lineage) to compute its
partitions from data in stable storage. This is
a powerful property: in essence, a program
cannot reference an
RDD that it cannot reconstruct after a
APPLICATION OF RDD: “LOG MINING”
val lines = spark.textFile(“hdfs://...”)
val errors = lines.filter(_.startsWith(“ERROR”))
val messages = errors.map(_.split('t')(2))
RDD FAULT TOLERANCE
RDDs track the series of transformations used to
build them (their lineage) to recompute lost data
messages = textFile(...).filter(_.contains(“error”))
NAIVE FAIR SHARING ALGORITHM
1. when a heartbeat is received from node n:
2. if n has a free slot then
3. sort jobs in increasing order of number of running tasks
4. for j in jobs do
5. if j has unlaunched task t with data on n then
6. launch t on n
7. else if j has unlaunched task t then
8. launch t on n
9. end if
10. end for
11. end if
1. Initialize j.skipcount to 0 for all jobs j.
2. when a heartbeat is received from node n:
3. if n has a free slot then
4. sort jobs in increasing order of number of running tasks
5. for j in jobs do
6. if j has unlaunched task t with data on n then
7. launch t on n
8. set j.skipcount = 0
9. else if j has unlaunched task t then
10.if j.skipcount >= D then
11.launch t on n
13.set j.skipcount = j.skipcount +1
LATE(LONGEST APPROXIMATE TIME TO END) SCHEDULER
the LATE algorithm works as follows:
• If a task slot becomes available and there are less
than SpeculativeCap speculative tasks running:
– Ignore the request if the node‟s total progress
is below SlowNodeThreshold.
– Rank currently running, non-speculatively executed tasks by estimated time
– Launch a copy of the highest-ranked task with
progress rate below SlowTaskThreshold.
Locality Problems with Fair Sharing
The main aspect of MapReduce that complicates
scheduling is the need to place tasks near their
input data. Locality increases throughput
because network bandwidth in a large cluster is
much lower than the total bandwidth of the
cluster‟s disks. Running on a node that contains the
data(node locality) is most efficient, but when
this is not possible, running on the same rack
(rack locality) is faster than running off-rack.
But in fair share scheduling we only consider
LIMITATIONS : FSS
Head of line scheduling:
The first locality problem occurs in small jobs (jobs that
have small input files and hence have a small number of
data blocks to read). The problem is that whenever a job
reaches the head of the sorted list in fair share algorithm (i.e.
has the fewest running tasks), one of its tasks is launched on
the next slot that becomes free, no matter which node this slot
is on. If the head-of-line job is small, it is unlikely to have
data on the node that is given to it. For example, a job with
data on 10% of nodes will only achieve 10% locality.
The problem is that there is a tendency for a job to be
assigned the same slot repeatedly.
Suppose that job j‟s fractional share of the cluster is f . Then
for any given block b, the probability that none of j‟s slots are
on a node with a copy of b is (1− f )RL: there are R replicas of
b, each replica is on a node with L slots, and the probability
that a slot does not belong to j is 1− f . Therefore, j is expected
to achieve at most 1−(1− f )RL locality.
LIMITATIONS : DELAY SCHEDULING
Long Task Balancing:
To lower the chance that a node fills with long tasks, we can
spread long tasks through out the cluster by changing the
locality test in Algorithm to prevent jobs with long tasks from
launching tasks on nodes that are running a higher-than-
average number of long tasks. Although we do not know
which jobs have long tasks in advance, we can treat new jobs
as long-task jobs, and mark them as short-task jobs if their
tasks finish quickly.
LIMITATIONS :DELAY SCHEDULING
Hotspots are only likely to occur if multiple jobs
need to read the same data file, and that file is
small enough that copies of its blocks are only
present on a small fraction of nodes. In this
case, no scheduling algorithm can achieve high
locality without excessive queueing delays.
If some commodity hardware computers is running
behind its peers this scheduler,instead of trying to
finding out the reasons as to why it is behaving this
way, it marks it as a straggler. The complications
associated with it are tremendous as this does not
observe whether it is temporary defect or a
permanent crippling one we are not giving it any
more tasks during the entire duration of
We have tried to crete a scheduler which may be able to circumvent the
limitations described earlier. In this scheduler the task is enqueued into the
priority queues of the nodes where the data for those tasks are avialable.
1. Retrieve the list of local nodes from the arriving task.
2. Set n= REPLICATION.FACTOR
3. Create n instances of the task in the in n task trackers ‟ priority
queue with different priority value where the data will be local
for that specific task.
4. The tasks will be executed in accordance with the priority
status, the task which has a certain priority other than 1 will
have to skip that many number of tasks if and only if those
tasks have a higher priority and have arrived at a later time
with reference to that task.