Scheduling scheme for hadoop clusters

A RESEARCH ON
SCHEDULING SCHEME FOR
HADOOP CLUSTERS

Guided by
Neetha K N
Dept of CSE

Presented by
Amjith B
S7 CSE

Hadoop

MapReduce and
HDFS

AREAS OF SEMINAR

Hadoop cluster

TERMINOLOGY REVIEW
Rack 1

Rack 2

Rack n

Node 1

.
.
.

Node 1

Node 1

Node 2

Node 2

Node 2

...

Node n

Node n

Node n

• Hadoop is a Open source software framework for
distributed processing of large datasets across
large clusters of computers
• 2 Components
MapReduce engine
Distributed file system

INTRODUCTION

• Mapreduce engine
Programming model developed by Google
 Computation component of Hadoop
 Consists of Map and Reduce functions
• HDFS
 Storage component of Hadoop
 Splits the data into blocks and distributes them
Fault tolerant and self-healing

COMPONENTS

MapReduce • Jobtracker
node
• Tasktracker

• Name node
HDFS node
• Data node

• HDFS Node
• NameNode – Maintains metadata information
about files (1 per cluster).
• DataNode – Handles all data allocation and
replication and is installed on each slave node (1
to many per cluster).
• MapReduce node
• JobTracker – Schedules job execution and keep
track of cluster wide job status (1 per cluster)
• TaskTracker – Receives tasks from job tracker.
Runs on compute nodes in conjunction with data
node (1 to many per cluster).

SYSTEM

FEATURES

DISADVANTAG
ES

Hadoop FIFO
scheduing

Implements by
FIFO principle

Can not assign
priority for jobs

Facebook’s Fair
scheduler

Even allocation of No preemption
resources
support for large
tasks

REF [4]

Yahoo’s Capacity
scheduler

FIFO scheduler
based on priority

REF[6]

Problem in
assigning
priorities

LITERATURE SURVEY

REFERENCE
REF [6]

• The underutilization of CPU processes
• Not flexible
• Interaction between master node with slave nodes

EXISTING SYSTEM
(disadvantage)

• Analyze the system for CPU and IO underutilization
• Use a predictive scheduler for predicting the appropriate
TaskTracker
• Couple the scheduler with a prefetching mechanism to
improve the system performance

PROPSED SYSTEM

• Flexible task scheduler
• Predicts the most appropriate task trackers to assign
future tasks
• Allows DataNodes to explore underutilization of disk
bandwidth
• Seeks stragglers and predicts candidate data blocks

PREDICTIVE SCHEDULER

• Integrate with predictive scheduler
• Multiple worker threads
• Monitor status of worker threads and coordinate
prefetching process

PREFETCHING MODULE

Copying the job from HDFS to TaskTracker
Creation of local working directory for task
Creation of TaskTracker instance

STEPS FOR LAUNCHING
TASKS

ISSUES IN PREFETCHING MODULE

• When to prefetch
• What to prefetch
• How much to prefetch

•
•
•
•

Avoidance of I/O stalls
Maximising CPU utilisation
Helps the smooth functioning of Hadoop
Flexible

ADVANTAGES

EXISTING SYSTEM

PROPSOED SYSTEM

Low i/o perfomance

High I/O perfomance

CPU underutilised

Proper utilisation

Less flexible

Additional overhead of prefetching to
master

COMPARISON

• Hadoop on demand (HOD)
• A scheduler in heterogeneous environment

FUTURE SCOPE

• 1. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on
large clusters. OSDI ’04, pages 137–150, 2008.
• 2. M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica. Improving
mapreduce performance in heterogeneous environments. In OSDI’08: 8th
USENIX Symposium on Operating Systems Design and Implementation,
October 2008.
• 3. R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka.
Informed prefetching and caching. SIGOPS Oper. Syst. Rev., 29:79–95,
December 1995.
• 4. Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim,et. al. Hpmr:
Prefetching and pre-shuffling in shared mapreduce computation
environment. In Proceedings of 11th IEEE International Conference on
Cluster Computing, pages 16–20. ACM, 2009.
• 5. Tom White. Hadoop The Definitive Guide. O’Reilly, 2009.
• 6. Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin
Garegrat, Shiwali Mohan

REFERENCES

Scheduling scheme for hadoop clusters

More Related Content

What's hot

Similar to Scheduling scheme for hadoop clusters

Recently uploaded

Scheduling scheme for hadoop clusters

Editor's Notes