A RESEARCH ON
SCHEDULING SCHEME FOR
HADOOP CLUSTERS

Guided by
Neetha K N
Dept of CSE

Presented by
Amjith B
S7 CSE
Hadoop

MapReduce and
HDFS

AREAS OF SEMINAR
Hadoop cluster

TERMINOLOGY REVIEW
Rack 1

Rack 2

Rack n

Node 1

.
.
.

Node 1

Node 1

Node 2

Node 2

Node 2

...

Node n

Node n

Node n
• Hadoop is a Open source software framework for
distributed processing of large datasets across
large clusters of computers
• 2 Components
MapReduce engine
Distributed file system

INTRODUCTION
• Mapreduce engine
Programming model developed by Google
 Computation component of Hadoop
 Consists of Map and Reduce functions
• HDFS
 Storage component of Hadoop
 Splits the data into blocks and distributes them
Fault tolerant and self-healing

COMPONENTS
MapReduce • Jobtracker
node
• Tasktracker

• Name node
HDFS node
• Data node
• HDFS Node
• NameNode – Maintains metadata information
about files (1 per cluster).
• DataNode – Handles all data allocation and
replication and is installed on each slave node (1
to many per cluster).
• MapReduce node
• JobTracker – Schedules job execution and keep
track of cluster wide job status (1 per cluster)
• TaskTracker – Receives tasks from job tracker.
Runs on compute nodes in conjunction with data
node (1 to many per cluster).
SYSTEM

FEATURES

DISADVANTAG
ES

Hadoop FIFO
scheduing

Implements by
FIFO principle

Can not assign
priority for jobs

Facebook’s Fair
scheduler

Even allocation of No preemption
resources
support for large
tasks

REF [4]

Yahoo’s Capacity
scheduler

FIFO scheduler
based on priority

REF[6]

Problem in
assigning
priorities

LITERATURE SURVEY

REFERENCE
REF [6]
EXISTING SYSTEM
• The underutilization of CPU processes
• Not flexible
• Interaction between master node with slave nodes

EXISTING SYSTEM
(disadvantage)
• Analyze the system for CPU and IO underutilization
• Use a predictive scheduler for predicting the appropriate
TaskTracker
• Couple the scheduler with a prefetching mechanism to
improve the system performance

PROPSED SYSTEM
• Flexible task scheduler
• Predicts the most appropriate task trackers to assign
future tasks
• Allows DataNodes to explore underutilization of disk
bandwidth
• Seeks stragglers and predicts candidate data blocks

PREDICTIVE SCHEDULER
• Integrate with predictive scheduler
• Multiple worker threads
• Monitor status of worker threads and coordinate
prefetching process

PREFETCHING MODULE
Copying the job from HDFS to TaskTracker
Creation of local working directory for task
Creation of TaskTracker instance

STEPS FOR LAUNCHING
TASKS
ISSUES IN PREFETCHING MODULE

• When to prefetch
• What to prefetch
• How much to prefetch
•
•
•
•

Avoidance of I/O stalls
Maximising CPU utilisation
Helps the smooth functioning of Hadoop
Flexible

ADVANTAGES
EXISTING SYSTEM

PROPSOED SYSTEM

Low i/o perfomance

High I/O perfomance

CPU underutilised

Proper utilisation

Less flexible

Additional overhead of prefetching to
master

COMPARISON
• Hadoop on demand (HOD)
• A scheduler in heterogeneous environment

FUTURE SCOPE
• 1. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on
large clusters. OSDI ’04, pages 137–150, 2008.
• 2. M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica. Improving
mapreduce performance in heterogeneous environments. In OSDI’08: 8th
USENIX Symposium on Operating Systems Design and Implementation,
October 2008.
• 3. R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka.
Informed prefetching and caching. SIGOPS Oper. Syst. Rev., 29:79–95,
December 1995.
• 4. Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim,et. al. Hpmr:
Prefetching and pre-shuffling in shared mapreduce computation
environment. In Proceedings of 11th IEEE International Conference on
Cluster Computing, pages 16–20. ACM, 2009.
• 5. Tom White. Hadoop The Definitive Guide. O’Reilly, 2009.
• 6. Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin
Garegrat, Shiwali Mohan

REFERENCES
THANK YOU!!!!!!
QUESTIONS??

Scheduling scheme for hadoop clusters

  • 1.
    A RESEARCH ON SCHEDULINGSCHEME FOR HADOOP CLUSTERS Guided by Neetha K N Dept of CSE Presented by Amjith B S7 CSE
  • 2.
  • 3.
    Hadoop cluster TERMINOLOGY REVIEW Rack1 Rack 2 Rack n Node 1 . . . Node 1 Node 1 Node 2 Node 2 Node 2 ... Node n Node n Node n
  • 4.
    • Hadoop isa Open source software framework for distributed processing of large datasets across large clusters of computers • 2 Components MapReduce engine Distributed file system INTRODUCTION
  • 5.
    • Mapreduce engine Programmingmodel developed by Google  Computation component of Hadoop  Consists of Map and Reduce functions • HDFS  Storage component of Hadoop  Splits the data into blocks and distributes them Fault tolerant and self-healing COMPONENTS
  • 6.
    MapReduce • Jobtracker node •Tasktracker • Name node HDFS node • Data node
  • 7.
    • HDFS Node •NameNode – Maintains metadata information about files (1 per cluster). • DataNode – Handles all data allocation and replication and is installed on each slave node (1 to many per cluster). • MapReduce node • JobTracker – Schedules job execution and keep track of cluster wide job status (1 per cluster) • TaskTracker – Receives tasks from job tracker. Runs on compute nodes in conjunction with data node (1 to many per cluster).
  • 9.
    SYSTEM FEATURES DISADVANTAG ES Hadoop FIFO scheduing Implements by FIFOprinciple Can not assign priority for jobs Facebook’s Fair scheduler Even allocation of No preemption resources support for large tasks REF [4] Yahoo’s Capacity scheduler FIFO scheduler based on priority REF[6] Problem in assigning priorities LITERATURE SURVEY REFERENCE REF [6]
  • 10.
  • 14.
    • The underutilizationof CPU processes • Not flexible • Interaction between master node with slave nodes EXISTING SYSTEM (disadvantage)
  • 15.
    • Analyze thesystem for CPU and IO underutilization • Use a predictive scheduler for predicting the appropriate TaskTracker • Couple the scheduler with a prefetching mechanism to improve the system performance PROPSED SYSTEM
  • 17.
    • Flexible taskscheduler • Predicts the most appropriate task trackers to assign future tasks • Allows DataNodes to explore underutilization of disk bandwidth • Seeks stragglers and predicts candidate data blocks PREDICTIVE SCHEDULER
  • 18.
    • Integrate withpredictive scheduler • Multiple worker threads • Monitor status of worker threads and coordinate prefetching process PREFETCHING MODULE
  • 19.
    Copying the jobfrom HDFS to TaskTracker Creation of local working directory for task Creation of TaskTracker instance STEPS FOR LAUNCHING TASKS
  • 20.
    ISSUES IN PREFETCHINGMODULE • When to prefetch • What to prefetch • How much to prefetch
  • 21.
    • • • • Avoidance of I/Ostalls Maximising CPU utilisation Helps the smooth functioning of Hadoop Flexible ADVANTAGES
  • 22.
    EXISTING SYSTEM PROPSOED SYSTEM Lowi/o perfomance High I/O perfomance CPU underutilised Proper utilisation Less flexible Additional overhead of prefetching to master COMPARISON
  • 23.
    • Hadoop ondemand (HOD) • A scheduler in heterogeneous environment FUTURE SCOPE
  • 24.
    • 1. J.Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008. • 2. M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI’08: 8th USENIX Symposium on Operating Systems Design and Implementation, October 2008. • 3. R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed prefetching and caching. SIGOPS Oper. Syst. Rev., 29:79–95, December 1995. • 4. Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim,et. al. Hpmr: Prefetching and pre-shuffling in shared mapreduce computation environment. In Proceedings of 11th IEEE International Conference on Cluster Computing, pages 16–20. ACM, 2009. • 5. Tom White. Hadoop The Definitive Guide. O’Reilly, 2009. • 6. Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin Garegrat, Shiwali Mohan REFERENCES
  • 25.
  • 26.

Editor's Notes

  • #4 TERMINOLOGY REVIEW
  • #23 ex / pro1. low i/o performence* high i/o performence2. cpu work load underutilised* proper utilisation of CPU work load3. no overhead to master* additional  overhead of prefetching to master4. Suited for real time solution* not suited for real time solutions