Suggested Algorithm to improve Hadoop's performance.

Research on Scheduling Scheme
for Hadoop clusters
By: Jiong Xiea,b, FanJun Mengc, HaiLong Wangc, HongFang Panb, JinHong
Chengb, Xiao Qina
04/22/14 1CSC 8710

Outlines
• What is Hadoop?
• Hadoop Characterstics
• Hadoop Objectives
• Big Data Challenges
• Hadoop Architecture
• What is the predictive schedule and prefetching
mechanism ?
• Hadoop Issues
• Hadoop Scheduler
• PSP Scheduler
• Conclustion
04/22/14 2CSC 8710

Goal
• Designing prefetching mechanism to solve
the data moving problem in mapReducing
and to improve the performance.
04/22/14 CSC 8710 3

What is Hadoop?
• Hadoop is an open source software
framework that is used to deal with the
large amount of data and to process them
on clusters of commodity hardware.
04/22/14 4CSC 8710

Characteristics
• It is a framework of tools
- Not a particular program as some people think
• Open source tools.
• Distributed under apache license .
• Linux based tools.
• It works on a distributed models
- Not one big powerful computer, but numerous low
cost computers.
04/22/14 5CSC 8710

objectives
• Hadoop supports running of application on
Big Data.
• Therefore, Hadoop addresses Big Data
challenges.
Hadoop
Running application
on Big Datasupports
04/22/14 6CSC 8710

Big Data Challenges
04/22/14 7CSC 8710

Why Do We need Hadoop?
• Powerful computer can process data until some
point when the quantity of data becomes larger
than the ability of the computer.
• Now, we need Hadoop tool to deal with this
issue.
• Hadoop uses different strategy to deal with data.
04/22/14 8CSC 8710

Hadoop Functionality
• Hadoop breaks up the data into smaller pieces
and distribute them equally on different nodes to
be processed at the same time.
• Similarly, Hadoop divides the computation into
the nodes equally.
• Results are combined all together then sent
again to the application
04/22/14 9CSC 8710

Hadoop Functionality
Node Node
Big Data
Node
Combined
Result
Dividing the data equally
computation
Returning the result
Input data
Combining the result
04/22/14 10CSC 8710

Architecture
• Hadoop consists of two main components:
– MapReduce: divides the workload into smaller pieces
– File System (HDFS): accounts for component failure, and it
keeps directory for all the tasks
– There are other projects provide additional functionality:
• Pig
• Hive
• HBase
• Flume
• Mahout
• Oozie
• Scoop
MapReduce File System
HDFS
Hadoop
04/22/14 11CSC 8710

Architecture
• Slave computers consist of 2
components:
- Task Tracker: to process the given task, and it
represents the mapReduce component.
- Data Node: to manage the piece of task that has
been give to the task tracker, and it represents HDFS.
04/22/14 12CSC 8710

Architecture
• The master computer consists of 4
components:
- Job Tracker: It works under mapReduce component so it breaks up the
task into smaller pieces and divides them equally on the Task Trackers.
- Task Tracker: to process the given task.
- Name Node: It is responsible to keep an index of all the tasks.
- Data Node: to manage the piece of task that has been give to the
task tracker.
04/22/14 13CSC 8710

Architecture
04/22/14 14CSC 8710

Fault Tolerance for Data
• Hadoop keeps three copies of each file, and each copy is
given to a different node.
• If any one of the Task Tracker fails The Job Tracker will
detect that failure and will ask another Task Tracker to
take care of that job.
• Tables in The Name node will be backed up as well in
different computer, and this is the reason why the
enterprise version of Hadoop keeps two masters. One is
the working master and the other one is back up master.
04/22/14 15CSC 8710

Scalability cost
• The scalability cost is always linear. If you
want to increase the speed, increase the
number of computers.
04/22/14 16CSC 8710

predictive schedule and prefetching
• implementing a predictive schedule and
prefetching (PSP) mechanism on Hadoop tools
to improve the performance.
• Predictive scheduler:
- A ﬂexible task scheduler, predicts the most appropriate task
trackers to the next data.
• Prefetching module:
– The responsible part of forcing the preload workers threads to
start loading data to main memory of the node before the
current task finish. It depends on estimated time.
04/22/14 17CSC 8710

PSP
• Factors that make PSP possible:
- Underutilization of CPU.
- Importance of MapReduce performance
- The storage availability in HDFS
- Interaction between the nodes
04/22/14 18CSC 8710

Hadoop’s Issue
• In the current MapReduce model, all the tasks are
managed by the master node, so the computation nodes
ask the master node to assign the new task to be
processed.
• The master node will tell the computing nodes what the
next task is, and where it is located.
• That will waste some of the CPU’s time while the
computation node communicates with the master node.
04/22/14 19CSC 8710

Hadoop’s Issue
• The original Hadoop assigns tasks randomly
from local or remote disk to the computation
node whenever the data is required.
• CPU of the computing nodes won’t process until
all the input data resources are loaded into the
main memory.
• This affects Hadoop’s performance negatively.
04/22/14 20CSC 8710

Prefetching
• It will force the preload workers threads to start
loading data from the local desk to the main
memory of the node before the current task
finish.
• The waiting time will be reduced, so the task will
be processed on time.
• Improving the performance of MapReduce
system.
04/22/14 21CSC 8710

Hadoop Scheduler
• The original Hadoop scheduler, The job tracker includes
the task scheduler module assign tasks to different tasks
trackers.
• Task Trackers periodically send heartbeat to the job
tracker.
• The job tracker checks the heartbeat and send tasks to
the available one.
• The scheduler assigns tasks randomly to the nodes via
the same heartbeat message protocol.
• It assigns tasks randomly and mispredict stragglers in
many cases.
04/22/14 22CSC 8710

Predictive Scheduler
• Making a predictive scheduler by designing a
prediction algorithm integrated with the original
Hadoop.
• The predictive scheduler predicts stragglers and
find the appropriate data blocks.
• The prediction decisions are made by a
prediction module during the prefetching stage.
04/22/14 23CSC 8710

Hadoop Function
04/22/14 24CSC 8710

Lunching Process
• Three basic steps to lunch the tasks:
- Copying the job from the shared file system to the job
tracker’s file system, and copying all the required
files.
- Creating a local directory of the task and un-jar the
content of the jar into the directory.
- Copying the task to the task tracker to be processed.
04/22/14 25CSC 8710

Lunching Process
• In PSP, all the last steps are monitored
by the prediction module, and it
predicts three events:
- The finish time of the current processed task.
- Tasks that are going to be assigned to the task
trackers
- Lunch time of the pending tasks.
04/22/14 26CSC 8710

prefetching
• These three issued must be addressed:
- When to prefetch:
- What to prefetch
- How much to prefetch
04/22/14 27CSC 8710

Conclusion
• Proposing a predictive scheduling and prefetching
mechanism (PSP) aim to enhance Hadoop performance.
• prediction module predicts data blocks to be accessed
by computing nodes in a cluster.
• the prefetching module preloads these future set of data
in the cache of the nodes.
• It has been applied on 10 nodes, so it reduces the
execution time up to 28% and 19% for the average.
• It increases the overall throughput and the I/O utilization.
04/22/14 28CSC 8710

Resources
• http://ac.els-cdn.com/S1877050913005668/1-s2.0-S1877050913005668-
main.pdf?_tid=00e2b8e8-8d59-11e3-be92-
00000aacb362&acdnat=1391490095_5f34abbe9f98d3b8a0978b2464478da
1
• http://blog.vitria.com/bid/87945/Big-Data-Analytics-Challenges-Facing-All-
Communications-Service-Providers
• http://blog.raremile.com/hadoop-demystified/
• http://namitkabra.wordpress.com/category/etl/page/2/
• http://odbms.org/download/Pro%20Hadoop%20Ch.%201.pdf
• http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf
• http://wiki.apache.org/hadoop/Defining%20Hadoop
• https://engineering.purdue.edu/~ychu/ee673/Projects.F11/detectstraggeler_fi
nalrpt.pdf
04/22/14 29CSC 8710

Suggested Algorithm to improve Hadoop's performance.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Suggested Algorithm to improve Hadoop's performance.

Similar to Suggested Algorithm to improve Hadoop's performance. (20)

Recently uploaded

Recently uploaded (20)

Suggested Algorithm to improve Hadoop's performance.