Schedulers optimization to handle multiple jobs in hadoop cluster

SCHEDULERS OPTIMIZATION TO
HANDLE MULTIPLE JOBS IN
HADOOP CLUSTER
By
Shivaraj B G
4th Sem
Department of Computer Networking, MITE.

Introduction
Big Data: Information is turning out to be more
accessible as well as more reasonable to PCs.
 A large portion of the Big Data flow of information
in the world is uncontrollable stuff like words,
pictures and feature on the Web and those
floods of sensor information.
 It is called unstructured information and is not
commonly available for conventional databases.
 Firstly Big data should be obtained, processed
and analyzed.

Characteristics of Bigdata
 Volume: Represents huge
information.
 Velocity (Speed): A quick rate
that information is gotten.
 Variety (Mixture):
Unstructured and semi-
organized information sorts.
 Value: Information has
inherent value however must
be found.

Hadoop: A structure of open source tools, libraries
and techniques for 'huge information'
examination.
 Hadoop is converged with two essential key
properties known as Hadoop Distributed File
System (HDFS) with capacity and
administration and MapReduce with
processing ability.
Why Hadoop?
 Adjustable framework with hardware and
programming with implicit application level with

Why to use Hadoop ?
 The Hadoop is reliable and simple
 Utilized to store extensive information with
unlimited storage capacity with adaptable
framework.
 As it is of 100% open source package.
 Quick recovery from system failures.
 Ability to quickly handle large amount of
information in parallel.
 Once information written in HDFS can be read
many times.

Contents of Hadoop
1. Hadoop Distributed File System
(HDFS):
 HDFS is master (expert) and slave
structural architecture; client first
cooperates with expert with
NameNode and Secondary
NameNode.
 Every slave runs both DataNode
and JobTracker daemon that speak
with and getting guidelines from
master hub.
 The TaskTracker daemon is the
slave to the JobTracker and
DataNode daemon is slave to the
NameNode.

2. MapReduce: Is basic programming model
confined to use of key-values pairs.

Scheduling in Hadoop
 Scheduling is a policy used to determine when a
job executes its tasks. And it communicates each
other or across clusters using TCP suit using its
distributed file system.
 Example: process time, communication for data
transmission along with available bandwidth.
 So the need of scheduling is the fact that
multiplexing and multitasking.
 The fresh hadoop setup uses only First-in First-
out scheduler.

Pig
 scripting language intended to handle any sort of data sets
(unstructured information).
 Pig is comprised of two parts: the First is language itself which is
called Piglatin; the Second is a Pigruntime environment where
Piglatin programs are executed.
 Pig has two execution modes: Local Mode (pig –x nearby) and
MapReduce Mode (pig or pig –x mapreduce) with an ad-hoc way
of creating and executing MapReduce jobs.
Piglatin  High-level scripting language.
 Requires no metadata or schema.
 Statements translated into a series of MapReduce
jobs.
Grunt  Interactive shell.
Piggybank  Shared repository for User Defined Functions (UDFs).

Hive
 Data Warehouse System for Hadoop.
 Tools to facilitate effortless information to
extract/transform/load (ETL) from records stored
directly HDFS or in other information storage
systems such as HBase.
 Contains metadata so as to describe information
right to use in HDFS and HBase, not information
itself.
 Uses a straightforward SQL-like query language
called HiveQL
 Query implementation through MapReduce

With the aim of optimizing schedulers, the framework
allow jobs to complete in a timely manner, while allowing
users who are making queries to get results back in a
reasonable time. so users have more freedom in
adopting the most appropriated scheduler or other
techniques according to their requirements.

Problem Statement
First in first out (FIFO) scheduler approach allows one
job to take all task slots within the cluster, i.e., no other jobs
can utilize the cluster until the current one completes.
Consequently, jobs that arrive at a later time or with a lower
priority will be blocked by those ahead in the queue. Given
the total number of jobs is large, there will be a significant
delay caused by FIFO scheduler.

Proposed System
The proposed system has a novel framework for optimization approach.
1. Fair scheduler
The Fair scheduler starts execution if any slots are available during
execution of tasks the smaller jobs are assigned to that slot.
Merits of Fair Scheduler
 Though cluster is shared with large jobs it allows running small jobs
quickly.
 Unlike default FIFO scheduler, without starving the large or small job fair
scheduling chooses a job in the queue to run task if available slot is free.
 Provide service for multiple slots with guaranteed levels of jobs execution
in shared cluster and simple to configure and administer.

2. Capacity Scheduler
 The concept provided by Capacity Scheduler is
scheduling queues.
 Is designed for large cluster for sharing minimum
capacity guarantee while cluster is partitioned among
multiple users.
Merits of Capacity Scheduler
 Capacity guarantees – Multiple queues supports jobs
execution simultaneously as submitted by the user.
 Elasticity – jobs are assigned beyond the capacity due its
freely available resources which helps entire cluster utilization
 Multi-Tenancy – Provides system resources to queues
created by user to link with JobTracker to execute jobs.

Objective
 To parallelize the job execution crosswise over
stand alone mode or in a cluster.
 Estimate the processing time for all parallel
applications like small or long running jobs by
distributing tasks to the schedulers.

Methodology
HDFS Framework works on block size. Whereas general Parallel file
system supports block sizes of 16 KB to 4 MB and maintains default block
size of 256KB. But HDFS works with default 64 Mb block size.

Optimization Techniques
1. MapCombineReduce

2. Distinctive Block Size
 dfs.block.size: File system block size: 67108864 (bytes)
 E.g. Input data size = 1GB and dfs.block.size = 64 MB
then the minimum no. of maps are (1*1024)/64 = 16
maps.
 E.g. If input data size = 1 GB and dfs.block.size = 128
MB (134217728 bytes) then minimum no. of maps are
(1*1024)/128 = 8 maps.
File Size Time for Moving Data to
HDFS
Total Number of Blocks
Created
64 Mb 128 MB
1.3 GB 0.48 Sec 0.38 Sec 21 11
2.7 GB 1.38 Sec 1.17 Sec 41 21
4.0 GB 2.32 Sec 1.58 Sec 61 31

S/W and H/W Specification
Software Specification
 Operating System: Ubuntu 12.04 LTS.
 Java (Jdk 6.1) is required to run *.jar files.
 Hadoop Version-hadoop-1.0.3.tar.gz Stable Release.
 Pig: apache- pig-0.14.0.tar.gz.
 Hive: apache-hive-0.13.1-bin.tar.gz.
Hardware specification
Name Specification
Main processor P4, 2GHz or Higher
Secondary Memory 200GB
Primary Memory 4GB or higher

Results
Starting Hadoop multi-node cluster
Cd /usr/local/hadoop/bin
$ ./start-all.sh

Displaying Master Node and
available cluster summary
 Using web browser type the master URL such as
localhost:50070 or master:50070

Listing Total Number of Files in
HDFS.
 $ bin/hadoop dfs -ls

Input file of weblogs Stored in
HDFS

Movie dataset Input File Stored in
HDFS

Job Queue Scheduling Information
with Default Scheduler
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogHitsByLinkProcessor.jar WeblogHitsByLinkProcessor
/user/hduser/nasa_input /user/hduser/weblog_output/
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogMessagesizevsHitsProcessor /user/hduser/nasa_input /user/hduser/weblogs_output1/
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar
WeblogTimeOfDayHistogramCreator /user/hduser/nasa_input /user/hduser/welogg_output2/

Scheduling Information with FAIR
Scheduler.
 Using web browser type Hadoop MapReduce URL such as localhost:50030
$ time bin/hadoop jar WeblogHits.jar WeblogHits
/user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblog_output
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblogs_output1
$ time bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar
WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblogg_output2

Job Summary for QueueA using
Capacity Scheduler
$ time bin/hadoop jar Weblog.jar Weblog -Dmapred.job.queue.name=queueA
user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/capacity/ouputs/weblog_output

Job Summary for QueueB using
Capacity Scheduler.
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogTimeOfDayHistogramCreator -
Dmapred.job.queue.name=queueB
/user/hduser/heterogeneous/inputs/weblogtime/user/hduser/heterogen
eous/capacity/ouputs/weblogs_output1

Output for Web-Logs with Total
Number of Hits on Particular Links.

Output for Web-Logs with Total
Number Messages with their Size.

0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Output for Web-Logs with Time between 0-
23 Hours along Number of Users.

0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Time
Users

Performance Analysis of Proposed Hadoop
System using Homogeneous Hadoop Jobs with
Single Node Cluster
3.7
2.92
3.6
3.25
0
0.5
1
1.5
2
2.5
3
3.5
4
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System

System using Heterogeneous Hadoop Jobs with
Single Node Cluster
26.11
19.04
29.96
21.24
0
5
10
15
20
25
30
35

System using Homogeneous Hadoop Jobs with
Multi-Node Cluster.
1.54 1.5
3.79
1.66
0
0.5
1
1.5
2
2.5
3
3.5
4

System using Heterogeneous Hadoop Jobs with
Multi-Node Cluster.
7.38
6.38
13.47
12.26
0
2
4
6
8
10
12
14
16

Results for Movie dataset using
PIG
 Running pig in local mode:
 pig -x local
 movies = LOAD
'/Users/Rich/Documents/Courses/Fall2014/
BigData/Pig/movies_data.csv' USING
PigStorage(',') as
(id,name,year,rating,duration);
 DUMP movies

Filter:
 List the movies that were released between 1950 and 1960
 movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961;
 store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960';

 Foreach Generate:
 List movie names and their duration (in minutes)
 movies_name_duration = foreach movies generate name, (float)duration/3600;
 store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration';

Order:
 List all movies in descending order of year
 movies_year_sort =order movies by year desc;
 store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort';

Results for Movie Dataset using
Hive
For Displaying Contents of Movie Data Set.
Hive> select * from Movies;

Result for Particular Year of Movies
Released.
Hive> select *from movies where year = 1995;

finding all Movies based on Movie Length
specified.
Hive> select * from movies where length > 3000;

Retrieve information using GROUP
BY clause.
Hive> select COUNT(1) from movies GROUP
BY year;

Results for heterogeneous
datasets using Hadoop, pig and
hive
Hadoop Pig Hive
Word Count 2.367 1.58 1.52
Movie Rating 1.53 1.45 1.57
2.36
1.58
1.521.53
1.45
1.57
0
0.5
1
1.5
2
2.5
Hadoop Pig Hive
WordCount
MovieRating

Conclusion
This effort is projected to give a high level
summary of what is Big data and how to solve
the issues generated through four V’s and
stored in HDFS using various configuration
parameters by setting up Hadoop, Pig and
Hive to retrieve useful data from bulky data
sets.

Schedulers optimization to handle multiple jobs in hadoop cluster

Schedulers optimization to handle multiple jobs in hadoop cluster

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (13)

Similar to Schedulers optimization to handle multiple jobs in hadoop cluster

Similar to Schedulers optimization to handle multiple jobs in hadoop cluster (20)

Recently uploaded

Recently uploaded (20)

Schedulers optimization to handle multiple jobs in hadoop cluster