VISVESVARAYA TECHNOLOGICAL UNIVERSITY
“Jnana Sangama” Belagavi 590018
A Project Report on
“PROJECT TITLE” (in caps)
Submitted in partial fulfillment for the award of degree of Bachelor of Engineering
in Computer Science & Engineering during the academic year 2016-20.
By
Student Name USN
Student Name USN
Student Name USN
Student Name USN
Under the guidance of
Guide Name
Assistant Professor
Dept. of CS&E
MRIT, Mandya
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
MYSURU ROYAL INSTITUTE OF TECHNOLOGY, MANDYA
2019 - 2020

VISVESVARAYA TECHNOLOGICAL UNIVERSITY
Mysuru Royal Institute of Technology, Mandya – 571606
2019-2020
Department of Computer Science & Engineering
CERTIFICATE
This is to certify that the project work entitled “TITLE” is a bonafide
work carried out by name (usn), in partial fulfillment for the award of Bachelor of
Engineering in Computer Science and Engineering of the Visvesvaraya
Technological University, Belagavi, Karnataka during the year 2019-2020. It is certified
that all corrections/suggestions indicated for the Internal Assessment have been
incorporated in the report. The project report has been approved as it satisfies the
academic requirements in respect of project work prescribed for the Bachelor of
Engineering degree.
------------------------------------------ -----------------------------------------
Signature of Internal Guide Signature of Project Coordinator
Prof. Guide Name Prof. Chethan Raj C
Asst. Professor Asst. Professor
Dept. of CS&E Dept. of CS&E
MRIT, Mandya MRIT, Mandya
------------------------------------------ -----------------------------------------
Signature of HOD Signature of Principal
Prof. Soumya B Dr. Suresh Chandra
Asst. Professor Principal,
Dept. of CS&E MRIT, Mandya
MRIT, Mandya
EXTERNAL VIVA
Name of the Examiner Signature with date
1.
2.

Mysuru Royal Institute of Technology,
Mandya – 571606
Department of Computer Science and Engineering
DECLARATION
I Student Name, studying in the Eigth semester BE, Computer Science and
Engineering, Mysuru Royal Institute of Technology, Mandya, hereby declare that the
project work entitled “----------TITLE------------------” has been carried out independently
under the guidance of Guide Name, Asst. professor, Department of Computer Science
and Engineering, Mysuru Royal Institute of Technology, Mandya. This project work is
submitted to the Visvesvaraya Technological University, Belagavi in the partial
fulfillment of required for the award of degree in Bachelor of Engineering during the
academic year 2016-2020.
This dissertation has not been submitted previously for the award of any other degree or
diploma to any other institution or university.
Date:
Place:
____Sign_________________
Name
(USN)

I
ACKNOWLEDGEMENT
Happiness cannot be expressed by words and help taken cannot be left without
thanking. I would like to thank all of them who were a part of my project work.
I am thankful to our principal, Dr. Suresh Chandra H S, MRIT Mandya for all
the facilities provided to us in the college.
I would like to convey my sincere thanks to Prof. Soumya B, Head of the
Department, Dept. of Computer Science and Engineering, MRIT.
I am especially thankful to Prof. Chethan Raj C, Project Coordinator, Dept. of
Computer Science and Engineering, MRIT, for his whole hearted encouragement and
individual guidance in carrying out this project.
I express my deep profound gratitude to Prof. Guide Name, Assistant Professor,
Dept. of Computer Science and Engineering, MRIT who has been my guide and
guiding in my endeavor to complete this project successfully.
My profound thanks to all my lecturers for extending their kind co-operation and
help during this project work.
I would like to express my deepest gratitude to my family members, for their
support and love.
Finally, I would like to thank all my friends, who all made invaluable
contributions to my work.
Thanking You
Student Name
(USN)

II
ABSTRACT
A Wireless Sensor Network (WSN) is a collection of small tiny devices which
have computational processing ability, wireless receiver, transmitter technology and a
power supply. Energy consumption in the sensor node is for the sensing, communication
and data processing. More energy is required for data communication in sensor node.
Among wireless communication systems, WSN is the most popularly used network and it
consists of spatially distributed sensor nodes with sensing computation and wireless
communication capabilities. These sensor nodes are scattered in an unattended
environment (i.e. sensing field) to sense the physical world. It is very costly to deploy a
complete test bed containing multiple networked computers, to validate and verify a
certain network protocol or a specific network algorithm. The network simulator saves
both money and time in accomplishing this task. In this project, we introduce two metrics
called signal strength indicator and desired distance estimator to find optimal and reliable
path between source and destination.

III
Table of Contents
CONTENTS PAGE NO.
ACKNOWLEDGEMENT I
ABSTRACT II
LIST OF CONTENTS III
LIST OF FIGURES VI
LIST OF TABLES VII
LIST OF SNAPSHOTS
LIST OF CONTENTS
CHAPTER 1 INTRODUCTION 01
1.1 Domain Overview 01
1.2 Project Overview 02
1.3 Existing System 03
1.4 Disadvantage of Existing System 03
1.5 Problem Statement 04
1.6 Project Motivation 05
1.7 Proposed System 04
1.8 Advantages of Proposed System 05
1.9 Objective of the Project 06
1.10 Organization of report 07
CHAPTER 2 LITERATURE SURVEY 09
2.1 Literature Review 09
2.2 Conclusion of Review 09
CHAPTER 3 SYSTEM REQUIREMENT SPECIFICATION 16
3.1 Introduction 16
3.2 Functional Requirement 16
3.3 Non - Functional Requirement 17

IV
3.4 System Requirements 17
3.4.1 Hardware Requirements 17
3.4.2 Software Requirements 17
CHAPTER 4 SYSTEM DEVELOPMENT 18
4.1 Introduction to System Development 18
4.2 Module & Methodology 18
4.2.1 Sub Module1 18
4.2.2 Sub Module2 21
CHAPTER 5 SYSTEM DESIGN 18
5.1 Introduction to System Design 18
5.2 High Level Design 18
5.2.1 Architecture of the System 18
5.2.2 Data Flow Diagram 21
5.3 Low Level Design 24
5.3.1 Process Diagram 24
5.3.2 Flow Chart 25
5.3.3 Sequence Diagram 26
5.3.4 Sequence Diagram 24
5.3.5 Activity Diagram 25
5.3.6 Use Case Diagram 26
CHAPTER 6 SYSTEM IMPLEMENTATION 27
6.1 Introduction to System Implementation 27
6.2 Language Used for Implementation 27
6.3 Algorithms 28
6.3.1 Algorithm1 Name 28

V
6.3.2 Algorithm2 Name 29
6.4 Code Snippet 30
CHAPTER 7 TESTING 36
7.1 Introduction to Testing 36
7.2 Types of Testing 36
7.3 Test Cases 37
CHAPTER 8 RESULTS AND DISCUSSIONS 39
8.1 Introduction 39
8.2 Snapshots with Description 43
CONCLUSION AND FUTURE ENHANCEMENT 46
REFERENCES 47

VI
LIST OF FIGURES
Figure No. Name of the Figure Page No.
Fig 1.1 Wireless Sensor Network (WSN) 02
Fig 1.2 Sensor Node Components 02
Fig 2.1 Components of NS2 11
Fig 2.2 Network consists of N nodes at time t1=0 sec 13
Fig 2.3 Network consists of N nodes at time t2=10 sec 14
Fig 4.1 Architecture of Existing system 19
Fig 4.2 Architecture of Proposed system 20
Fig 4.3 Work Flow of existing System 21
Fig 4.4 Work flow of proposed system 23
Fig 4.5 Process Diagram 24
Fig 4.6 Flow chart 25
Fig 4.7 Sequence Diagram 26
Fig 7.1 Initial position of node 39
Fig 7.2 Route request from source and destination 40
Fig 7.3 Obstacle during route discovery 41
Fig 7.4 Finding an alternate path 42
Fig 7.5 Bit error rate 43
Fig 7.6 Packet delivery ratio 44
Fig 7.7 Throughput 45

VII
LIST OF TABLES
Table No. Name of Table Page No.
Table 1.1 Routing table-1 13
Table 1.2 Path table-1 14
Table 2.1 Routing table-2 15
Table 2.2 Path table-2 15
Table 6.1 Unit Test Cases 37
Table 6.2 Integration Test Cases 38

VIII
LIST OF SNAPSHOTS
Snapshot No. Name of SnapShot Page No.
Fig 8.1 Routing table-1 13
Fig 8.2 Path table-1 14
Fig 8.3 Routing table-2 15
Fig 8.4 Path table-2 15
Fig 8.5 Unit Test Cases 37
Fig 8.6 Integration Test Cases 38

ABSTRACT
Mapreduce is a prominent groundwork for scrutinizing and processing
substantially massive data and hadoop groundwork is open source. It has been considered
as the default platform in present days for examining, manipulating and storing enormous
data. Since every educational establishments, business industries and research and
development centers rely on hadoop for processing their data, the performance of the
system must be maintained. One major obstacle of the hadoop groundwork that affects
the performance and complicates overall system is long makespan or the completion time
of mapreduce jobs.
The hadoop scheme presently in use ratify stabile assignment of slots i.e. map and
reduce slot numbers are predefined for cluster throughout its life at the inception of
hadoop cluster formation. This setting causes under utilization of resources and large
completion time. In order to reduce this limitations, this project has presented a
mechanism where in the slots are assigned dynamically by self tuning. It collects
execution technicalities of foregoing jobs and based on these details allocates the slots for
map and reduce, so this in turn leverages the performance of overall application.

Chapter 1
INTRODUCTION
In recent years MapReduce programming standard has turned out to be the
prominent technology for analyzing and processing big-data and its implementation
Apache Hadoop is a complimentary implementation which can be used for analyzing
broad range of data. Hadoop is a schematic groundwork adapted to process and stock
huge bulk of data on distributed-parallel environment. Hadoop is designed and written in
a way that it can extend from one server to multiple thousands of system each of which
offers local storage and refining of abundant amount of data submitted by user.
Due to the advancement in cloud computing, the Hadoop-MapReduce is suitable
not only for large-companies and research centre for working on data-intensive projects
but also for regular users by launching a hadoop cluster on cloud.
1.1 Objective
With the rapid advancement of technology and as more and more data is
generated, the applications are employing MapReduce techniques for scrutinizing,
processing, and extracting their data. In this circumstance, main concern of programmer
is how to achieve good reliability and how to enhance performance of a hadoop cluster.
Hadoop groundwork constitutes large set of predefined system technicalities or attributes
and these parameters plays salient role in leveraging performance of application.
Preeminent intentions for development of this application are:
First and foremost objective is to formulate new methods for modifying primitive
attributes of the system for improving overall performance of the system.
The second target of application is to curtail completion-time or also called makespan of
batch of jobs by incorporating new formulated methods.
And finally by achieving the above listed two objectives, third goal is to increase resource
utilization while processing unstable workloads can also be achieved.

1.2 Existing System
In the elementary Hadoop core architecture cluster comprises of a master node
which is solitary and responsible for management and examining of the entire worker or
also called slave nodes. And Hadoop groundwork cluster contains several worker nodes
which hosts the task-tracker routine to execute map-reduce jobs. The jobtracker
component resides in the master node; its main operation is allocating jobs and organizing
map or reduce tasks to executed on map or reduce slots respectively in an adept manner.
The number of tasks which can be accommodated on individual nodes is
represented by a term called slot and in the elementary hadoop structure, each slot can run
only one task at a specified time. Based on this circumstances and theory, the total
number of slots position present in every node indicates the maximal magnitude of
parallelism which can be achieved.
The slot setting arrangement is primitive parameter and considered to be default
throughout cluster’s lifetime which has crucial impact on performance of system. The
basic hadoop groundwork makes use of fixed slot configuration and in this setting the
number of slots for map and reduce is both predefined for each separate node at the
beginning of cluster creation.
This predefined number assigned for static configuration is random values without
taking into account any job attributes. So static configuration of hadoop is not optimized
and performance of the whole system may be hindered.
Some of the drawbacks of classic Hadoop-MapReduce are:
The system uses stabile slot setting, i.e. they have predefined number of map and reduce
slots for individual nodes of cluster throughout its lifetime.
A static arrangement of slots causes improper resource utilization.
It scales down performance of overall system under diverse and unstable workloads.

1.3 Proposed System
In order to overcome the limitations of existing system, this project aims at
designing algorithms for modifying primitive system attributes and increasing the system
performance of batch of jobs. In this project, a new conceptual theory of dynamically
assigning slots is proposed. The vital goal of this new technique is decreasing completion-
time of tasks executed while the simplicity hadoop implementation is retained as it is.
The newly projected and designed system is termed as TuMM which stands for
TUnable knob for minimizing Makespan of MapReduce jobs. Its major goal is to make
slot allotment proportion of map and reduce tasks automatic. Projected system
groundwork composes of two primary components: Workload-Estimator (WE) and Slot-
Scheduler (SS).
The workload-estimator is present in Job-tracker routine, and it acquires the
details like execution time of foregoing completed tasks. This detail is used to compute
current workload in hadoop cluster. Second integral component slot-scheduler fine-tunes
the ratio of map and reduce slots for each worker node based on result computed by
Workload-Estimator.
A new variation of TuMM technique called H-TuMM is implemented for
heterogeneous clusters which assigns slot for all the nodes separately to lessen the
makespan of job cluster.
Some of advantages of this proposed system are:
It minimizes the completion time of two phases thereby scaling down the makespan of
multiple jobs by individually allocating slots for nodes in heterogeneous environment.
The projected system shows up to 28% curtailment of completion time or makespan for
job cluster can be achieved.
And which in turn causes 20% enhancement and rise in proper usage of system resources.

1.4 Organization Of Report
This chapter summarizes introduction of project which is elaborately described in
later section of report, so the second chapter provides a detailed survey about this
projected system which constitutes various paper related to how the performance of
hadoop can be improved. In the third chapter the requirements, constraints listed by the
user for designing this application is described and following chapter illustrates design of
the application which comprises of sequence diagram, architecture, and so on. Fifth
chapter gives insight on implementation part and then testing techniques and test cases
used for verifying this application is depicted in chapter six. Analysis of report containing
snapshots is present in seventh chapter and finally conclusion and future enhancement is
specified in the end.

Chapter 2
LITERATURE SURVEY
Improving MapReduce Performance through Data Placement in
Heterogeneous Hadoop Clusters [1]
J Xie et al have designed and invented a data placement approach in the Hadoop
distributed file system to calibrate the data load in a heterogeneous Hadoop cluster. The
newly designed data placement component firstly distributes a vast data set to multiple
nodes with respect to computing capacity of each node. They designed a data
reorganization algorithm along with data redistribution algorithm in HDFS and these two
algorithms can be used to solve the data skew problem caused by dynamic data addition
and removal. Initial algorithm is used to divide and distribute file chunks to
heterogeneous nodes in a cluster at beginning of cluster formation. When all file
fragments of input file which is currently required by computing nodes are present in a
node, then these chunks are distributed to computing nodes and then second algorithm is
incorporated for rearranging file chunks to solve the data skew problem.
First data placement algorithm starts off by initially splitting a vast input into a
numerous fragments of same size. Then these fragments are allotted to nodes in cluster
based on node’s data processing speed. Comparatively the high-performance nodes can
stock and process more file chunks than low-performance nodes. The input file segments
distributed by this algorithm might get disturbed because of the following reasons: first
new data may be added to the current input file. Second data fragments may be deleted
from current input file. And third new data computing nodes are augmented to the cluster
which is present.
To overcome this data load balancing problem, data redistribution algorithm
mechanism is being incorporated. This reorders file chunks based on computing ratios, so
in this method first the data about disk space utilization and network topology of cluster is
compiled by the data distribution server. Next, two lists called over-utilized and under-
utilized node list is created. Then the server shifts the file chunks from over-utilized node
list to an underutilized node list until data load are allocated evenly among nodes.

MARLA: MapReduce for Heterogeneous Clusters [2]
Z Fadika et al implemented MARLA a MapReduce paradigm with dynamic load
balancing which can be adapted for homogenous, heterogeneous and even for load
imbalanced environments. MARLA is based on basic shared file systems as its input
output management technique. Idea of this new model relies on dynamic task scheduling
which allow nodes in hadoop bunch to request tasks when required. Previously in Hadoop
MapReduce the tasks were evenly distributed and pre-assigned for the nodes before
running a given application, but in MARLA the nodes in the cluster must request for the
job when they are done executing foregoing tasks. Main node is responsible for
registering number of tasks available in nodes and these nodes are assigned a token for
identifying process and this can be used for requesting tasks. When task is requested by
particular node, that specific task becomes unavailable to rest of the processing nodes.
Node can request for a job only when it has executed, and successfully completed the
foregoing tasks and henceforth in this scheme the fast and slow nodes process their fair-
share.
MARLA is composed of three integral components: splitter, task-controller and
fault-tracker. The first component splitter is used for management of input and output ,
second component task-tracker and task-controller is responsible for task assignment and
for checking concurrency and the last component fault-tracker is used for fault tolerance.
Splitter management component is composed of splitting dataset and distribution of
dataset, in order to work this framework takes input fragments as tasks which relies on
user, are created. This scheme increases data visibility provided by shared disk file
system to present its input data to cluster nodes, so input distribution is directly executed
through shared file system.
Task tracker is responsible for availability of tasks from data fragments produced
by splitter, and availability of map and reduce code implemented by user to processing
nodes in cluster by shared file system. Task tracker frequently checks improvement and
progress of tasks and failed tasks are sent to task-bag through fault-tracker component.
Failed tasks in task-bag are put on short term leave and then retried later and completed
tasks are shifted to completed-task-bag and it is moved to reduce phase.

HadoopCL: MapReduce on Distributed Heterogeneous Platforms
through Seamless Integration of Hadoop and OpenCL [3]
In distributed parallel computing as complexity raises the three challenges: the
programmability, reliability and energy efficiency of the system also increases. When
trying to avoid three problems listed previously, performance of system may be hindered.
In this work M Grossman et al have introduced a new idea of integrating Hadoop
MapReduce with OpenCL to facilitate the use of heterogeneous processors in distributed
system. Incorporating OpenCL with Hadoop provides: first user friendly, flexible and
easily learnable application programming interface in high level and most widely used
programming language, second it provides reliability of distributed file system and thirdly
it guarantees minimal power utilization and leveraging performance of heterogeneous
processors.
By adapting new paradigm HadoopCL all the three challenges can be maintained
without sacrificing the performance in hadoop distributed system. Functionalities of
HadoopCL include: first in order to lessen modification done to legacy code, HadoopCL
extends hadoop groundwork’s mapper and reducer classes to support execution of user
written java kernels on heterogeneous hadoop bunches. Next functionality is adoption of
dedicated communication threads and asynchronous communication to escalate utility of
available bandwidth and restrain communication blockage. Third, HadoopCL aids in
translating the java bytecode to OpenCL kernels automatically using APARAPI and
translation of extensions to existing features of APARAPI. Lastly it evaluates
HadoopCL’s performance in two multinode cluster comprising multicore CPUs, GPUs
and APUs.
HadoopCL depends on APARAPI tool for translating java bytecode to OpenCL
kernels and OpenCL kernel code is produced for user-written map and reduce module and
even for HadoopCL glue code whose function is to pass keys and values into user-written
functions. The HadoopCL can modify its own memory access arrangement and iteration
of loop for the best performance of the system. Presently it grants optimization for GPUs
and multicore CPUs and APARAPI was extended to aid asynchronous kernel execution
and it was accomplished by reforming the APARAPI C++ runtime to stock references to
OpenCL events.

Performance Modeling of MapReduce Jobs in Heterogeneous
Cloud Environments [4]
In present days hadoop is used for heterogeneous data handling and management
which has additional challenge efficient cluster administration and job management. In
this heterogeneity of data resources, which system resources is leading to performance
hindrance and bottlenecks is not clear. In order to provide a mechanism for configuring
and optimizing such Hadoop cluster Z Zang et al, analyzed efficiency and performance
precision of the bounds-based performance (BBP) model and using this model they
estimated completion time of MapReduce job in heterogeneous bunches.
BBP (bounds-based performance) paradigm measures upper and lower limit of job
finishing time and this model relies on makespan theorem which is used to calculate
performance confinement on completion time for provided set of n number of tasks that
are processed and refined by k number of servers.
Greedy algorithm is used for allotment of tasks to slots and this is an online
allocation technique where in, slot which has finished executing foregoing task earliest is
assigned a new task. Then lower bound is the product of average duration of n task and
fraction of n tasks and k servers. And upper bound is summation of maximum duration of
n tasks and the product of average duration of n task and fraction of n-1 tasks and k
servers. Difference between the least and at most value indicates set of obtainable
completion times due to task scheduling and non-determinism.
For approximately reckoning total finishing time of job submitted, first median of
task timing taken and maximum duration of task should be measured at different stages of
job execution: map-phase, shuffle-phase/sort-phase and reduce-phase. Median and
maximum reckoning can be redeemed from job execution record. Job completion timing
value of different processing stage like map-phase, shuffle-phase/sort-phase and reduce-
phase of job can be computed by using newly projected bound paradigm.
Dynamic Job Ordering and Slot Configurations for MapReduce
Workloads [5]
MapReduce performance and resource utilization varies based on different map-
reduce slot configurations and job execution orders, so S Tang et al initiated usage of two

classes of algorithms to reduce makespan and total completion time for an offline
workload. First set of algorithms is used for optimizing job ordering given a map-reduce
slot configuration and next class of algorithm is used for optimizing slot configuration.
Algorithm used for optimizing job order is MK_JR and is based on Jhonson’s
Rule for makespan optimization. The Jhonson’s rule can provide a best job order for
makespan when there are one reduce and one map slot only. But generally when random
amount of map and reduce slots are accessible, lowering makespan is considered as NP-
hard. MK_JR algorithm produces 1+δ roughly close value to lowering makespan, where
δ<1 and can be reckoned as ratio of summation of maximum map and reduce task size to
summation of all task size. δ is a very small value because the time needed for processing
single map-reduce task is very small compared to processing time of overall MapReduce
workload. Another algorithm presented for optimizing makespan and total completion
time concurrently is MK_TCT_JR, MK_TCT_JR is a bi-criteria heuristic algorithm, it
optimizes the parameter values by observing the significant trade-off between completion
time and makespan.
After obtaining optimized map-reduce slot configuration by computing and
verifying all possible values from 1 to S-1 where S is total number of slots. But when S
becomes very large, search algorithm may be inefficient, in order to overcome this
problem proportional configuration property was used.

Chapter 3
SOFTWARE REQUIREMENT SPECIFICATION
3.1 Introduction
This chapter discusses about various requirements of the project such as software
requirements, hardware precondition, functional and non-functional prerequisite of the
project and constraints the system must adhere to and this section of report also includes
the purpose and project perspective.
3.2 Purpose
The main purpose of this project is to enhance the performance of the system by
scaling down completion time using the dynamic self tunable slot technique which in turn
leverages the resource utilization of the overall system.
3.3 Project Perspective
The elementary hadoop groundwork cluster is predefined with fixed setting
arrangement for the slots. The number of slots for map and reduce stage is permanently
defined in the beginning of cluster formation and can’t be altered later on. This
elementary mechanism of hadoop schematic groundwork hinders the performance and
optimality of the entire system and induces underutilization of resources among the nodes
in system. Many techniques where projected to address this problem and complication
generated in former methods like:
 Quincy et al [6] adapted locality restraints and fairness hindrance for dealing job
allocation and management complication.
 Zaharia et al [7] suggested a delay scheduling to boost and facilitate optimality of
fair scheduler by leveraging dataset locality
 Verma et al [8] projected a heuristic to scale down makespan of a set of separate,
self-reliant MapReduce jobs by applying classic Johnson’s algorithm.
In order to overcome the limitations of all above techniques, in this project self
tunable slot assignment techniques has been implemented. In this technique, map and
reduce slots are allocated dynamically based on the feedback collected from the workload

computing component. First integral component workload estimator calculates time of the
foregoing job execution and then this is sent to next vital component slot scheduler,
which properly assigns slots to map and reduce.
3.4 Functional Requirements
Functional requirements are used for expressing the behavior of a project, purpose
and role each component. This is represented as using inputs, outputs and its behavior
based on the specified input. While implementing design phase of system, functional
requirements are considered and behavior of the whole project is realized using the use
cases. The use case depicts the behavior, relationship between the components or modules
of the project and the use case explanation for this project is illustrated in system design
chapter.
3.5 Non – Functional Requirements
Non functional requirements depict conditions which can be utilized for analyzing
the working of a project instead of its behavior. Category of perquisite or requirements of
this kind is depicted elaborately in groundwork of this project. It is used to indicate the
characteristics like security and usability which can be termed as execution-qualities. And
traits like reliability, performance, optimality, consistency, maintainability can be termed
as evolution-qualities are also described for projected application.
3.5.1 Performance
Compare to other existing techniques of slot assignment, the self tunable slot
allocation mechanism adapted in this project performance is high which in turn lowers the
makespan of the job batch and resource utilization is increased.
3.5.2 Optimality
Optimality defines how best the application runs in any circumstances irrespective
of any kind of input dataset which results in effective and efficient processing method. In
this application optimality is realized based on new formulated methodology that we have
incorporated to specific dataset.

3.5.3 Reliability
The reliability of application mainly depends on efficiency, throughput or
performance that directly or indirectly affects overall behavior of system by considering
specific data and its required properties. This application is more reliable, it allocates slots
based feedback of foregoing jobs timing, thereby proper utilization of resources.
3.5.4 Portability
Portability furnishes insight on how a project could be carried out irrespective of
any platform or environment. This application makes use of open source architecture of
hadoop and java as coding language so it is compatible with different platforms and
datasets, and easily portable.
3.5.5 Security
The security of this application depends on hadoop tool utilized, but it’s not
affected by the dataset being used.
3.6 System Specification
3.6.1 Hardware Specification
 SystemProcessor : Pentium IV 2.4 GHz
 Secondary Storage : 40 GB
 Primary Memory : 4GB
3.6.2 Software Specification
 Platform tool : Windows 7/UBUNTU
 Programmed by : Java 1.7, Hadoop 0.8.1
 Interface : Eclipse
 RDB : MYSQL

Chapter 4
SYSTEM DESIGN
System design is a substanially important phase in project building and creation
stages of whole development cycle. The system design is a mecahnsim for depicting and
representing the overall architecture of the system, interfaces between the different
components, the methods and parameters defined for each module and data for a system
according to requirements specified by the user.
In order to design a system, first step is to collect the system requirements,
functional and non functional requirements , constaraints from the user. Second step is
designing the system in an abstract manner, this step provides outline of all major
components that is required for designing sytem architecture. Third step is detecting and
addressing bottlecks generated in the abstarct or high-level design due to violation of
some constarints specified by the user. Next operation is designing system in more
elaborate and detailed manner and this step constitutes specifying the methods,
parameters,interfaces to application components.
4.1 High Level Design
High level design reveals an abstract layout of entire application where abstract
HLD pictorially depicts primitive constituents of system to be developed. The
architecture of the system, the diagrams depicting flow of realtionship, flow of data are all
considered as the high level designs and these designs are written using non-technical
terms with slight additional technical terms.
4.1.1 System Architecture
Architecture of application projects a blueprint of entire system in pictorial
illustration. In this project the architecture three major components: job-assigner, slot-
assigner and task processor as shown in the figure 4.1. When the user submits a batch of
jobs to the system, first its sent to job-assigner component.
This component in turn contains two sub components slot scheduler and workload
estimator. Integral component workload estimator repeatedly collects the execution time
technicalities of latterly completed tasks at periodic intervals and then this value is used

for reckoning current map-reduce tasks at hand. After this second integral part, slot
scheduler based on these estimation adjusts and assigns the slot ratio to map and reduce
slave nodes .
Figure 4.1 System Architecture
4.1.2 Data Flow Diagram
Data flow diagram is constructed using geometrical components for representing
flow of data between modules of sytem and another name for DFD is bubble-chart. DFD
may be used to define abstraction of whole application at any stage, in that the context
diagram is contemplated to be as topmost position of abstraction.
Figure 4.2 potrays data flow diagram, whern the user logins, he will be provided
with two options of working on homogeneous or heterogeneous cluster after this user
submits job to be processed. System later examines whether job is scheduled for

processing or waiting in queue. Once job is scheduled for carrying out work, workload is
verified and job is split into task and in next step task is assigned to slot for execution.
Figure 4.2 Data Flow Diagram

4.2 Low Level Design
Low level design is used describing major components of system in detail and
elaborate manner so this technique is also called detailed-design. In this technique the
diagram is constructed by iteratively refining the given details, requirements and
constraints and also depicts modules, their parameters, methods and relationship among
them.
4.2.1 Use Case Diagram
Use case diagram is a simple mechanism for demonstrating how the user interacts
with the system functions which falls under behavioral drawing category in UML design
and can be used to find out different users and describe their behavior towards different
use cases.
Figure 4.3 Use Case Diagram
Figure 4.3 portrays use case representation where user interacts with functional
components like job tracker, reduce process for assigning job to system. In this diagram

there are two different users: first user submits the job for processing, second type of user
is the one who requires the content.
4.2.2 Sequence Diagram
Sequence diagram portrays how interplay occurs between different modules
formulated in application and it falls under category of interaction drawing of UML
design. This diagram shows how process interacts, their order, and sequence of message
sent and received for interaction and all this occurs in a time frame, so this diagram are
also referred as event-diagrams.
Figure 4.4 Sequence Diagram
Figure 4.4 portrays sequnce daigram which depicts interplay between three
components: user job, job tracker and task tracker. First integral module user job process
interacts with job tacker by sending a request message for processing user job and then

job tracker sends user data to task tacker to processes job, after execution of job, task
tracker send respond message with user results back to user.
4.2.3 Activity Diagram
Activity diagram is a kind of behavioral reprentation which describes how
workflows in overall system and is used to describe actions, interactions and activities of
system in step-by-step manner.It can also be called as a type of flowchart. Figure 4.5
showcases activity diagram which describes how the workflows between all the modules
i.e job tracker, mapping process and reducing job.
Figure 4.5 Activity Diagram

4.2.4 Collaboration Diagram
Collaboration diagram called as communication diagram is a type of interaction
diagram in UML design which describes interplay among modules of system through
messages as depicted in figure 4.6 which portrays collaboration diagram. This diagram
combines both dynamic behavior and static details of a system and therefore it can be
formed by the details taken from use case diagram, sequence diagram, class diagram and
so on.
Figure 4.6 Colloboration Diagram

Chapter 5
IMPLEMENTATION
5.1 System Implementation
Implementation stage of an application creation is actualization of ideas, design
and requirement specification into source code. The primary objective of implementation
part of building a project is production of source codes with good style and comments
when necessary, by applying a proper and best coding technique which is suitable with
the help of proper documents.
Program codes are created in accordance to the structured coding techniques,
which adheres to control flow, so that execution sequence follows the order in which
codes are scripted. This makes the code unambiguous and more readable, which eases
understanding, modifying, debugging, testing, and documentation of the programs.
5.2 Modules
The modules implemented in this application for scaling down the makespan of
the submitted jobs and efficient utilization of resource is described as follows:
5.2.1 Batch Processing of jobs
In this module, we have to submit a job in the manner of batch. From this batch
process, the jobs are processed as each and every batch for ease of understanding
5.2.2 Estimation of Workload
In the basic version, assessment of workload was obtained based on amount of
remaining jobs for map and reduce stage. New idea projected is based on workload
details previously or known beforehand and for this workload details can be collected
from tasks configuration, training stage, or some factual data settings, but in some cases
the information regarding workload may not be precise or accessible for use.
So for this situation, in this module workload details without any previous data is
estimated. This module first considers incomplete task of both map and reduce stage and

then execution time is summed up and used as estimator value but jobs present in waiting
queue are not considered for this calculation.
5.2.3 Feedback for workload
The technique adapted in last section is most suitable for homogenous
environment where all the nodes are running same type of job, similar configuration and
with the aid similar system resource usage. But for trending heterogeneous where slot
allocation is dynamically changed, a new technique which uses feedback from foregoing
jobs is implemented. This proposal helps to balance and accommodate slots
automatically.
5.2.4 Job manager
In this module, job manager is used to manage resources available in task
administrator. Here, job manager have three sections are, workload estimator, slot
scheduler, and the scheduler and integral component slot scheduler is used to assign job
to task administrator.
5.2.5 Task Administrator
In this module task administrator, performs the job instructed by job manager and
this component performs task with guidance of task manager. Task manager implements
job by two phases are map and reduce
5.3 Snapshot of Code Snippets
Snapshot 5.1: Map Function
Above code snippet is code for map stage which is extended from base predefined class.

Snapshot 5.2: Overridden Map Function
This snapshot 5.2 is code snippet which describes overridden map procedure.
Snapshot 5.3: Overridden Reduce Function
Above snapshot 5.3 of code snippet is used for coding overridden reduce function.

Snapshot 5.4: Driver Configuration
Snapshot 5.4 of code snippet portrayed above gives insight of driver configuration
integral component.

Chapter 6
TESTING
Function of testing is to discover errors and is used for determining every
plausible faults or glitches which may be developed in product. It presents a way for
assessing functionality of components, modules, assemblies, sub-assemblies and
completed product with thorough examination. It is the phase of product development,
which is used to make sure that system is designed according to the specified
requirements, the system adheres to user specified constraints, meets user expectation and
does not fail in an unacceptable manner.
6.1 Testing Types
Testing can be done in specific method, using several kinds of testing at different
stage levels of developed product and some of the testing types are:
6.1.1 Unit Testing
Unit testing is a technique for checking an individual module or codes in the final
product with related program input so that it produces valid and desired outputs. For unit
testing individual constituents of application is tested with test-cases written for each
integral component by considering both positive and negative inputs. Each and every
decision condition, internal code flow should be verified for desirable output with
thorough examination and scrutinize.
6.1.2 Integration Testing
Integration testing is a mechanism for examining integrated module code and in
this technique uses an iterative approach where individual components that are unit tested
are combined and verified for errors and proper functionality. This step is done many
times by integrating all components designed and tested. This testing is specifically used
for determining the problems which arises from the combination of the components.
6.1.3 System Testing
System testing involves checking complete integrated software of product
developed to check whether it adheres user requirements and can detect errors in

integrated modules and as well as in entire system. System considers traits for checking
usability, performance, optimality, exception handling, volume and load testing.
6.1.4 Functional Testing
Functional tests administers systematic demonstrations that functions tested are
available as specified by business and technical specifications, documentation, and
manuals for user and organization of this kind of testing is focused on key functionality,
requirements, or special test cases. In addition, systematic coverage pertaining to identify
Business process flows; data fields, predefined processes, and successive processes must
be considered for testing and before it is complete, additional tests are identified and the
effective value of current tests is determined.
6.1.5 Acceptance Testing
Acceptance testing is a method of testing specifically designed for examining
whether the product or application to be tested meets all described requirements by the
user, whether it adheres to the constraints specified. This testing is conducted after
performing system testing and then product is made available for user and this technique
can be adapted as black box testing technique.
6.1.6 White Box Testing
White box testing is mechanism which checks internal working or code of product
and in order to test product test-cases are designed based on details of flow of control and
data, path and branch conditions. This technique is used for checking the system at code
and can be incorporated in unit test verification, integration and regression testing.
6.1.7 Black Box Testing
Black box testing is a testing mechanism designed for verifying the functionality
of the system components or modules or whole project. This testing doesn’t consider
internal flow of code structures, it provides inputs and responds with outputs without
considering how inner software structure works and this testing is suitable for most of
testing levels like acceptance, system, integration and unit.

6.2 Test Cases
Test case is a document which contains set of perquisites using which application
can be tested and can be used for checking modules or whole application. Test case can
be formal also called technical or informal called non-technical, in formal draft
application is tested for both positive and negative circumstances and in latter kind there
is no technical specification, it is based on the working of the application. It can be
documented using technicalities of modules, methodology, scenarios and cases.
6.2.1 Scenario for Hadoop Initialization
Test Case ID Test_Case_01
Scenario Hadoop Initialization
Description
This case checks for proper initialization of
hadoop. To initialize hadoop, it must be
installed, configured and path must set
properly.
Input
Enter the script to start initialization of the
hadoop in the terminal
Expected Result
Specific Hadoop version with related files
should be added.
Actual Result
Hadoop with specified version is initialized
with related files.
Remarks Pass
Table 6.1 Test Case of Hadoop Initialization

6.2.2 Scenario for Hadoop Cluster Formation
Scenario Hadoop Cluster Formation
Description
The perquisite for cluster formation is making
sure that hadoop package must be available for
use from the same specified path to all the
nodes. Then based on required number, cluster
is formed.
Input
Enter the script start_dfs.sh and start_mapred.sh
in the terminal
Expected Result
Hadoop with given number of clusters should
be defined.
Actual Result Required number of hadoop cluster is created.
Remarks Pass
Table 6.2 Test Case of Hadoop Cluster Formation
6.2.3 Scenario for Verification by JPS Command
Scenario Verification by JPS Command
Description
The Java virtual machine process status tool
is used for verifying the hadoop parameters
like job-history, name-node, and resource-

manager are functioning properly.
Input Enter the jps command in the terminal
Expected Result
The JPS should be initialized and all the
required attributes and properties should be
defined.
Actual Result
JPS is initialized and required attributes like
name-node, resource-manager is defined.
Remarks Pass
Table 6.3 Test Case of Verification using JPS Command
6.2.4 Scenario for Port Programming
Test Case ID Test_Case _04
Scenario Port Programming
Description Processing of the port assigned to hadoop
Expected Result
The port allocated to hadoop should be
properly configured and programmed.
Actual Result The port assigned for hadoop is programmed.
Remarks Pass
Table 6.4 Test Case for Port Programming

6.2.5 Scenario for Configuring Environment for Dataset
Scenario Setup for submitting dataset
Description
Configure a directory / folder to dump a
source file
Input Specific dataset
Expected Result
The source file for specific dataset should be
created according to hadoop parameters in a
hadoop environment.
Actual Result Source file for specific dataset is created
Remarks Pass
Table 6.5 Test Case for Configuring Environment for Dataset
6.2.6 Scenario for Processing Data Using Mapreduce
Scenario Processing data using mapreduce
Description
Hadoop processes the large dataset with map
reduce technique. Which contains reduce and
map stages for analyzing data effectively.
Input Any specific dataset

Expected Result
Hadoop should reduce the processing activity
by utilizing map reduce technique on specific
database.
Actual Result
The hadoop processing activity is reduced by
using map reduce technique
Remarks Pass
Table 6.6 Test Case for Usage of MapReduce
6.2.7 Scenario for Result Analysis
Scenario Result Analysis
Description
After specific dataset is submitted for
processing, the hadoop uses map-reduce
mechanism for processing this data. Then
result is produced as graphs
Input Submit dataset to mapreduce framework
Expected Result
Different kinds of graphs which describes the
results effectively should be generated
Actual Result
The application generates graphs for given
dataset.
Remarks Pass
Table 6.7 Test Case for Result Analysis

Chapter 7
RESULT ANALYSIS
Snapshot 7.1: Initialization of Hadoop Cluster
The snapshot 7.1 shows the initialization of the hadoop cluster1 and snapshot 7.2
illustrates hadoop cluster parameters like namenode, tmp, datanode, pid. Initialization is
through set of commands in terminal.
Snapshot 7.2: Parameters of Hadoop Cluster

Snapshot 7.3: Start Hadoop Services
Snapshot above shows how hadoop services are started with command start –all and it
starts all services like namenode, resource manager, secondary namenode and node
manager.
Snapshot 7.4: Hadoop Page
Snapshot shows overview of hadoop bunches which comprises of starting date, version,
cluster id and some additional configuration details.

Snapshot 7.5: JPS Verification
Snapshot 7.5 shows JPS tracking, in all the hadoop applications, the JPS command i.e.
java virtual machine process status tool and this command is used for examining all
hadoop parameters proper functioning. Snapshot 7.6 portrays how path for source file is
set.
Snapshot 7.6 setting path for source file

Snapshot 7.7: MapReduce Process
Snapshot 7.8: Main Page of Slot Configuration

Snapshot 7.9: Bar Graph illustrating the Result
Snapshot 7.10: Line Graph illustrating the result
Result of application is depicted in snapshot 7.9 and 7.10 which constitutes line
and bar graph of result analysis

CONCLUSION
The project presented a new slot assignment technique called TuMM for
dynamically assigning slots in Hadoop. The vital purpose of this application is to boost
utilization of resource and scale down makespan of the given n number of jobs. This
mechanism is suitable for homogeneous clusters and for heterogeneous hadoop bunches
modified version of latter mechanism called H-TuMM is introduced. In this system,
performance improvement is done, by separately configuring slots for every node. From
this new slot allocation method, the project shows of about 28% decrease of completion
time.

REFERENCES
[1] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors et al. “ Improving MapReduce
Performance through Data Placement in Heterogeneous Hadoop Clusters”,
Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW), 2010
IEEE International Symposium.
[2] Z. Fadika, E. Dede, J. Hartog, M. Govindaraju, “MARLA: MapReduce for
Heterogeneous Clusters”, in Cluster, Cloud and Grid Computing (CCGrid), 2012
12th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
(ccgrid 2012)
[3] M. Grossman, M. Breternitz, V. Sarkar, “HadoopCL: MapReduce on Distributed
Heterogeneous Platforms through Seamless Integration of Hadoop and OpenCL”,
in Parallel and Distributed Processing Symposium Workshops and PhD Forum
(IPDPSW), 2013 IEEE 27th
International.
[4] Z. Zhang, L. Cherkasova, B. T. Loo, “Performance Modeling of Mapreduce Jobs
in Heterogeneous Cloud Environments”, in Cloud Computing (CLOUD), 2013
IEEE Sixth International Conference.
[5] S. Tang, B. S. Lee, B. He, “Dynamic Job Ordering and Slot Configuration for
MapReduce Workloads”, in IEEE Transaction on Services Computing (volume:9,
Issue: 1).
[6] M. Isard, Vijayan Prabhakaran, J. Currey et al., “Quincy: fair scheduling for
distributed computing clusters,” in SOSP’09, 2009, pp. 261–276.
[7] M. Zaharia, D. Borthakur, J. S. Sarma et al., “Delay scheduling: A simple
technique for achieving locality and fairness in cluster scheduling,” in
EuroSys’10, 2010.
[8] A. Verma, L. Cherkasova, and R. H. Campbell, “Two sides of a coin: Optimizing
the schedule of mapreduce jobs to minimize their makespan and improve cluster
performance,” in MASCOTS’ 12, Aug 2012.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Recently uploaded

Recently uploaded (20)