This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technological
period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the performance
of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
Big data is a popular term used to describe the large volume of data which includes structured, semi-structured and unstructured
data. Now-a-days, unstructured data is growing in an explosive speed with the development of Internet and social networks like Twitter,Facebook
& Yahoo etc., In order to process such colossal of data a software is required that does this efficiently and this is where Hadoop steps in. Hadoop
has become one of the most used frameworks when dealing with big data. It is used to analyze and process big data. In this paper, Apache Flume
is configured and integrated with spark streaming for streaming the data from twitter application. The streamed data is stored into Apache
Cassandra. After retrieving the data, the data is going to be analyzed by using the concept of Apache Zeppelin. The result will be displayed on
Dashboard and the dashboard result is also going to be analyzed and validating using JSON
Intro to Hybrid Data Warehouse combines traditional Enterprise DW with Hadoop to create a complete data ecosystem. Learn the basics in this slide deck.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technological
period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the performance
of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
Big data is a popular term used to describe the large volume of data which includes structured, semi-structured and unstructured
data. Now-a-days, unstructured data is growing in an explosive speed with the development of Internet and social networks like Twitter,Facebook
& Yahoo etc., In order to process such colossal of data a software is required that does this efficiently and this is where Hadoop steps in. Hadoop
has become one of the most used frameworks when dealing with big data. It is used to analyze and process big data. In this paper, Apache Flume
is configured and integrated with spark streaming for streaming the data from twitter application. The streamed data is stored into Apache
Cassandra. After retrieving the data, the data is going to be analyzed by using the concept of Apache Zeppelin. The result will be displayed on
Dashboard and the dashboard result is also going to be analyzed and validating using JSON
Intro to Hybrid Data Warehouse combines traditional Enterprise DW with Hadoop to create a complete data ecosystem. Learn the basics in this slide deck.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Today’s era is generally treated as the era of data on each and every field of computing application huge amount of data is generated. The society is gradually more dependent on computers so large amount of data is generated in each and every second which is either in structured format, unstructured format or semi structured format. These huge amount of data are generally treated as big data. To analyze big data is a biggest challenge in current world. Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and it generally follows horizontal processing. Map Reduce programming is generally run over Hadoop Framework and process the large amount of structured and unstructured data. This Paper describes about different joining strategies used in Map reduce programming to combine the data of two files in Hadoop Framework and also discusses the skewness problem associate to it.
Big data is data that, by virtue of its velocity, volume, or variety (the three Vs), cannot be easily stored or analyzed with traditional methods. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
This presentation provides a comprehensive introduction to the Hadoop Distributed System, a powerful and widely used framework for distributed storage and processing of large-scale data. Hadoop has revolutionized the way organizations manage and analyze data, making it a crucial tool in the field of big data and data analytics.
In this presentation, we explore the key components and features of Hadoop, shedding light on the fundamental building blocks that enable its exceptional data processing capabilities. We cover essential topics, including the Hadoop Distributed File System (HDFS), MapReduce, YARN (Yet Another Resource Negotiator), and Hadoop Ecosystem components like Hive, Pig, and Spark.
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
Cloud has been a computational and storage solution for many data centric organizations. The
problem today those organizations are facing from the cloud is in data searching in an efficient
manner. A framework is required to distribute the work of searching and fetching from
thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The
major idea is to design a web server in the map phase using the jetty web server which will give
a fast and efficient way of searching data in MapReduce paradigm. For real time processing on
Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in
web server with multi-level index keys. The web server uses to handle traffic throughput. By web
clustering technology we can improve the application performance. To keep the work down, the
load balancer should automatically be able to distribute load to the newly added nodes in the
server.
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Big Data Analysis and Its Scheduling Policy – Hadoop
1. IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan – Feb. 2015), PP 36-40
www.iosrjournals.org
DOI: 10.9790/0661-17143640 www.iosrjournals.org 36 | Page
Big Data Analysis and Its Scheduling Policy – Hadoop
Divya S1
., Kanya Rajesh R2
., Rini Mary Nithila I3
., Vinothini M4
.,
PG Scholars1, 2, 3, 4
Department of Information Technology,
Francis Xavier Engineering College, Anna University, TamilNadu, India
Abstract: This paper is deals with Parallel Distributed system. Hadoop has become a central platform to
store big data through its Hadoop Distributed File System (HDFS) as well as to run analytics on this stored big
data using its MapReduce component. Map Reduce programming model have shown great value in
processing huge amount of data. Map Reduce is a common framework for data-intensive distributed
computing of batch jobs. Hadoop Distributed File System (HDFS) is a Java-based file system that provides
scalable and reliable data storage that is designed to span large clusters of commodity servers. In all Hadoop
implementations, the default FIFO scheduler is available where jobs are scheduled in FIFO order with support
for other priority based schedulers also. During this paper, we are going to study a Hadoop framework, HDFS
design and Map reduce Programming model. And also various schedulers possible with Hadoop and provided
some behavior of the current scheduling schemes in Hadoop on a locally deployed cluster is described.
Keywords: Hadoop, Map-Reduce, Big Data Analytics, Scheduling Algorithms
I. Introduction
Apache Hadoop is a software framework for processing Big Data such as data in the range of petabytes
. The framework was originally developed by Doug Cutting, the creator of Apache Lucene, as part of his Web
Search Engine Apache Nutch . Hadoop is an open source, large scale, batch data processing, distributed
computing framework for big data storage and analytics. It facilitates scalability and takes care of detecting and
handling failures. Hadoop ensures high availability of data by creating multiple copies of the data in different
nodes throughout the cluster. In Hadoop, the code is moved to the location of the data instead of moving the
data towards the code. Hadoop is a highly efficient and reliable cloud computing platform, which can be
deployed in a cloud computing data center. Hadoop users need not concern on the low-level details of
distributed system but focus on their business need when they develop distributed applications. That is why lots
of famous Internet service providers, including Facebook, Twitter, Yahoo, Baidu, and Alibaba, have already
chosen Hadoop as one of the core components to build their own cloud systems to provide more efficient
services.
Hadoop MapReduce is a programming model and software framework for writing applications that
rapidly process vast amounts of data in parallel on large clusters of compute nodes. In Hadoop, the Map Reduce
programming model is used. MapReduce is the framework for processing large volume of datasets as key
value pairs. MapReduce divides each job in to two types of functions, map and reduce. Both Map and
Reduce functions take input as key value pairs and emits the result as another set of key value pairs.
Each job is divided in to number of map tasks and Reduce tasks. The input is initially processed in
distributed map tasks and aggregate the result with the help of reduce tasks. In order to complete the
transactions submitted by users efficiently, Hadoop needs the right job scheduling algorithm and appropriate
task scheduling algorithm. Job and task are different concepts in Hadoop. When a user submits a transaction,
Hadoop will create a job and put it in the queue of jobs waiting for the job scheduler to dispatch it. Then, the job
will be divided into a series of tasks, which can be executed in parallel. The task scheduler is responsible for
dispatching tasks by certain task scheduling algorithm. Several job scheduling algorithms have been developed,
including first-in–first-out scheduling, fair scheduling, and capacity scheduling.
The rest of the paper is organized as follows. Section 2 describes why Hadoop for big data analytics.
Section 3 provides a brief introduction to Hadoop, HDFS and Map/Reduce framework. Section 4 presents our
analysis of Hadoop policies in Hadoop cluster.
II. Why Hadoop For Big Data Analytics?
Big Data is moving from a focus on individual projects to an influence on enterprises strategic
information architecture. Dealing with data volume, variety, velocity and complexity is forcing changes to many
traditional approaches. This realization is leading organizations to abandon the concept of a single enterprise
data warehouse containing all information needed for decisions. Instead, they are moving towards multiple
systems, including content management, data warehouses, data marts and specialized file systems tied together
with data services and metadata, which will become the logical enterprise data warehouse. There are various
2. Big Data Analysis and Its Scheduling Policy - Hadoop
DOI: 10.9790/0661-17143640 www.iosrjournals.org 37 | Page
systems available for big data processing and analytics, alternatives to Hadoop such as HPCC or the newly
launched Red Shift by Amazon. However, the success of Hadoop can be gauged by the number of Hadoop
distributions available from different technological companies such as IBM Info Sphere Big Insights, Microsoft
HDInsight Service on Azure, Cloudera Hadoop, Yahoo’s distribution of Hadoop, and many more. There are
basically four reasons behind its success:
It’s an open source project.
It can be used in numerous domains.
It has a lot of scope for improvement with respect to fault tolerance, availability and file systems.
There are a number of reasons why Hadoop is an attractive option. Not only does the platform offer
both distributed computing and computational capabilities at a relatively low cost, it’s able to scale to meet the
anticipated exponential increase in data generated by mobile technology, social media, the Internet of Things,
and other emerging digital technologies
III. Hadoop Framework
Hadoop is an open source software framework that dramatically simplifies writing distributed data
intensive applications. It provides a distributed file system, which is modeled after the Google File System and a
map/reduce implementation that manages distributed computation.
3.1 HDFS
Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable
data storage that is designed to span large clusters of commodity servers. The HDFS is designed as
master/slave architecture (Fig. 1). A HDFS cluster consists of Name Node, a master node that manages
the filing system name space and regulates access to files by clients. Additionally, there are various Data
Nodes, usually one per node within the cluster, that manage storage connected to the nodes that they
run on. HDFS exposes a filing system name space and permits user information to be hold on in files.
Internally, a file is split into one or a lot of blocks and these blocks are stored in a set of Data Nodes. The
Name Node executes filing system name space operations like gap, closing, and renaming files and
directories. It determines the mapping of blocks to Data Nodes. Clients read and write request are served
by the Data node. The Data Nodes is also responsible for block creation, deletion, and replication as per
the instruction given by Name Node.
Figure 1: HDFS Architecture
3.1.1 How HDFS works?
An HDFS cluster is comprised of a Namenode which manages the cluster metadata and DataNodes that
store the data. Files and directories are represented on the NameNode by nodes. Nodes record attributes like
permissions, modification and access times, or namespace and disk space quotas. The file content is split into
large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple
DataNodes. The blocks are stored on the local file system on the datanodes. The Namenode actively monitors
the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the
NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping
of blocks to DataNodes, holding the entire namespace image in RAM. The NameNode does not directly send
requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those
3. Big Data Analysis and Its Scheduling Policy - Hadoop
DOI: 10.9790/0661-17143640 www.iosrjournals.org 38 | Page
DataNodes. The instructions include commands to: replicate blocks to other nodes, remove local block replicas,
re-register and send an immediate block report, or shut down the node.
3.2 Map Reduce Programming Model
The Map Reduce programming model is designed to process large data set in parallel by
distributing the Job into a various independent Tasks. The Job considered to here as a complete Map
Reduce program, which is the execution of a Mapper or Reducer across a set of data. A Task is an
executing a Mapper or Reducer on a chunks of the data. Then the Map Reduce Job normally splits the
input data into independent portions, which are executed by the map tasks in a fully parallel fashion. The
Hadoop Map Reduce framework consists of a one Master node that runs a Job tracker instance which
takes Job requests from a client node and Slave nodes everyone running a Task Tracker instance. The Job
tracker is responsible for distributing the job to the Slave nodes, scheduling the job’s component tasks
on the Task Trackers, monitoring them as well as reassigning tasks to the Task Trackers when they at
the time of failure. It also provides the status and diagnostic information to the client. The task given by
the Job tracker is executed by the Task Tracker. Fig. 2 depicts the Map Reduce framework.
Figure 2: Hadoop Map Reduce
3.2.1 Execution of Map Task
Every map task is provided with a portion of the input file called as split. By default, a split
contains a one HDFS block, so the total number of the number of map tasks is equal to the total number of
file blocks. The execution of a map task is divided into two passes:
1) The map pass reads the task’s split from HDFS, interpreted it into records (key/value pairs), and
applies the map function to every record.
2) After applying map function to every input record, the commit phase stores the final output with the
Task-Tracker, Then Task Tracker informs the Job-Tracker that the task has finished its execution. The
map method specifies an Output Collector instance, which collects the output records created by the
map function. The output of the map step is consumed by the reduce step, so that Output Collector
stores output produce by mapper in a simpler format so that it is easy to consume for reduce task. The
Task-Tracker will read these data and index files when servicing requests from reduce tasks.
3.2.2 Execution Of Reduce Task
The execution of a reduce task is divided into three passes.
1) The shuffle phase is responsible for fetching the reduce task’s input file. Each reduce task is
appointed a partition of the key range generated by the map pass, so that the reduce task must receive
the content of this partition from each map task’s output.
2) The sort phase group together records having same key. Public interface Reducer <K2, V2, K3, V3>
{Void Reduce (K2 key, Iterator<V2> values, Output Collector <K3, V3> output); Void close () ;}
3) The reduce phase appoint the user-defined reduce function to every key and corresponding list of
values. Within the shuffle phase, a reduce task fetches data from every map task by sending HTTP
requests to a configurable number of Task-Trackers at once .The Job-Tracker relays the location of each
Task-Tracker that hosts map output to each Task-Tracker that executes a reduce task. Reduce task can’t
receive the output of a map task till the map has finished execution and committed its final output to disk.
The output of both map and reduce tasks is written to disk before it can be consumed.
4. Big Data Analysis and Its Scheduling Policy - Hadoop
DOI: 10.9790/0661-17143640 www.iosrjournals.org 39 | Page
IV. Scheduling Policies In Hadoop
4.1 Job Scheduling In Hadoop
When Hadoop started out, it was designed mainly for running large batch jobs such as web indexing
and log mining. Users submitted jobs to a queue, and the cluster ran them in order. However, as organizations
placed more data in their Hadoop clusters and developed more computations they wanted to run, another use
case became attractive: sharing a MapReduce cluster between multiple users. The benefits of sharing are
tremendous: with all the data in one place, users can run queries that they may never have been able to execute
otherwise, and costs go down because system utilization is higher than building a separate Hadoop cluster for
each group. However, sharing requires support from the Hadoop job scheduler to provide guaranteed capacity to
production jobs and good response time to interactive jobs while allocating resources fairly between users. The
scheduler in Hadoop became a pluggable component and opened the door for innovation in this space. The
result was two schedulers for multi-user workloads: the Fair Scheduler developed at Facebook, and the Capacity
Scheduler, developed at Yahoo.
4.1.1 Default FIFO Scheduler
The default Hadoop scheduler operates using a FIFO queue. After a job is partitioned into individual
tasks, they are loaded into the queue and assigned to free slots as they become available on TaskTracker nodes.
Although there is support for assignment of priorities to jobs, this is not turned on by default.
4.1.2 Fair Scheduler
The Fair Scheduler was developed at Facebook to manage access to their Hadoop cluster, which runs
several large jobs computing user metrics, etc. on several TBs of data daily. Users may assign jobs to pools,
with each pool allocated a guaranteed minimum number of Map and Reduce slots. Free slots in idle pools may
be allocated to other pools, while excess capacity within a pool is shared among jobs. In addition, administrators
may enforce priority settings on certain pools. Tasks are therefore scheduled in an interleaved manner, based on
their priority within their pool, and the cluster capacity and usage of their pool. Over time, this has the effect of
ensuring that jobs receive roughly equal amounts of resources. Shorter jobs are allocated sufficient resources to
finish quickly. At the same time, longer jobs are guaranteed to not be starved of resources.
The Fair Scheduler is based on three concepts:
Jobs are placed into named “pools” based on a configurable attribute such as user name, UNIX group, or
specifically tagging a job as being in a particular pool through its jobconf.
Each pool can have a “guaranteed capacity” that is specified through a config file, which gives a minimum
number of map slots and reduce slots to allocate to the pool. When there are pending jobs in the pool, it gets
at least this many slots, but if it has no jobs, the slots can be used by other pools.
Excess capacity that is not going toward a pool’s minimum is allocated between jobs using fair sharing. Fair
sharing ensures that over time, each job receives roughly the same amount of resources. This means that
shorter jobs will finish quickly, while longer jobs are guaranteed not to get starved.
4.1.3 Capacity Scheduler
The Capacity Scheduler is a pluggable scheduler for Hadoop that allows multiple tenants to securely
share a large cluster. Resources are allocated to each tenant's applications in a way that fully utilizes the cluster,
governed by the constraints of allocated capacities. Queues are typically set up by administrators to reflect the
economics of the shared cluster. The Capacity Scheduler supports hierarchical queues to ensure that resources
are shared among the sub-queues of an organization before other queues are allowed to use free resources.
The Capacity Scheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-
friendly manner while maximizing the throughput and the utilization of the cluster. The Capacity Scheduler is
designed to allow sharing a large cluster while giving each organization capacity guarantees. The central idea is
that the available resources in the Hadoop cluster are shared among multiple organizations who collectively
fund the cluster based on their computing needs. There is an added benefit that an organization can access any
excess capacity not being used by others. This provides elasticity for the organizations in a cost-effective
manner. Sharing clusters across organizations necessitates strong support for multi-tenancy since each
organization must be guaranteed capacity and safe-guards to ensure the shared cluster is impervious to single
rouge application or user or sets thereof.
The Capacity Scheduler provides a stringent set of limits to ensure that a single application or user or
queue cannot consume disproportionate amount of resources in the cluster. Also, the Capacity
Scheduler provides limits on initialized/pending applications from a single user and queue to ensure fairness and
stability of the cluster. The primary abstraction provided by the Capacity Scheduler is the concept of queues. To
provide further control and predictability on sharing of resources, the Capacity Scheduler supports hierarchical
queues to ensure resources are shared among the sub-queues of an organization before other queues are allowed
5. Big Data Analysis and Its Scheduling Policy - Hadoop
DOI: 10.9790/0661-17143640 www.iosrjournals.org 40 | Page
to use free resources, there-by providing affinity for sharing free resources among applications of a given
organization.
4.2 Resource Aware Scheduling
The Fair Scheduler and Capacity Scheduler described above attempt to allocate capacity fairly among
users and jobs without considering resource availability on a more fine-grained basis. As CPU and disk channel
capacity has been increasing in recent years, a Hadoop cluster with heterogeneous nodes could exhibit
significant diversity in processing power and disk access speed among nodes. Performance could be affected if
multiple processor-intensive or data-intensive tasks are allocated onto nodes with slow processors or disk
channels respectively. This possibility arises as the Job Tracker simply treats each Task Tracker node as having
a number of available task slots. Even the improved LATE speculative execution could end up increasing the
degree of congestion within a busy cluster, if speculative copies are simply assigned to machines that are
already close to maximum resource utilization.
. Scheduling in Hadoop is centralized, and worker initiated. Scheduling decisions are taken by a master
node, called the Job Tracker, whereas the worker nodes, called TaskTracker are responsible for task execution.
The Job Tracker maintains a queue of currently running jobs, states of TaskTracker in a cluster, and list of tasks
allocated to each TaskTracker. Each Task Tracker node is currently configured with a maximum number of
available computation slots. Although this can be configured on a per-node basis to reflect the actual processing
power and disk channel speed, etc. available on cluster machines, there is no online modification of this slot
capacity available.
Two possible resource-aware Job Tracker scheduling mechanisms are: 1) Dynamic Free Slot
Advertisement-Instead of having a fixed number of available computation slots configured on each Task
Tracker node, this number is computed dynamically using the resource metrics obtained from each node. In one
possible heuristic, overall resource availability is set on a machine to be the minimum availability across all
resource metrics. In a cluster that is not running at maximum utilization at all times, this is expected to improve
job response times significantly as no machine is running tasks in a manner that runs into a resource
bottleneck. 2) Free Slot Priorities/Filtering- In this mechanism, cluster administrators will configure maximum
number of compute slots per node at configuration time. The order in which free TaskTracker slots are
advertised is decided according to their resource availability. As TaskTracker slots become free, they are
buffered for some small time period (say, 2s) and advertised in a block. TaskTracker slots with higher resource
availability are presented first for scheduling tasks on. In an environment where even short jobs take a relatively
long time to complete, this will present significant performance gains. Instead of scheduling a task onto the next
available free slot (which happens to be a relatively resource-deficient machine at this point), job response time
would be improved by scheduling it onto a resource-rich machine, even if such a node takes a longer time to
become available. Buffering the advertisement of free slots allowed for this scheduling allocation.
V. Conclusion & Future Work
Hadoop is a popular open source implementation platform of MapReduce model and used to process
and analyze large-scale data sets in parallel. Ability to make Hadoop scheduler resource aware is one the
emerging research problem that grabs the attention of most of the researchers as the current
implementation is based on statically configured slots. This paper summarizes pros and cons of Scheduling
policies of various Hadoop Schedulers developed by different communities. Each of the Scheduler considers
the resources like CPU, Memory, Job deadlines and IO etc. All the schedulers discussed in this paper
addresses one or more problem(s) in scheduling in Hadoop. Nevertheless all the schedulers discussed
above assumes homogeneous Hadoop clusters. Future work will consider scheduling in Hadoop in
Heterogeneous Clusters.
References
[1]. Apache Hadoop. http://hadoop.apache.org
[2]. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2004
[3]. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.The Google file system. In 19th Symposium on Operating Systems
Principles, pages 29–43, Lake George, New York, 2003.
[4]. Hadoop Distributed File System, http://hadoop.apache.org/hdfs
[5]. Hadoop’s Fair Scheduler http://hadoop.apache.org/common/docs/r0.20.2/fair_schedu ler.html
[6]. Hadoop’s Capacity Scheduler: http://hadoop.apache.org/core/docs/current/capacity_sched uler.html
[7]. K. Kc and K. Anyanwu, "Scheduling Hadoop Jobs to Meet Deadlines", in Proc. CloudCom, 2010, pp.388-392.
[8]. Mark Yong, Nitin Garegrat, Shiwali Mohan: “Towards a Resource Aware Scheduler in Hadoop” in Proc. ICWS, 2009, pp:102-109
[9]. B.Thirmala Rao, N.V.Sridevei, V. Krishna Reddy, LSS.Reddy. Performance Issues of Heterogeneous Hadoop Clusters in
Cloud Computing. Global Journal Computer Science & Technology Vol. 11, no. 8, May 2011, pp.81-87
[10]. Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: a
simple technique for achieving locality and fairness in cluster scheduling. In EuroSys ’10: Proceedings of the 5th European
conference on Computer systems, pages 265–278, New York, NY, USA, 2010. ACM.