SlideShare a Scribd company logo
1 of 51
SCHEDULERS OPTIMIZATION TO
HANDLE MULTIPLE JOBS IN
HADOOP CLUSTER
By
Shivaraj B G
4th Sem
Department of Computer Networking, MITE.
Introduction
Big Data: Information is turning out to be more
accessible as well as more reasonable to PCs.
 A large portion of the Big Data flow of information
in the world is uncontrollable stuff like words,
pictures and feature on the Web and those
floods of sensor information.
 It is called unstructured information and is not
commonly available for conventional databases.
 Firstly Big data should be obtained, processed
and analyzed.
Characteristics of Bigdata
 Volume: Represents huge
information.
 Velocity (Speed): A quick rate
that information is gotten.
 Variety (Mixture):
Unstructured and semi-
organized information sorts.
 Value: Information has
inherent value however must
be found.
Hadoop: A structure of open source tools, libraries
and techniques for 'huge information'
examination.
 Hadoop is converged with two essential key
properties known as Hadoop Distributed File
System (HDFS) with capacity and
administration and MapReduce with
processing ability.
Why Hadoop?
 Adjustable framework with hardware and
programming with implicit application level with
Why to use Hadoop ?
 The Hadoop is reliable and simple
 Utilized to store extensive information with
unlimited storage capacity with adaptable
framework.
 As it is of 100% open source package.
 Quick recovery from system failures.
 Ability to quickly handle large amount of
information in parallel.
 Once information written in HDFS can be read
many times.
Contents of Hadoop
1. Hadoop Distributed File System
(HDFS):
 HDFS is master (expert) and slave
structural architecture; client first
cooperates with expert with
NameNode and Secondary
NameNode.
 Every slave runs both DataNode
and JobTracker daemon that speak
with and getting guidelines from
master hub.
 The TaskTracker daemon is the
slave to the JobTracker and
DataNode daemon is slave to the
NameNode.
2. MapReduce: Is basic programming model
confined to use of key-values pairs.
Scheduling in Hadoop
 Scheduling is a policy used to determine when a
job executes its tasks. And it communicates each
other or across clusters using TCP suit using its
distributed file system.
 Example: process time, communication for data
transmission along with available bandwidth.
 So the need of scheduling is the fact that
multiplexing and multitasking.
 The fresh hadoop setup uses only First-in First-
out scheduler.
Pig
 scripting language intended to handle any sort of data sets
(unstructured information).
 Pig is comprised of two parts: the First is language itself which is
called Piglatin; the Second is a Pigruntime environment where
Piglatin programs are executed.
 Pig has two execution modes: Local Mode (pig –x nearby) and
MapReduce Mode (pig or pig –x mapreduce) with an ad-hoc way
of creating and executing MapReduce jobs.
Piglatin  High-level scripting language.
 Requires no metadata or schema.
 Statements translated into a series of MapReduce
jobs.
Grunt  Interactive shell.
Piggybank  Shared repository for User Defined Functions (UDFs).
Hive
 Data Warehouse System for Hadoop.
 Tools to facilitate effortless information to
extract/transform/load (ETL) from records stored
directly HDFS or in other information storage
systems such as HBase.
 Contains metadata so as to describe information
right to use in HDFS and HBase, not information
itself.
 Uses a straightforward SQL-like query language
called HiveQL
 Query implementation through MapReduce
With the aim of optimizing schedulers, the framework
allow jobs to complete in a timely manner, while allowing
users who are making queries to get results back in a
reasonable time. so users have more freedom in
adopting the most appropriated scheduler or other
techniques according to their requirements.
Problem Statement
First in first out (FIFO) scheduler approach allows one
job to take all task slots within the cluster, i.e., no other jobs
can utilize the cluster until the current one completes.
Consequently, jobs that arrive at a later time or with a lower
priority will be blocked by those ahead in the queue. Given
the total number of jobs is large, there will be a significant
delay caused by FIFO scheduler.
Proposed System
The proposed system has a novel framework for optimization approach.
1. Fair scheduler
The Fair scheduler starts execution if any slots are available during
execution of tasks the smaller jobs are assigned to that slot.
Merits of Fair Scheduler
 Though cluster is shared with large jobs it allows running small jobs
quickly.
 Unlike default FIFO scheduler, without starving the large or small job fair
scheduling chooses a job in the queue to run task if available slot is free.
 Provide service for multiple slots with guaranteed levels of jobs execution
in shared cluster and simple to configure and administer.
2. Capacity Scheduler
 The concept provided by Capacity Scheduler is
scheduling queues.
 Is designed for large cluster for sharing minimum
capacity guarantee while cluster is partitioned among
multiple users.
Merits of Capacity Scheduler
 Capacity guarantees – Multiple queues supports jobs
execution simultaneously as submitted by the user.
 Elasticity – jobs are assigned beyond the capacity due its
freely available resources which helps entire cluster utilization
 Multi-Tenancy – Provides system resources to queues
created by user to link with JobTracker to execute jobs.
Objective
 To parallelize the job execution crosswise over
stand alone mode or in a cluster.
 Estimate the processing time for all parallel
applications like small or long running jobs by
distributing tasks to the schedulers.
Methodology
HDFS Framework works on block size. Whereas general Parallel file
system supports block sizes of 16 KB to 4 MB and maintains default block
size of 256KB. But HDFS works with default 64 Mb block size.
Optimization Techniques
1. MapCombineReduce
2. Distinctive Block Size
 dfs.block.size: File system block size: 67108864 (bytes)
 E.g. Input data size = 1GB and dfs.block.size = 64 MB
then the minimum no. of maps are (1*1024)/64 = 16
maps.
 E.g. If input data size = 1 GB and dfs.block.size = 128
MB (134217728 bytes) then minimum no. of maps are
(1*1024)/128 = 8 maps.
File Size Time for Moving Data to
HDFS
Total Number of Blocks
Created
64 Mb 128 MB
1.3 GB 0.48 Sec 0.38 Sec 21 11
2.7 GB 1.38 Sec 1.17 Sec 41 21
4.0 GB 2.32 Sec 1.58 Sec 61 31
S/W and H/W Specification
Software Specification
 Operating System: Ubuntu 12.04 LTS.
 Java (Jdk 6.1) is required to run *.jar files.
 Hadoop Version-hadoop-1.0.3.tar.gz Stable Release.
 Pig: apache- pig-0.14.0.tar.gz.
 Hive: apache-hive-0.13.1-bin.tar.gz.
Hardware specification
Name Specification
Main processor P4, 2GHz or Higher
Secondary Memory 200GB
Primary Memory 4GB or higher
Results
Starting Hadoop multi-node cluster
Cd /usr/local/hadoop/bin
$ ./start-all.sh
Displaying Master Node and
available cluster summary
 Using web browser type the master URL such as
localhost:50070 or master:50070
Listing Total Number of Files in
HDFS.
 $ bin/hadoop dfs -ls
Input file of weblogs Stored in
HDFS
Movie dataset Input File Stored in
HDFS
Job Queue Scheduling Information
with Default Scheduler
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogHitsByLinkProcessor.jar WeblogHitsByLinkProcessor
/user/hduser/nasa_input /user/hduser/weblog_output/
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogMessagesizevsHitsProcessor /user/hduser/nasa_input /user/hduser/weblogs_output1/
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar
WeblogTimeOfDayHistogramCreator /user/hduser/nasa_input /user/hduser/welogg_output2/
Scheduling Information with FAIR
Scheduler.
 Using web browser type Hadoop MapReduce URL such as localhost:50030
$ time bin/hadoop jar WeblogHits.jar WeblogHits
/user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblog_output
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblogs_output1
$ time bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar
WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblogg_output2
Job Summary for QueueA using
Capacity Scheduler
$ time bin/hadoop jar Weblog.jar Weblog -Dmapred.job.queue.name=queueA
user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/capacity/ouputs/weblog_output
Job Summary for QueueB using
Capacity Scheduler.
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogTimeOfDayHistogramCreator -
Dmapred.job.queue.name=queueB
/user/hduser/heterogeneous/inputs/weblogtime/user/hduser/heterogen
eous/capacity/ouputs/weblogs_output1
Output for Web-Logs with Total
Number of Hits on Particular Links.
Output for Web-Logs with Total
Number Messages with their Size.
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Output for Web-Logs with Time between 0-
23 Hours along Number of Users.
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Time
Users
Performance Analysis of Proposed Hadoop
System using Homogeneous Hadoop Jobs with
Single Node Cluster
3.7
2.92
3.6
3.25
0
0.5
1
1.5
2
2.5
3
3.5
4
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
Performance Analysis of Proposed Hadoop
System using Heterogeneous Hadoop Jobs with
Single Node Cluster
26.11
19.04
29.96
21.24
0
5
10
15
20
25
30
35
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
Performance Analysis of Proposed Hadoop
System using Homogeneous Hadoop Jobs with
Multi-Node Cluster.
1.54 1.5
3.79
1.66
0
0.5
1
1.5
2
2.5
3
3.5
4
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
Performance Analysis of Proposed Hadoop
System using Heterogeneous Hadoop Jobs with
Multi-Node Cluster.
7.38
6.38
13.47
12.26
0
2
4
6
8
10
12
14
16
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
Results for Movie dataset using
PIG
 Running pig in local mode:
 pig -x local
 movies = LOAD
'/Users/Rich/Documents/Courses/Fall2014/
BigData/Pig/movies_data.csv' USING
PigStorage(',') as
(id,name,year,rating,duration);
 DUMP movies
Filter:
 List the movies that were released between 1950 and 1960
 movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961;
 store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960';
 Foreach Generate:
 List movie names and their duration (in minutes)
 movies_name_duration = foreach movies generate name, (float)duration/3600;
 store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration';
Order:
 List all movies in descending order of year
 movies_year_sort =order movies by year desc;
 store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort';
Results for Movie Dataset using
Hive
For Displaying Contents of Movie Data Set.
Hive> select * from Movies;
Result for Particular Year of Movies
Released.
Hive> select *from movies where year = 1995;
finding all Movies based on Movie Length
specified.
Hive> select * from movies where length > 3000;
Retrieve information using GROUP
BY clause.
Hive> select COUNT(1) from movies GROUP
BY year;
Results for heterogeneous
datasets using Hadoop, pig and
hive
Hadoop Pig Hive
Word Count 2.367 1.58 1.52
Movie Rating 1.53 1.45 1.57
2.36
1.58
1.521.53
1.45
1.57
0
0.5
1
1.5
2
2.5
Hadoop Pig Hive
WordCount
MovieRating
Conclusion
This effort is projected to give a high level
summary of what is Big data and how to solve
the issues generated through four V’s and
stored in HDFS using various configuration
parameters by setting up Hadoop, Pig and
Hive to retrieve useful data from bulky data
sets.
Schedulers optimization to handle multiple jobs in hadoop cluster

More Related Content

What's hot

Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Shivkumar Babshetty
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...ijcses
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 

What's hot (19)

Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Viewers also liked

Leading Through Culture: Inspiring Performance and Collaboration
Leading Through Culture: Inspiring Performance and CollaborationLeading Through Culture: Inspiring Performance and Collaboration
Leading Through Culture: Inspiring Performance and CollaborationLaís de Oliveira
 
Grafitec_audio_all_2015_v3
Grafitec_audio_all_2015_v3Grafitec_audio_all_2015_v3
Grafitec_audio_all_2015_v3grafitec_z
 
Dreamwever
DreamweverDreamwever
Dreamweveriswan_di
 
Paleo diet featured on Austin Fox 7
Paleo diet featured on Austin Fox 7Paleo diet featured on Austin Fox 7
Paleo diet featured on Austin Fox 7Dr. Lane Sebring
 
HIC_AutomationCaseStudies
HIC_AutomationCaseStudiesHIC_AutomationCaseStudies
HIC_AutomationCaseStudiesAdam McGrath
 
Are you ready for me - Women Leaving Correctional Services Project Report 2016
Are you ready for me - Women Leaving Correctional Services Project Report 2016Are you ready for me - Women Leaving Correctional Services Project Report 2016
Are you ready for me - Women Leaving Correctional Services Project Report 2016Geoff Hazell
 
Dasar dasar DreamWever
Dasar dasar DreamWeverDasar dasar DreamWever
Dasar dasar DreamWeveriswan_di
 
Linux principles and philosophy
Linux principles and philosophyLinux principles and philosophy
Linux principles and philosophyOmar Al Hoot
 
Do's and don'ts'
Do's and don'ts'Do's and don'ts'
Do's and don'ts'amyobrennan
 

Viewers also liked (13)

Industry Resume
Industry ResumeIndustry Resume
Industry Resume
 
Leading Through Culture: Inspiring Performance and Collaboration
Leading Through Culture: Inspiring Performance and CollaborationLeading Through Culture: Inspiring Performance and Collaboration
Leading Through Culture: Inspiring Performance and Collaboration
 
Grafitec_audio_all_2015_v3
Grafitec_audio_all_2015_v3Grafitec_audio_all_2015_v3
Grafitec_audio_all_2015_v3
 
Dreamwever
DreamweverDreamwever
Dreamwever
 
Paleo diet featured on Austin Fox 7
Paleo diet featured on Austin Fox 7Paleo diet featured on Austin Fox 7
Paleo diet featured on Austin Fox 7
 
Modul 5
Modul 5Modul 5
Modul 5
 
HIC_AutomationCaseStudies
HIC_AutomationCaseStudiesHIC_AutomationCaseStudies
HIC_AutomationCaseStudies
 
(Pr) textos instructivos
(Pr) textos instructivos(Pr) textos instructivos
(Pr) textos instructivos
 
Are you ready for me - Women Leaving Correctional Services Project Report 2016
Are you ready for me - Women Leaving Correctional Services Project Report 2016Are you ready for me - Women Leaving Correctional Services Project Report 2016
Are you ready for me - Women Leaving Correctional Services Project Report 2016
 
Dasar dasar DreamWever
Dasar dasar DreamWeverDasar dasar DreamWever
Dasar dasar DreamWever
 
Question 6
Question 6Question 6
Question 6
 
Linux principles and philosophy
Linux principles and philosophyLinux principles and philosophy
Linux principles and philosophy
 
Do's and don'ts'
Do's and don'ts'Do's and don'ts'
Do's and don'ts'
 

Similar to Schedulers optimization to handle multiple jobs in hadoop cluster

Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...Govt.Engineering college, Idukki
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopNushrat
 

Similar to Schedulers optimization to handle multiple jobs in hadoop cluster (20)

Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 

Recently uploaded

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 

Recently uploaded (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 

Schedulers optimization to handle multiple jobs in hadoop cluster

  • 1. SCHEDULERS OPTIMIZATION TO HANDLE MULTIPLE JOBS IN HADOOP CLUSTER By Shivaraj B G 4th Sem Department of Computer Networking, MITE.
  • 2. Introduction Big Data: Information is turning out to be more accessible as well as more reasonable to PCs.  A large portion of the Big Data flow of information in the world is uncontrollable stuff like words, pictures and feature on the Web and those floods of sensor information.  It is called unstructured information and is not commonly available for conventional databases.  Firstly Big data should be obtained, processed and analyzed.
  • 3. Characteristics of Bigdata  Volume: Represents huge information.  Velocity (Speed): A quick rate that information is gotten.  Variety (Mixture): Unstructured and semi- organized information sorts.  Value: Information has inherent value however must be found.
  • 4. Hadoop: A structure of open source tools, libraries and techniques for 'huge information' examination.  Hadoop is converged with two essential key properties known as Hadoop Distributed File System (HDFS) with capacity and administration and MapReduce with processing ability. Why Hadoop?  Adjustable framework with hardware and programming with implicit application level with
  • 5. Why to use Hadoop ?  The Hadoop is reliable and simple  Utilized to store extensive information with unlimited storage capacity with adaptable framework.  As it is of 100% open source package.  Quick recovery from system failures.  Ability to quickly handle large amount of information in parallel.  Once information written in HDFS can be read many times.
  • 6. Contents of Hadoop 1. Hadoop Distributed File System (HDFS):  HDFS is master (expert) and slave structural architecture; client first cooperates with expert with NameNode and Secondary NameNode.  Every slave runs both DataNode and JobTracker daemon that speak with and getting guidelines from master hub.  The TaskTracker daemon is the slave to the JobTracker and DataNode daemon is slave to the NameNode.
  • 7. 2. MapReduce: Is basic programming model confined to use of key-values pairs.
  • 8. Scheduling in Hadoop  Scheduling is a policy used to determine when a job executes its tasks. And it communicates each other or across clusters using TCP suit using its distributed file system.  Example: process time, communication for data transmission along with available bandwidth.  So the need of scheduling is the fact that multiplexing and multitasking.  The fresh hadoop setup uses only First-in First- out scheduler.
  • 9. Pig  scripting language intended to handle any sort of data sets (unstructured information).  Pig is comprised of two parts: the First is language itself which is called Piglatin; the Second is a Pigruntime environment where Piglatin programs are executed.  Pig has two execution modes: Local Mode (pig –x nearby) and MapReduce Mode (pig or pig –x mapreduce) with an ad-hoc way of creating and executing MapReduce jobs. Piglatin  High-level scripting language.  Requires no metadata or schema.  Statements translated into a series of MapReduce jobs. Grunt  Interactive shell. Piggybank  Shared repository for User Defined Functions (UDFs).
  • 10. Hive  Data Warehouse System for Hadoop.  Tools to facilitate effortless information to extract/transform/load (ETL) from records stored directly HDFS or in other information storage systems such as HBase.  Contains metadata so as to describe information right to use in HDFS and HBase, not information itself.  Uses a straightforward SQL-like query language called HiveQL  Query implementation through MapReduce
  • 11. With the aim of optimizing schedulers, the framework allow jobs to complete in a timely manner, while allowing users who are making queries to get results back in a reasonable time. so users have more freedom in adopting the most appropriated scheduler or other techniques according to their requirements.
  • 12. Problem Statement First in first out (FIFO) scheduler approach allows one job to take all task slots within the cluster, i.e., no other jobs can utilize the cluster until the current one completes. Consequently, jobs that arrive at a later time or with a lower priority will be blocked by those ahead in the queue. Given the total number of jobs is large, there will be a significant delay caused by FIFO scheduler.
  • 13. Proposed System The proposed system has a novel framework for optimization approach. 1. Fair scheduler The Fair scheduler starts execution if any slots are available during execution of tasks the smaller jobs are assigned to that slot. Merits of Fair Scheduler  Though cluster is shared with large jobs it allows running small jobs quickly.  Unlike default FIFO scheduler, without starving the large or small job fair scheduling chooses a job in the queue to run task if available slot is free.  Provide service for multiple slots with guaranteed levels of jobs execution in shared cluster and simple to configure and administer.
  • 14. 2. Capacity Scheduler  The concept provided by Capacity Scheduler is scheduling queues.  Is designed for large cluster for sharing minimum capacity guarantee while cluster is partitioned among multiple users. Merits of Capacity Scheduler  Capacity guarantees – Multiple queues supports jobs execution simultaneously as submitted by the user.  Elasticity – jobs are assigned beyond the capacity due its freely available resources which helps entire cluster utilization  Multi-Tenancy – Provides system resources to queues created by user to link with JobTracker to execute jobs.
  • 15. Objective  To parallelize the job execution crosswise over stand alone mode or in a cluster.  Estimate the processing time for all parallel applications like small or long running jobs by distributing tasks to the schedulers.
  • 16. Methodology HDFS Framework works on block size. Whereas general Parallel file system supports block sizes of 16 KB to 4 MB and maintains default block size of 256KB. But HDFS works with default 64 Mb block size.
  • 17.
  • 18.
  • 20. 2. Distinctive Block Size  dfs.block.size: File system block size: 67108864 (bytes)  E.g. Input data size = 1GB and dfs.block.size = 64 MB then the minimum no. of maps are (1*1024)/64 = 16 maps.  E.g. If input data size = 1 GB and dfs.block.size = 128 MB (134217728 bytes) then minimum no. of maps are (1*1024)/128 = 8 maps. File Size Time for Moving Data to HDFS Total Number of Blocks Created 64 Mb 128 MB 1.3 GB 0.48 Sec 0.38 Sec 21 11 2.7 GB 1.38 Sec 1.17 Sec 41 21 4.0 GB 2.32 Sec 1.58 Sec 61 31
  • 21. S/W and H/W Specification Software Specification  Operating System: Ubuntu 12.04 LTS.  Java (Jdk 6.1) is required to run *.jar files.  Hadoop Version-hadoop-1.0.3.tar.gz Stable Release.  Pig: apache- pig-0.14.0.tar.gz.  Hive: apache-hive-0.13.1-bin.tar.gz. Hardware specification Name Specification Main processor P4, 2GHz or Higher Secondary Memory 200GB Primary Memory 4GB or higher
  • 22. Results Starting Hadoop multi-node cluster Cd /usr/local/hadoop/bin $ ./start-all.sh
  • 23. Displaying Master Node and available cluster summary  Using web browser type the master URL such as localhost:50070 or master:50070
  • 24. Listing Total Number of Files in HDFS.  $ bin/hadoop dfs -ls
  • 25. Input file of weblogs Stored in HDFS
  • 26. Movie dataset Input File Stored in HDFS
  • 27. Job Queue Scheduling Information with Default Scheduler  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogHitsByLinkProcessor.jar WeblogHitsByLinkProcessor /user/hduser/nasa_input /user/hduser/weblog_output/  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogMessagesizevsHitsProcessor /user/hduser/nasa_input /user/hduser/weblogs_output1/  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/nasa_input /user/hduser/welogg_output2/
  • 28. Scheduling Information with FAIR Scheduler.  Using web browser type Hadoop MapReduce URL such as localhost:50030 $ time bin/hadoop jar WeblogHits.jar WeblogHits /user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/fair/ouputs/weblog_output $ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/fair/ouputs/weblogs_output1 $ time bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/fair/ouputs/weblogg_output2
  • 29.
  • 30. Job Summary for QueueA using Capacity Scheduler $ time bin/hadoop jar Weblog.jar Weblog -Dmapred.job.queue.name=queueA user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/capacity/ouputs/weblog_output
  • 31. Job Summary for QueueB using Capacity Scheduler. $ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator - Dmapred.job.queue.name=queueB /user/hduser/heterogeneous/inputs/weblogtime/user/hduser/heterogen eous/capacity/ouputs/weblogs_output1
  • 32. Output for Web-Logs with Total Number of Hits on Particular Links.
  • 33. Output for Web-Logs with Total Number Messages with their Size.
  • 34. 0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  • 35. Output for Web-Logs with Time between 0- 23 Hours along Number of Users.
  • 36. 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Time Users
  • 37. Performance Analysis of Proposed Hadoop System using Homogeneous Hadoop Jobs with Single Node Cluster 3.7 2.92 3.6 3.25 0 0.5 1 1.5 2 2.5 3 3.5 4 Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB Proposed Hadoop System
  • 38. Performance Analysis of Proposed Hadoop System using Heterogeneous Hadoop Jobs with Single Node Cluster 26.11 19.04 29.96 21.24 0 5 10 15 20 25 30 35 Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB Proposed Hadoop System
  • 39. Performance Analysis of Proposed Hadoop System using Homogeneous Hadoop Jobs with Multi-Node Cluster. 1.54 1.5 3.79 1.66 0 0.5 1 1.5 2 2.5 3 3.5 4 Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB Proposed Hadoop System
  • 40. Performance Analysis of Proposed Hadoop System using Heterogeneous Hadoop Jobs with Multi-Node Cluster. 7.38 6.38 13.47 12.26 0 2 4 6 8 10 12 14 16 Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB Proposed Hadoop System
  • 41. Results for Movie dataset using PIG  Running pig in local mode:  pig -x local  movies = LOAD '/Users/Rich/Documents/Courses/Fall2014/ BigData/Pig/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);  DUMP movies
  • 42. Filter:  List the movies that were released between 1950 and 1960  movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961;  store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960';
  • 43.  Foreach Generate:  List movie names and their duration (in minutes)  movies_name_duration = foreach movies generate name, (float)duration/3600;  store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration';
  • 44. Order:  List all movies in descending order of year  movies_year_sort =order movies by year desc;  store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort';
  • 45. Results for Movie Dataset using Hive For Displaying Contents of Movie Data Set. Hive> select * from Movies;
  • 46. Result for Particular Year of Movies Released. Hive> select *from movies where year = 1995;
  • 47. finding all Movies based on Movie Length specified. Hive> select * from movies where length > 3000;
  • 48. Retrieve information using GROUP BY clause. Hive> select COUNT(1) from movies GROUP BY year;
  • 49. Results for heterogeneous datasets using Hadoop, pig and hive Hadoop Pig Hive Word Count 2.367 1.58 1.52 Movie Rating 1.53 1.45 1.57 2.36 1.58 1.521.53 1.45 1.57 0 0.5 1 1.5 2 2.5 Hadoop Pig Hive WordCount MovieRating
  • 50. Conclusion This effort is projected to give a high level summary of what is Big data and how to solve the issues generated through four V’s and stored in HDFS using various configuration parameters by setting up Hadoop, Pig and Hive to retrieve useful data from bulky data sets.