SCHEDULERS OPTIMIZATION TO
HANDLE MULTIPLE JOBS IN
HADOOP CLUSTER
By
Shivaraj B G
4th Sem
Department of Computer Networking, MITE.
Introduction
Big Data: Information is turning out to be more
accessible as well as more reasonable to PCs.
 A large portion of the Big Data flow of information
in the world is uncontrollable stuff like words,
pictures and feature on the Web and those
floods of sensor information.
 It is called unstructured information and is not
commonly available for conventional databases.
 Firstly Big data should be obtained, processed
and analyzed.
Characteristics of Bigdata
 Volume: Represents huge
information.
 Velocity (Speed): A quick rate
that information is gotten.
 Variety (Mixture):
Unstructured and semi-
organized information sorts.
 Value: Information has
inherent value however must
be found.
Hadoop: A structure of open source tools, libraries
and techniques for 'huge information'
examination.
 Hadoop is converged with two essential key
properties known as Hadoop Distributed File
System (HDFS) with capacity and
administration and MapReduce with
processing ability.
Why Hadoop?
 Adjustable framework with hardware and
programming with implicit application level with
Why to use Hadoop ?
 The Hadoop is reliable and simple
 Utilized to store extensive information with
unlimited storage capacity with adaptable
framework.
 As it is of 100% open source package.
 Quick recovery from system failures.
 Ability to quickly handle large amount of
information in parallel.
 Once information written in HDFS can be read
many times.
Contents of Hadoop
1. Hadoop Distributed File System
(HDFS):
 HDFS is master (expert) and slave
structural architecture; client first
cooperates with expert with
NameNode and Secondary
NameNode.
 Every slave runs both DataNode
and JobTracker daemon that speak
with and getting guidelines from
master hub.
 The TaskTracker daemon is the
slave to the JobTracker and
DataNode daemon is slave to the
NameNode.
2. MapReduce: Is basic programming model
confined to use of key-values pairs.
Scheduling in Hadoop
 Scheduling is a policy used to determine when a
job executes its tasks. And it communicates each
other or across clusters using TCP suit using its
distributed file system.
 Example: process time, communication for data
transmission along with available bandwidth.
 So the need of scheduling is the fact that
multiplexing and multitasking.
 The fresh hadoop setup uses only First-in First-
out scheduler.
Pig
 scripting language intended to handle any sort of data sets
(unstructured information).
 Pig is comprised of two parts: the First is language itself which is
called Piglatin; the Second is a Pigruntime environment where
Piglatin programs are executed.
 Pig has two execution modes: Local Mode (pig –x nearby) and
MapReduce Mode (pig or pig –x mapreduce) with an ad-hoc way
of creating and executing MapReduce jobs.
Piglatin  High-level scripting language.
 Requires no metadata or schema.
 Statements translated into a series of MapReduce
jobs.
Grunt  Interactive shell.
Piggybank  Shared repository for User Defined Functions (UDFs).
Hive
 Data Warehouse System for Hadoop.
 Tools to facilitate effortless information to
extract/transform/load (ETL) from records stored
directly HDFS or in other information storage
systems such as HBase.
 Contains metadata so as to describe information
right to use in HDFS and HBase, not information
itself.
 Uses a straightforward SQL-like query language
called HiveQL
 Query implementation through MapReduce
With the aim of optimizing schedulers, the framework
allow jobs to complete in a timely manner, while allowing
users who are making queries to get results back in a
reasonable time. so users have more freedom in
adopting the most appropriated scheduler or other
techniques according to their requirements.
Problem Statement
First in first out (FIFO) scheduler approach allows one
job to take all task slots within the cluster, i.e., no other jobs
can utilize the cluster until the current one completes.
Consequently, jobs that arrive at a later time or with a lower
priority will be blocked by those ahead in the queue. Given
the total number of jobs is large, there will be a significant
delay caused by FIFO scheduler.
Proposed System
The proposed system has a novel framework for optimization approach.
1. Fair scheduler
The Fair scheduler starts execution if any slots are available during
execution of tasks the smaller jobs are assigned to that slot.
Merits of Fair Scheduler
 Though cluster is shared with large jobs it allows running small jobs
quickly.
 Unlike default FIFO scheduler, without starving the large or small job fair
scheduling chooses a job in the queue to run task if available slot is free.
 Provide service for multiple slots with guaranteed levels of jobs execution
in shared cluster and simple to configure and administer.
2. Capacity Scheduler
 The concept provided by Capacity Scheduler is
scheduling queues.
 Is designed for large cluster for sharing minimum
capacity guarantee while cluster is partitioned among
multiple users.
Merits of Capacity Scheduler
 Capacity guarantees – Multiple queues supports jobs
execution simultaneously as submitted by the user.
 Elasticity – jobs are assigned beyond the capacity due its
freely available resources which helps entire cluster utilization
 Multi-Tenancy – Provides system resources to queues
created by user to link with JobTracker to execute jobs.
Objective
 To parallelize the job execution crosswise over
stand alone mode or in a cluster.
 Estimate the processing time for all parallel
applications like small or long running jobs by
distributing tasks to the schedulers.
Methodology
HDFS Framework works on block size. Whereas general Parallel file
system supports block sizes of 16 KB to 4 MB and maintains default block
size of 256KB. But HDFS works with default 64 Mb block size.
Optimization Techniques
1. MapCombineReduce
2. Distinctive Block Size
 dfs.block.size: File system block size: 67108864 (bytes)
 E.g. Input data size = 1GB and dfs.block.size = 64 MB
then the minimum no. of maps are (1*1024)/64 = 16
maps.
 E.g. If input data size = 1 GB and dfs.block.size = 128
MB (134217728 bytes) then minimum no. of maps are
(1*1024)/128 = 8 maps.
File Size Time for Moving Data to
HDFS
Total Number of Blocks
Created
64 Mb 128 MB
1.3 GB 0.48 Sec 0.38 Sec 21 11
2.7 GB 1.38 Sec 1.17 Sec 41 21
4.0 GB 2.32 Sec 1.58 Sec 61 31
S/W and H/W Specification
Software Specification
 Operating System: Ubuntu 12.04 LTS.
 Java (Jdk 6.1) is required to run *.jar files.
 Hadoop Version-hadoop-1.0.3.tar.gz Stable Release.
 Pig: apache- pig-0.14.0.tar.gz.
 Hive: apache-hive-0.13.1-bin.tar.gz.
Hardware specification
Name Specification
Main processor P4, 2GHz or Higher
Secondary Memory 200GB
Primary Memory 4GB or higher
Results
Starting Hadoop multi-node cluster
Cd /usr/local/hadoop/bin
$ ./start-all.sh
Displaying Master Node and
available cluster summary
 Using web browser type the master URL such as
localhost:50070 or master:50070
Listing Total Number of Files in
HDFS.
 $ bin/hadoop dfs -ls
Input file of weblogs Stored in
HDFS
Movie dataset Input File Stored in
HDFS
Job Queue Scheduling Information
with Default Scheduler
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogHitsByLinkProcessor.jar WeblogHitsByLinkProcessor
/user/hduser/nasa_input /user/hduser/weblog_output/
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogMessagesizevsHitsProcessor /user/hduser/nasa_input /user/hduser/weblogs_output1/
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar
WeblogTimeOfDayHistogramCreator /user/hduser/nasa_input /user/hduser/welogg_output2/
Scheduling Information with FAIR
Scheduler.
 Using web browser type Hadoop MapReduce URL such as localhost:50030
$ time bin/hadoop jar WeblogHits.jar WeblogHits
/user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblog_output
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblogs_output1
$ time bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar
WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblogg_output2
Job Summary for QueueA using
Capacity Scheduler
$ time bin/hadoop jar Weblog.jar Weblog -Dmapred.job.queue.name=queueA
user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/capacity/ouputs/weblog_output
Job Summary for QueueB using
Capacity Scheduler.
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogTimeOfDayHistogramCreator -
Dmapred.job.queue.name=queueB
/user/hduser/heterogeneous/inputs/weblogtime/user/hduser/heterogen
eous/capacity/ouputs/weblogs_output1
Output for Web-Logs with Total
Number of Hits on Particular Links.
Output for Web-Logs with Total
Number Messages with their Size.
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Output for Web-Logs with Time between 0-
23 Hours along Number of Users.
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Time
Users
Performance Analysis of Proposed Hadoop
System using Homogeneous Hadoop Jobs with
Single Node Cluster
3.7
2.92
3.6
3.25
0
0.5
1
1.5
2
2.5
3
3.5
4
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
Performance Analysis of Proposed Hadoop
System using Heterogeneous Hadoop Jobs with
Single Node Cluster
26.11
19.04
29.96
21.24
0
5
10
15
20
25
30
35
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
Performance Analysis of Proposed Hadoop
System using Homogeneous Hadoop Jobs with
Multi-Node Cluster.
1.54 1.5
3.79
1.66
0
0.5
1
1.5
2
2.5
3
3.5
4
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
Performance Analysis of Proposed Hadoop
System using Heterogeneous Hadoop Jobs with
Multi-Node Cluster.
7.38
6.38
13.47
12.26
0
2
4
6
8
10
12
14
16
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
Results for Movie dataset using
PIG
 Running pig in local mode:
 pig -x local
 movies = LOAD
'/Users/Rich/Documents/Courses/Fall2014/
BigData/Pig/movies_data.csv' USING
PigStorage(',') as
(id,name,year,rating,duration);
 DUMP movies
Filter:
 List the movies that were released between 1950 and 1960
 movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961;
 store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960';
 Foreach Generate:
 List movie names and their duration (in minutes)
 movies_name_duration = foreach movies generate name, (float)duration/3600;
 store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration';
Order:
 List all movies in descending order of year
 movies_year_sort =order movies by year desc;
 store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort';
Results for Movie Dataset using
Hive
For Displaying Contents of Movie Data Set.
Hive> select * from Movies;
Result for Particular Year of Movies
Released.
Hive> select *from movies where year = 1995;
finding all Movies based on Movie Length
specified.
Hive> select * from movies where length > 3000;
Retrieve information using GROUP
BY clause.
Hive> select COUNT(1) from movies GROUP
BY year;
Results for heterogeneous
datasets using Hadoop, pig and
hive
Hadoop Pig Hive
Word Count 2.367 1.58 1.52
Movie Rating 1.53 1.45 1.57
2.36
1.58
1.521.53
1.45
1.57
0
0.5
1
1.5
2
2.5
Hadoop Pig Hive
WordCount
MovieRating
Conclusion
This effort is projected to give a high level
summary of what is Big data and how to solve
the issues generated through four V’s and
stored in HDFS using various configuration
parameters by setting up Hadoop, Pig and
Hive to retrieve useful data from bulky data
sets.
Schedulers optimization to handle multiple jobs in hadoop cluster

Schedulers optimization to handle multiple jobs in hadoop cluster

  • 1.
    SCHEDULERS OPTIMIZATION TO HANDLEMULTIPLE JOBS IN HADOOP CLUSTER By Shivaraj B G 4th Sem Department of Computer Networking, MITE.
  • 2.
    Introduction Big Data: Informationis turning out to be more accessible as well as more reasonable to PCs.  A large portion of the Big Data flow of information in the world is uncontrollable stuff like words, pictures and feature on the Web and those floods of sensor information.  It is called unstructured information and is not commonly available for conventional databases.  Firstly Big data should be obtained, processed and analyzed.
  • 3.
    Characteristics of Bigdata Volume: Represents huge information.  Velocity (Speed): A quick rate that information is gotten.  Variety (Mixture): Unstructured and semi- organized information sorts.  Value: Information has inherent value however must be found.
  • 4.
    Hadoop: A structureof open source tools, libraries and techniques for 'huge information' examination.  Hadoop is converged with two essential key properties known as Hadoop Distributed File System (HDFS) with capacity and administration and MapReduce with processing ability. Why Hadoop?  Adjustable framework with hardware and programming with implicit application level with
  • 5.
    Why to useHadoop ?  The Hadoop is reliable and simple  Utilized to store extensive information with unlimited storage capacity with adaptable framework.  As it is of 100% open source package.  Quick recovery from system failures.  Ability to quickly handle large amount of information in parallel.  Once information written in HDFS can be read many times.
  • 6.
    Contents of Hadoop 1.Hadoop Distributed File System (HDFS):  HDFS is master (expert) and slave structural architecture; client first cooperates with expert with NameNode and Secondary NameNode.  Every slave runs both DataNode and JobTracker daemon that speak with and getting guidelines from master hub.  The TaskTracker daemon is the slave to the JobTracker and DataNode daemon is slave to the NameNode.
  • 7.
    2. MapReduce: Isbasic programming model confined to use of key-values pairs.
  • 8.
    Scheduling in Hadoop Scheduling is a policy used to determine when a job executes its tasks. And it communicates each other or across clusters using TCP suit using its distributed file system.  Example: process time, communication for data transmission along with available bandwidth.  So the need of scheduling is the fact that multiplexing and multitasking.  The fresh hadoop setup uses only First-in First- out scheduler.
  • 9.
    Pig  scripting languageintended to handle any sort of data sets (unstructured information).  Pig is comprised of two parts: the First is language itself which is called Piglatin; the Second is a Pigruntime environment where Piglatin programs are executed.  Pig has two execution modes: Local Mode (pig –x nearby) and MapReduce Mode (pig or pig –x mapreduce) with an ad-hoc way of creating and executing MapReduce jobs. Piglatin  High-level scripting language.  Requires no metadata or schema.  Statements translated into a series of MapReduce jobs. Grunt  Interactive shell. Piggybank  Shared repository for User Defined Functions (UDFs).
  • 10.
    Hive  Data WarehouseSystem for Hadoop.  Tools to facilitate effortless information to extract/transform/load (ETL) from records stored directly HDFS or in other information storage systems such as HBase.  Contains metadata so as to describe information right to use in HDFS and HBase, not information itself.  Uses a straightforward SQL-like query language called HiveQL  Query implementation through MapReduce
  • 11.
    With the aimof optimizing schedulers, the framework allow jobs to complete in a timely manner, while allowing users who are making queries to get results back in a reasonable time. so users have more freedom in adopting the most appropriated scheduler or other techniques according to their requirements.
  • 12.
    Problem Statement First infirst out (FIFO) scheduler approach allows one job to take all task slots within the cluster, i.e., no other jobs can utilize the cluster until the current one completes. Consequently, jobs that arrive at a later time or with a lower priority will be blocked by those ahead in the queue. Given the total number of jobs is large, there will be a significant delay caused by FIFO scheduler.
  • 13.
    Proposed System The proposedsystem has a novel framework for optimization approach. 1. Fair scheduler The Fair scheduler starts execution if any slots are available during execution of tasks the smaller jobs are assigned to that slot. Merits of Fair Scheduler  Though cluster is shared with large jobs it allows running small jobs quickly.  Unlike default FIFO scheduler, without starving the large or small job fair scheduling chooses a job in the queue to run task if available slot is free.  Provide service for multiple slots with guaranteed levels of jobs execution in shared cluster and simple to configure and administer.
  • 14.
    2. Capacity Scheduler The concept provided by Capacity Scheduler is scheduling queues.  Is designed for large cluster for sharing minimum capacity guarantee while cluster is partitioned among multiple users. Merits of Capacity Scheduler  Capacity guarantees – Multiple queues supports jobs execution simultaneously as submitted by the user.  Elasticity – jobs are assigned beyond the capacity due its freely available resources which helps entire cluster utilization  Multi-Tenancy – Provides system resources to queues created by user to link with JobTracker to execute jobs.
  • 15.
    Objective  To parallelizethe job execution crosswise over stand alone mode or in a cluster.  Estimate the processing time for all parallel applications like small or long running jobs by distributing tasks to the schedulers.
  • 16.
    Methodology HDFS Framework workson block size. Whereas general Parallel file system supports block sizes of 16 KB to 4 MB and maintains default block size of 256KB. But HDFS works with default 64 Mb block size.
  • 19.
  • 20.
    2. Distinctive BlockSize  dfs.block.size: File system block size: 67108864 (bytes)  E.g. Input data size = 1GB and dfs.block.size = 64 MB then the minimum no. of maps are (1*1024)/64 = 16 maps.  E.g. If input data size = 1 GB and dfs.block.size = 128 MB (134217728 bytes) then minimum no. of maps are (1*1024)/128 = 8 maps. File Size Time for Moving Data to HDFS Total Number of Blocks Created 64 Mb 128 MB 1.3 GB 0.48 Sec 0.38 Sec 21 11 2.7 GB 1.38 Sec 1.17 Sec 41 21 4.0 GB 2.32 Sec 1.58 Sec 61 31
  • 21.
    S/W and H/WSpecification Software Specification  Operating System: Ubuntu 12.04 LTS.  Java (Jdk 6.1) is required to run *.jar files.  Hadoop Version-hadoop-1.0.3.tar.gz Stable Release.  Pig: apache- pig-0.14.0.tar.gz.  Hive: apache-hive-0.13.1-bin.tar.gz. Hardware specification Name Specification Main processor P4, 2GHz or Higher Secondary Memory 200GB Primary Memory 4GB or higher
  • 22.
    Results Starting Hadoop multi-nodecluster Cd /usr/local/hadoop/bin $ ./start-all.sh
  • 23.
    Displaying Master Nodeand available cluster summary  Using web browser type the master URL such as localhost:50070 or master:50070
  • 24.
    Listing Total Numberof Files in HDFS.  $ bin/hadoop dfs -ls
  • 25.
    Input file ofweblogs Stored in HDFS
  • 26.
    Movie dataset InputFile Stored in HDFS
  • 27.
    Job Queue SchedulingInformation with Default Scheduler  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogHitsByLinkProcessor.jar WeblogHitsByLinkProcessor /user/hduser/nasa_input /user/hduser/weblog_output/  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogMessagesizevsHitsProcessor /user/hduser/nasa_input /user/hduser/weblogs_output1/  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/nasa_input /user/hduser/welogg_output2/
  • 28.
    Scheduling Information withFAIR Scheduler.  Using web browser type Hadoop MapReduce URL such as localhost:50030 $ time bin/hadoop jar WeblogHits.jar WeblogHits /user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/fair/ouputs/weblog_output $ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/fair/ouputs/weblogs_output1 $ time bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/fair/ouputs/weblogg_output2
  • 30.
    Job Summary forQueueA using Capacity Scheduler $ time bin/hadoop jar Weblog.jar Weblog -Dmapred.job.queue.name=queueA user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/capacity/ouputs/weblog_output
  • 31.
    Job Summary forQueueB using Capacity Scheduler. $ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator - Dmapred.job.queue.name=queueB /user/hduser/heterogeneous/inputs/weblogtime/user/hduser/heterogen eous/capacity/ouputs/weblogs_output1
  • 32.
    Output for Web-Logswith Total Number of Hits on Particular Links.
  • 33.
    Output for Web-Logswith Total Number Messages with their Size.
  • 34.
    0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 1 2 34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  • 35.
    Output for Web-Logswith Time between 0- 23 Hours along Number of Users.
  • 36.
    0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 1 2 34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Time Users
  • 37.
    Performance Analysis ofProposed Hadoop System using Homogeneous Hadoop Jobs with Single Node Cluster 3.7 2.92 3.6 3.25 0 0.5 1 1.5 2 2.5 3 3.5 4 Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB Proposed Hadoop System
  • 38.
    Performance Analysis ofProposed Hadoop System using Heterogeneous Hadoop Jobs with Single Node Cluster 26.11 19.04 29.96 21.24 0 5 10 15 20 25 30 35 Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB Proposed Hadoop System
  • 39.
    Performance Analysis ofProposed Hadoop System using Homogeneous Hadoop Jobs with Multi-Node Cluster. 1.54 1.5 3.79 1.66 0 0.5 1 1.5 2 2.5 3 3.5 4 Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB Proposed Hadoop System
  • 40.
    Performance Analysis ofProposed Hadoop System using Heterogeneous Hadoop Jobs with Multi-Node Cluster. 7.38 6.38 13.47 12.26 0 2 4 6 8 10 12 14 16 Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB Proposed Hadoop System
  • 41.
    Results for Moviedataset using PIG  Running pig in local mode:  pig -x local  movies = LOAD '/Users/Rich/Documents/Courses/Fall2014/ BigData/Pig/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);  DUMP movies
  • 42.
    Filter:  List themovies that were released between 1950 and 1960  movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961;  store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960';
  • 43.
     Foreach Generate: List movie names and their duration (in minutes)  movies_name_duration = foreach movies generate name, (float)duration/3600;  store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration';
  • 44.
    Order:  List allmovies in descending order of year  movies_year_sort =order movies by year desc;  store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort';
  • 45.
    Results for MovieDataset using Hive For Displaying Contents of Movie Data Set. Hive> select * from Movies;
  • 46.
    Result for ParticularYear of Movies Released. Hive> select *from movies where year = 1995;
  • 47.
    finding all Moviesbased on Movie Length specified. Hive> select * from movies where length > 3000;
  • 48.
    Retrieve information usingGROUP BY clause. Hive> select COUNT(1) from movies GROUP BY year;
  • 49.
    Results for heterogeneous datasetsusing Hadoop, pig and hive Hadoop Pig Hive Word Count 2.367 1.58 1.52 Movie Rating 1.53 1.45 1.57 2.36 1.58 1.521.53 1.45 1.57 0 0.5 1 1.5 2 2.5 Hadoop Pig Hive WordCount MovieRating
  • 50.
    Conclusion This effort isprojected to give a high level summary of what is Big data and how to solve the issues generated through four V’s and stored in HDFS using various configuration parameters by setting up Hadoop, Pig and Hive to retrieve useful data from bulky data sets.