This effort is projected to give a high level summary of what is Big data and how to solve the issues generated through four V’s and stored in HDFS using various configuration parameters by setting up Hadoop, Pig and Hive to retrieve useful data from bulky data sets.
2. Introduction
Big Data: Information is turning out to be more
accessible as well as more reasonable to PCs.
A large portion of the Big Data flow of information
in the world is uncontrollable stuff like words,
pictures and feature on the Web and those
floods of sensor information.
It is called unstructured information and is not
commonly available for conventional databases.
Firstly Big data should be obtained, processed
and analyzed.
3. Characteristics of Bigdata
Volume: Represents huge
information.
Velocity (Speed): A quick rate
that information is gotten.
Variety (Mixture):
Unstructured and semi-
organized information sorts.
Value: Information has
inherent value however must
be found.
4. Hadoop: A structure of open source tools, libraries
and techniques for 'huge information'
examination.
Hadoop is converged with two essential key
properties known as Hadoop Distributed File
System (HDFS) with capacity and
administration and MapReduce with
processing ability.
Why Hadoop?
Adjustable framework with hardware and
programming with implicit application level with
5. Why to use Hadoop ?
The Hadoop is reliable and simple
Utilized to store extensive information with
unlimited storage capacity with adaptable
framework.
As it is of 100% open source package.
Quick recovery from system failures.
Ability to quickly handle large amount of
information in parallel.
Once information written in HDFS can be read
many times.
6. Contents of Hadoop
1. Hadoop Distributed File System
(HDFS):
HDFS is master (expert) and slave
structural architecture; client first
cooperates with expert with
NameNode and Secondary
NameNode.
Every slave runs both DataNode
and JobTracker daemon that speak
with and getting guidelines from
master hub.
The TaskTracker daemon is the
slave to the JobTracker and
DataNode daemon is slave to the
NameNode.
7. 2. MapReduce: Is basic programming model
confined to use of key-values pairs.
8. Scheduling in Hadoop
Scheduling is a policy used to determine when a
job executes its tasks. And it communicates each
other or across clusters using TCP suit using its
distributed file system.
Example: process time, communication for data
transmission along with available bandwidth.
So the need of scheduling is the fact that
multiplexing and multitasking.
The fresh hadoop setup uses only First-in First-
out scheduler.
9. Pig
scripting language intended to handle any sort of data sets
(unstructured information).
Pig is comprised of two parts: the First is language itself which is
called Piglatin; the Second is a Pigruntime environment where
Piglatin programs are executed.
Pig has two execution modes: Local Mode (pig –x nearby) and
MapReduce Mode (pig or pig –x mapreduce) with an ad-hoc way
of creating and executing MapReduce jobs.
Piglatin High-level scripting language.
Requires no metadata or schema.
Statements translated into a series of MapReduce
jobs.
Grunt Interactive shell.
Piggybank Shared repository for User Defined Functions (UDFs).
10. Hive
Data Warehouse System for Hadoop.
Tools to facilitate effortless information to
extract/transform/load (ETL) from records stored
directly HDFS or in other information storage
systems such as HBase.
Contains metadata so as to describe information
right to use in HDFS and HBase, not information
itself.
Uses a straightforward SQL-like query language
called HiveQL
Query implementation through MapReduce
11. With the aim of optimizing schedulers, the framework
allow jobs to complete in a timely manner, while allowing
users who are making queries to get results back in a
reasonable time. so users have more freedom in
adopting the most appropriated scheduler or other
techniques according to their requirements.
12. Problem Statement
First in first out (FIFO) scheduler approach allows one
job to take all task slots within the cluster, i.e., no other jobs
can utilize the cluster until the current one completes.
Consequently, jobs that arrive at a later time or with a lower
priority will be blocked by those ahead in the queue. Given
the total number of jobs is large, there will be a significant
delay caused by FIFO scheduler.
13. Proposed System
The proposed system has a novel framework for optimization approach.
1. Fair scheduler
The Fair scheduler starts execution if any slots are available during
execution of tasks the smaller jobs are assigned to that slot.
Merits of Fair Scheduler
Though cluster is shared with large jobs it allows running small jobs
quickly.
Unlike default FIFO scheduler, without starving the large or small job fair
scheduling chooses a job in the queue to run task if available slot is free.
Provide service for multiple slots with guaranteed levels of jobs execution
in shared cluster and simple to configure and administer.
14. 2. Capacity Scheduler
The concept provided by Capacity Scheduler is
scheduling queues.
Is designed for large cluster for sharing minimum
capacity guarantee while cluster is partitioned among
multiple users.
Merits of Capacity Scheduler
Capacity guarantees – Multiple queues supports jobs
execution simultaneously as submitted by the user.
Elasticity – jobs are assigned beyond the capacity due its
freely available resources which helps entire cluster utilization
Multi-Tenancy – Provides system resources to queues
created by user to link with JobTracker to execute jobs.
15. Objective
To parallelize the job execution crosswise over
stand alone mode or in a cluster.
Estimate the processing time for all parallel
applications like small or long running jobs by
distributing tasks to the schedulers.
16. Methodology
HDFS Framework works on block size. Whereas general Parallel file
system supports block sizes of 16 KB to 4 MB and maintains default block
size of 256KB. But HDFS works with default 64 Mb block size.
20. 2. Distinctive Block Size
dfs.block.size: File system block size: 67108864 (bytes)
E.g. Input data size = 1GB and dfs.block.size = 64 MB
then the minimum no. of maps are (1*1024)/64 = 16
maps.
E.g. If input data size = 1 GB and dfs.block.size = 128
MB (134217728 bytes) then minimum no. of maps are
(1*1024)/128 = 8 maps.
File Size Time for Moving Data to
HDFS
Total Number of Blocks
Created
64 Mb 128 MB
1.3 GB 0.48 Sec 0.38 Sec 21 11
2.7 GB 1.38 Sec 1.17 Sec 41 21
4.0 GB 2.32 Sec 1.58 Sec 61 31
21. S/W and H/W Specification
Software Specification
Operating System: Ubuntu 12.04 LTS.
Java (Jdk 6.1) is required to run *.jar files.
Hadoop Version-hadoop-1.0.3.tar.gz Stable Release.
Pig: apache- pig-0.14.0.tar.gz.
Hive: apache-hive-0.13.1-bin.tar.gz.
Hardware specification
Name Specification
Main processor P4, 2GHz or Higher
Secondary Memory 200GB
Primary Memory 4GB or higher
27. Job Queue Scheduling Information
with Default Scheduler
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogHitsByLinkProcessor.jar WeblogHitsByLinkProcessor
/user/hduser/nasa_input /user/hduser/weblog_output/
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogMessagesizevsHitsProcessor /user/hduser/nasa_input /user/hduser/weblogs_output1/
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar
WeblogTimeOfDayHistogramCreator /user/hduser/nasa_input /user/hduser/welogg_output2/
28. Scheduling Information with FAIR
Scheduler.
Using web browser type Hadoop MapReduce URL such as localhost:50030
$ time bin/hadoop jar WeblogHits.jar WeblogHits
/user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblog_output
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblogs_output1
$ time bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar
WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblogg_output2
29.
30. Job Summary for QueueA using
Capacity Scheduler
$ time bin/hadoop jar Weblog.jar Weblog -Dmapred.job.queue.name=queueA
user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/capacity/ouputs/weblog_output
31. Job Summary for QueueB using
Capacity Scheduler.
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar
WeblogTimeOfDayHistogramCreator -
Dmapred.job.queue.name=queueB
/user/hduser/heterogeneous/inputs/weblogtime/user/hduser/heterogen
eous/capacity/ouputs/weblogs_output1
37. Performance Analysis of Proposed Hadoop
System using Homogeneous Hadoop Jobs with
Single Node Cluster
3.7
2.92
3.6
3.25
0
0.5
1
1.5
2
2.5
3
3.5
4
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
38. Performance Analysis of Proposed Hadoop
System using Heterogeneous Hadoop Jobs with
Single Node Cluster
26.11
19.04
29.96
21.24
0
5
10
15
20
25
30
35
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
39. Performance Analysis of Proposed Hadoop
System using Homogeneous Hadoop Jobs with
Multi-Node Cluster.
1.54 1.5
3.79
1.66
0
0.5
1
1.5
2
2.5
3
3.5
4
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
40. Performance Analysis of Proposed Hadoop
System using Heterogeneous Hadoop Jobs with
Multi-Node Cluster.
7.38
6.38
13.47
12.26
0
2
4
6
8
10
12
14
16
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB
Proposed Hadoop System
41. Results for Movie dataset using
PIG
Running pig in local mode:
pig -x local
movies = LOAD
'/Users/Rich/Documents/Courses/Fall2014/
BigData/Pig/movies_data.csv' USING
PigStorage(',') as
(id,name,year,rating,duration);
DUMP movies
42. Filter:
List the movies that were released between 1950 and 1960
movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961;
store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960';
43. Foreach Generate:
List movie names and their duration (in minutes)
movies_name_duration = foreach movies generate name, (float)duration/3600;
store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration';
44. Order:
List all movies in descending order of year
movies_year_sort =order movies by year desc;
store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort';
45. Results for Movie Dataset using
Hive
For Displaying Contents of Movie Data Set.
Hive> select * from Movies;
46. Result for Particular Year of Movies
Released.
Hive> select *from movies where year = 1995;
47. finding all Movies based on Movie Length
specified.
Hive> select * from movies where length > 3000;
49. Results for heterogeneous
datasets using Hadoop, pig and
hive
Hadoop Pig Hive
Word Count 2.367 1.58 1.52
Movie Rating 1.53 1.45 1.57
2.36
1.58
1.521.53
1.45
1.57
0
0.5
1
1.5
2
2.5
Hadoop Pig Hive
WordCount
MovieRating
50. Conclusion
This effort is projected to give a high level
summary of what is Big data and how to solve
the issues generated through four V’s and
stored in HDFS using various configuration
parameters by setting up Hadoop, Pig and
Hive to retrieve useful data from bulky data
sets.