Youtube big data analysis using hadoop,pig,hive

A
seminar on
Practical Training
on
Big data and hadoop
SUBMITTED BY:
Pankaj chhipa
Final year , CS
Roll No. 15/276
DEPARTMENT OF ELECTRONICS ENGINEERING
RAJASTHAN TECHNICAL UNIVERSITY
KOTA
SUBMITTED TO:-
Mrs. Jyoti Yaduwanshi

You tube data analysis .
Using
Hadoop

Purpose and Scope
 "YouTube has over a billion users and every day
people watch hundreds of millions of hours on
YouTube and generate billions of views”.
 "Every day, people across the world are uploading
1.2 million videos to YouTube, or over 100 hours
per minute and this number is ever increasing [3].
To analyze and understand the activity occurring
on such a massive scale, a relational SQL database
is not enough. Such kind of data is well suited to a
massively parallel and distributed system like
Hadoop.

 The main objective of this project is to focus on
how data generated from YouTube can be
mined and utilized by different companies to
make targeted, real time and informed
decisions about their product that can increase
their market share.

This project uses following concepts and tools
throughout its
lifecycle.
 Java API
 Hadoop
 Hive
 Pig

 In this project we fetch a specific channel’s YouTube data
using YouTube API. We will use Google Developers
Console and generate a unique access key which is required
to fetch YouTube public channel data. Once the API key is
generated, a java based console application is designed to
use the YouTube API for fetching video(s) information
based on a search criteria.
 The text file output generated from the console application
is then loaded from HDFS file into HIVE database. HDFS is
a primary Hadoop application and a user can directly
interact with HDFS using various shell-like commands
supported by Hadoop. Then we run queries on Big Data
using HIVE to extract the meaningful output which can be
used by the management for analysis.

You tube data
extraction
Load data into
HDFS
Hive
(databases)
Processing data
using hive
command
Result

 Wikipedia defines Big Data as "a collection of data
sets so large and complex that it becomes difficult
to process using the available database
management tools.
 The challenges include how to capture, curate,
store, search, share, analyze and visualize Big
Data.
 In today's environment, we have access to more
types of data. These data sources include online
transactions, social networking activities, mobile
device services, internet gaming etc.

 Big Data is a collection of data sets that are large
and complex in nature. They constitute both
structured and unstructured data that grow large
so fast that they are not manageable by traditional
relational database systems .
 Big Data is defined as any kind of data source that
has at least three shared characteristics:
 Extremely large Volumes of data
 Extremely high Velocity of data
 Extremely wide Variety of data

 It is a framework for storing data on large clusters
 It’s framework that used for manupulate large amount
of data.
Hadoop consists of two main components
 A distributed processing framework named
MapReduce (which is now supported by a component
called YARN(Yet Another Resource Negotiator) and
 A distributed file system known as the Hadoop
Distributed File System, or HDFS.

 Hadoop provides a mechanism called
MapReduce model to do distributed processing
of large data.

 Identify top 5 categories in which most of the video‘s
are uploaded?
 A = load '/home/pankaj/Documents/pankaj1/hive-
student/youtube_analysis/Demo_Data.csv' using
PigStorage(',');
 B = group A by $3;
 C = foreach B generate group,COUNT(A);
 D = order C by $1 DESC;
 E = limit D 5;
 dump E;

2. Top ten highest rated video’s?
 A = load
‘/home/pankaj/Documents/pankaj1/hive-
PigStorage(',');
 X = filter A by $0 != 'ID';
 B = foreach X generate $0,$4;
 C = order B by $1 DESC;
 D = limit C 5;
 dump D;

3. top ten most viewed
 A = load
'/home/pankaj/Documents/pankaj1/hive-
PigStorage(',');
 B = foreach X generate $0,$3,(int)$5;
 D = limit C 10;
 dump D;

4. top 10 lengthy video’s;
 A = load
'/home/pankaj/Documents/pankaj1/hive-
PigStorage(',');
 B = foreach X generate $0,$3,(int)$7;
 D = limit C 10;
 dump D;

5. How many peoples are less than 18 years old and
uploaded video?
PigStorage(',');
 B = foreach X generate $1,(int)$2,$5;
 C = filter B by $1 < 18 and $2 != 0;
 D = group C by $0;
 E = foreach D generate COUNT(C);
 F = group E by $0;
 G = foreach F generate COUNT(E);
 dump G;

6.how many child in which category uploaded
video’s are less then 18 year;
PigStorage(',');
 B = foreach X generate $3,$2;
 Z = filter B by $1 < 18;
 C = group Z by $0;
 D = foreach C generate group,COUNT(Z);
 dump D;

Step 1: Use the following command to ‘Create a
Table’ in HIVE
 Hive> create table YouTube_data_table (a1 int,a2
string,a3 int,a4 string,a5 int,a6 int,a7 string) ROW
FORMAT DELIMITED FIELDS TERMINATED BY
't' STORED AS TEXTFILE
This command will create a Hive table named
‘youtube in which rows will be delimited and
rows fields will be terminated by commas.

Step 2: Load YouTube data into the Hive Table
Use the command given below to load YouTube
data into the Hive table created in
 Hive> load data local inpath '/home /pankaj
/Desktop/YouTube.csv’ overwrite into table
youtube;

Ques 1. Identifies Top 5 catagories in which most
of the videos are uploaded?
 hive> select a4,count(a6) as l FROM YouTube
group by a4 SORT BY l DESC LIMIT 5;
output

2. Top ten highest rated video’s?
 hive> select a4,a5 FROM YouTube SORT BY a5
DESC LIMIT 10;
output

3. top 10 lengthy video’s;
 hive>select a1,a4,a8 FROM YouTube SORT BY
a8 DESC LIMIT 10;
output

4. How many peoples are less than 18 years old
and uploaded video?
 hive>select count(a2) FROM YouTube where a2
< 18;
output

5. how many child in which category uploaded
video’s are less then 18 year;
 hive> select a4,count(a2) FROM YouTube where
a2 < 18 group by a4;
output

6: Number of comment per video’s?
 hive> select a4,count(a7) FROM YouTube group
by a4;
output

7. sort of data on the bases of uploader name?
 hive> select * FROM YouTube sort by a2;
output

Youtube big data analysis using hadoop,pig,hive

Youtube big data analysis using hadoop,pig,hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Youtube big data analysis using hadoop,pig,hive

Similar to Youtube big data analysis using hadoop,pig,hive (20)

Recently uploaded

Recently uploaded (20)

Youtube big data analysis using hadoop,pig,hive