SlideShare a Scribd company logo
A
seminar on
Practical Training
on
Big data and hadoop
SUBMITTED BY:
Pankaj chhipa
Final year , CS
Roll No. 15/276
DEPARTMENT OF ELECTRONICS ENGINEERING
RAJASTHAN TECHNICAL UNIVERSITY
KOTA
SUBMITTED TO:-
Mrs. Jyoti Yaduwanshi
You tube data analysis .
Using
Hadoop
Purpose and Scope
 "YouTube has over a billion users and every day
people watch hundreds of millions of hours on
YouTube and generate billions of views”.
 "Every day, people across the world are uploading
1.2 million videos to YouTube, or over 100 hours
per minute and this number is ever increasing [3].
To analyze and understand the activity occurring
on such a massive scale, a relational SQL database
is not enough. Such kind of data is well suited to a
massively parallel and distributed system like
Hadoop.
 The main objective of this project is to focus on
how data generated from YouTube can be
mined and utilized by different companies to
make targeted, real time and informed
decisions about their product that can increase
their market share.
This project uses following concepts and tools
throughout its
lifecycle.
 Java API
 Hadoop
 Hive
 Pig
 In this project we fetch a specific channel’s YouTube data
using YouTube API. We will use Google Developers
Console and generate a unique access key which is required
to fetch YouTube public channel data. Once the API key is
generated, a java based console application is designed to
use the YouTube API for fetching video(s) information
based on a search criteria.
 The text file output generated from the console application
is then loaded from HDFS file into HIVE database. HDFS is
a primary Hadoop application and a user can directly
interact with HDFS using various shell-like commands
supported by Hadoop. Then we run queries on Big Data
using HIVE to extract the meaningful output which can be
used by the management for analysis.
You tube data
extraction
Load data into
HDFS
Hive
(databases)
Processing data
using hive
command
Result
 Wikipedia defines Big Data as "a collection of data
sets so large and complex that it becomes difficult
to process using the available database
management tools.
 The challenges include how to capture, curate,
store, search, share, analyze and visualize Big
Data.
 In today's environment, we have access to more
types of data. These data sources include online
transactions, social networking activities, mobile
device services, internet gaming etc.
 Big Data is a collection of data sets that are large
and complex in nature. They constitute both
structured and unstructured data that grow large
so fast that they are not manageable by traditional
relational database systems .
 Big Data is defined as any kind of data source that
has at least three shared characteristics:
 Extremely large Volumes of data
 Extremely high Velocity of data
 Extremely wide Variety of data
 It is a framework for storing data on large clusters
 It’s framework that used for manupulate large amount
of data.
Hadoop consists of two main components
 A distributed processing framework named
MapReduce (which is now supported by a component
called YARN(Yet Another Resource Negotiator) and
 A distributed file system known as the Hadoop
Distributed File System, or HDFS.
 Hadoop provides a mechanism called
MapReduce model to do distributed processing
of large data.
Input file
 Identify top 5 categories in which most of the video‘s
are uploaded?
 A = load '/home/pankaj/Documents/pankaj1/hive-
student/youtube_analysis/Demo_Data.csv' using
PigStorage(',');
 B = group A by $3;
 C = foreach B generate group,COUNT(A);
 D = order C by $1 DESC;
 E = limit D 5;
 dump E;
output
2. Top ten highest rated video’s?
 A = load
‘/home/pankaj/Documents/pankaj1/hive-
student/youtube_analysis/Demo_Data.csv' using
PigStorage(',');
 X = filter A by $0 != 'ID';
 B = foreach X generate $0,$4;
 C = order B by $1 DESC;
 D = limit C 5;
 dump D;
output
3. top ten most viewed
 A = load
'/home/pankaj/Documents/pankaj1/hive-
student/youtube_analysis/Demo_Data.csv' using
PigStorage(',');
 X = filter A by $0 != 'ID';
 B = foreach X generate $0,$3,(int)$5;
 C = order B by $2 DESC;
 D = limit C 10;
 dump D;
output
4. top 10 lengthy video’s;
 A = load
'/home/pankaj/Documents/pankaj1/hive-
student/youtube_analysis/Demo_Data.csv' using
PigStorage(',');
 X = filter A by $0 != 'ID';
 B = foreach X generate $0,$3,(int)$7;
 C = order B by $2 DESC;
 D = limit C 10;
 dump D;
output
5. How many peoples are less than 18 years old and
uploaded video?
 A = load '/home/pankaj/Documents/pankaj1/hive-
student/youtube_analysis/Demo_Data.csv' using
PigStorage(',');
 X = filter A by $0 != 'ID';
 B = foreach X generate $1,(int)$2,$5;
 C = filter B by $1 < 18 and $2 != 0;
 D = group C by $0;
 E = foreach D generate COUNT(C);
 F = group E by $0;
 G = foreach F generate COUNT(E);
 dump G;
output
6.how many child in which category uploaded
video’s are less then 18 year;
 A = load '/home/pankaj/Documents/pankaj1/hive-
student/youtube_analysis/Demo_Data.csv' using
PigStorage(',');
 X = filter A by $0 != 'ID';
 B = foreach X generate $3,$2;
 Z = filter B by $1 < 18;
 C = group Z by $0;
 D = foreach C generate group,COUNT(Z);
 dump D;
output
Step 1: Use the following command to ‘Create a
Table’ in HIVE
 Hive> create table YouTube_data_table (a1 int,a2
string,a3 int,a4 string,a5 int,a6 int,a7 string) ROW
FORMAT DELIMITED FIELDS TERMINATED BY
't' STORED AS TEXTFILE
This command will create a Hive table named
‘youtube in which rows will be delimited and
rows fields will be terminated by commas.
Step 2: Load YouTube data into the Hive Table
Use the command given below to load YouTube
data into the Hive table created in
 Hive> load data local inpath '/home /pankaj
/Desktop/YouTube.csv’ overwrite into table
youtube;
Ques 1. Identifies Top 5 catagories in which most
of the videos are uploaded?
 hive> select a4,count(a6) as l FROM YouTube
group by a4 SORT BY l DESC LIMIT 5;
output
2. Top ten highest rated video’s?
 hive> select a4,a5 FROM YouTube SORT BY a5
DESC LIMIT 10;
output
3. top 10 lengthy video’s;
 hive>select a1,a4,a8 FROM YouTube SORT BY
a8 DESC LIMIT 10;
output
4. How many peoples are less than 18 years old
and uploaded video?
 hive>select count(a2) FROM YouTube where a2
< 18;
output
5. how many child in which category uploaded
video’s are less then 18 year;
 hive> select a4,count(a2) FROM YouTube where
a2 < 18 group by a4;
output
6: Number of comment per video’s?
 hive> select a4,count(a7) FROM YouTube group
by a4;
output
7. sort of data on the bases of uploader name?
 hive> select * FROM YouTube sort by a2;
output
Youtube big data analysis using hadoop,pig,hive
Youtube big data analysis using hadoop,pig,hive

More Related Content

What's hot

Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
 
Hadoop
HadoopHadoop
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
Mrinal Kumar
 
Big data ppt
Big data pptBig data ppt
Big data unit i
Big data unit iBig data unit i
Big data unit i
Navjot Kaur
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
Mohit Saini
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
Edureka!
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
Mohammed Guller
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
Dushhyant Kumar
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Big data
Big dataBig data
Big data
Nimish Kochhar
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
huda2018
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
DataminingTools Inc
 
Hadoop
Hadoop Hadoop
Hadoop
ABHIJEET RAJ
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
Savio Aberneithie
 
Big Data
Big DataBig Data
Big Data
Seminar Links
 

What's hot (20)

Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Hadoop
HadoopHadoop
Hadoop
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Big data
Big dataBig data
Big data
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Hadoop
Hadoop Hadoop
Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
Big Data
Big DataBig Data
Big Data
 

Similar to Youtube big data analysis using hadoop,pig,hive

Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
Dataiku
 
Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmap
Dr. Mohan K. Bavirisetty
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
Bhadra Gowdra
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
Bill Hayduk
 
My First Hadoop Program !!!
My First Hadoop Program !!!My First Hadoop Program !!!
My First Hadoop Program !!!
Ayapparaj SKS
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
N1802038292
N1802038292N1802038292
N1802038292
IOSR Journals
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...
Prof. Maulik Trivedi
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
How Do I Learn Big Data
How Do I Learn Big DataHow Do I Learn Big Data
How Do I Learn Big Data
bigdatabeginner
 
How Do I Learn Big Data
How Do I Learn Big DataHow Do I Learn Big Data
How Do I Learn Big Data
bigdatabeginner
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
KCC Software Ltd. & Easylearning.guru
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Edureka!
 
Big data
Big dataBig data
Big data
jaskaur1234
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
KCC Software Ltd. & Easylearning.guru
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Edureka!
 
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas12611618
 
Hadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopHadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of Hadoop
Dr Neelesh Jain
 
Cis520 group e
Cis520 group eCis520 group e
Cis520 group e
Enrique Romero
 

Similar to Youtube big data analysis using hadoop,pig,hive (20)

Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmap
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
My First Hadoop Program !!!
My First Hadoop Program !!!My First Hadoop Program !!!
My First Hadoop Program !!!
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
N1802038292
N1802038292N1802038292
N1802038292
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
How Do I Learn Big Data
How Do I Learn Big DataHow Do I Learn Big Data
How Do I Learn Big Data
 
How Do I Learn Big Data
How Do I Learn Big DataHow Do I Learn Big Data
How Do I Learn Big Data
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
 
Big data
Big dataBig data
Big data
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
 
Hadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopHadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of Hadoop
 
Cis520 group e
Cis520 group eCis520 group e
Cis520 group e
 

Recently uploaded

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Youtube big data analysis using hadoop,pig,hive

  • 1. A seminar on Practical Training on Big data and hadoop SUBMITTED BY: Pankaj chhipa Final year , CS Roll No. 15/276 DEPARTMENT OF ELECTRONICS ENGINEERING RAJASTHAN TECHNICAL UNIVERSITY KOTA SUBMITTED TO:- Mrs. Jyoti Yaduwanshi
  • 2. You tube data analysis . Using Hadoop
  • 3. Purpose and Scope  "YouTube has over a billion users and every day people watch hundreds of millions of hours on YouTube and generate billions of views”.  "Every day, people across the world are uploading 1.2 million videos to YouTube, or over 100 hours per minute and this number is ever increasing [3]. To analyze and understand the activity occurring on such a massive scale, a relational SQL database is not enough. Such kind of data is well suited to a massively parallel and distributed system like Hadoop.
  • 4.  The main objective of this project is to focus on how data generated from YouTube can be mined and utilized by different companies to make targeted, real time and informed decisions about their product that can increase their market share.
  • 5. This project uses following concepts and tools throughout its lifecycle.  Java API  Hadoop  Hive  Pig
  • 6.  In this project we fetch a specific channel’s YouTube data using YouTube API. We will use Google Developers Console and generate a unique access key which is required to fetch YouTube public channel data. Once the API key is generated, a java based console application is designed to use the YouTube API for fetching video(s) information based on a search criteria.  The text file output generated from the console application is then loaded from HDFS file into HIVE database. HDFS is a primary Hadoop application and a user can directly interact with HDFS using various shell-like commands supported by Hadoop. Then we run queries on Big Data using HIVE to extract the meaningful output which can be used by the management for analysis.
  • 7. You tube data extraction Load data into HDFS Hive (databases) Processing data using hive command Result
  • 8.  Wikipedia defines Big Data as "a collection of data sets so large and complex that it becomes difficult to process using the available database management tools.  The challenges include how to capture, curate, store, search, share, analyze and visualize Big Data.  In today's environment, we have access to more types of data. These data sources include online transactions, social networking activities, mobile device services, internet gaming etc.
  • 9.  Big Data is a collection of data sets that are large and complex in nature. They constitute both structured and unstructured data that grow large so fast that they are not manageable by traditional relational database systems .  Big Data is defined as any kind of data source that has at least three shared characteristics:  Extremely large Volumes of data  Extremely high Velocity of data  Extremely wide Variety of data
  • 10.  It is a framework for storing data on large clusters  It’s framework that used for manupulate large amount of data. Hadoop consists of two main components  A distributed processing framework named MapReduce (which is now supported by a component called YARN(Yet Another Resource Negotiator) and  A distributed file system known as the Hadoop Distributed File System, or HDFS.
  • 11.  Hadoop provides a mechanism called MapReduce model to do distributed processing of large data.
  • 12.
  • 14.
  • 15.  Identify top 5 categories in which most of the video‘s are uploaded?  A = load '/home/pankaj/Documents/pankaj1/hive- student/youtube_analysis/Demo_Data.csv' using PigStorage(',');  B = group A by $3;  C = foreach B generate group,COUNT(A);  D = order C by $1 DESC;  E = limit D 5;  dump E;
  • 17. 2. Top ten highest rated video’s?  A = load ‘/home/pankaj/Documents/pankaj1/hive- student/youtube_analysis/Demo_Data.csv' using PigStorage(',');  X = filter A by $0 != 'ID';  B = foreach X generate $0,$4;  C = order B by $1 DESC;  D = limit C 5;  dump D;
  • 19. 3. top ten most viewed  A = load '/home/pankaj/Documents/pankaj1/hive- student/youtube_analysis/Demo_Data.csv' using PigStorage(',');  X = filter A by $0 != 'ID';  B = foreach X generate $0,$3,(int)$5;  C = order B by $2 DESC;  D = limit C 10;  dump D;
  • 21. 4. top 10 lengthy video’s;  A = load '/home/pankaj/Documents/pankaj1/hive- student/youtube_analysis/Demo_Data.csv' using PigStorage(',');  X = filter A by $0 != 'ID';  B = foreach X generate $0,$3,(int)$7;  C = order B by $2 DESC;  D = limit C 10;  dump D;
  • 23. 5. How many peoples are less than 18 years old and uploaded video?  A = load '/home/pankaj/Documents/pankaj1/hive- student/youtube_analysis/Demo_Data.csv' using PigStorage(',');  X = filter A by $0 != 'ID';  B = foreach X generate $1,(int)$2,$5;  C = filter B by $1 < 18 and $2 != 0;  D = group C by $0;  E = foreach D generate COUNT(C);  F = group E by $0;  G = foreach F generate COUNT(E);  dump G;
  • 25. 6.how many child in which category uploaded video’s are less then 18 year;  A = load '/home/pankaj/Documents/pankaj1/hive- student/youtube_analysis/Demo_Data.csv' using PigStorage(',');  X = filter A by $0 != 'ID';  B = foreach X generate $3,$2;  Z = filter B by $1 < 18;  C = group Z by $0;  D = foreach C generate group,COUNT(Z);  dump D;
  • 27.
  • 28. Step 1: Use the following command to ‘Create a Table’ in HIVE  Hive> create table YouTube_data_table (a1 int,a2 string,a3 int,a4 string,a5 int,a6 int,a7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE This command will create a Hive table named ‘youtube in which rows will be delimited and rows fields will be terminated by commas.
  • 29. Step 2: Load YouTube data into the Hive Table Use the command given below to load YouTube data into the Hive table created in  Hive> load data local inpath '/home /pankaj /Desktop/YouTube.csv’ overwrite into table youtube;
  • 30. Ques 1. Identifies Top 5 catagories in which most of the videos are uploaded?  hive> select a4,count(a6) as l FROM YouTube group by a4 SORT BY l DESC LIMIT 5; output
  • 31. 2. Top ten highest rated video’s?  hive> select a4,a5 FROM YouTube SORT BY a5 DESC LIMIT 10; output
  • 32. 3. top 10 lengthy video’s;  hive>select a1,a4,a8 FROM YouTube SORT BY a8 DESC LIMIT 10; output
  • 33. 4. How many peoples are less than 18 years old and uploaded video?  hive>select count(a2) FROM YouTube where a2 < 18; output
  • 34. 5. how many child in which category uploaded video’s are less then 18 year;  hive> select a4,count(a2) FROM YouTube where a2 < 18 group by a4; output
  • 35. 6: Number of comment per video’s?  hive> select a4,count(a7) FROM YouTube group by a4; output
  • 36. 7. sort of data on the bases of uploader name?  hive> select * FROM YouTube sort by a2; output