SlideShare a Scribd company logo
1 of 6
Analysis of Trending Videos Pattern on YouTube
using Hadoop MapReduce
Pooja Kumar
MSc Data Analytics
National College of Ireland
Dublin, Ireland
x18181929@student.ncirl.ie
Abstract— YouTube is one of the prominent sites for video
hosting and sharing. To enhance the content and quality of the
video, analysis of user interaction factor and streaming
information of videos is required. Video content can be
scrutinized using factors like number of views, likes, comments
and dislikes. Based on this the channel which is more active and
in which the user has more interest can be identified. This
project aims to utilize the data which has user interaction
factors and information about videos to analyze the trending
pattern. The large dataset of YouTube can be managed in
Hadoop. Data is stored in the Hadoop Distributed File System
(HDFS) and processed using the MapReduce framework. Data
was analyzed to distinguish the content and discover which has
a better response, which helps to organize the videos in the
future. Further, the analysis helps in understanding the popular
category, popular channel, the video which has a high number
of views, which category of videos is most disliked.
Keywords— Hadoop, Hadoop Distributed File System
(HDFS), MapReduce, YouTube, Mapper, Reducer.
I. INTRODUCTION
Social networking sites offer a lucrative way to promote
and advertise. They offer various opportunities for product
publicity, showcasing new movie trailers and images,
corporate promotions [2]. Beyond YouTube's clear
entertainment and leisure advantages, it is also being
leveraged for professional and other industry gains [3]. Most
companies are releasing previews of their new products on
YouTube for greater exposure. YouTube helps uploaders to
have reviews, likes and shares as the feedback on their videos.
The feedback is given by the viewers. By using this
information future improvements can be done. This will help
to customize the product according to the viewer needs. Each
minute, over 500 hours of video are uploaded to YouTube.
YouTube has more than 2 billion users. On YouTube daily, an
average of 1 billion hours of video is watched. Based on
10,000 reviewers around 32 million inappropriate videos were
removed in 2018. YouTube can be browsed over 80 languages
[7]. In the United States (U.S), YouTube reaches more young
people on mobile than any Television Network. Around “73%
of U.S adults use YouTube”. Figure.1 shows how many
internet users in the U.S use YouTube. Nearly “15% of site
traffics on YouTube comes from the U.S”. Next, it is “India
with 8.1% and Japan with 4.6%”. YouTube will generate $5.5
billion in advertising revenues in the US alone by 2020 [8].
In this paper Hadoop, a big data framework was used for
handling data. It is a Java-based programming framework.
MapReduce programming model is implemented for
processing massive amounts of data in parallel. Data storage,
accessing, processing and computation with a file system is
provided in Hadoop Distributed File System (HDFS). U.S
YouTube data was analyzed using HDFS and MapReduce
framework to provide a better understanding of viewer
interest.
Fig. 1. U.S YouTube Users
Research Question: How Hadoop MapReduce framework
help to analyze the pattern of trending videos on YouTube
and user’s interest that can be used to benefit people by
making proper decision to advance on trending topics?
The research question is answered based on analyzing some
topics using MapReduce.
Q1) How videos are distributed and in which category there
are more videos?
Q2) Which channel has more views?
Q3) Which video is disliked by most of the users?
Q4) Which channel has more videos and to which category
they belong?
Q5) Which video has been trending from many days?
Q6) How many videos were removed from each category?
Q7) For how many videos in each channel the rating is
disabled?
Q8) How many videos in a channel are not associated with a
particular category?
Q9) How well are the factors views, likes and comments
count are correlated to each other? (Statistical approach using
python)
In this project trending topics and people’s interest can be
analyzed. This information can be used to benefit people by
making a proper decision to advance on trending topics. This
analysis helps the uploader to find more about the
competitors on YouTube. It discovers the videos that perform
best. Based on videos uploaded and viewer interaction factor,
it detects the active channel and category. Using this method,
it is easy to find a number of videos in each category, the
number of viewers in each channel, videos that viewers don’t
like.
II. RELATED WORK
The data transmitted was in limited size when
communication was in traditional form. Now, due to
widespread usage of internet, the massive amount of data is
possessed by companies and social media platforms. As data
is massive to analyze and extract information from the data,
analytics is used. The critical data analysis has resulted in
advanced analytical intelligence for various data segments
and to forecast future predictions. The huge and complex data
cannot be handled by the design of conventional data analysis
software. The result cannot be generated for the accumulated
data due to complexity. Thus [Biju and Mathew, 2017],
explained the need for Big Data analytics [6]. Because of the
availability of Big Data and low-cost hardware product, there
was a requirement to process data quickly and cost-
effectively [3].
Hadoop MapReduce framework splits large framework
into smaller parallel chunks and manages the scheduling.
Mapping is done for each piece to an intermediate value and
then it is reduced as a result of an analysis. MapReduce
algorithm can be rewritten according to the problem and can
be broken into chunks to be solved in parallel. This is how it
addresses large datasets with a distributed solving method
[5]. “YARN (Yet Another Resource Negotiator) separates
resource management functions from programming model”.
On top of YARN, MapReduce is an application that will be
running. Resource management and job scheduling are split
into separate daemons in JobTracker’s application. The
author explained how Hadoop MapReduce and YARN works
[5].
In the existing situation, YouTube analytics can be used
by uploaders to analyze their own channel. This analysis
provides the overall parameters and it is available for their
own channel. Competitor’s information is not revealed [3].
The author [Harikumar, Kapoor and Waghmare, 2019]
has analyzed YouTube data. A technique was implemented
to detect the sensitive text in a comment to identify the
popular content type which helped both YouTubers and
advertising companies to upload the videos based on
popularity. Here, the first data was collected and the sensitive
contents in comments were substituted further it was
analyzed for popular type content. Based on this analysis
YouTubers advertise the product from which Ad revenue was
earned [5].
The author [Dabas, Kaur, Gulati and Tilak, 2019] has
presented the classification and analysis of YouTube video
comments based on the Hadoop application. Using Hive
queries the comments was been summarized. Sentiment
analysis was performed on comments using python. Thus, the
system delivered a promising result for queries in terms of
execution time [2].
III. METHODOLOGY
This study focuses on generating a pattern using the
Hadoop environment for user interaction on YouTube. For
this research two datasets were considered, videos trending
on YouTube and categories of video-specific for each region
were downloaded from Kaggle [9].
The YouTube dataset has a daily record of trending
videos of several months. It includes the data of many
countries in a separate file. U.S YouTube data file was chosen
for the analysis. It contains 40949 records and has
information such as video id, trending date, title, channel,
category id, publish time, tags, views, likes, dislikes,
comment count, thumbnail link, comment disabled
information, ratings disabled information, information about
the error or removed video and description. The extracted file
was in comma separated values(csv) format and required less
pre-processing as there were no missing or NAN values. But
all the commas were removed as they act as delimiter and the
date was formatted into the proper format i.e, dd/mm/yyyy.
The category dataset is different for each region this
information was fetched from associated JSON file from
Kaggle [9]. This JSON file includes information like etag,
category id, snippet-channel id, snippet-title, snippet-
assignable. The JSON file was converted to structured data
i.e, to CSV format using python code. There was no need of
pre-processing this dataset as it contained all the information.
Both datasets were merged using python code. While
merging, only required information was taken into
consideration and the column which was not used in the
analysis was removed. After merging there were 40949
records in the dataset. Figure.2 shows the information about
the merged dataset. The combined dataset was loaded into
HDFS. Java MapReduce framework was utilized for all
insights in the entire process. After the generation of result
visualization was done. By analyzing this dataset, patterns
about the trending videos can be drawn and it would be useful
for the YouTubers. This helps to know which kind of
category or video has a better response based on that
YouTubers can upload the contents.
Fig. 2. Dataset after merging
IV. IMPLEMENTATION
The architecture was designed for implementation and
followed to perform the analysis on YouTube data. The
required processing was done on the dataset and all the
essential records from the dataset were merged and loaded into
HDFS. Using the MapReduce program, the data was read
from the HDFS. Mapping and reducing operations are
performed and the generated output was stored into HDFS.
Java programming language was used to write code. Then the
generated output was visualized. Figure.3. shows the process
flow of Data Analysis.
Fig. 3. Process Flow of Data Analysis
To perform the MapReduce task, three java classes will be
created Mapper, Reducer and Driver class. In mapper class
input will be processed and intermediate key-value will be
generated by splitting input and recording it in the form of
(key, value) pair. The output from the mapper is fed as input
to reducer class. For a wide range of processing, data can be
aggregated, filtered and combined in several ways in reducer
to generate the output. The Driver class is accountable for
setting MapReduce job to run in Hadoop. Job name, the data
type for input and output, class names of mapper and reducer
are specified in driver class. During Java, program
compilation directory will be created in the current directory
with a package name mentioned in a java source file and all
compiled files will put into it. Also, the jar file has to be
created. The input was read from HDFS and the generated
output was stored back into HDFS. This is similar to all the
queries.
Q1) The first task is to find the distribution of videos based
on category. To perform this task, three java classes were
created categoryMapper, categoryReducer and driver class.
Mapper class is as seen in figure 4, a variable is declared that
stores all the lines from the input file. Using the delimiter, the
line is split and values are stored as an array. The column
which has category information is read and mapped as key.
For each key, the value is stored as 1. It is an intermediate key-
value pair output generated from the mapper class. In Reducer
class the output from mapper is read as an input. By shuffling
and reducing, the output is generated. Figure 5. shows the
reducer java class. Figure. 6 is the driver class in which the job
name as YouTube, the data type for output as (Text,
IntWritable) and name of mapper and reducer class are
specified to run the program. This program was compiled.
Using JAR command all the classes are added to the JAR file
shown in figure.7 and the program is executed. The output is
stored in HDFS. This task was performed to know how videos
are distributed.
Fig. 4. Mapper class
Fig. 5. Reducer class
Fig. 6. Driver class
Fig. 7. JAR file for Task 1
Q2) The second task is to identify which channel has more
views. In mapper class, channel and number of views
associated with the particular channel are sorted and mapped
as key-value pair. In reducer, the key-value pair was shuffled
and reduced. All the java classes were added to the JAR file
to perform the execution. After execution output was stored
into HDFS.
Q3) The third task is to identify the video which is most
disliked. In mapper, the video ID associated with each video
and the number of dislikes was taken into consideration.
Video ID and dislike count was sorted and intermediate key-
value pair was generated. It was sent to the reducer to perform
shuffling and reducing task. An operation was performed to
count the number of dislikes that were given to the video.
Based on this output the video which most of the viewers do
not like can be identified.
Q4) The fourth task is to analyze category, the channel
which has more videos. In mapper, category and channel ID
which is the name of the channel was grouped to form a key.
For every key, the value was set and this key-value pair was
sent to the reducer. In reducer for each occurrence of the same
channel in a category, the value was counted. Thus, a
particular channel in a category that has more video was
fetched as output.
Q5) The fifth task is to identify which video has been
trending for many days on YouTube. For each video, there is
a record of publishing date and trending date using these the
information the number of trending days of a video can be
fetched in a mapper class. The output from the mapper was
sent as input to the reducer and final reduced output was
generated.
Q6) The sixth task is to check how many videos were
removed from each category. In mapper, category and video
removed which is binary information that is presented in
column video_error_or_removed was used. When the binary
is False then the corresponding category name was set as key
and value was stored as 1. Thus, the intermediate key-value
pair was generated. In reducer, the input from mapper was
reduced and the final output of the key-value pair was
generated.
Q7) The seventh task is to identify how many videos in a
channel the rating is disabled. This is similar to the above task
the rating disabled is binary information. In mapper, each
time the rating disabled is as True value the corresponding
channel name is set as key and value is set as 1. Thus, in
reducer class each time the channel name is repeated the
value is added and the output was generated.
Q8) The eighth task is to find how many videos in the
channel are not associated with the category. In this snippet
assignable column has binary value every time the value is
false the corresponding channel and category were set as key
and value was set in mapper class. In reducer class, every
time the channel name and category were the same the count
of value was increased by 1 and thus the reducer output was
generated.
JAR file information of all the JAVA programs is as seen
in the figure. 8.
Q9) The final task was to analyze the correlation between
views, likes and comment count. This was done using python.
If views, likes and comment counts are highly correlated then
it suggests the viewer’s interest on a particular video.
V. RESULTS
Q1) As seen in the figure.10, the entertainment and music
categories have more videos compared to all others. In this
graph, the distribution of videos for each category can be
seen. It is evident that YouTubers upload more videos to
entertainment and music categories. Figure. 11, shows the
distribution of videos based on number of times category
name appears the text becomes bigger and bolder in word
cloud.
Fig. 8. JAR file of all Tasks
Fig. 9. Output for Task1
Fig. 10. Videos in each category
Fig. 11. Distribution of videos
Q2) As Figure 12. shows the top 20 channels which have
more views. It can be analyzed that these 20 channels viewed
more on YouTube. Most of the channels here belong to
music, entertainment and sports categories.
Fig.12. Top 20 channel with more views
Q3) In figure 13. the video ID of the 20 videos that are
most disliked is shown. Using this video ID corresponding
channel to which the video belongs can be identified. Here,
ID- FlsCjmMhFmw belongs to the YouTube-Spotlight
channel and this channel is the most disliked channel by the
viewers.
Fig.13. Most Disliked videos
Q4) The result shown in figure 14. says which channel
belonging to which category has more videos. It is observed
that in the sports category the ESPN channel has more video
when compared to all other channels.
Fig.14. Channel which has more number of Videos
Q5) In figure 15. it is displaying the top 10 videos that
have been trending for many days on YouTube. In this, the
two videos which are trending for many days belong to the
Sports and News & Politics category.
Fig.15. Top 10 shows trending from many days
Q6) In figure 16. it is seen that some of the videos which
belong to entertainment, Film & Animation and Sports
Category. The videos belong to other categories were not
removed.
Fig.16. Categories in which more videos are removed
Q7) The graph in figure 17. gives information about for
how many videos in a channel the rating has been disabled.
The greater number of videos for which the rating has
disabled belong to How To & Style category.
Fig.17. Videos Streaming for many days
Q8) From figure 18. it is observed that the videos
belonging to CNET and Bleacher Report channel are not
associated with their category which is Shows. This means
the snippet is not assignable.
Fig.18. Number of videos snippet not assignable
Q9) From the result in figure 19. It is observed that views
and likes are highly correlated to each other compared to
comment count. Based on this correlation matrix viewer’s
interest can be analyzed. If video has more views then it has
a high chance of being liked by many viewers.
Fig.19. Correlation Matrix
VI. CONCLUSION AND FUTURE ENHANCEMENT
In this paper, an analysis was made on YouTube data
using the Hadoop MapReduce framework. The research
question was answered by using MapReduce tasks. Through
this analysis, it was found that entertainment and music
categories have more videos and have a greater number of
views. The shows which belongs to sports, news & politics
category have been trending for many days. From these tasks
the video trending, pattern and user interest were analyzed.
The viewer’s interest can be identified based on views, likes
and comments. The project results also highlight the
advantages of the Hadoop framework and its disadvantage is
syntax complexity of Java MapReduce.
In the future, analysis can be made using uploader’s
information and also sentiment analysis can be done on a
description of the video.
REFERENCES
[1] P. Merla, Y.Liang “Data Analysis using Hadoop MapReduce
Environment”, IEEE Conf. on Big Data, 2017 [Online]. Available:
IEEE Xplore, https://www.ieee.org/ [Accessed on: Apr. 10, 2020].
[2] C. Dabas, P. Kaur, N. Gulati and M.Tilak, “Analysis of Comments on
YouTube Videos using Hadoop”, Fifth Internation Conf. on Image
Information Processing (ICIIP), 2019.
[3] F. Shaikh, D. Pawaskar, U. Khan and A.Siddiqui, “YouTube Data
Analysis using MapReduce on Hadoop”, IEEE Conf. on Recent Trends
in Electronics, Information & Communication Technology, May.
18/19, 2018.
[4] K. Bhatter, S. Gavhane, P. Dhamne, G. Aochar and S. Rabade, “A
Review on YouTube Data Analysis using MapReduce on Hadoop”,
International Journal of Research in Engineering, Science and
Management, vol. 2, no. 12, Dec., 2019 [Online]. Available:
https://www.ijresm.com/ [Accessed on Apr. 10, 2020].
[5] D. Harikumar, D. Kapoor and S. Waghmare, “YouTube Data
Sensitivity and Analysis using Hadoop Framework”, International
Research Journal of Engineering and Technology, vol. 6, no. 4, 2019
[Online]. Available: https://www.irjet.net/ [Accessed on Apr. 10,
2020].
[6] S. Biju and A. Mathew, “Comparative Analysis of Selected Big Data
Analytics Tools”, Journal of International Technology and Information
Management, vol. 26, no. 2, 2017.
[7] FactSlides.com, “25 Facts about YouTube FactSlides”, Nov. 28, 2019
[Online]. Available: https://www.factslides.com/s-YouTube/
[Accessed on Apr. 10, 2020].
[8] P. Cooper, Hootsuite Social Media Management, “23 YouTube
Statistics that Matter to Marketers in 2020”, Dec. 17, 2019 [Online].
Available: https://blog.hootsuite.com/youtube-stats-marketers/
[Accessed on Apr. 10, 2020].
[9] M. J, “Trending YouTube Video Statistics”, Kaggle.com, 2019
[Online]. Available: https://www.kaggle.com/datasnaek/youtube-new/
[Accessed: Apr. 10, 2020].

More Related Content

Similar to Data Intensive Architectures

YouTubeVideoCatagorization
YouTubeVideoCatagorizationYouTubeVideoCatagorization
YouTubeVideoCatagorizationUrjit Patel
 
YouTube Title Prediction Using Sentiment Analysis
YouTube Title Prediction Using Sentiment AnalysisYouTube Title Prediction Using Sentiment Analysis
YouTube Title Prediction Using Sentiment AnalysisIRJET Journal
 
YouTube Trending Video Dashboard
YouTube Trending Video DashboardYouTube Trending Video Dashboard
YouTube Trending Video DashboardIRJET Journal
 
Social Media Content Analyser
Social Media Content AnalyserSocial Media Content Analyser
Social Media Content AnalyserIRJET Journal
 
Analysis and Prediction of Sentiments for Cricket Tweets using Hadoop
Analysis and Prediction of Sentiments for Cricket Tweets using HadoopAnalysis and Prediction of Sentiments for Cricket Tweets using Hadoop
Analysis and Prediction of Sentiments for Cricket Tweets using HadoopIRJET Journal
 
IRJET- Opinion Mining on Pulwama Attack
IRJET-  	  Opinion Mining on Pulwama AttackIRJET-  	  Opinion Mining on Pulwama Attack
IRJET- Opinion Mining on Pulwama AttackIRJET Journal
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache HadoopSuman Saurabh
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop RecommendationIRJET Journal
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop RecommendationIRJET Journal
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund AnalyticsBernardo Najlis
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...ijtsrd
 
System For Product Recommendation In E-Commerce Applications
System For Product Recommendation In E-Commerce ApplicationsSystem For Product Recommendation In E-Commerce Applications
System For Product Recommendation In E-Commerce ApplicationsIJERD Editor
 
Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...paperpublications3
 
Consumer Purchase Intention Prediction System
Consumer Purchase Intention Prediction SystemConsumer Purchase Intention Prediction System
Consumer Purchase Intention Prediction SystemIRJET Journal
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...IRJET Journal
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Love Arora
 
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...IRJET Journal
 
Methodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdfMethodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdfSajid Amit
 

Similar to Data Intensive Architectures (20)

YouTubeVideoCatagorization
YouTubeVideoCatagorizationYouTubeVideoCatagorization
YouTubeVideoCatagorization
 
YouTube Title Prediction Using Sentiment Analysis
YouTube Title Prediction Using Sentiment AnalysisYouTube Title Prediction Using Sentiment Analysis
YouTube Title Prediction Using Sentiment Analysis
 
YouTube Trending Video Dashboard
YouTube Trending Video DashboardYouTube Trending Video Dashboard
YouTube Trending Video Dashboard
 
Social Media Content Analyser
Social Media Content AnalyserSocial Media Content Analyser
Social Media Content Analyser
 
Analysis and Prediction of Sentiments for Cricket Tweets using Hadoop
Analysis and Prediction of Sentiments for Cricket Tweets using HadoopAnalysis and Prediction of Sentiments for Cricket Tweets using Hadoop
Analysis and Prediction of Sentiments for Cricket Tweets using Hadoop
 
IRJET- Opinion Mining on Pulwama Attack
IRJET-  	  Opinion Mining on Pulwama AttackIRJET-  	  Opinion Mining on Pulwama Attack
IRJET- Opinion Mining on Pulwama Attack
 
youtube.docx
youtube.docxyoutube.docx
youtube.docx
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop Recommendation
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop Recommendation
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund Analytics
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
 
System For Product Recommendation In E-Commerce Applications
System For Product Recommendation In E-Commerce ApplicationsSystem For Product Recommendation In E-Commerce Applications
System For Product Recommendation In E-Commerce Applications
 
Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...
 
Consumer Purchase Intention Prediction System
Consumer Purchase Intention Prediction SystemConsumer Purchase Intention Prediction System
Consumer Purchase Intention Prediction System
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
 
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
 
Methodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdfMethodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdf
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Recently uploaded (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Data Intensive Architectures

  • 1. Analysis of Trending Videos Pattern on YouTube using Hadoop MapReduce Pooja Kumar MSc Data Analytics National College of Ireland Dublin, Ireland x18181929@student.ncirl.ie Abstract— YouTube is one of the prominent sites for video hosting and sharing. To enhance the content and quality of the video, analysis of user interaction factor and streaming information of videos is required. Video content can be scrutinized using factors like number of views, likes, comments and dislikes. Based on this the channel which is more active and in which the user has more interest can be identified. This project aims to utilize the data which has user interaction factors and information about videos to analyze the trending pattern. The large dataset of YouTube can be managed in Hadoop. Data is stored in the Hadoop Distributed File System (HDFS) and processed using the MapReduce framework. Data was analyzed to distinguish the content and discover which has a better response, which helps to organize the videos in the future. Further, the analysis helps in understanding the popular category, popular channel, the video which has a high number of views, which category of videos is most disliked. Keywords— Hadoop, Hadoop Distributed File System (HDFS), MapReduce, YouTube, Mapper, Reducer. I. INTRODUCTION Social networking sites offer a lucrative way to promote and advertise. They offer various opportunities for product publicity, showcasing new movie trailers and images, corporate promotions [2]. Beyond YouTube's clear entertainment and leisure advantages, it is also being leveraged for professional and other industry gains [3]. Most companies are releasing previews of their new products on YouTube for greater exposure. YouTube helps uploaders to have reviews, likes and shares as the feedback on their videos. The feedback is given by the viewers. By using this information future improvements can be done. This will help to customize the product according to the viewer needs. Each minute, over 500 hours of video are uploaded to YouTube. YouTube has more than 2 billion users. On YouTube daily, an average of 1 billion hours of video is watched. Based on 10,000 reviewers around 32 million inappropriate videos were removed in 2018. YouTube can be browsed over 80 languages [7]. In the United States (U.S), YouTube reaches more young people on mobile than any Television Network. Around “73% of U.S adults use YouTube”. Figure.1 shows how many internet users in the U.S use YouTube. Nearly “15% of site traffics on YouTube comes from the U.S”. Next, it is “India with 8.1% and Japan with 4.6%”. YouTube will generate $5.5 billion in advertising revenues in the US alone by 2020 [8]. In this paper Hadoop, a big data framework was used for handling data. It is a Java-based programming framework. MapReduce programming model is implemented for processing massive amounts of data in parallel. Data storage, accessing, processing and computation with a file system is provided in Hadoop Distributed File System (HDFS). U.S YouTube data was analyzed using HDFS and MapReduce framework to provide a better understanding of viewer interest. Fig. 1. U.S YouTube Users Research Question: How Hadoop MapReduce framework help to analyze the pattern of trending videos on YouTube and user’s interest that can be used to benefit people by making proper decision to advance on trending topics? The research question is answered based on analyzing some topics using MapReduce. Q1) How videos are distributed and in which category there are more videos? Q2) Which channel has more views? Q3) Which video is disliked by most of the users? Q4) Which channel has more videos and to which category they belong? Q5) Which video has been trending from many days? Q6) How many videos were removed from each category? Q7) For how many videos in each channel the rating is disabled? Q8) How many videos in a channel are not associated with a particular category? Q9) How well are the factors views, likes and comments count are correlated to each other? (Statistical approach using python) In this project trending topics and people’s interest can be analyzed. This information can be used to benefit people by making a proper decision to advance on trending topics. This
  • 2. analysis helps the uploader to find more about the competitors on YouTube. It discovers the videos that perform best. Based on videos uploaded and viewer interaction factor, it detects the active channel and category. Using this method, it is easy to find a number of videos in each category, the number of viewers in each channel, videos that viewers don’t like. II. RELATED WORK The data transmitted was in limited size when communication was in traditional form. Now, due to widespread usage of internet, the massive amount of data is possessed by companies and social media platforms. As data is massive to analyze and extract information from the data, analytics is used. The critical data analysis has resulted in advanced analytical intelligence for various data segments and to forecast future predictions. The huge and complex data cannot be handled by the design of conventional data analysis software. The result cannot be generated for the accumulated data due to complexity. Thus [Biju and Mathew, 2017], explained the need for Big Data analytics [6]. Because of the availability of Big Data and low-cost hardware product, there was a requirement to process data quickly and cost- effectively [3]. Hadoop MapReduce framework splits large framework into smaller parallel chunks and manages the scheduling. Mapping is done for each piece to an intermediate value and then it is reduced as a result of an analysis. MapReduce algorithm can be rewritten according to the problem and can be broken into chunks to be solved in parallel. This is how it addresses large datasets with a distributed solving method [5]. “YARN (Yet Another Resource Negotiator) separates resource management functions from programming model”. On top of YARN, MapReduce is an application that will be running. Resource management and job scheduling are split into separate daemons in JobTracker’s application. The author explained how Hadoop MapReduce and YARN works [5]. In the existing situation, YouTube analytics can be used by uploaders to analyze their own channel. This analysis provides the overall parameters and it is available for their own channel. Competitor’s information is not revealed [3]. The author [Harikumar, Kapoor and Waghmare, 2019] has analyzed YouTube data. A technique was implemented to detect the sensitive text in a comment to identify the popular content type which helped both YouTubers and advertising companies to upload the videos based on popularity. Here, the first data was collected and the sensitive contents in comments were substituted further it was analyzed for popular type content. Based on this analysis YouTubers advertise the product from which Ad revenue was earned [5]. The author [Dabas, Kaur, Gulati and Tilak, 2019] has presented the classification and analysis of YouTube video comments based on the Hadoop application. Using Hive queries the comments was been summarized. Sentiment analysis was performed on comments using python. Thus, the system delivered a promising result for queries in terms of execution time [2]. III. METHODOLOGY This study focuses on generating a pattern using the Hadoop environment for user interaction on YouTube. For this research two datasets were considered, videos trending on YouTube and categories of video-specific for each region were downloaded from Kaggle [9]. The YouTube dataset has a daily record of trending videos of several months. It includes the data of many countries in a separate file. U.S YouTube data file was chosen for the analysis. It contains 40949 records and has information such as video id, trending date, title, channel, category id, publish time, tags, views, likes, dislikes, comment count, thumbnail link, comment disabled information, ratings disabled information, information about the error or removed video and description. The extracted file was in comma separated values(csv) format and required less pre-processing as there were no missing or NAN values. But all the commas were removed as they act as delimiter and the date was formatted into the proper format i.e, dd/mm/yyyy. The category dataset is different for each region this information was fetched from associated JSON file from Kaggle [9]. This JSON file includes information like etag, category id, snippet-channel id, snippet-title, snippet- assignable. The JSON file was converted to structured data i.e, to CSV format using python code. There was no need of pre-processing this dataset as it contained all the information. Both datasets were merged using python code. While merging, only required information was taken into consideration and the column which was not used in the analysis was removed. After merging there were 40949 records in the dataset. Figure.2 shows the information about the merged dataset. The combined dataset was loaded into HDFS. Java MapReduce framework was utilized for all insights in the entire process. After the generation of result visualization was done. By analyzing this dataset, patterns about the trending videos can be drawn and it would be useful for the YouTubers. This helps to know which kind of category or video has a better response based on that YouTubers can upload the contents. Fig. 2. Dataset after merging
  • 3. IV. IMPLEMENTATION The architecture was designed for implementation and followed to perform the analysis on YouTube data. The required processing was done on the dataset and all the essential records from the dataset were merged and loaded into HDFS. Using the MapReduce program, the data was read from the HDFS. Mapping and reducing operations are performed and the generated output was stored into HDFS. Java programming language was used to write code. Then the generated output was visualized. Figure.3. shows the process flow of Data Analysis. Fig. 3. Process Flow of Data Analysis To perform the MapReduce task, three java classes will be created Mapper, Reducer and Driver class. In mapper class input will be processed and intermediate key-value will be generated by splitting input and recording it in the form of (key, value) pair. The output from the mapper is fed as input to reducer class. For a wide range of processing, data can be aggregated, filtered and combined in several ways in reducer to generate the output. The Driver class is accountable for setting MapReduce job to run in Hadoop. Job name, the data type for input and output, class names of mapper and reducer are specified in driver class. During Java, program compilation directory will be created in the current directory with a package name mentioned in a java source file and all compiled files will put into it. Also, the jar file has to be created. The input was read from HDFS and the generated output was stored back into HDFS. This is similar to all the queries. Q1) The first task is to find the distribution of videos based on category. To perform this task, three java classes were created categoryMapper, categoryReducer and driver class. Mapper class is as seen in figure 4, a variable is declared that stores all the lines from the input file. Using the delimiter, the line is split and values are stored as an array. The column which has category information is read and mapped as key. For each key, the value is stored as 1. It is an intermediate key- value pair output generated from the mapper class. In Reducer class the output from mapper is read as an input. By shuffling and reducing, the output is generated. Figure 5. shows the reducer java class. Figure. 6 is the driver class in which the job name as YouTube, the data type for output as (Text, IntWritable) and name of mapper and reducer class are specified to run the program. This program was compiled. Using JAR command all the classes are added to the JAR file shown in figure.7 and the program is executed. The output is stored in HDFS. This task was performed to know how videos are distributed. Fig. 4. Mapper class Fig. 5. Reducer class Fig. 6. Driver class Fig. 7. JAR file for Task 1 Q2) The second task is to identify which channel has more views. In mapper class, channel and number of views associated with the particular channel are sorted and mapped as key-value pair. In reducer, the key-value pair was shuffled and reduced. All the java classes were added to the JAR file to perform the execution. After execution output was stored into HDFS. Q3) The third task is to identify the video which is most disliked. In mapper, the video ID associated with each video and the number of dislikes was taken into consideration. Video ID and dislike count was sorted and intermediate key- value pair was generated. It was sent to the reducer to perform shuffling and reducing task. An operation was performed to count the number of dislikes that were given to the video. Based on this output the video which most of the viewers do not like can be identified. Q4) The fourth task is to analyze category, the channel which has more videos. In mapper, category and channel ID
  • 4. which is the name of the channel was grouped to form a key. For every key, the value was set and this key-value pair was sent to the reducer. In reducer for each occurrence of the same channel in a category, the value was counted. Thus, a particular channel in a category that has more video was fetched as output. Q5) The fifth task is to identify which video has been trending for many days on YouTube. For each video, there is a record of publishing date and trending date using these the information the number of trending days of a video can be fetched in a mapper class. The output from the mapper was sent as input to the reducer and final reduced output was generated. Q6) The sixth task is to check how many videos were removed from each category. In mapper, category and video removed which is binary information that is presented in column video_error_or_removed was used. When the binary is False then the corresponding category name was set as key and value was stored as 1. Thus, the intermediate key-value pair was generated. In reducer, the input from mapper was reduced and the final output of the key-value pair was generated. Q7) The seventh task is to identify how many videos in a channel the rating is disabled. This is similar to the above task the rating disabled is binary information. In mapper, each time the rating disabled is as True value the corresponding channel name is set as key and value is set as 1. Thus, in reducer class each time the channel name is repeated the value is added and the output was generated. Q8) The eighth task is to find how many videos in the channel are not associated with the category. In this snippet assignable column has binary value every time the value is false the corresponding channel and category were set as key and value was set in mapper class. In reducer class, every time the channel name and category were the same the count of value was increased by 1 and thus the reducer output was generated. JAR file information of all the JAVA programs is as seen in the figure. 8. Q9) The final task was to analyze the correlation between views, likes and comment count. This was done using python. If views, likes and comment counts are highly correlated then it suggests the viewer’s interest on a particular video. V. RESULTS Q1) As seen in the figure.10, the entertainment and music categories have more videos compared to all others. In this graph, the distribution of videos for each category can be seen. It is evident that YouTubers upload more videos to entertainment and music categories. Figure. 11, shows the distribution of videos based on number of times category name appears the text becomes bigger and bolder in word cloud. Fig. 8. JAR file of all Tasks Fig. 9. Output for Task1 Fig. 10. Videos in each category Fig. 11. Distribution of videos Q2) As Figure 12. shows the top 20 channels which have more views. It can be analyzed that these 20 channels viewed more on YouTube. Most of the channels here belong to music, entertainment and sports categories.
  • 5. Fig.12. Top 20 channel with more views Q3) In figure 13. the video ID of the 20 videos that are most disliked is shown. Using this video ID corresponding channel to which the video belongs can be identified. Here, ID- FlsCjmMhFmw belongs to the YouTube-Spotlight channel and this channel is the most disliked channel by the viewers. Fig.13. Most Disliked videos Q4) The result shown in figure 14. says which channel belonging to which category has more videos. It is observed that in the sports category the ESPN channel has more video when compared to all other channels. Fig.14. Channel which has more number of Videos Q5) In figure 15. it is displaying the top 10 videos that have been trending for many days on YouTube. In this, the two videos which are trending for many days belong to the Sports and News & Politics category. Fig.15. Top 10 shows trending from many days Q6) In figure 16. it is seen that some of the videos which belong to entertainment, Film & Animation and Sports Category. The videos belong to other categories were not removed. Fig.16. Categories in which more videos are removed Q7) The graph in figure 17. gives information about for how many videos in a channel the rating has been disabled. The greater number of videos for which the rating has disabled belong to How To & Style category. Fig.17. Videos Streaming for many days Q8) From figure 18. it is observed that the videos belonging to CNET and Bleacher Report channel are not associated with their category which is Shows. This means the snippet is not assignable.
  • 6. Fig.18. Number of videos snippet not assignable Q9) From the result in figure 19. It is observed that views and likes are highly correlated to each other compared to comment count. Based on this correlation matrix viewer’s interest can be analyzed. If video has more views then it has a high chance of being liked by many viewers. Fig.19. Correlation Matrix VI. CONCLUSION AND FUTURE ENHANCEMENT In this paper, an analysis was made on YouTube data using the Hadoop MapReduce framework. The research question was answered by using MapReduce tasks. Through this analysis, it was found that entertainment and music categories have more videos and have a greater number of views. The shows which belongs to sports, news & politics category have been trending for many days. From these tasks the video trending, pattern and user interest were analyzed. The viewer’s interest can be identified based on views, likes and comments. The project results also highlight the advantages of the Hadoop framework and its disadvantage is syntax complexity of Java MapReduce. In the future, analysis can be made using uploader’s information and also sentiment analysis can be done on a description of the video. REFERENCES [1] P. Merla, Y.Liang “Data Analysis using Hadoop MapReduce Environment”, IEEE Conf. on Big Data, 2017 [Online]. Available: IEEE Xplore, https://www.ieee.org/ [Accessed on: Apr. 10, 2020]. [2] C. Dabas, P. Kaur, N. Gulati and M.Tilak, “Analysis of Comments on YouTube Videos using Hadoop”, Fifth Internation Conf. on Image Information Processing (ICIIP), 2019. [3] F. Shaikh, D. Pawaskar, U. Khan and A.Siddiqui, “YouTube Data Analysis using MapReduce on Hadoop”, IEEE Conf. on Recent Trends in Electronics, Information & Communication Technology, May. 18/19, 2018. [4] K. Bhatter, S. Gavhane, P. Dhamne, G. Aochar and S. Rabade, “A Review on YouTube Data Analysis using MapReduce on Hadoop”, International Journal of Research in Engineering, Science and Management, vol. 2, no. 12, Dec., 2019 [Online]. Available: https://www.ijresm.com/ [Accessed on Apr. 10, 2020]. [5] D. Harikumar, D. Kapoor and S. Waghmare, “YouTube Data Sensitivity and Analysis using Hadoop Framework”, International Research Journal of Engineering and Technology, vol. 6, no. 4, 2019 [Online]. Available: https://www.irjet.net/ [Accessed on Apr. 10, 2020]. [6] S. Biju and A. Mathew, “Comparative Analysis of Selected Big Data Analytics Tools”, Journal of International Technology and Information Management, vol. 26, no. 2, 2017. [7] FactSlides.com, “25 Facts about YouTube FactSlides”, Nov. 28, 2019 [Online]. Available: https://www.factslides.com/s-YouTube/ [Accessed on Apr. 10, 2020]. [8] P. Cooper, Hootsuite Social Media Management, “23 YouTube Statistics that Matter to Marketers in 2020”, Dec. 17, 2019 [Online]. Available: https://blog.hootsuite.com/youtube-stats-marketers/ [Accessed on Apr. 10, 2020]. [9] M. J, “Trending YouTube Video Statistics”, Kaggle.com, 2019 [Online]. Available: https://www.kaggle.com/datasnaek/youtube-new/ [Accessed: Apr. 10, 2020].