SlideShare a Scribd company logo
Analysis of Trending Videos Pattern on YouTube
using Hadoop MapReduce
Pooja Kumar
MSc Data Analytics
National College of Ireland
Dublin, Ireland
x18181929@student.ncirl.ie
Abstract— YouTube is one of the prominent sites for video
hosting and sharing. To enhance the content and quality of the
video, analysis of user interaction factor and streaming
information of videos is required. Video content can be
scrutinized using factors like number of views, likes, comments
and dislikes. Based on this the channel which is more active and
in which the user has more interest can be identified. This
project aims to utilize the data which has user interaction
factors and information about videos to analyze the trending
pattern. The large dataset of YouTube can be managed in
Hadoop. Data is stored in the Hadoop Distributed File System
(HDFS) and processed using the MapReduce framework. Data
was analyzed to distinguish the content and discover which has
a better response, which helps to organize the videos in the
future. Further, the analysis helps in understanding the popular
category, popular channel, the video which has a high number
of views, which category of videos is most disliked.
Keywords— Hadoop, Hadoop Distributed File System
(HDFS), MapReduce, YouTube, Mapper, Reducer.
I. INTRODUCTION
Social networking sites offer a lucrative way to promote
and advertise. They offer various opportunities for product
publicity, showcasing new movie trailers and images,
corporate promotions [2]. Beyond YouTube's clear
entertainment and leisure advantages, it is also being
leveraged for professional and other industry gains [3]. Most
companies are releasing previews of their new products on
YouTube for greater exposure. YouTube helps uploaders to
have reviews, likes and shares as the feedback on their videos.
The feedback is given by the viewers. By using this
information future improvements can be done. This will help
to customize the product according to the viewer needs. Each
minute, over 500 hours of video are uploaded to YouTube.
YouTube has more than 2 billion users. On YouTube daily, an
average of 1 billion hours of video is watched. Based on
10,000 reviewers around 32 million inappropriate videos were
removed in 2018. YouTube can be browsed over 80 languages
[7]. In the United States (U.S), YouTube reaches more young
people on mobile than any Television Network. Around “73%
of U.S adults use YouTube”. Figure.1 shows how many
internet users in the U.S use YouTube. Nearly “15% of site
traffics on YouTube comes from the U.S”. Next, it is “India
with 8.1% and Japan with 4.6%”. YouTube will generate $5.5
billion in advertising revenues in the US alone by 2020 [8].
In this paper Hadoop, a big data framework was used for
handling data. It is a Java-based programming framework.
MapReduce programming model is implemented for
processing massive amounts of data in parallel. Data storage,
accessing, processing and computation with a file system is
provided in Hadoop Distributed File System (HDFS). U.S
YouTube data was analyzed using HDFS and MapReduce
framework to provide a better understanding of viewer
interest.
Fig. 1. U.S YouTube Users
Research Question: How Hadoop MapReduce framework
help to analyze the pattern of trending videos on YouTube
and user’s interest that can be used to benefit people by
making proper decision to advance on trending topics?
The research question is answered based on analyzing some
topics using MapReduce.
Q1) How videos are distributed and in which category there
are more videos?
Q2) Which channel has more views?
Q3) Which video is disliked by most of the users?
Q4) Which channel has more videos and to which category
they belong?
Q5) Which video has been trending from many days?
Q6) How many videos were removed from each category?
Q7) For how many videos in each channel the rating is
disabled?
Q8) How many videos in a channel are not associated with a
particular category?
Q9) How well are the factors views, likes and comments
count are correlated to each other? (Statistical approach using
python)
In this project trending topics and people’s interest can be
analyzed. This information can be used to benefit people by
making a proper decision to advance on trending topics. This
analysis helps the uploader to find more about the
competitors on YouTube. It discovers the videos that perform
best. Based on videos uploaded and viewer interaction factor,
it detects the active channel and category. Using this method,
it is easy to find a number of videos in each category, the
number of viewers in each channel, videos that viewers don’t
like.
II. RELATED WORK
The data transmitted was in limited size when
communication was in traditional form. Now, due to
widespread usage of internet, the massive amount of data is
possessed by companies and social media platforms. As data
is massive to analyze and extract information from the data,
analytics is used. The critical data analysis has resulted in
advanced analytical intelligence for various data segments
and to forecast future predictions. The huge and complex data
cannot be handled by the design of conventional data analysis
software. The result cannot be generated for the accumulated
data due to complexity. Thus [Biju and Mathew, 2017],
explained the need for Big Data analytics [6]. Because of the
availability of Big Data and low-cost hardware product, there
was a requirement to process data quickly and cost-
effectively [3].
Hadoop MapReduce framework splits large framework
into smaller parallel chunks and manages the scheduling.
Mapping is done for each piece to an intermediate value and
then it is reduced as a result of an analysis. MapReduce
algorithm can be rewritten according to the problem and can
be broken into chunks to be solved in parallel. This is how it
addresses large datasets with a distributed solving method
[5]. “YARN (Yet Another Resource Negotiator) separates
resource management functions from programming model”.
On top of YARN, MapReduce is an application that will be
running. Resource management and job scheduling are split
into separate daemons in JobTracker’s application. The
author explained how Hadoop MapReduce and YARN works
[5].
In the existing situation, YouTube analytics can be used
by uploaders to analyze their own channel. This analysis
provides the overall parameters and it is available for their
own channel. Competitor’s information is not revealed [3].
The author [Harikumar, Kapoor and Waghmare, 2019]
has analyzed YouTube data. A technique was implemented
to detect the sensitive text in a comment to identify the
popular content type which helped both YouTubers and
advertising companies to upload the videos based on
popularity. Here, the first data was collected and the sensitive
contents in comments were substituted further it was
analyzed for popular type content. Based on this analysis
YouTubers advertise the product from which Ad revenue was
earned [5].
The author [Dabas, Kaur, Gulati and Tilak, 2019] has
presented the classification and analysis of YouTube video
comments based on the Hadoop application. Using Hive
queries the comments was been summarized. Sentiment
analysis was performed on comments using python. Thus, the
system delivered a promising result for queries in terms of
execution time [2].
III. METHODOLOGY
This study focuses on generating a pattern using the
Hadoop environment for user interaction on YouTube. For
this research two datasets were considered, videos trending
on YouTube and categories of video-specific for each region
were downloaded from Kaggle [9].
The YouTube dataset has a daily record of trending
videos of several months. It includes the data of many
countries in a separate file. U.S YouTube data file was chosen
for the analysis. It contains 40949 records and has
information such as video id, trending date, title, channel,
category id, publish time, tags, views, likes, dislikes,
comment count, thumbnail link, comment disabled
information, ratings disabled information, information about
the error or removed video and description. The extracted file
was in comma separated values(csv) format and required less
pre-processing as there were no missing or NAN values. But
all the commas were removed as they act as delimiter and the
date was formatted into the proper format i.e, dd/mm/yyyy.
The category dataset is different for each region this
information was fetched from associated JSON file from
Kaggle [9]. This JSON file includes information like etag,
category id, snippet-channel id, snippet-title, snippet-
assignable. The JSON file was converted to structured data
i.e, to CSV format using python code. There was no need of
pre-processing this dataset as it contained all the information.
Both datasets were merged using python code. While
merging, only required information was taken into
consideration and the column which was not used in the
analysis was removed. After merging there were 40949
records in the dataset. Figure.2 shows the information about
the merged dataset. The combined dataset was loaded into
HDFS. Java MapReduce framework was utilized for all
insights in the entire process. After the generation of result
visualization was done. By analyzing this dataset, patterns
about the trending videos can be drawn and it would be useful
for the YouTubers. This helps to know which kind of
category or video has a better response based on that
YouTubers can upload the contents.
Fig. 2. Dataset after merging
IV. IMPLEMENTATION
The architecture was designed for implementation and
followed to perform the analysis on YouTube data. The
required processing was done on the dataset and all the
essential records from the dataset were merged and loaded into
HDFS. Using the MapReduce program, the data was read
from the HDFS. Mapping and reducing operations are
performed and the generated output was stored into HDFS.
Java programming language was used to write code. Then the
generated output was visualized. Figure.3. shows the process
flow of Data Analysis.
Fig. 3. Process Flow of Data Analysis
To perform the MapReduce task, three java classes will be
created Mapper, Reducer and Driver class. In mapper class
input will be processed and intermediate key-value will be
generated by splitting input and recording it in the form of
(key, value) pair. The output from the mapper is fed as input
to reducer class. For a wide range of processing, data can be
aggregated, filtered and combined in several ways in reducer
to generate the output. The Driver class is accountable for
setting MapReduce job to run in Hadoop. Job name, the data
type for input and output, class names of mapper and reducer
are specified in driver class. During Java, program
compilation directory will be created in the current directory
with a package name mentioned in a java source file and all
compiled files will put into it. Also, the jar file has to be
created. The input was read from HDFS and the generated
output was stored back into HDFS. This is similar to all the
queries.
Q1) The first task is to find the distribution of videos based
on category. To perform this task, three java classes were
created categoryMapper, categoryReducer and driver class.
Mapper class is as seen in figure 4, a variable is declared that
stores all the lines from the input file. Using the delimiter, the
line is split and values are stored as an array. The column
which has category information is read and mapped as key.
For each key, the value is stored as 1. It is an intermediate key-
value pair output generated from the mapper class. In Reducer
class the output from mapper is read as an input. By shuffling
and reducing, the output is generated. Figure 5. shows the
reducer java class. Figure. 6 is the driver class in which the job
name as YouTube, the data type for output as (Text,
IntWritable) and name of mapper and reducer class are
specified to run the program. This program was compiled.
Using JAR command all the classes are added to the JAR file
shown in figure.7 and the program is executed. The output is
stored in HDFS. This task was performed to know how videos
are distributed.
Fig. 4. Mapper class
Fig. 5. Reducer class
Fig. 6. Driver class
Fig. 7. JAR file for Task 1
Q2) The second task is to identify which channel has more
views. In mapper class, channel and number of views
associated with the particular channel are sorted and mapped
as key-value pair. In reducer, the key-value pair was shuffled
and reduced. All the java classes were added to the JAR file
to perform the execution. After execution output was stored
into HDFS.
Q3) The third task is to identify the video which is most
disliked. In mapper, the video ID associated with each video
and the number of dislikes was taken into consideration.
Video ID and dislike count was sorted and intermediate key-
value pair was generated. It was sent to the reducer to perform
shuffling and reducing task. An operation was performed to
count the number of dislikes that were given to the video.
Based on this output the video which most of the viewers do
not like can be identified.
Q4) The fourth task is to analyze category, the channel
which has more videos. In mapper, category and channel ID
which is the name of the channel was grouped to form a key.
For every key, the value was set and this key-value pair was
sent to the reducer. In reducer for each occurrence of the same
channel in a category, the value was counted. Thus, a
particular channel in a category that has more video was
fetched as output.
Q5) The fifth task is to identify which video has been
trending for many days on YouTube. For each video, there is
a record of publishing date and trending date using these the
information the number of trending days of a video can be
fetched in a mapper class. The output from the mapper was
sent as input to the reducer and final reduced output was
generated.
Q6) The sixth task is to check how many videos were
removed from each category. In mapper, category and video
removed which is binary information that is presented in
column video_error_or_removed was used. When the binary
is False then the corresponding category name was set as key
and value was stored as 1. Thus, the intermediate key-value
pair was generated. In reducer, the input from mapper was
reduced and the final output of the key-value pair was
generated.
Q7) The seventh task is to identify how many videos in a
channel the rating is disabled. This is similar to the above task
the rating disabled is binary information. In mapper, each
time the rating disabled is as True value the corresponding
channel name is set as key and value is set as 1. Thus, in
reducer class each time the channel name is repeated the
value is added and the output was generated.
Q8) The eighth task is to find how many videos in the
channel are not associated with the category. In this snippet
assignable column has binary value every time the value is
false the corresponding channel and category were set as key
and value was set in mapper class. In reducer class, every
time the channel name and category were the same the count
of value was increased by 1 and thus the reducer output was
generated.
JAR file information of all the JAVA programs is as seen
in the figure. 8.
Q9) The final task was to analyze the correlation between
views, likes and comment count. This was done using python.
If views, likes and comment counts are highly correlated then
it suggests the viewer’s interest on a particular video.
V. RESULTS
Q1) As seen in the figure.10, the entertainment and music
categories have more videos compared to all others. In this
graph, the distribution of videos for each category can be
seen. It is evident that YouTubers upload more videos to
entertainment and music categories. Figure. 11, shows the
distribution of videos based on number of times category
name appears the text becomes bigger and bolder in word
cloud.
Fig. 8. JAR file of all Tasks
Fig. 9. Output for Task1
Fig. 10. Videos in each category
Fig. 11. Distribution of videos
Q2) As Figure 12. shows the top 20 channels which have
more views. It can be analyzed that these 20 channels viewed
more on YouTube. Most of the channels here belong to
music, entertainment and sports categories.
Fig.12. Top 20 channel with more views
Q3) In figure 13. the video ID of the 20 videos that are
most disliked is shown. Using this video ID corresponding
channel to which the video belongs can be identified. Here,
ID- FlsCjmMhFmw belongs to the YouTube-Spotlight
channel and this channel is the most disliked channel by the
viewers.
Fig.13. Most Disliked videos
Q4) The result shown in figure 14. says which channel
belonging to which category has more videos. It is observed
that in the sports category the ESPN channel has more video
when compared to all other channels.
Fig.14. Channel which has more number of Videos
Q5) In figure 15. it is displaying the top 10 videos that
have been trending for many days on YouTube. In this, the
two videos which are trending for many days belong to the
Sports and News & Politics category.
Fig.15. Top 10 shows trending from many days
Q6) In figure 16. it is seen that some of the videos which
belong to entertainment, Film & Animation and Sports
Category. The videos belong to other categories were not
removed.
Fig.16. Categories in which more videos are removed
Q7) The graph in figure 17. gives information about for
how many videos in a channel the rating has been disabled.
The greater number of videos for which the rating has
disabled belong to How To & Style category.
Fig.17. Videos Streaming for many days
Q8) From figure 18. it is observed that the videos
belonging to CNET and Bleacher Report channel are not
associated with their category which is Shows. This means
the snippet is not assignable.
Fig.18. Number of videos snippet not assignable
Q9) From the result in figure 19. It is observed that views
and likes are highly correlated to each other compared to
comment count. Based on this correlation matrix viewer’s
interest can be analyzed. If video has more views then it has
a high chance of being liked by many viewers.
Fig.19. Correlation Matrix
VI. CONCLUSION AND FUTURE ENHANCEMENT
In this paper, an analysis was made on YouTube data
using the Hadoop MapReduce framework. The research
question was answered by using MapReduce tasks. Through
this analysis, it was found that entertainment and music
categories have more videos and have a greater number of
views. The shows which belongs to sports, news & politics
category have been trending for many days. From these tasks
the video trending, pattern and user interest were analyzed.
The viewer’s interest can be identified based on views, likes
and comments. The project results also highlight the
advantages of the Hadoop framework and its disadvantage is
syntax complexity of Java MapReduce.
In the future, analysis can be made using uploader’s
information and also sentiment analysis can be done on a
description of the video.
REFERENCES
[1] P. Merla, Y.Liang “Data Analysis using Hadoop MapReduce
Environment”, IEEE Conf. on Big Data, 2017 [Online]. Available:
IEEE Xplore, https://www.ieee.org/ [Accessed on: Apr. 10, 2020].
[2] C. Dabas, P. Kaur, N. Gulati and M.Tilak, “Analysis of Comments on
YouTube Videos using Hadoop”, Fifth Internation Conf. on Image
Information Processing (ICIIP), 2019.
[3] F. Shaikh, D. Pawaskar, U. Khan and A.Siddiqui, “YouTube Data
Analysis using MapReduce on Hadoop”, IEEE Conf. on Recent Trends
in Electronics, Information & Communication Technology, May.
18/19, 2018.
[4] K. Bhatter, S. Gavhane, P. Dhamne, G. Aochar and S. Rabade, “A
Review on YouTube Data Analysis using MapReduce on Hadoop”,
International Journal of Research in Engineering, Science and
Management, vol. 2, no. 12, Dec., 2019 [Online]. Available:
https://www.ijresm.com/ [Accessed on Apr. 10, 2020].
[5] D. Harikumar, D. Kapoor and S. Waghmare, “YouTube Data
Sensitivity and Analysis using Hadoop Framework”, International
Research Journal of Engineering and Technology, vol. 6, no. 4, 2019
[Online]. Available: https://www.irjet.net/ [Accessed on Apr. 10,
2020].
[6] S. Biju and A. Mathew, “Comparative Analysis of Selected Big Data
Analytics Tools”, Journal of International Technology and Information
Management, vol. 26, no. 2, 2017.
[7] FactSlides.com, “25 Facts about YouTube FactSlides”, Nov. 28, 2019
[Online]. Available: https://www.factslides.com/s-YouTube/
[Accessed on Apr. 10, 2020].
[8] P. Cooper, Hootsuite Social Media Management, “23 YouTube
Statistics that Matter to Marketers in 2020”, Dec. 17, 2019 [Online].
Available: https://blog.hootsuite.com/youtube-stats-marketers/
[Accessed on Apr. 10, 2020].
[9] M. J, “Trending YouTube Video Statistics”, Kaggle.com, 2019
[Online]. Available: https://www.kaggle.com/datasnaek/youtube-new/
[Accessed: Apr. 10, 2020].

More Related Content

Similar to Data Intensive Architectures

YouTubeVideoCatagorization
YouTubeVideoCatagorizationYouTubeVideoCatagorization
YouTubeVideoCatagorization
Urjit Patel
 
YouTube Title Prediction Using Sentiment Analysis
YouTube Title Prediction Using Sentiment AnalysisYouTube Title Prediction Using Sentiment Analysis
YouTube Title Prediction Using Sentiment Analysis
IRJET Journal
 
YouTube Trending Video Dashboard
YouTube Trending Video DashboardYouTube Trending Video Dashboard
YouTube Trending Video Dashboard
IRJET Journal
 
Social Media Content Analyser
Social Media Content AnalyserSocial Media Content Analyser
Social Media Content Analyser
IRJET Journal
 
Analysis and Prediction of Sentiments for Cricket Tweets using Hadoop
Analysis and Prediction of Sentiments for Cricket Tweets using HadoopAnalysis and Prediction of Sentiments for Cricket Tweets using Hadoop
Analysis and Prediction of Sentiments for Cricket Tweets using Hadoop
IRJET Journal
 
IRJET- Opinion Mining on Pulwama Attack
IRJET-  	  Opinion Mining on Pulwama AttackIRJET-  	  Opinion Mining on Pulwama Attack
IRJET- Opinion Mining on Pulwama Attack
IRJET Journal
 
youtube.docx
youtube.docxyoutube.docx
youtube.docx
BhumikaBiyani1
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
Suman Saurabh
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop Recommendation
IRJET Journal
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop Recommendation
IRJET Journal
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund Analytics
Bernardo Najlis
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
ijtsrd
 
System For Product Recommendation In E-Commerce Applications
System For Product Recommendation In E-Commerce ApplicationsSystem For Product Recommendation In E-Commerce Applications
System For Product Recommendation In E-Commerce Applications
IJERD Editor
 
Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...
paperpublications3
 
Consumer Purchase Intention Prediction System
Consumer Purchase Intention Prediction SystemConsumer Purchase Intention Prediction System
Consumer Purchase Intention Prediction System
IRJET Journal
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
IRJET Journal
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
jadhavpravin920
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Love Arora
 
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET Journal
 
Methodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdfMethodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdf
Sajid Amit
 

Similar to Data Intensive Architectures (20)

YouTubeVideoCatagorization
YouTubeVideoCatagorizationYouTubeVideoCatagorization
YouTubeVideoCatagorization
 
YouTube Title Prediction Using Sentiment Analysis
YouTube Title Prediction Using Sentiment AnalysisYouTube Title Prediction Using Sentiment Analysis
YouTube Title Prediction Using Sentiment Analysis
 
YouTube Trending Video Dashboard
YouTube Trending Video DashboardYouTube Trending Video Dashboard
YouTube Trending Video Dashboard
 
Social Media Content Analyser
Social Media Content AnalyserSocial Media Content Analyser
Social Media Content Analyser
 
Analysis and Prediction of Sentiments for Cricket Tweets using Hadoop
Analysis and Prediction of Sentiments for Cricket Tweets using HadoopAnalysis and Prediction of Sentiments for Cricket Tweets using Hadoop
Analysis and Prediction of Sentiments for Cricket Tweets using Hadoop
 
IRJET- Opinion Mining on Pulwama Attack
IRJET-  	  Opinion Mining on Pulwama AttackIRJET-  	  Opinion Mining on Pulwama Attack
IRJET- Opinion Mining on Pulwama Attack
 
youtube.docx
youtube.docxyoutube.docx
youtube.docx
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop Recommendation
 
One Stop Recommendation
One Stop RecommendationOne Stop Recommendation
One Stop Recommendation
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund Analytics
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
 
System For Product Recommendation In E-Commerce Applications
System For Product Recommendation In E-Commerce ApplicationsSystem For Product Recommendation In E-Commerce Applications
System For Product Recommendation In E-Commerce Applications
 
Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...
 
Consumer Purchase Intention Prediction System
Consumer Purchase Intention Prediction SystemConsumer Purchase Intention Prediction System
Consumer Purchase Intention Prediction System
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
 
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
 
Methodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdfMethodology of CVE Research - Sajid Amit.pdf
Methodology of CVE Research - Sajid Amit.pdf
 

Recently uploaded

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 

Data Intensive Architectures

  • 1. Analysis of Trending Videos Pattern on YouTube using Hadoop MapReduce Pooja Kumar MSc Data Analytics National College of Ireland Dublin, Ireland x18181929@student.ncirl.ie Abstract— YouTube is one of the prominent sites for video hosting and sharing. To enhance the content and quality of the video, analysis of user interaction factor and streaming information of videos is required. Video content can be scrutinized using factors like number of views, likes, comments and dislikes. Based on this the channel which is more active and in which the user has more interest can be identified. This project aims to utilize the data which has user interaction factors and information about videos to analyze the trending pattern. The large dataset of YouTube can be managed in Hadoop. Data is stored in the Hadoop Distributed File System (HDFS) and processed using the MapReduce framework. Data was analyzed to distinguish the content and discover which has a better response, which helps to organize the videos in the future. Further, the analysis helps in understanding the popular category, popular channel, the video which has a high number of views, which category of videos is most disliked. Keywords— Hadoop, Hadoop Distributed File System (HDFS), MapReduce, YouTube, Mapper, Reducer. I. INTRODUCTION Social networking sites offer a lucrative way to promote and advertise. They offer various opportunities for product publicity, showcasing new movie trailers and images, corporate promotions [2]. Beyond YouTube's clear entertainment and leisure advantages, it is also being leveraged for professional and other industry gains [3]. Most companies are releasing previews of their new products on YouTube for greater exposure. YouTube helps uploaders to have reviews, likes and shares as the feedback on their videos. The feedback is given by the viewers. By using this information future improvements can be done. This will help to customize the product according to the viewer needs. Each minute, over 500 hours of video are uploaded to YouTube. YouTube has more than 2 billion users. On YouTube daily, an average of 1 billion hours of video is watched. Based on 10,000 reviewers around 32 million inappropriate videos were removed in 2018. YouTube can be browsed over 80 languages [7]. In the United States (U.S), YouTube reaches more young people on mobile than any Television Network. Around “73% of U.S adults use YouTube”. Figure.1 shows how many internet users in the U.S use YouTube. Nearly “15% of site traffics on YouTube comes from the U.S”. Next, it is “India with 8.1% and Japan with 4.6%”. YouTube will generate $5.5 billion in advertising revenues in the US alone by 2020 [8]. In this paper Hadoop, a big data framework was used for handling data. It is a Java-based programming framework. MapReduce programming model is implemented for processing massive amounts of data in parallel. Data storage, accessing, processing and computation with a file system is provided in Hadoop Distributed File System (HDFS). U.S YouTube data was analyzed using HDFS and MapReduce framework to provide a better understanding of viewer interest. Fig. 1. U.S YouTube Users Research Question: How Hadoop MapReduce framework help to analyze the pattern of trending videos on YouTube and user’s interest that can be used to benefit people by making proper decision to advance on trending topics? The research question is answered based on analyzing some topics using MapReduce. Q1) How videos are distributed and in which category there are more videos? Q2) Which channel has more views? Q3) Which video is disliked by most of the users? Q4) Which channel has more videos and to which category they belong? Q5) Which video has been trending from many days? Q6) How many videos were removed from each category? Q7) For how many videos in each channel the rating is disabled? Q8) How many videos in a channel are not associated with a particular category? Q9) How well are the factors views, likes and comments count are correlated to each other? (Statistical approach using python) In this project trending topics and people’s interest can be analyzed. This information can be used to benefit people by making a proper decision to advance on trending topics. This
  • 2. analysis helps the uploader to find more about the competitors on YouTube. It discovers the videos that perform best. Based on videos uploaded and viewer interaction factor, it detects the active channel and category. Using this method, it is easy to find a number of videos in each category, the number of viewers in each channel, videos that viewers don’t like. II. RELATED WORK The data transmitted was in limited size when communication was in traditional form. Now, due to widespread usage of internet, the massive amount of data is possessed by companies and social media platforms. As data is massive to analyze and extract information from the data, analytics is used. The critical data analysis has resulted in advanced analytical intelligence for various data segments and to forecast future predictions. The huge and complex data cannot be handled by the design of conventional data analysis software. The result cannot be generated for the accumulated data due to complexity. Thus [Biju and Mathew, 2017], explained the need for Big Data analytics [6]. Because of the availability of Big Data and low-cost hardware product, there was a requirement to process data quickly and cost- effectively [3]. Hadoop MapReduce framework splits large framework into smaller parallel chunks and manages the scheduling. Mapping is done for each piece to an intermediate value and then it is reduced as a result of an analysis. MapReduce algorithm can be rewritten according to the problem and can be broken into chunks to be solved in parallel. This is how it addresses large datasets with a distributed solving method [5]. “YARN (Yet Another Resource Negotiator) separates resource management functions from programming model”. On top of YARN, MapReduce is an application that will be running. Resource management and job scheduling are split into separate daemons in JobTracker’s application. The author explained how Hadoop MapReduce and YARN works [5]. In the existing situation, YouTube analytics can be used by uploaders to analyze their own channel. This analysis provides the overall parameters and it is available for their own channel. Competitor’s information is not revealed [3]. The author [Harikumar, Kapoor and Waghmare, 2019] has analyzed YouTube data. A technique was implemented to detect the sensitive text in a comment to identify the popular content type which helped both YouTubers and advertising companies to upload the videos based on popularity. Here, the first data was collected and the sensitive contents in comments were substituted further it was analyzed for popular type content. Based on this analysis YouTubers advertise the product from which Ad revenue was earned [5]. The author [Dabas, Kaur, Gulati and Tilak, 2019] has presented the classification and analysis of YouTube video comments based on the Hadoop application. Using Hive queries the comments was been summarized. Sentiment analysis was performed on comments using python. Thus, the system delivered a promising result for queries in terms of execution time [2]. III. METHODOLOGY This study focuses on generating a pattern using the Hadoop environment for user interaction on YouTube. For this research two datasets were considered, videos trending on YouTube and categories of video-specific for each region were downloaded from Kaggle [9]. The YouTube dataset has a daily record of trending videos of several months. It includes the data of many countries in a separate file. U.S YouTube data file was chosen for the analysis. It contains 40949 records and has information such as video id, trending date, title, channel, category id, publish time, tags, views, likes, dislikes, comment count, thumbnail link, comment disabled information, ratings disabled information, information about the error or removed video and description. The extracted file was in comma separated values(csv) format and required less pre-processing as there were no missing or NAN values. But all the commas were removed as they act as delimiter and the date was formatted into the proper format i.e, dd/mm/yyyy. The category dataset is different for each region this information was fetched from associated JSON file from Kaggle [9]. This JSON file includes information like etag, category id, snippet-channel id, snippet-title, snippet- assignable. The JSON file was converted to structured data i.e, to CSV format using python code. There was no need of pre-processing this dataset as it contained all the information. Both datasets were merged using python code. While merging, only required information was taken into consideration and the column which was not used in the analysis was removed. After merging there were 40949 records in the dataset. Figure.2 shows the information about the merged dataset. The combined dataset was loaded into HDFS. Java MapReduce framework was utilized for all insights in the entire process. After the generation of result visualization was done. By analyzing this dataset, patterns about the trending videos can be drawn and it would be useful for the YouTubers. This helps to know which kind of category or video has a better response based on that YouTubers can upload the contents. Fig. 2. Dataset after merging
  • 3. IV. IMPLEMENTATION The architecture was designed for implementation and followed to perform the analysis on YouTube data. The required processing was done on the dataset and all the essential records from the dataset were merged and loaded into HDFS. Using the MapReduce program, the data was read from the HDFS. Mapping and reducing operations are performed and the generated output was stored into HDFS. Java programming language was used to write code. Then the generated output was visualized. Figure.3. shows the process flow of Data Analysis. Fig. 3. Process Flow of Data Analysis To perform the MapReduce task, three java classes will be created Mapper, Reducer and Driver class. In mapper class input will be processed and intermediate key-value will be generated by splitting input and recording it in the form of (key, value) pair. The output from the mapper is fed as input to reducer class. For a wide range of processing, data can be aggregated, filtered and combined in several ways in reducer to generate the output. The Driver class is accountable for setting MapReduce job to run in Hadoop. Job name, the data type for input and output, class names of mapper and reducer are specified in driver class. During Java, program compilation directory will be created in the current directory with a package name mentioned in a java source file and all compiled files will put into it. Also, the jar file has to be created. The input was read from HDFS and the generated output was stored back into HDFS. This is similar to all the queries. Q1) The first task is to find the distribution of videos based on category. To perform this task, three java classes were created categoryMapper, categoryReducer and driver class. Mapper class is as seen in figure 4, a variable is declared that stores all the lines from the input file. Using the delimiter, the line is split and values are stored as an array. The column which has category information is read and mapped as key. For each key, the value is stored as 1. It is an intermediate key- value pair output generated from the mapper class. In Reducer class the output from mapper is read as an input. By shuffling and reducing, the output is generated. Figure 5. shows the reducer java class. Figure. 6 is the driver class in which the job name as YouTube, the data type for output as (Text, IntWritable) and name of mapper and reducer class are specified to run the program. This program was compiled. Using JAR command all the classes are added to the JAR file shown in figure.7 and the program is executed. The output is stored in HDFS. This task was performed to know how videos are distributed. Fig. 4. Mapper class Fig. 5. Reducer class Fig. 6. Driver class Fig. 7. JAR file for Task 1 Q2) The second task is to identify which channel has more views. In mapper class, channel and number of views associated with the particular channel are sorted and mapped as key-value pair. In reducer, the key-value pair was shuffled and reduced. All the java classes were added to the JAR file to perform the execution. After execution output was stored into HDFS. Q3) The third task is to identify the video which is most disliked. In mapper, the video ID associated with each video and the number of dislikes was taken into consideration. Video ID and dislike count was sorted and intermediate key- value pair was generated. It was sent to the reducer to perform shuffling and reducing task. An operation was performed to count the number of dislikes that were given to the video. Based on this output the video which most of the viewers do not like can be identified. Q4) The fourth task is to analyze category, the channel which has more videos. In mapper, category and channel ID
  • 4. which is the name of the channel was grouped to form a key. For every key, the value was set and this key-value pair was sent to the reducer. In reducer for each occurrence of the same channel in a category, the value was counted. Thus, a particular channel in a category that has more video was fetched as output. Q5) The fifth task is to identify which video has been trending for many days on YouTube. For each video, there is a record of publishing date and trending date using these the information the number of trending days of a video can be fetched in a mapper class. The output from the mapper was sent as input to the reducer and final reduced output was generated. Q6) The sixth task is to check how many videos were removed from each category. In mapper, category and video removed which is binary information that is presented in column video_error_or_removed was used. When the binary is False then the corresponding category name was set as key and value was stored as 1. Thus, the intermediate key-value pair was generated. In reducer, the input from mapper was reduced and the final output of the key-value pair was generated. Q7) The seventh task is to identify how many videos in a channel the rating is disabled. This is similar to the above task the rating disabled is binary information. In mapper, each time the rating disabled is as True value the corresponding channel name is set as key and value is set as 1. Thus, in reducer class each time the channel name is repeated the value is added and the output was generated. Q8) The eighth task is to find how many videos in the channel are not associated with the category. In this snippet assignable column has binary value every time the value is false the corresponding channel and category were set as key and value was set in mapper class. In reducer class, every time the channel name and category were the same the count of value was increased by 1 and thus the reducer output was generated. JAR file information of all the JAVA programs is as seen in the figure. 8. Q9) The final task was to analyze the correlation between views, likes and comment count. This was done using python. If views, likes and comment counts are highly correlated then it suggests the viewer’s interest on a particular video. V. RESULTS Q1) As seen in the figure.10, the entertainment and music categories have more videos compared to all others. In this graph, the distribution of videos for each category can be seen. It is evident that YouTubers upload more videos to entertainment and music categories. Figure. 11, shows the distribution of videos based on number of times category name appears the text becomes bigger and bolder in word cloud. Fig. 8. JAR file of all Tasks Fig. 9. Output for Task1 Fig. 10. Videos in each category Fig. 11. Distribution of videos Q2) As Figure 12. shows the top 20 channels which have more views. It can be analyzed that these 20 channels viewed more on YouTube. Most of the channels here belong to music, entertainment and sports categories.
  • 5. Fig.12. Top 20 channel with more views Q3) In figure 13. the video ID of the 20 videos that are most disliked is shown. Using this video ID corresponding channel to which the video belongs can be identified. Here, ID- FlsCjmMhFmw belongs to the YouTube-Spotlight channel and this channel is the most disliked channel by the viewers. Fig.13. Most Disliked videos Q4) The result shown in figure 14. says which channel belonging to which category has more videos. It is observed that in the sports category the ESPN channel has more video when compared to all other channels. Fig.14. Channel which has more number of Videos Q5) In figure 15. it is displaying the top 10 videos that have been trending for many days on YouTube. In this, the two videos which are trending for many days belong to the Sports and News & Politics category. Fig.15. Top 10 shows trending from many days Q6) In figure 16. it is seen that some of the videos which belong to entertainment, Film & Animation and Sports Category. The videos belong to other categories were not removed. Fig.16. Categories in which more videos are removed Q7) The graph in figure 17. gives information about for how many videos in a channel the rating has been disabled. The greater number of videos for which the rating has disabled belong to How To & Style category. Fig.17. Videos Streaming for many days Q8) From figure 18. it is observed that the videos belonging to CNET and Bleacher Report channel are not associated with their category which is Shows. This means the snippet is not assignable.
  • 6. Fig.18. Number of videos snippet not assignable Q9) From the result in figure 19. It is observed that views and likes are highly correlated to each other compared to comment count. Based on this correlation matrix viewer’s interest can be analyzed. If video has more views then it has a high chance of being liked by many viewers. Fig.19. Correlation Matrix VI. CONCLUSION AND FUTURE ENHANCEMENT In this paper, an analysis was made on YouTube data using the Hadoop MapReduce framework. The research question was answered by using MapReduce tasks. Through this analysis, it was found that entertainment and music categories have more videos and have a greater number of views. The shows which belongs to sports, news & politics category have been trending for many days. From these tasks the video trending, pattern and user interest were analyzed. The viewer’s interest can be identified based on views, likes and comments. The project results also highlight the advantages of the Hadoop framework and its disadvantage is syntax complexity of Java MapReduce. In the future, analysis can be made using uploader’s information and also sentiment analysis can be done on a description of the video. REFERENCES [1] P. Merla, Y.Liang “Data Analysis using Hadoop MapReduce Environment”, IEEE Conf. on Big Data, 2017 [Online]. Available: IEEE Xplore, https://www.ieee.org/ [Accessed on: Apr. 10, 2020]. [2] C. Dabas, P. Kaur, N. Gulati and M.Tilak, “Analysis of Comments on YouTube Videos using Hadoop”, Fifth Internation Conf. on Image Information Processing (ICIIP), 2019. [3] F. Shaikh, D. Pawaskar, U. Khan and A.Siddiqui, “YouTube Data Analysis using MapReduce on Hadoop”, IEEE Conf. on Recent Trends in Electronics, Information & Communication Technology, May. 18/19, 2018. [4] K. Bhatter, S. Gavhane, P. Dhamne, G. Aochar and S. Rabade, “A Review on YouTube Data Analysis using MapReduce on Hadoop”, International Journal of Research in Engineering, Science and Management, vol. 2, no. 12, Dec., 2019 [Online]. Available: https://www.ijresm.com/ [Accessed on Apr. 10, 2020]. [5] D. Harikumar, D. Kapoor and S. Waghmare, “YouTube Data Sensitivity and Analysis using Hadoop Framework”, International Research Journal of Engineering and Technology, vol. 6, no. 4, 2019 [Online]. Available: https://www.irjet.net/ [Accessed on Apr. 10, 2020]. [6] S. Biju and A. Mathew, “Comparative Analysis of Selected Big Data Analytics Tools”, Journal of International Technology and Information Management, vol. 26, no. 2, 2017. [7] FactSlides.com, “25 Facts about YouTube FactSlides”, Nov. 28, 2019 [Online]. Available: https://www.factslides.com/s-YouTube/ [Accessed on Apr. 10, 2020]. [8] P. Cooper, Hootsuite Social Media Management, “23 YouTube Statistics that Matter to Marketers in 2020”, Dec. 17, 2019 [Online]. Available: https://blog.hootsuite.com/youtube-stats-marketers/ [Accessed on Apr. 10, 2020]. [9] M. J, “Trending YouTube Video Statistics”, Kaggle.com, 2019 [Online]. Available: https://www.kaggle.com/datasnaek/youtube-new/ [Accessed: Apr. 10, 2020].