Data Intensive Architectures

Analysis of Trending Videos Pattern on YouTube
using Hadoop MapReduce
Pooja Kumar
MSc Data Analytics
National College of Ireland
Dublin, Ireland
x18181929@student.ncirl.ie
Abstract— YouTube is one of the prominent sites for video
hosting and sharing. To enhance the content and quality of the
video, analysis of user interaction factor and streaming
information of videos is required. Video content can be
scrutinized using factors like number of views, likes, comments
and dislikes. Based on this the channel which is more active and
in which the user has more interest can be identified. This
project aims to utilize the data which has user interaction
factors and information about videos to analyze the trending
pattern. The large dataset of YouTube can be managed in
Hadoop. Data is stored in the Hadoop Distributed File System
(HDFS) and processed using the MapReduce framework. Data
was analyzed to distinguish the content and discover which has
a better response, which helps to organize the videos in the
future. Further, the analysis helps in understanding the popular
category, popular channel, the video which has a high number
of views, which category of videos is most disliked.
Keywords— Hadoop, Hadoop Distributed File System
(HDFS), MapReduce, YouTube, Mapper, Reducer.
I. INTRODUCTION
Social networking sites offer a lucrative way to promote
and advertise. They offer various opportunities for product
publicity, showcasing new movie trailers and images,
corporate promotions [2]. Beyond YouTube's clear
entertainment and leisure advantages, it is also being
leveraged for professional and other industry gains [3]. Most
companies are releasing previews of their new products on
YouTube for greater exposure. YouTube helps uploaders to
have reviews, likes and shares as the feedback on their videos.
The feedback is given by the viewers. By using this
information future improvements can be done. This will help
to customize the product according to the viewer needs. Each
minute, over 500 hours of video are uploaded to YouTube.
YouTube has more than 2 billion users. On YouTube daily, an
average of 1 billion hours of video is watched. Based on
10,000 reviewers around 32 million inappropriate videos were
removed in 2018. YouTube can be browsed over 80 languages
[7]. In the United States (U.S), YouTube reaches more young
people on mobile than any Television Network. Around “73%
of U.S adults use YouTube”. Figure.1 shows how many
internet users in the U.S use YouTube. Nearly “15% of site
traffics on YouTube comes from the U.S”. Next, it is “India
with 8.1% and Japan with 4.6%”. YouTube will generate $5.5
billion in advertising revenues in the US alone by 2020 [8].
In this paper Hadoop, a big data framework was used for
handling data. It is a Java-based programming framework.
MapReduce programming model is implemented for
processing massive amounts of data in parallel. Data storage,
accessing, processing and computation with a file system is
provided in Hadoop Distributed File System (HDFS). U.S
YouTube data was analyzed using HDFS and MapReduce
framework to provide a better understanding of viewer
interest.
Fig. 1. U.S YouTube Users
Research Question: How Hadoop MapReduce framework
help to analyze the pattern of trending videos on YouTube
and user’s interest that can be used to benefit people by
making proper decision to advance on trending topics?
The research question is answered based on analyzing some
topics using MapReduce.
Q1) How videos are distributed and in which category there
are more videos?
Q2) Which channel has more views?
Q3) Which video is disliked by most of the users?
Q4) Which channel has more videos and to which category
they belong?
Q5) Which video has been trending from many days?
Q6) How many videos were removed from each category?
Q7) For how many videos in each channel the rating is
disabled?
Q8) How many videos in a channel are not associated with a
particular category?
Q9) How well are the factors views, likes and comments
count are correlated to each other? (Statistical approach using
python)
In this project trending topics and people’s interest can be
analyzed. This information can be used to benefit people by
making a proper decision to advance on trending topics. This

analysis helps the uploader to find more about the
competitors on YouTube. It discovers the videos that perform
best. Based on videos uploaded and viewer interaction factor,
it detects the active channel and category. Using this method,
it is easy to find a number of videos in each category, the
number of viewers in each channel, videos that viewers don’t
like.
II. RELATED WORK
The data transmitted was in limited size when
communication was in traditional form. Now, due to
widespread usage of internet, the massive amount of data is
possessed by companies and social media platforms. As data
is massive to analyze and extract information from the data,
analytics is used. The critical data analysis has resulted in
advanced analytical intelligence for various data segments
and to forecast future predictions. The huge and complex data
cannot be handled by the design of conventional data analysis
software. The result cannot be generated for the accumulated
data due to complexity. Thus [Biju and Mathew, 2017],
explained the need for Big Data analytics [6]. Because of the
availability of Big Data and low-cost hardware product, there
was a requirement to process data quickly and cost-
effectively [3].
Hadoop MapReduce framework splits large framework
into smaller parallel chunks and manages the scheduling.
Mapping is done for each piece to an intermediate value and
then it is reduced as a result of an analysis. MapReduce
algorithm can be rewritten according to the problem and can
be broken into chunks to be solved in parallel. This is how it
addresses large datasets with a distributed solving method
[5]. “YARN (Yet Another Resource Negotiator) separates
resource management functions from programming model”.
On top of YARN, MapReduce is an application that will be
running. Resource management and job scheduling are split
into separate daemons in JobTracker’s application. The
author explained how Hadoop MapReduce and YARN works
[5].
In the existing situation, YouTube analytics can be used
by uploaders to analyze their own channel. This analysis
provides the overall parameters and it is available for their
own channel. Competitor’s information is not revealed [3].
The author [Harikumar, Kapoor and Waghmare, 2019]
has analyzed YouTube data. A technique was implemented
to detect the sensitive text in a comment to identify the
popular content type which helped both YouTubers and
advertising companies to upload the videos based on
popularity. Here, the first data was collected and the sensitive
contents in comments were substituted further it was
analyzed for popular type content. Based on this analysis
YouTubers advertise the product from which Ad revenue was
earned [5].
The author [Dabas, Kaur, Gulati and Tilak, 2019] has
presented the classification and analysis of YouTube video
comments based on the Hadoop application. Using Hive
queries the comments was been summarized. Sentiment
analysis was performed on comments using python. Thus, the
system delivered a promising result for queries in terms of
execution time [2].
III. METHODOLOGY
This study focuses on generating a pattern using the
Hadoop environment for user interaction on YouTube. For
this research two datasets were considered, videos trending
on YouTube and categories of video-specific for each region
were downloaded from Kaggle [9].
The YouTube dataset has a daily record of trending
videos of several months. It includes the data of many
countries in a separate file. U.S YouTube data file was chosen
for the analysis. It contains 40949 records and has
information such as video id, trending date, title, channel,
category id, publish time, tags, views, likes, dislikes,
comment count, thumbnail link, comment disabled
information, ratings disabled information, information about
the error or removed video and description. The extracted file
was in comma separated values(csv) format and required less
pre-processing as there were no missing or NAN values. But
all the commas were removed as they act as delimiter and the
date was formatted into the proper format i.e, dd/mm/yyyy.
The category dataset is different for each region this
information was fetched from associated JSON file from
Kaggle [9]. This JSON file includes information like etag,
category id, snippet-channel id, snippet-title, snippet-
assignable. The JSON file was converted to structured data
i.e, to CSV format using python code. There was no need of
pre-processing this dataset as it contained all the information.
Both datasets were merged using python code. While
merging, only required information was taken into
consideration and the column which was not used in the
analysis was removed. After merging there were 40949
records in the dataset. Figure.2 shows the information about
the merged dataset. The combined dataset was loaded into
HDFS. Java MapReduce framework was utilized for all
insights in the entire process. After the generation of result
visualization was done. By analyzing this dataset, patterns
about the trending videos can be drawn and it would be useful
for the YouTubers. This helps to know which kind of
category or video has a better response based on that
YouTubers can upload the contents.
Fig. 2. Dataset after merging

IV. IMPLEMENTATION
The architecture was designed for implementation and
followed to perform the analysis on YouTube data. The
required processing was done on the dataset and all the
essential records from the dataset were merged and loaded into
HDFS. Using the MapReduce program, the data was read
from the HDFS. Mapping and reducing operations are
performed and the generated output was stored into HDFS.
Java programming language was used to write code. Then the
generated output was visualized. Figure.3. shows the process
flow of Data Analysis.
Fig. 3. Process Flow of Data Analysis
To perform the MapReduce task, three java classes will be
created Mapper, Reducer and Driver class. In mapper class
input will be processed and intermediate key-value will be
generated by splitting input and recording it in the form of
(key, value) pair. The output from the mapper is fed as input
to reducer class. For a wide range of processing, data can be
aggregated, filtered and combined in several ways in reducer
to generate the output. The Driver class is accountable for
setting MapReduce job to run in Hadoop. Job name, the data
type for input and output, class names of mapper and reducer
are specified in driver class. During Java, program
compilation directory will be created in the current directory
with a package name mentioned in a java source file and all
compiled files will put into it. Also, the jar file has to be
created. The input was read from HDFS and the generated
output was stored back into HDFS. This is similar to all the
queries.
Q1) The first task is to find the distribution of videos based
on category. To perform this task, three java classes were
created categoryMapper, categoryReducer and driver class.
Mapper class is as seen in figure 4, a variable is declared that
stores all the lines from the input file. Using the delimiter, the
line is split and values are stored as an array. The column
which has category information is read and mapped as key.
For each key, the value is stored as 1. It is an intermediate key-
value pair output generated from the mapper class. In Reducer
class the output from mapper is read as an input. By shuffling
and reducing, the output is generated. Figure 5. shows the
reducer java class. Figure. 6 is the driver class in which the job
name as YouTube, the data type for output as (Text,
IntWritable) and name of mapper and reducer class are
specified to run the program. This program was compiled.
Using JAR command all the classes are added to the JAR file
shown in figure.7 and the program is executed. The output is
stored in HDFS. This task was performed to know how videos
are distributed.
Fig. 4. Mapper class
Fig. 5. Reducer class
Fig. 6. Driver class
Fig. 7. JAR file for Task 1
Q2) The second task is to identify which channel has more
views. In mapper class, channel and number of views
associated with the particular channel are sorted and mapped
as key-value pair. In reducer, the key-value pair was shuffled
and reduced. All the java classes were added to the JAR file
to perform the execution. After execution output was stored
into HDFS.
Q3) The third task is to identify the video which is most
disliked. In mapper, the video ID associated with each video
and the number of dislikes was taken into consideration.
Video ID and dislike count was sorted and intermediate key-
value pair was generated. It was sent to the reducer to perform
shuffling and reducing task. An operation was performed to
count the number of dislikes that were given to the video.
Based on this output the video which most of the viewers do
not like can be identified.
Q4) The fourth task is to analyze category, the channel
which has more videos. In mapper, category and channel ID

which is the name of the channel was grouped to form a key.
For every key, the value was set and this key-value pair was
sent to the reducer. In reducer for each occurrence of the same
channel in a category, the value was counted. Thus, a
particular channel in a category that has more video was
fetched as output.
Q5) The fifth task is to identify which video has been
trending for many days on YouTube. For each video, there is
a record of publishing date and trending date using these the
information the number of trending days of a video can be
fetched in a mapper class. The output from the mapper was
sent as input to the reducer and final reduced output was
generated.
Q6) The sixth task is to check how many videos were
removed from each category. In mapper, category and video
removed which is binary information that is presented in
column video_error_or_removed was used. When the binary
is False then the corresponding category name was set as key
and value was stored as 1. Thus, the intermediate key-value
pair was generated. In reducer, the input from mapper was
reduced and the final output of the key-value pair was
generated.
Q7) The seventh task is to identify how many videos in a
channel the rating is disabled. This is similar to the above task
the rating disabled is binary information. In mapper, each
time the rating disabled is as True value the corresponding
channel name is set as key and value is set as 1. Thus, in
reducer class each time the channel name is repeated the
value is added and the output was generated.
Q8) The eighth task is to find how many videos in the
channel are not associated with the category. In this snippet
assignable column has binary value every time the value is
false the corresponding channel and category were set as key
and value was set in mapper class. In reducer class, every
time the channel name and category were the same the count
of value was increased by 1 and thus the reducer output was
generated.
JAR file information of all the JAVA programs is as seen
in the figure. 8.
Q9) The final task was to analyze the correlation between
views, likes and comment count. This was done using python.
If views, likes and comment counts are highly correlated then
it suggests the viewer’s interest on a particular video.
V. RESULTS
Q1) As seen in the figure.10, the entertainment and music
categories have more videos compared to all others. In this
graph, the distribution of videos for each category can be
seen. It is evident that YouTubers upload more videos to
entertainment and music categories. Figure. 11, shows the
distribution of videos based on number of times category
name appears the text becomes bigger and bolder in word
cloud.
Fig. 8. JAR file of all Tasks
Fig. 9. Output for Task1
Fig. 10. Videos in each category
Fig. 11. Distribution of videos
Q2) As Figure 12. shows the top 20 channels which have
more views. It can be analyzed that these 20 channels viewed
more on YouTube. Most of the channels here belong to
music, entertainment and sports categories.

Fig.12. Top 20 channel with more views
Q3) In figure 13. the video ID of the 20 videos that are
most disliked is shown. Using this video ID corresponding
channel to which the video belongs can be identified. Here,
ID- FlsCjmMhFmw belongs to the YouTube-Spotlight
channel and this channel is the most disliked channel by the
viewers.
Fig.13. Most Disliked videos
Q4) The result shown in figure 14. says which channel
belonging to which category has more videos. It is observed
that in the sports category the ESPN channel has more video
when compared to all other channels.
Fig.14. Channel which has more number of Videos
Q5) In figure 15. it is displaying the top 10 videos that
have been trending for many days on YouTube. In this, the
two videos which are trending for many days belong to the
Sports and News & Politics category.
Fig.15. Top 10 shows trending from many days
Q6) In figure 16. it is seen that some of the videos which
belong to entertainment, Film & Animation and Sports
Category. The videos belong to other categories were not
removed.
Fig.16. Categories in which more videos are removed
Q7) The graph in figure 17. gives information about for
how many videos in a channel the rating has been disabled.
The greater number of videos for which the rating has
disabled belong to How To & Style category.
Fig.17. Videos Streaming for many days
Q8) From figure 18. it is observed that the videos
belonging to CNET and Bleacher Report channel are not
associated with their category which is Shows. This means
the snippet is not assignable.

Fig.18. Number of videos snippet not assignable
Q9) From the result in figure 19. It is observed that views
and likes are highly correlated to each other compared to
comment count. Based on this correlation matrix viewer’s
interest can be analyzed. If video has more views then it has
a high chance of being liked by many viewers.
Fig.19. Correlation Matrix
VI. CONCLUSION AND FUTURE ENHANCEMENT
In this paper, an analysis was made on YouTube data
using the Hadoop MapReduce framework. The research
question was answered by using MapReduce tasks. Through
this analysis, it was found that entertainment and music
categories have more videos and have a greater number of
views. The shows which belongs to sports, news & politics
category have been trending for many days. From these tasks
the video trending, pattern and user interest were analyzed.
The viewer’s interest can be identified based on views, likes
and comments. The project results also highlight the
advantages of the Hadoop framework and its disadvantage is
syntax complexity of Java MapReduce.
In the future, analysis can be made using uploader’s
information and also sentiment analysis can be done on a
description of the video.
REFERENCES
[1] P. Merla, Y.Liang “Data Analysis using Hadoop MapReduce
Environment”, IEEE Conf. on Big Data, 2017 [Online]. Available:
IEEE Xplore, https://www.ieee.org/ [Accessed on: Apr. 10, 2020].
[2] C. Dabas, P. Kaur, N. Gulati and M.Tilak, “Analysis of Comments on
YouTube Videos using Hadoop”, Fifth Internation Conf. on Image
Information Processing (ICIIP), 2019.
[3] F. Shaikh, D. Pawaskar, U. Khan and A.Siddiqui, “YouTube Data
Analysis using MapReduce on Hadoop”, IEEE Conf. on Recent Trends
in Electronics, Information & Communication Technology, May.
18/19, 2018.
[4] K. Bhatter, S. Gavhane, P. Dhamne, G. Aochar and S. Rabade, “A
Review on YouTube Data Analysis using MapReduce on Hadoop”,
International Journal of Research in Engineering, Science and
Management, vol. 2, no. 12, Dec., 2019 [Online]. Available:
https://www.ijresm.com/ [Accessed on Apr. 10, 2020].
[5] D. Harikumar, D. Kapoor and S. Waghmare, “YouTube Data
Sensitivity and Analysis using Hadoop Framework”, International
Research Journal of Engineering and Technology, vol. 6, no. 4, 2019
[Online]. Available: https://www.irjet.net/ [Accessed on Apr. 10,
2020].
[6] S. Biju and A. Mathew, “Comparative Analysis of Selected Big Data
Analytics Tools”, Journal of International Technology and Information
Management, vol. 26, no. 2, 2017.
[7] FactSlides.com, “25 Facts about YouTube FactSlides”, Nov. 28, 2019
[Online]. Available: https://www.factslides.com/s-YouTube/
[Accessed on Apr. 10, 2020].
[8] P. Cooper, Hootsuite Social Media Management, “23 YouTube
Statistics that Matter to Marketers in 2020”, Dec. 17, 2019 [Online].
Available: https://blog.hootsuite.com/youtube-stats-marketers/
[Accessed on Apr. 10, 2020].
[9] M. J, “Trending YouTube Video Statistics”, Kaggle.com, 2019
[Online]. Available: https://www.kaggle.com/datasnaek/youtube-new/
[Accessed: Apr. 10, 2020].

Data Intensive Architectures

Recommended

Recommended

More Related Content

Similar to Data Intensive Architectures

Similar to Data Intensive Architectures (20)

Recently uploaded

Recently uploaded (20)

Data Intensive Architectures