youtube.docx

Major project documentation
“YouTube trending video analysis DBMS”
is submitted to
Department of Computer Applications,
Submitted To:
Submitted By:
Project Undertaken:

Acknowledgement
The satisfaction that accompanies that the successful completion of any task
would be incomplete without the mention of people whose ceaseless
cooperation made it possible, whose constant guidance and encouragement
crown all efforts with success. We are grateful to our project guide “Mr. Shakti
kundu” for the guidance, inspiration and constructive suggestions that helpful
us in the preparation of this project.
We are also thankful to my colleagues with whom we have
fruitful discussions which have helped us a lot in giving a final shape to the
program.
ABSTRACT
Unlike popular videos, which would have already achieved
high viewership numbers by the time they are declared

popular, YouTube trending videos represent content that
targets viewers’ attention over a relatively short time, and
has the potential of becoming popular. Despite their
importance and visibility, YouTube trending videos have
not been studied or analyzed thoroughly. In this paper, we
present our findings for measuring, analyzing, and
comparing key aspects of YouTube trending videos. Our
study is based on collecting and monitoring high-resolution
time-series of the viewership and related statistics of more
than 8,000 YouTube videos over an aggregate period of
nine months. Since trending videos are declared as such
just several hours after they are uploaded, we are able to
analyze trending videos’ time-series across critical and
sufficiently-long durations of their lifecycle. In addition,
we analyze the profile of users who upload trending videos,
to potentially identify the role that these users’ profile plays
in getting their uploaded videos trending. Furthermore, we
conduct a directional-relationship analysis among all pairs
of trending videos’ time-series that we have monitored. We
employ Granger Causality (GC) with significance testing to
conduct this analysis. Unlike traditional correlation
measures, our directional-relationship analysis provides a
deeper insight onto the viewership pattern of different
categories of trending videos. Our findings include the
following. Trending videos and their channels have clear
distinct statistical attributes when compared to other
YouTube content that has not been labeled as trending.
Based on the GC measure, the viewership of nearly all
trending videos has some level of directional-relationship
with other trending videos in our dataset. Our results also
reveal a highly asymmetric directional-relationship among
different categories of trending videos. Our directionality
analysis also shows a clear pattern of viewership toward
popular categories, whereas some categories tend to be
isolated with little evidence of transitions among them.

Introduction
YouTube as a user generated content is one of the largest
and most popular video sharing websites. It hosts over four
billion views a day. YouTube provides public statistics
regarding its uploaded videos, most notably the number of
views, which shows the aggregate number of times a video
has been watched up to that point. Naturally, the number
of views for a video indicates the level of popularity of that
video; and it takes a varying amount of time for a video to
become popular (if it becomes popular). Meanwhile, there
relatively short time. YouTube also supports a feature
called trending, which represents content that has the
potential of becoming popular in a relatively short time.
Consequently, although trending videos are usually not
popular (yet) when declared as trending by YouTube, they
have the potential of becoming popular (eventually). For
example, some videos are labeled trending while having
only few hundreds in viewership numbers. From another
perspective, through trending videos, YouTube tries to
highlight emerging trends developing within different
viewership communities.
Meanwhile, the general attributes of the viewership of
trending videos have not been studied thoroughly. To the
best of our knowledge, basic statistics about YouTube
trending videos have not been studied, analyzed, or even
received any adequate attention. Considering the fact that
more than one billion unique users visit YouTube each
month and they upload 72 hours of video every minute
[26], YouTube is the best place for e.g. brand engagement
or advertising, but it is genuinely difficult and competitive
to get the attention of users. Therefore when a video
becomes popular, it is exposed to millions of users for free
and has the opportunity of keeping their attention for a
while. Finding these trends are significantly important that
many different websites have been emerged just to pick up

YouTube for content owners or advertisers. Better
understanding of YouTube trending videos and their
statistics, and a deeper insight about their lifecycles, can
greatly affect the strategies for marketing, target
advertising, recommendation systems and search engines,
as was suggested by prior YouTube measurement studies
[2]. This represents a key motivation for our effort
Scope
Our aim is to produce a scientific knowledge preprocessing analysis operating
solely with the
dataset US Videos. This step is important for all data processing exercises and
that we wish to
emphasize it. Before building theories from knowledge we'd like to grasp
key knowledge
attributes, like missing values, distinctive counts, outliers, and time-series
trends. This kernel
aims to function a tutorial to anyone fascinated by exploiting huge datasets. I
focus only on the
US videos dataset that isn't too huge by big-data standards (only twenty three,
362 rows by
sixteen columns as of March, 2018). This knowledge set contains solely YouTube
data and no
data that area unit troublesome to method and store, like video, image, audio,
or giant text
documents. Still we are going to proceed with knowledge preprocessing and
preliminary
knowledge Analysis (EDA) as if this were a very huge dataset, using techniques
that might be
utilized in rather more difficult knowledge manning exercises. We have a
tendency to worker
variety of techniques from the Scikit/Learn toolkit to administer aspiring to the
info at hand.
INPUT DESIGN
Input design is the process of converting user-oriented input to a computer based format.
Input
design is a part of overall system design, which requires very careful attention. Often the
collection of input
data is the most expensive part of the system. The main objectives of the input design are …
1. Produce cost effective method of input
2. Achieve highest possible level of accuracy
3. Ensure that the input is acceptable to and understood by the staff.

INPUT DATA:
The goal of designing input data is to make entry easy, logical and free from errors as
possible. The
entering data entry operators need to know the allocated space for each field; field sequence
and which must
match with that in the source document. The format in which the data fields are entered
should be given in
the input form. Here data entry is online; it makes use of processor that accepts commands
and data from the
operator through a key board. The input required is analyzed by the processor. It is then
accepted or rejected.
Input stages include the following processes
Data Recording
Data Transcription
Data Conversion
Data Verification
Data Control
Data Transmission
Data Correction
One of the aims of the system analyst must be to select data capture method and devices,
which
reduce the number of stages so as to reduce both the changes of errors and the cost. Input
types, can be
characterized as.
External
Internal
Operational
Computerized
Interactive
Input files can exist in document form before being input to the computer. Input design is
rather
complex since it involves procedures for capturing data as well as inputting it to the
computer.
Trending
Channel_Title No of videos subscribers Company

Primary key : channel_title is a primary key because it is unique key in
which we can obtain all information through this single key.
Candidate key : No candidate key because except A.id no one attribute
can give information about the Airplane entity.
Foreign Key : No foreign key present in this entity.
Coding
DATABASE DESIGN
There are two choices to define the database schema. These are SQL and
NoSQL. We can use traditional database management system like MsSQL
or MySQL to keep data. As you know, we should keep information about
videos and users into RDBMS. Other information about videos, called
metadata, should be kept too. Now we have the main three tables to keep
data. (Notice that we just only think the basic properties of Youtube. We
can forget the recommendation system).
User
– UserID (primary key)
– Name (nvarchar)
– Age (Integer)
– Email (nvarchar)
– Address (nvarchar)
– Register Date (DateTime)
– Last Login (DateTime)
SQL is a language used for managing data in relational databases
that store data in tabular form with labelled rows and columns. We
query data from a relational database with the select statement of
SQL. The select statement is highly versatile and flexible in terms of
data transformation and filtering operations.

In that sense, SQL can be considered as a data analysis tool. The
advantage of using SQL for data transformation and filtering is that
we only retrieve the data we need. It is more practical and efficient
than retrieving all the data and then applying these operations.
In this article, we will use SQL statements and functions to analyze
YouTube trending video statistics. The dataset is available on
Kaggle. I created an SQL table that contains a small part of this
dataset.
Note: I’m using MySQL as the database management system.
Although SQL syntax is mostly the same for all database
management systems, there might be small differences.
The table is called “trending” and it has the following structure.
trending table
We have the dates a video is published and becomes trending. We
also have the title and channel of the video. The views and likes are
the other two features the dataset contains.

Regarding all these features (i.e. columns) we can do a bunch of
different operations. For instance, a simple one can be finding the
top 5 channels in terms of the number of trending videos.
mysql> select channel_title, count(*) as number_of_videos
-> from trending
-> group by channel_title
-> order by number_of_videos desc
-> limit 5;+-----------------+------------------+
| channel_title | number_of_videos |
+-----------------+------------------+
| Washington Post | 28 |
| Netflix | 28 |
| ESPN | 27 |
| TED-Ed | 27 |
| CNN | 27 |
+-----------------+------------------+
We select the channel title column and count the number of rows.
The “as” keyword is used to assign a new name to the aggregated
columns. The group by clause is used to group the videos (i.e. rows)
based on channels. Finally, we sort the results in descending order
using the order by clause and display the first 5.
The number of videos seems to be too low because I only included
the ones published in January, 2018.
We may want to see the title of the most-viewed video.
mysql> select title, views
-> from trending
-> where views = (select max(views) from trending);
(image by author)

The query above contains a nested select statement. It is used with
the where clause to find the desired condition which is the
maximum values in the views column.
The most-viewed video in this table has been watched almost 60
million times.
SQL provides many different options for filtering the data. In the
previous example, we found out that the most-viewed video belongs
to Bruno Mars. We can filter the titles to only see the videos belong
to Bruno Mars.
mysql> select distinct(title)
-> from trending
-> where title like "%Bruno Mars%";
(image by author)
We do not have to provide the exact value for filtering if we use the
like keyword. The “%” represents any character so “%Bruno Mars%”
represents any value that contains the “Bruno Mars” phrase. The
distinct keyword is used to remove the duplicates.
If we are not sure about characters being lower or uppercase, we can
convert all the characters to lower or upper case before filtering.
mysql> select distinct(lower(title))
-> from trending
-> where title like "%bruno mars%";

(image by author)
The dataset contains the published date of videos and when they
become trending. We can calculate the average time it takes for a
video to become trending.
Before calculating the difference, we need to extract the date part
from the publish time column because it contains both the date and
time.
mysql> select trending_date, publish_time
-> from trending
-> limit 3;
+---------------+---------------------+
| trending_date | publish_time |
+---------------+---------------------+
| 2018-01-02 | 2018-01-01 15:30:03 |
| 2018-01-02 | 2018-01-01 01:05:59 |
| 2018-01-02 | 2018-01-01 14:21:14 |
+---------------+---------------------+
The date function extracts the date part and the datediff function
calculates the difference. Thus, we can calculate the average
difference as follows:
mysql> select avg(datediff(trending_date, date(publish_time)))
-> as avg_diff
-> from trending;+----------+
| avg_diff |
+----------+
| 3.9221 |
+----------+

The datediff functions takes two dates separated by a comma and
calculates the difference. It takes 3.92 days on average for a video to
become trending.
We can also calculate the average difference for videos that are
published in a specific time period. We just need to add a where
clause for filtering.
mysql> select avg(datediff(trending_date, date(publish_time)))
as avg_diff
-> from trending
-> where hour(publish_time) > 20;+----------+
| avg_diff |
+----------+
| 4.4825 |
+----------+
We extract the hour value from publish time and use it in the where
clause for filtering.
SQL provides functions for data aggregation which can be
implemented in the select statement. For instance, we can calculate
the average ratio of likes over views of videos published by Netflix.
mysql> select avg(likes / views)
-> from trending
-> where channel_title = "Netflix";+--------------------+
| avg(likes / views) |
+--------------------+
| 0.01816295 |
+--------------------+
The average value is close to 0.02 so Netflix videos have
approximately 2 percent like over view ratio.

Let’s write a slightly more complicated query and calculate the
average video views of channels that published more than 25 videos.
We will also sort the results in descending order by the averages.
mysql> select channel_title, avg(views) as avg_views,
-> count(title) as number_of_videos
-> from trending
-> group by channel_title
-> having number_of_videos > 25
-> order by avg_views desc;
The retrieved data contains 3 columns. One is the channel title
column and the other two are aggregated columns. We filter the
channels based on the number of videos.
You may have noticed that we used the “having” clause instead of
the “where” clause for filtering. The “having” clause is used for
filtering based on aggregated columns.
ER Diagram

Conclusion
We have done some examples to analyse the YouTube trending
video statistics. The examples clearly demonstrate that SQL can also
be used as a data analysis tool.

youtube.docx

Recommended

Recommended

More Related Content

Similar to youtube.docx

Similar to youtube.docx (20)

Recently uploaded

Recently uploaded (20)

youtube.docx