3. With the emergence of the internet, the lock to address the mass
community has been unlocked. Today, every person can have an
online presence, which he/she can use to express his/her views on
various social networking sites like Twitter, YouTube, Facebook,
Instagram, etc. These sites offer easy access to their platform, with
few to no checks of the user. Some people exploit these loopholes and
use their undetectable identities to disturb others’ peace. The comment
section of posts which is used for meaningful discussion over the
published content now contains toxic and offensive messages. Many
users are demanding to remove these features as they offer little to no
value. A system that detects toxic texts containing insults, threats, etc.,
would offer great help in filtering these comment sections.
8. The main aim of this research is to propose a moderation system that can filter out
offensive, harsh, abusive comments, from the comment sections of various social media
platforms. Such models are present in resource-rich languages like English, French, etc.
With the help of Natural Language Processing, these models can be applied to comments,
written in Hinglish (Hindi + English) language.
The research objectives are formulated based on the aim of the study, which is as follows:
• To classify offensive words commonly found in the comments in various classes such
as abusive, hateful, bulgar, insult, threats, etc.
• To apply various pre-processing techniques on the self-created dataset
• To compare various predictive models to identify the most accurate model to classify
comments into their rightful classes.
• To evaluate the performance of the selected model with the created dataset
• To integrate the evaluated model with platforms to help moderators filter out
classified comments
10. In the era of the internet, everyone is living two lives, physical and digital. On average, a
person is digitally more active than physically. For example, a person with no friends can
have 100 friends on social media platforms. These digital identities are undetectable due to
various features offering users security. Most people show their emotions in the form of
various posts, videos, tweets, etc. Many people view them, show their support, and
appreciate the creator via the comment section offered in this content. But some people use
these comment sections to spread negativity. These comments contain disheartening
messages which can discourage the creator. Also, they make meaningful discussions
disturbing, due to which many users do not take part in them.
These comments can be removed by using a moderator but the volume of comments
posted makes it impossible for any moderator to filter through all the comments. Our work
will help these moderators easily sort and filter these comments, without the help of the
moderator. In India, most people use the Hinglish language to chat and comment. Our work
would be first in this language. It will also be able to filter English comments. Our work will
use the multi-lingual model to filter toxic comments to provide better performance.
12. In this project, we aim to develop a model that can classify
toxic comments written in English as well as in the Hinglish
language. Supporting different languages is very difficult for
this project. Also, detecting misspellings and identifying
substituted words being used to hide the original word is out
of the scope of this project. However, the developed
moderator system can be used as a prototype to be used in
different languages to identify toxic comments.
14. Dataset is self created, using python
inbuilt youtube_comment_scrapper,
which requires a youtube link as an
input only.
It returns a csv file containing the
[Comments, Time, Like, UserLink,
user].
15. Dropping out few columns as they have no requirement
Columns like "Unnamed:0", "Likes", "Time", "UserLink", "user" are not
needed in the while finding out whether a comment is a toxic or not.