SAMUEL FULL MSC PROJECT

Filtering Offensive Language in Online Communities using Grammatical Relations
BY
SAMUEL AYOKUNLE ADEKANMBI
MATRIC NO: 133466
Project submitted in partial fulfillment award of Master of Science degree
(Computer science)
Department of computer science,
University of Ibadan.
February, 2014.

Certification
I certify that this research work was carried out by Samuel Ayokunle ADEKANMBI (133466)
under my supervision.
.
____________________ _______________________
Date Dr B O Longe

DEDICATION
This entire work dedicated to everyone that believes in the PromoUpdate dream.

ACKNOWLEDGEMENT
My profound gratitude goes to my parent and my siblings for their moral and financial support
which has immensely led to the success of this project. To my Dad, You are the best; I love you
so much even though I don’t show it.
I am indeed grateful to my supervisor, Dr. Olumide B. Longe for his moral support, patience and
understanding during the course of this project. Thank you very much Sir.
I also want to appreciate my very good and crazy friends: Tini, Phina, Kunchasho, TY, Alamu,
Oluwashola Amiola Philip, Emmanuel, Muideen, Lola Mojekodunmi, Jane, Gbenro, N.O Jimoh,
Tifa; You guys are my brothers from another mother.
I can’t underestimate the effort of all my lecturers in the department; I pray the blessing of the lord
shall not depart from your homes.
My Msc. Programme will have being incomplete without some set of wonderful people: Tini,
Phina, Helen, Rotimi, Modupe, Tolu, Big Fish, Last Don, Giel, and the whole crew at chief Madu’s
Palace. Thanks for being there for me.
To all my classmates, Dimple, Becky, Elohor, Ben, Fake AYs, John, Uzomma, Deola, Banky,
Shukurat, Toyosi, Shola, Adesi, GP, Toyosi, Tosinsss, etc; you have been a blessing to me and the
success of my programme. I say a big thanks to you for your support throughout the programme.
I appreciate your love. Thanks for believing in the PromoUpdate dream. You guys are the best.
Finally, to anyone that has contributed to the success of this project and my success in life, whose
name is not mentioned here, please just know that you are not unknown to me to me and you are
appreciated more than you know. God bless you all. See you at the top.

TABLE OF CONTENT page
Title page i
Certification ii
Dedication iii
Acknowledgement iv
Table of content v
Abstract viii
CHAPTER ONE: INTRODUCTION
1.1 Background of study 1
1.2 Problem Statement 4
1.3 Aims and Objectives 4
1.4 Research Methodology 5
1.5 Scope and Limitation 5
1.6 Organization of the study 6
1.7 Expected Contribution to Knowledge 6
1.7.1 Glossary of terms. 7
CHAPTER TWO: REVIEW OF THE LITERATURE
2.1 Offensive Language in Online Communities 8
2.2 Rate of Cyberbullying among youth 9
2.3 Tradition-Bullying and Cyber-Bullying 10
2.4 Type of Bullying Online 12
2.5 Challenges in the fight to stop cyberbullying 12
2.6 Preventing Cyberbullying 13
2.7 Responding to Cyberbullying 14
2.8 Grammatical Relations 16

2.9 Using text mining techniques to detect online offensive content 17
2.10 Heads and Dependents 20
2.11 Statistical Parsing 21
2.12 Dependency Parsing 27
CHAPTER THREE: SYSTEM ANALYSIS AND DESIGN
3.1 Systems Analysis 36
3.2 Analysis of the existing system 37
3.3 Problem of the existing approaches 40
3.4 Proposed Filtering Philosophy 41
3.5 Identify Removable Content by Grammatical Relations 44
CHAPTER FOUR: IMPLEMENTATION
4.1 Justification of Programming Language Used 56
4.2 System Specification 58
4.3 System Implementation 59
CHAPTER FIVE: SUMMARY, CONCLUSION AND FUTURE WORKS
5.1 Summary. 65

5.2 Conclusion 65
5.3 Future Works 66
References 67

ABSTRACT
Offensive language has risen to be a big issue to the health of both online communities and their
users. To the online community, the spread of offensive language undermines its reputation, drives
users away, and even directly affects its growth. To users, viewing offensive language brings
negative influence to their mental health, especially for children and youth.
A semantic filtering model is been proposed and implemented using grammatical analysis and part
of speech tagging. Statistical/probabilistic analysis of recurring offensive tokens is been done using
Bayesian method. The designed semantic filtering system was tested as an online web application
with a client application by engaging users to validate the efficiency of the designed system.
When offensive language is detected in a user message, a problem arises about how the offensive
language should be removed, i.e. the offensive language filtering problem.
Our semantic filtering technique is based on the grammatical relations of words in a sentence so
that the rest of the filtered sentence is readable and the existence of offensive words in the original
sentence is hard to notice. We tested the effectiveness of our approach with a large dataset and the
results show that our techniques are very effective and accurate with little process overhead.
Moreover, as the most time-consuming part of semantic filtering is the sentence parsing process,
we will examine other light-weighted NLP techniques to speed up sentence parsing. Also, we also
plan to extend our filtering approach to support other languages such as Chinese and French in
future works.

CHAPTER ONE
INTRODUCTION
Online social networking (OSN) websites have enjoyed a great success in recent years and have
become the new frontier in today’s social relationships providing great places for self-expression
and exchange of ideas.
Social networking has provided opportunities for new relationships as well as strengthening
existing relationships. Benefits of social networking platforms vary based on platform type,
features and the company itself. OSN allows organizations to improve communication and
productivity by disseminating information among different groups of employees in a more
efficient manner, resulting in increased productivity.
In the past, social networks were viewed as a distraction and offered no educational benefit.
Blocking these social networks was a form of protection for students against wasting time,
bullying, and invasions of privacy. In an educational setting, OSNs are seen by many instructors
and educators as a frivolous, time-wasting distraction from schoolwork, and it is not uncommon
to be banned in school computer labs. Cyberbullying has also become an issue of concern with
social networks. According to the Children Go Online survey of 9-24 year olds, it was found that
a third have received bullying comments online.( http://internetsafety101.org) To avoid this
problem, many school districts/boards have blocked access to online social networks within the
school environment. I
Social networking services often include a lot of personal information posted publicly, and many
believe that sharing personal information is a window into privacy theft. Schools have taken action
to protect students from this. It is believed that this outpouring of identifiable information and the

easy communication vehicle that social networking services opens the door to sexual predators,
cyberbullying, and cyber-stalking.(http://en.wikipedia.org/wiki/Social_networking_service) In
contrast, however, 70% of social media using teens and 85% of adults believe that people are
mostly kind to one another on social network sites.(
http://en.wikipedia.org/wiki/Social_networking_service) Research has suggested that there has
been a shift in blocking the use of social networking services. In many cases, the opposite is
occurring as the potential of online networking services is being realized. It has been suggested
that if schools block them [Online Social Networks], they’re preventing students from learning the
skills they need. Banning social networking is not only inappropriate but also borderline
irresponsible when it comes to providing the best educational experiences for students. Schools
have the option of educating safe media usage as well as incorporating digital media into the
classroom experience, thus preparing students for the literacy they will encounter in the future.
Cyberbullying is a fast growing trend that experts believe is more harmful than typical schoolyard
bullying. Nearly all of us can be contacted 24/7 via the internet or our mobile phones. Victims can
be reached anytime and at anyplace. For many children, home is no longer a refuge from the
bullies. “Children can escape threats and abuse in the classroom, only to find text messages and
emails from the same tormentors when they arrive home.”
“There’s no safe place anymore and one can be bullied 24/7; even in the privacy of his/her own
bedroom.” (Cyberbullying, Able Publishing Newsletter - Term 3, 2008).
Online social networking sites have become increasingly popular with children, especially young
teens, as a place where they can meet other people, communicate, and exchange information. No
type of bullying is harmless. In some cases, it can constitute criminal behaviour. In extreme
incidents, cyberbullying has led teenagers to suicide. Most victims, however, suffer shame,

embarrassment, anger, depression and withdrawal.(Cyberbullying, Able Publishing Newsletter -
Term 3, 2008) Cyberbullying is often seen as anonymous, and the nature of the internet allows it
to spread quickly to hundreds and thousands of people.
Cyberbullying has the same insidious effects as any kind of bullying, turning children away from
school, friendships, and in tragic instances, life itself. Parents often tell their children to turn off
the mobile phones or stay off the computer. Many parents don’t understand that the internet and
mobile phone act as a social lifeline for teenagers to their peer group. Victims often don't tell their
parents because they think their parents will only make the problem worse, or that they might even
confiscate their mobile phone or take away their internet access, removing that social lifeline.
While bullying is something that is often ‘under the radar’ of adults, cyberbullying is even more
so. Teenagers are increasingly communicating in ways that are often unknown by adults and away
from their supervision. They organize their social lives through these mediums. Their friendships
are made and broken over these mediums.
So the question remains "How can we avoid offensive languages in OSNs?” This research work
aims at removing offensive languages in a user message. When offensive language is detected in
a user message, a problem arises about how the offensive language should be removed, i.e. the
offensive language filtering problem. To solve this problem, manual filtering approach is known
to produce the best filtering result. However, manual filtering is costly in time and labor thus
cannot be widely applied.(http://en.wikipedia.org/wiki/Anti-spam_techniques) Here, we will
analyze the offensive language in text messages posted in online communities, and propose a new
automatic sentence-level filtering approach that is able to semantically remove the offensive
language by utilizing the grammatical relations among words. Comparing with existing automatic

filtering approaches, the proposed filtering approach provides filtering results much closer to
manual filtering.
1.1 Statement problem
The online community has encouraged the use of offensive languages which has spread into about
80% of all OSN; and has been very harmful to the mental health of both children and youth.(Zhi
Xu and Sencun Zhu, 2010) To the online community, the deluge of offensive language undermines
the community’s reputation, drives users away, and even directly affects its growth.
People have realized the problems brought by offensive language in online communities and many
efforts have been made on detecting the existence of offensive language within user messages.
However, detection alone is not enough to eliminate the hazard caused by offensive language.
When offensive contents are detected within a user message, a question arises naturally about how
the detected offensive content should be removed from the message before it is been transmitted.
Also, how do we remove or filter offensive languages and words form a message thoroughly and
still keep inoffensive content untouched as much as possible. Also, will the readability of filtered
content be guaranteed so as to make our filtering transparent to readers?
1.2 Aims and objectives:
This project work intends to develop and implement a sentence-level semantic filtering System,
which will

1. Utilize grammatical relations among words to stop cyberbullying by semantically remove
offensive content in a sentence.
2. Produce minimal error when filtering offensive languages and words form a message and
still keeps inoffensive content untouched as much as possible.
3. Guarantee the readability of filtered content so as to make the filtering transparent to
readers.
4. Implement the designed model which is going to be a sophisticated NLP application, not
an AI application, since learning is not going to be involved.
5. To help reduce the chances of victimization in Online Social Networking Sites.
1.3 Research Methodology
The methodology adopted in carrying out this project include the use of interviews to gather
primary data from a number of leading filtering vendors in Nigeria. Both telephone and face-to-
face interviews will be carried out with the relevant technology experts within selected
organizations. Also, an existing database of offensive words and languages will be collected and
use to simulate an offensive database engine. A semantic filtering model will be proposed and
implemented using XYZ. Statistical/probabilistic analysis of recurring offensive tokens will be
done using Bayesian method. The designed semantic filtering system will be tested as an online
web application with a client application by engaging users to validate the efficiency of the
designed system.

1.4 Organization of the study
The thesis work is arranged in five chapters with the breakdown as follows:
The First Chapter is termed introduction and it includes the Online Social Networking System,
research aim and objectives, research methodology and organization of dissertation.
Chapter Two deals with the literature review on grammatical relations, cyberbullying and the
concept of sematic filtering system.
Chapter Three presents the Methodology and analysis of the input and output specification of the
proposed system and the design of the system.
Chapter Four describes the system implementation and evaluation of the system design. This
would consist of a brief description of each program module and their functions. It also justifies
the choice of package and describes the software required to implement the system. It also shows
the measures taking during the implementation.
Chapter Five summarizes the project work. It covers conclusion and recommendation for the
project.

LITERATURE REVIEW
2.1 Offensive Language in Online Communities
A lot of people most especially kids have been bullying each other for generations. The latest
generation, however, has been able to utilize technology to expand their reach and the extent of
their harm. (http://cyberbullying.us) This phenomenon is being called cyberbullying, defined as:
“willful and repeated harm inflicted through the use of computers, cell phones, and other
electronic devices.” Basically, we are referring to incidents where adolescents use technology,
usually computers or cell phones, to harass, threaten, humiliate, or otherwise hassle their peers.
For example, youth can send hurtful text messages to others or spread rumors using cell phones
or computers. Teens have also created web pages, videos, profiles on social networking sites
making fun of others. With cell phones, adolescents have taken pictures in a bedroom, a
bathroom, or another location where privacy is expected, and posted or distributed them online.
More recently, some have recorded unauthorized videos of other kids and uploaded them for the
world to see, rate, tag, and discuss.( http://cyberbullying.us)
However, there are many detrimental outcomes associated with cyberbullying and making use of
offensive languages that reach into the real world. First, many targets of cyberbullying report
feeling depressed, sad, angry, and frustrated. As one teenager stated: “It makes me hurt both
physically and mentally. It scares me and takes away all my confidence. It makes me feel sick
and worthless.” Victims who experience cyberbullying also reveal that they were afraid or
embarrassed to go to school or even come out to talk in public.(http://cyberbullying.us) In
addition, there is a link between cyberbullying and low self-esteem, family problems, academic

problems, school violence, and delinquent behavior. Finally, cyberbullied youth also report
having suicidal thoughts, and there have been a number of examples around the world where
youth who were victimized ended up taking their own lives.(http://cyberbullying.us)
Cyberbullying occurs across a variety of venues and mediums in cyberspace, and it shouldn’t
come as a surprise that it occurs most often where teenagers congregate. Initially, many kids
hung out in chat rooms, and as a result that is where most harassment took place. In recent years,
most youth are have been drawn to social networking websites (such as Facebook, Twitter,
Linked In, etc.) and video-sharing websites (such as YouTube). This trend has led to increased
reports of cyberbullying occurring in those environments. (Burgess-Proctor, Patchin, & Hinduja,
2009; Hinduja & Patchin, 2008b; R. M. Kowalski & Limber, 2007; Lenhart, 2007; Li, 2007a;
Patchin & Hinduja, 2006). Instant messaging on the Internet or text messaging via a cell phone
also appear to be common ways in which youth are harassing one another.
2.2 Rate of Cyberbullying among Youth
Estimates of the number of youth who experience cyberbullying vary widely (ranging from 10-
40% or more), depending on the age of the group studied and how cyberbullying is formally
defined. In this research, we inform secondary school students(of International School, Ibadan;
Abadina College, U.I; and Igbobi College Yaba, Lagos) that cyberbullying is when someone
“repeatedly picks on another person by making use of offensive languages through OSN when
chatting or when someone posts something offensive online about another person that they don’t
like.” Using this definition, about 62% of the over 800 randomly-selected 11-18 year-old
students indicated they had been a victim at some point in their life. About this same number

admitted to cyberbullying others during their lifetime. Finally, about 40% of youths in this recent
study said they had both been a victim and an offender.
Fig 2.1
2.3 Traditional-Bullying and Cyber-Bullying
While often similar in terms of form and technique, bullying and cyberbullying have many
differences that can make the latter even more devastating. First, victims often do not know who
the bully is, or why they are being targeted. The cyberbully can cloak his or her identity behind a
computer using anonymous email addresses or pseudonymous screen names.
Second, the hurtful actions of a cyberbully are viral; that is, a large number of people (at school,
in the neighborhood, in the city, in the world!) can be involved in a cyber-attack on a victim, or
at least find out about the incident with a few keystrokes or clicks of the mouse. The perception,
then, is that absolutely everyone knows about it.

Third, it is often easier to be cruel using technology because cyberbullying can be done from a
physically distant location, and the bully doesn’t have to see the immediate response by the
target. In fact, some teens simply might not recognize the serious harm they are causing because
they are sheltered from the victim’s response.
Finally, while parents and teachers are doing a better job supervising youth at school and at
home, many adults don’t have the technological know-how to keep track of what teens are up to
online. As a result, a victim’s experience may be missed and a bully’s actions may be left
unchecked. Even if bullies are identified, many adults find themselves unprepared to adequately
respond.()5
All these and more makes cyberbullying is a growing problem because increasing numbers of
kids are using and have completely embraced interactions via computers and cell phones. Two-
thirds of youth go online every day for school work, to keep in touch with their friends, to play
games, to learn about celebrities, to share their digital creations, or for many other reasons.
Because the online communication tools have become an important part of their lives, it is not
surprising that some youths have decided to use the technology to be malicious or menacing
towards others. The fact that teens are connected to technology 24/7 means they are susceptible
to victimization (and able to act on mean intentions toward others) around the clock.() Apart
from a measure of anonymity, it is also easier to be hateful using typed words rather than spoken
words face-to-face and because some adults have been slow to respond to cyberbullying, many
cyberbullies feel that there are little to no consequences for their actions.
Cyberbullying crosses all geographical boundaries. The Internet has really opened up the whole
world to users who access it on a broad array of devices, and for the most part, this has been a
good thing. Nevertheless, some kids feel free to post or send whatever they want while online

without considering how that content can inflict pain – and sometimes cause severe
psychological and emotional wounds.
2.4 Types of Bullying Online
According to the Internet Safety 101 curriculum, there are many types of cyberbullying which
includes:
 Gossip: Posting or sending cruel gossip to damage a person’s reputation and
relationships with friends, family, and acquaintances.
 Exclusion: Deliberately excluding someone from an online group.
 Impersonation: Breaking into someone’s e-mail or other online account and sending
messages that will cause embarrassment or damage to the person’s reputation and
affect his or her relationship with others.
 Harassment: Repeatedly posting or sending offensive, rude, and insulting messages.
 Cyber-stalking: Posting or sending unwanted or intimidating messages, which may
include threats.
 Flaming: Online fights where scornful and offensive messages are posted on websites,
forums, or blogs.
 Outing and Trickery: Tricking someone into revealing secrets or embarrassing
information, which is then shared online.
 Cyber-threats: Remarks on the Internet threatening or implying violent behavior,
displaying suicidal tendencies.

2.5 Challenges in the fight to stop cyberbullying
There are two major challenges that make it difficult to prevent cyberbullying. First, many
people don’t see the harm associated with it. Some attempt to dismiss or disregard cyberbullying
because there are “more serious forms of aggression to worry about.” While it is true that there
are many issues facing adolescents, parents, teachers, and law enforcement today, we first need
to accept that cyberbullying is one such problem that will only get more serious if ignored.
The other challenge relates to who is willing to step up and take responsibility for responding to
inappropriate use of technology. Parents often say that they don’t have the technical skills to
keep up with their kids’ online behavior; teachers are afraid to intervene in behaviors that often
occur away from school; and law enforcement is hesitant to get involved unless there is clear
evidence of a crime or a significant threat to someone’s physical safety. As a result,
cyberbullying incidents often slip through the cracks. Indeed, the behavior often continues and
escalates because they are not quickly addressed.() Based on these challenges, there is need to
collectively create an environment where kids feel comfortable talking with adults about this
problem and feel confident that meaningful steps will be taken to resolve the situation. We also
need to get everyone involved - youth, parents, educators, counselors, law enforcement, social
media companies, and the community at large. It will take a concerted and comprehensive effort
from all stakeholders to really make a difference in reducing cyberbullying.

2.6 Preventing Cyberbullying
The most important preventive step that schools can take is to educate the school community
about responsible internet use. Students need to know that all forms of bullying are wrong and
that those who engage in harassing or threatening behaviors will be subject to discipline. It is
therefore important to discuss issues related to the appropriate use of online communications
technology in various areas of the general curriculum. To be sure, these messages should be
reinforced in classes that regularly utilize technology. Signage also should be posted in the
computer lab or at each computer workstation to remind students of the rules of acceptable use.
In general, it is crucial to establish and maintain a school climate of respect and integrity where
violations result in informal or formal sanction.()
Furthermore, school district personnel should review their harassment and bullying policies to
see if they allow for the discipline of students who engage in cyberbullying. If their policy covers
it, cyberbullying incidents that occur at school - or that originate off campus but ultimately result
in a substantial disruption of the learning environment - are well within a school’s legal authority
to intervene. The school then needs to make it clear to students, parents, and all staff that these
behaviors are unacceptable and will be subject to discipline. In some cases, simply discussing the
incident with the offender’s parents will result in the behavior stopping.
2.7 Responding to Cyberbullying
Students should already know that cyberbullying is unacceptable and that the behavior will result
in discipline. Utilize school liaison officers or other members of law enforcement to thoroughly
investigate incidents, as needed, if the behaviors cross a certain threshold of severity. Once the

offending party has been identified, develop a response that is commensurate with the harm done
and the disruption that occurred.
School administrators should also work with parents to convey to the student that cyberbullying
behaviors are taken seriously and are not trivialized. Moreover, schools should come up with
creative response strategies, particularly for relatively minor forms of harassment that do not
result in significant harm. For example, students may be required to create anti-cyberbullying
posters to be displayed throughout the school. Older students might be required to give a brief
presentation to younger students about the importance of using technology in ethically-sound
ways. The point here, again, is to condemn the behavior while sending a message to the rest of
the school community that bullying in any form is wrong and will not be tolerated.
Even though the vast majority of these incidents can be handled informally (calling parents,
counseling the bully and target, expressing condemnation of the behavior), there may be
occasions where formal response from the school is warranted. This is particularly the case in
incidents involving serious threats toward another student, if the target no longer feels
comfortable coming to school, or if cyberbullying behaviors continue after informal attempts to
stop it have failed. In these cases, detention, suspension, changes of placement, or even
expulsion may be necessary. If these extreme measures are required, it is important that
educators are able to clearly demonstrate the link to school and present evidence that supports
their action.
Also, youth should develop a relationship with an adult they trust (a parent, teacher, or someone
else) so they can talk about any experiences they have online (or off) that make them upset or
uncomfortable. If possible, teens should ignore minor teasing or name calling, and not respond to
the bully as that might simply make the problem continue. It’s also useful to keep all evidence of

cyberbullying to show an adult who can help with the situation. If targets of cyberbullying are
able to keep a log or a journal of the dates and times and instances of the online harassment, that
can also help prove what was going on and who started it.
Overall, youth should go online with their parents – show them what web sites they use, and
why. At the same time, they need to be responsible when interacting with others on the Internet.
For instance, they shouldn’t say anything to anyone online that they wouldn’t say to them in
person with their parents in the room. Finally, youth ought to take advantage of the privacy
settings within Facebook and other websites, and the social software (instant messaging, email,
and chat programs) that they use – they are there to help reduce the chances of victimization.
Users can adjust the settings to restrict and monitor who can contact them and who can read their
online content.
Law enforcement officers also have a role in preventing and responding to cyberbullying. To
begin, they need to be aware of ever-evolving state and local laws concerning online behaviors,
and equip themselves with the skills and knowledge to intervene as necessary. In a recent survey
of school resource officers, we found that almost one-quarter did not know if their state had a
cyberbullying law. This is surprising since their most visible responsibility involves responding
to actions which are in violation of law (e.g., harassment, threats, stalking). Even if the behavior
doesn’t immediately appear to rise to the level of a crime, officers should use their discretion to
handle the situation in a way that is appropriate for the circumstances. For example, a simple
discussion of the legal issues involved in cyberbullying may be enough to deter some youth from
future misbehavior. Officers might also talk to parents about their child’s conduct and express to
them the seriousness of online harassment.

Relatedly, officers can play an essential role in preventing cyberbullying from occurring or
getting out of hand in the first place. They can speak to students in classrooms about
cyberbullying and online safety issues more broadly in an attempt to discourage them from
engaging in risky or unacceptable actions and interactions. They might also speak to parents
about local and state laws, so that they are informed and can properly respond if their child is
involved in an incident.
2.8 Grammatical Relations
Grammatical relations refer to functional relationships between constituents in a clause. The
standard examples of grammatical functions from traditional grammar are subject, direct object,
and indirect object. Beyond these concepts from traditional grammar, more modern theories of
grammar are likely to acknowledge many further types of grammatical relations (e.g.
complement, specifier, predicative, etc.). The role of grammatical relations in theories of
grammar is the greatest in many dependency grammars, which tend to posit dozens of distinct
grammatical relations.() Every head-dependent dependency bears a grammatical function.
Grammatical relations are exemplified in traditional grammar by the notions of subject, direct
object, and indirect object;
For example:
Adekanmbi gave Samuel the book.
The subject Adekanmbi performs or is the source of the action. The direct object the book is
acted upon by the subject, and the indirect object Samuel receives the direct object or otherwise
benefits from the action. Traditional grammars often begin with these rather vague notions of the
grammatical functions. When one begins to examine the distinctions more closely, it quickly

becomes clear that these basic definitions do not provide much more than a loose orientation
point. What is indisputable about the grammatical relations is that they are relational. That is,
subject and object can exist as such only by virtue of the context in which they appear. A noun
such as Adekanmbi or a noun phrase such as the book cannot qualify as subject and direct object,
respectively, unless they appear in an environment, e.g. a clause, where they are related to each
other and/or to an action or state. In this regard, the main verb in a clause is responsible for
assigning grammatical relations to the clause "participants".
2.9 Using Text Mining Techniques to Detect Online Offensive Contents
Offensive language identification in social media is a difficult task because the textual contents
in such environment is often unstructured, informal, and even misspelled. While defensive
methods adopted by current social media are not sufficient, researchers have studied intelligent
ways to identify offensive contents using text mining approach. Implementing text mining
techniques to analyze online data requires the following phases:
1) Data acquisition and preprocess,
2) Feature extraction
3) Classification
The major challenges of using text mining to detect offensive contents lie on the feature selection
phrase, which will be elaborated in the following sections.
a) Message-level Feature Extraction
Most offensive content detection research extracts two kinds of features: lexical and syntactic
features.

Lexical features treat each word and phrase as an entity. Word patterns such as appearance of
certain keywords and their frequencies are often used to represent the language model. Early
research used Bag-of-Words (BoW) in offensiveness detection. The BoW approach treats a text
as an unordered collection of words and disregards the syntactic and semantic information.
However, using BoW approach alone not only yields low accuracy in subtle offensive language
detection, but also brings in a high false positive rate especially during heated arguments,
defensive reactions to others’ offensive posts, and even conversations between close friends. N-
gram approach is considered as an improved approach in that it brings words’ nearby context
information into consideration to detect offensive contents. N-grams represent subsequences of
N continuous words in texts. Bi-gram and Tri-gram are the most popular N-grams used in text
mining. However, N-gram suffers from difficulty in exploring related words separated by long
distances in texts. Simply increasing N can alleviate the problem but will slow down system
processing speed and bring in more false positives.
Syntactic features: Although lexical features perform well in detecting offensive entities,
without considering the syntactical structure of the whole sentence, they fail to distinguish
sentences’ offensiveness which contain same words but in different orders. Therefore, to
consider syntactical features in sentences, natural language parsers are introduced to parse
sentences on grammatical structures before feature selection. Equipping with a parser can help
avoid selecting un-related word sets as features in offensiveness detection.
b) User-level Offensiveness Detection
Most contemporary research on detecting online offensive languages only focus on sentence-
level and message-level constructs. Since no detection technique is 100% accurate, if users keep

connecting with the sources of offensive contents (e.g., online users or websites), they are at high
risk of continuously exposure to offensive contents. However, user-level detection is a more
challenging task and studies associated with the user level of analysis are largely missing. There
are some limited efforts at the user level. For example, Kontostathis et al propose a rule-based
communication model to track and categorize online predators. Pendar uses lexical features with
machine learning classifiers to differentiate victims from predators in online chatting
environment. Pazienza and Tudorache propose utilizing user profiling features to detect
aggressive discussions. They use users’ online behavior histories (e.g., presence and
conversations) to predict whether or not users’ future posts will be offensive. Although their
work points out an interesting direction to incorporate user information in detecting offensive
contents, more advanced user information such as users’ writing styles or posting trends or
reputations has not been included to improve the detection rate.()
Fig 2.2

2.10 Heads and dependents
The importance of the syntactic functions reaches its greatest extent in dependency grammar
(DG) theories of syntax. Every head-dependent dependency bears a syntactic function. The result
is that an inventory consisting of dozens of distinct syntactic functions is needed for each
language. For example, a determiner-noun dependency might be assumed to bear the DET
(determiner) function, and an adjective-noun dependency is assumed to bear the ATTR
(attribute) function. These functions are often produced as labels on the dependencies themselves
in the syntactic tree, e.g.
Fig 2.3
The tree contains the following syntactic functions: ATTR (attribute), CCOMP (clause
complement), DET (determiner), MOD (modifier), OBJ (object), SUBJ (subject), and VCOMP

(verb complement). The actual inventories of syntactic functions will differ from the one
suggested here in the number and types of functions that are assumed. In this regard, this tree is
merely intended to be illustrative of the importance that the syntactic functions can take on in
some theories of syntax and grammar.
2.11 Statistical parsing
CFGs can be used to parse, but some ambiguous sentences could not be disambiguated, and we
would like to know the most likely parse. A corpus could be used to do that.
2.11.1 Basic idea
1. Start with a Treebank (we can say bank of trees, e.g. Penn Treebank) which is a
collection of sentences with syntactic annotation, i.e., already-parsed sentences.
2. Examine which parse trees occur frequently
3. Extract grammar rules corresponding to those parse trees, estimating the probability of
the grammar rule based on its frequency.
That is, we’ll have a CFG augmented with probabilities (PCFG).
2.11.2 Probabilistic Context-Free Grammars (PCFGs)
Definition of a PCFG:
- Set of non-terminals (N)
- Set of terminals (T)

- Set of rules/productions (P), of the form Α → β
- Designated start symbol (S)
- Function, D assigns probabilities to each rule in P
D = P (A → β)
2.11.3 Estimating Probabilities using a Treebank
- Given a corpus of sentences annotated with syntactic annotation
(e.g., the Penn Treebank)
- Consider all parse trees
- (1) Each time we have a rule of the form A → ß applied in a parse tree, increment a counter for
that rule
- (2) Also count the number of times A is on the left hand side of a rule
- Divide (1) by (2) D = P (A→ ß | A) = Count (A → ß) / Count (A)
2.11.4 Using Probabilities to Parse
• P (T) = probability of a particular parse tree
= the product of the probabilities of all the rules r used to expand each node n in the parse
tree

Fig 2.4
We have the following rules and probabilities
- S → VP .05
- VP → V NP .40
- NP → Det N .20
- V → book .30
- Det → that .05
- N → flight .25
P ( T ) = P ( S → VP ) * P ( VP→ V NP ) *… * P ( N → flight )
= .05 * .40 * .20 * .30 * .05 * .25 = .000015
So, the probability for that parse is 0.000015. Probabilities are useful for comparing with other
probabilities. Whereas we couldn’t decide between two parses using a regular CFG, we now can.

2.11.5 Obtaining the best parse
The best parse T(S), where S is our sentence is the tree which has the highest probability.
We can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse
- CYK is a form of dynamic programming
- CYK is a chart parser, like the Earley parser
2.11.6 Problems with PCFGs
It’s still only a CFG, so dependencies on non-CFG information is not captured.
- e.g., Pronouns are more likely to be subjects than objects:
P [ ( NP → Pronoun ) | NP = subject ] >> P [ ( NP → Pronoun)
| NP =obj]
Ignores lexical dependency information (statistics), which is usually crucial for disambiguation
of “PP attachment ambiguity” and “Coordination ambiguity”.
- (T1) America sent [ [250,000 soldiers] [into Iraq] ]
- (T2) America sent [250,000 soldiers] [into Iraq]
“Sent” with “into”-PP always-attached high (T2) probability.
An example of Coordination ambiguity is two parses of the phrase “dogs in houses and cats”
- (T1) [ [NP dogs] in [ NP houses and cats ] ]
- (T2) [ [NP dogs in houses] and [NP cats ] ]
Here T1 is semantically wrong and T2 is correct but both tree results same score. So only PCFG
is not enough to disambiguate parse trees, lexical dependency information is also needed.
To handle lexical information, we’ll turn to lexicalized PCFGs.

2.11.7 Lexicalized PCFGs
 Lexicalized Parse Trees
- Add “headwords” to each phrasal node. Each PCFG rule in a tree is augmented to
identify one RHS constituent to be the head daughter
- The headword for a node is set to the head word of its head daughter
- Headship not in (most) treebanks
- Usually use head rules, e.g.:
- NP:
• Take leftmost NP
• Take rightmost N*
• Take rightmost JJ
• Take right child
- VP:
• Take leftmost VB*
• Take leftmost VP
• Take left child

Fig 2.5
2.11.8 Incorporating head probabilities
 Previously, we conditioned on the mother node (A):
- P ( A → β | A )
 Now, we can condition on the mother node and the headword of A (h(A)):
- P( A → β | A , h (A) )
We’re no longer conditioning on simply the mother category A, but on the mother category when
h(A) is the head.
- e.g., P ( VP → VBD NP PP | VP , dumped)

2.11.9 Calculating rule probabilities
 We calculate this by comparing how many times the rule occurs with h(n) as the
headword versus how many times the mother/headword combination appear in total:
P ( VP → VBD NP PP | VP , dumped )
= C (VP (dumped) → VBD NP PP) / Σβ C ( VP ( dumped ) → β)
2.11.10 Adding info about word-word dependencies
 We want to take into account one other factor: the probability of being a head word (in a
given context)
- P(h(n)=word | …)
 We condition this probability on two things: 1. the category of the node (n), and 2. the
headword of the mother (h(m(n)))
- P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n)))
- P(sacks | NP, dumped)
 What we’re really doing is factoring in how words relate to each other
 We will call this a dependency relation later: sacks is dependent on dumped, in this case

Fig 2.6: Lexicalized parsing can be seen as producing dependency trees
2.12 Dependency Parsing
Modern dependency grammar has been created by French linguistic Lucien Tesniere (1959).
Although its roots may be traced back to Panini’s grammar of Sankskrit (predecessor of bangla)
several centuries before. In NLP, dependency parse tree is thought as a ‘bridge’ between
syntactic and semantic analysis, since it gives some semantic information as well as syntactic.
Some peoples also argues that it is another version of chunk parsing, because a very careful
observation of a dependency tree will reveal that every subpart of a sentence: subject, object or
complements are appeared in different sub trees or under different relation, where each node is
dependent on another node. These sub trees or semantically dependent nodes can be thought of
as separate chunks.

2.12.1 Basic Concepts
In a dependency representation every node in the structure is a surface word (there are no
abstract nodes such as NP or VP), but each word may have additional attributes such as its part-
of-speech (POS) tag. The parent word is known as the head, and its children are its modifiers.
The observation which derives DG is: In a sentence, all but one word depend on other words.
The one word that doesn’t depend on any other is called the root of the sentence. A typical DG
analysis of the sentence “A man sleeps” is demonstrated below:
A depends on man
Man depends on sleep
Sleep depends on nothing (it is the root of the sentence)
Or, put differently
A modifies man
Man is the subject of sleep
Sleep is the main verb of the sentence
This is Dependency Grammar. A formulated dependency grammar is given below:
 Capturing relations between words is moving in the direction of dependency grammar
(DG)
 In DG, there is no such thing as constituency
 The structure of a sentence is purely the binary relations between words, A → B means
that B depends on A

Dependencies are motivated by grammatical function, both syntactically and semantically. A
word depends on another either if it is a complement or a modifier of the latter. The edge
between a parent and a child node specifies the grammatical relationship between the two words
(e.g. subj, obj, and adj).
In most formulations of DG for example, functional heads or governors (e.g. verbs)
subcategorize for their complements. Hence, a transitive verb like ‘like’ requires two
complements (dependents), one noun with the grammatical function subject and one with the
function object.
In this research thesis, we are using Stanford-Parser version-jdk1.5 for all of the output.
Ex sentence: John likes Italian food.
Tagged output: John/NNP likes/VBZ Italian/NN food/NN
Constituent structure output:
(ROOT
(S
(NP (NNP John))
(VP (VBZ likes)
(NP (NN Italian) (NN food)))))
Dependency structure output:
nsubj(likes-2, John-1)
nn(food-4, italian-3)
dobj(likes-2, food-4)

2.12.2 Dependency functions
2.12.2.1 Main functions
main
main element
The main element of a clause is usually a verb, but in a verb-less clause other elements may
serve as a head as well.
Ex: a sentence with a verb
He doesn't know whether to send a gift.
nsubj(know-4, He-1)
aux(know-4, does-2)
advmod(know-4, n't-3)
aux(send-7, to-6)
whether(know-4, send-7)
det(gift-9, a-8)
dobj(send-7, gift-9)
Ex: a sentence without a verb
A comprehensive grammar of the English language
det(grammar-3, A-1)
amod(grammar-3, comprehensive-2)
det(language-7, the-5)

amod(language-7, english-6)
of(grammar-3, language-7)
2.12.2.2 Verb complementation
nsubj
nominal subject
The dependency syntax collapses the classes of formal subject and ordinary subject into
one. The subject may also be a non-finite clause that-clause, WH-clause, etc.
dobj
direct object
The notion of object is wider than that in Quirk, comprising essentially all types of
second arguments, except subject complements. The motivation is that the subtypes of
second arguments are complementary, i.e. they occupy the same valency slot. There are
both simple nominal objects and more complex objects such as a non-finite clause, that-
clause, WH-clause or quote structure.
Ex: John explained that topic
nsubj(explained-2, John-1)
det(topic-4, that-3)
dobj(explained-2, topic-4)
ccomp
coordinated complement
Subject complement is the second argument of a copular verb.

Ex: Mary said John didn't go there
nsubj(said-2, Mary-1)
nsubj(go-6, John-3)
aux(go-6, did-4)
advmod(go-6, n't-5)
ccomp(said-2, go-6)
advmod(go-6, there-7)
iobj
indirect object
Indirect object corresponds to a third argument. The prepositional dative is described
accordingly. Again, the syntactic motivation is that the prepositional phrase occupies the
same valency slot as the indirect object and is semantically equivalent to it.
Ex: I gave him my address.
nsubj(gave-2, I-1)
iobj(gave-2, him-3)
dep(address-5, my-4)
dobj(gave-2, address-5)
What did Pauline give Tom?
Pauline gave it to Tom.

2.12.2.3 Determinative functions det
determiner
Central determiners (articles) or a determining pronoun. Successive determiners are
linked to each other.
Ex: This is an apple
nsubj(is-2, This-1)
det(apple-4, an-3)
dobj(is-2, apple-4)
2.12.3 Robinson’s axiom
Robinson (1970) formulated four axioms to govern the well-formedness of dependency
structures, depicted below:
1. One and only one element is independent.
2. All others depend directly on some element.
3. No element depends directly on more than one other.
4. If A depends directly on B and some element C intervenes between them (in the linear
order of string), then C depends directly on A or B or some other intervening element.
The first three axioms ensure that they shall be trees. Axioms 1 and 2 state that in each sentence,
only one element is independent and all others dependent on some other elements. Axiom 3
states that if element A depends on B, it must not depend on another element C. This

requirement is referred as single-headness. Axiom 4 is called the requirement of projectivity and
disallows crossing edges in dependency trees.
2.12.4 Dependency relation
A mapping M maps W to the actual words of a sentence. Now for w1, w2 ∈
W, <w1, w2 >∈ R asserts that w1 is dependent on w2. The properties of R treeness constraints
on dependency graphs as Robinson’s axioms.
Ex: Mary loves another Mary
↑ ↑ ↑ ↑
w1 w2 w3 w4
here, M (w1…w4 ∈ W)
1. R ⊂ W × W
2. ∀w1w2…wk-1 wk ∈ W: <w1,w2> ∈ R … < wk-1 wk > ∈ R: w1 ≠ wk
(acyclicity)
3. ∃!w1 ∈ W : ∀w2 ∈ W: < w1,w2 > ∉ R (rootedness)
4. ∀w1w2w3 ∈ W : <w1,w2> ∈ R ∧ <w1,w3> ∈ R→w2=w3 (singleheadedness)
2.12.5 Stanford dependency parser by Dan Klein
This parser uses the feature of Collin’s parser. Michael Collins in his ‘Head Driven Statistical
Parser’ showed mapping of his statistical parser to the dependency relation sets. Dan Klein’s

Stanford parser deals with tagged words: pairs <w, t>. First the head <wh, th> of a constituent is
generated using ‘Collins head finder’ method, then successive right dependents <wd, td> until a
‘stop’ token is generated, then successive left dependents until ‘stop’ token is generated. It
supports three formats for output:
1. dependencies
2. typedDependencies
3. typedDependenciesCollapsed
For example: Factory payrolls fell in September.
Tagged output: Factory/NN payrolls/NNS fell/VBD in/IN September/NNP Dependency
structure:
nn(payrolls-2, Factory-1)
nsubj(fell-3, payrolls-2)
in(fell-3, September-5)
Fig 2.7

First, fell-VBD is chosen as the head of the sentence, then, in-IN to the right is generated, which
then generates September-NN to the right, which generates ‘stop’ token on both sides. Then
return to in-IN, generate ‘stop’ to the right, and so on. The above output is the
‘typedDependenciesCollapsed’ format of Stanford dependency parse tree. This
‘typedDependenciesCollapsed’ doesn’t make separate nodes for the words, which are obvious in
any dependency relation in a sentence; instead it makes it a relation between two prominent
words. In the above example the preposition ‘in’ is used as a relation or dependency function
between the words ‘fell’ and ‘September’.
For example, only ‘typedDependencies’ format of the above sentence will be:
nn(payrolls-2, Factory-1)
nsubj(fell-3, payrolls-2)
dep(fell-3, in-4)
dep(in-4, September-5)
Fig 2.8

Example shows that it makes a separate node ‘in’ between ‘fell’ and ‘September’, which can be
used as a relation to make the tree shorter in depth. This thesis uses the
‘typedDependenciesCollapsed’ format as well because we don’t need to look at every word to
extract necessary information.

CHAPTER THREE
SYSTEM ANALYSIS AND DESIGN
In the following section of this chapter, existing sentence-level semantic filtering approaches and
methodologies for online social networking communities will be thoroughly examined, and issues
related to these approaches will be highlighted.
The proposed sentence-level semantic filtering approach will also be examined, and its operation
procedures, benefits, and feasibility will be expressed. Methodologies employed in acquiring the
requirement towards the successful implementation of the proposed filtering System will also be
discussed.
The design of the filtering system from both perspectives will be discussed, also with the program
components which will not be left out.
3.1 System Analysis
System analysis can be defined as the process of analyzing a system with the essential goal of
improving or modifying it. It can also be defined as the methodical study of a system, its current
and future required objectives, and procedures in order to form a basis for the system design.
It is the first of the three major phases in developing an information system. All the system analysis
efforts are directed towards deciding these 3 basic objectives:
1. Identify system owner and system users.
2. Define what the system will do.

3. Determine the technicality, economical and operational feasibility of the proposed system.
The purpose of analyzing is to produce a clear requirement specification of the newly designed or
upgraded system efficiently and effectively. It requires the ability to analyze the essential features
of a system.
This knowledge of a system is achieved through the investigation of the system and its
environment.
3.2 Analysis of the existing system
teens, as a place where they can meet other people, communicate, and exchange information.
However, this medium has encouraged the wide usage of offensive languages and also brought
about a fast growing trend that experts believe is very harmful called cyberbullying; which has
led teenagers to suicide in very extreme cases. People have realized the problems brought by
offensive language in online communities and many efforts have been made on detecting and
eliminating the existence of offensive language within user messages. The approaches used are
being discuss below.

3.2.1 Keyword Censoring Approach
Keyword censoring approaches match words appearing in user messages with offensive words
stored in the blacklist. Once found, these offensive words will be removed, partially replaced
(e.g., “bitch”), completely replaced (e.g., “******”), or substituted by family friendly words
(e.g., “naughty”). Because of its simplicity, keyword based censoring approach has been widely
applied in OSN websites, such as YouTube and World of Warcraft. However, the filtering result
is not as desired; brutally removing words from users message breaks the readability of the
messages. Replacing offensive words with symbols usually makes it easy to guess the original
offensive words. The idea of substitution seems tempting, but accurate substitution is usually
impractical. Inaccurate substitution will introduce additional issues. For example, in 2001,
Yahoo! deployed an Email filter which may automatically alter certain words in emails by family
friendly words. This filter was criticized as a “foolish filter” by BBC news because of its
inaccurate substitution.
To demonstrate the shortcoming of keyword censoring approaches, we present an example
below.
Filtering results with Keyword Censoring
Original comment: “What the fuck is wrong with you?”
Keyword Censoring: “What the f**k is wrong with you?”
According to presented filtering results, readers can still easily understand what the offender
wants to say and even be able to infer the removed words. This indicates the failure filtering
because offensive opinion has been successfully delivered to victims. Also, removing words

from a sentence without considering their context breaks the readability of rest of the sentence.
Compared with keyword censoring approaches, our proposed semantic filtering approach is
much more sophisticated and can achieve thorough filtering effort by utilizing the grammatical
relations among words in the sentence. Given a sentence containing both offensive and
inoffensive words, not only offensive words but also inoffensive words assisting to express
offensive opinions will be removed during our filtering. In this way, we essentially stop the
delivery of offensive opinion. And, there will be no way to infer the offensive content in original
messages after filtering.
3.2.2 Content Control Approach
Content control approaches are usually deployed at user side or ISP side to prevent user from
seeing inappropriate content on the Internet. Its filtering is usually done based on certain criteria,
such as URL address, the occurrence of offensive words, and topic classification. Here our focus
is text based criteria.
For example, if we present a sentence based content control approach with threshold set as the
number of offensive words in the sentence. If at least one offensive word is being detected within
a sentence, the filter will remove the sentence from user message.
To demonstrate the shortcoming of content control approaches, we present examples below.
Filtering results with Content control Censoring
Keyword Censoring: “ ”

However, content control approaches are too coarse-grained to be applied in online communities.
First of all, offender can easily bypass the filtering as long as knowing the estimation criteria.
More important, a sentence in user comment may contain both offensive and inoffensive content.
Inoffensive part may be removed falsely because of offensive part. Not allowing user to post
inoffensive content would easily drive users away and thus affect the growth of community.
Compared with content control approaches, we provide a fine-grained filtering by removing only
the smallest syntactic part in the sentence containing offensive language. The inoffensive content
in the original message will remain; thereby, user still has the freedom of speech for posting
inoffensive content. We believe such delicate filtering will be more acceptable to online
communities.
3.2.3 Manual Filtering Approach
Manual filtering is believed to produce the best filtering result. Basically, user messages are
reviewed by community administrator before being posted on the website.
Filtering results with Manual Filtering Approach
Keyword Censoring: “What is wrong with you?”
As shown above, the administrator is able to easily understand what the author wants to express
and precisely remove only the offensive content within the message.
However, manual filtering is very time and labor consuming, making it impossible to be widely
applied. For example, in the Linda-Ikeji blog community (http://lindaikeji.blogspot.com), the

blog administrator will manually review and filter user comments on some celebrities’ public
blogs. Obviously, users would expect a delay between posting a comment on a blog and
displaying this comment on the blog’s webpage. Further, the filtering totally relies on the
judgment of the community administrator. Our proposed semantic filtering approach mimics the
procedure of manual filtering by trying to understand the relations among words in order to
remove the offensive content semantically. The proposed semantic filtering approach will be
fully automatic, requiring no interference of any administrator.
3.3 Problem of the existing approaches
From the study of the existing approaches and based on the information provided above, the
following problems have been identified:
 Using the Keyword censoring approach, the readers can still easily understand what the
offender wants to say and even be able to infer the removed words. This indicates the
failure filtering because offensive opinion has been successfully delivered to victims. Also,
removing words from a sentence without considering their context breaks the readability
of rest of the sentence.
 Using the content control approaches will also be too coarse-grained to be applied in online
communities. Offender can easily bypass the filtering as long as knowing the estimation
criteria and more important, a sentence in user comment may contain both offensive and
inoffensive content. Inoffensive part may be removed falsely because of offensive part.
Not allowing user to post inoffensive content would easily drive users away and thus affect
the growth of community.

 Using the manual filtering approach is very time and labor consuming. The administrator
will have to manually review and filter all the users’ comments and messages; making it
impossible to be widely applied. Also, the filtering totally relies on the judgment of the
community administrator.
3.4 Proposed Filtering Philosophy
The goal of our semantic filtering is to achieve filtering results close to that of manual filtering.
To reach this goal, the foremost thing is to answer the question about how the filtering should be
performed in order to get the desired filtering results. In this section, we present our answer in
three steps. First, we analyze the characteristics of offensive text content in user messages. Then,
we introduce our filtering philosophy according to the summarized characteristics. Finally, we
show how this philosophy is transformed into heuristic rules applicable in the filtering process.
3.4.1 Offensive Language Text Content
Based on the observation on user comments collected from YouTube website, a sentence in a
user message may contain both offensive and inoffensive text content. Offensive text content is
exposed intentionally with purpose of bringing negative influence to victims (e.g., the readers of
message). The victim receives the negative influence by reading the offensive part of sentence
and understanding the carried offensive information.
Hence, the information carried by original sentence can be represented as
I = Ioff + Iinoff

The offender reaches his goal when the offensive information Ioff is delivered to readers.
Therefore, to achieve a thorough filtering, all words used to deliver Ioff should be removed.
Meanwhile, with respect to free speech, the part with Iinoff should be saved.
3.4.2 Filtering Philosophy
According to the analysis, we propose the philosophy that should be followed in sentence-level
offensive language filtering:
 Precisely identify all offensive contents and remove them semantically, so that viewers
will not notice the existence of offensive language in the original sentence;
 Keep the readability and inoffensive content in the sentence, so that the author will still
be allowed to express his opinion freely as long as it is not offensive;
This is called the philosophy of “filtering instead of blocking”. To the filter, the philosophy
states that: if removing one word will make another word meaningless or confusing to readers,
we should consider removing both words to keep the readability of a filtered sentence;
meanwhile, we only remove words that are affected by offensive words.
For example, in the sentence “Samuel said it and what the fuck is wrong with what he said?”,
suppose “fuck” is the only offensive words, the sentence can be separated into two parts. The
first part, “Samuel said it”, is inoffensive; but the second part, “what the fuck is wrong with
you?” is offensive. Therefore, we should remove the offensive word in the second part while
keeping the first part and also still making the sentence a meaningful and readable one. i.e. We
won’t have:
Samuel said it and what the is wrong with what he said? (Wrong)

But
Samuel said it and what is wrong with what he said? (Correct)
The words “the” and “fuck” must be removed in order to keep the transparency of filtering as
well as the readability of filtered text content.
3.4.3 Filtering Rules
Specifically, the proposed philosophy is transformed into two heuristic rules to estimate the
impact of removing words in a sentence.
Rule 1. (Modification Relation) in a modification relation, if the modifier is determined to be
offensive, removing modifier solely is enough; if the head is determined to be offensive, both the
head and the modifier should be removed.
The modification relation is a binary semantic relationship between two syntactic elements, such
as word, phrase, etc. One element is named head and the other is named modifier. The modifier
is used to describe the head (i.e. the modified component). Semantically, modifiers describe and
provide more accurate definitional meaning for head. As the modifier acts as a complement, the
removal of the modifier typically will not affect the grammaticality of the construction. For
example, in the sentence “she likes red apples.”, the adjective “red” is used to modify the noun
“apples”. Removing “red” will keep the readability of rest of sentence. We admit that, removing
modifiers will lose some information carried by modifiers. However, if the modifier is
determined removable but the head is not, removing modifier will remove only the offensive
information.

Rule 2. (Pattern Integrity) if removing the offensive word breaks the integrity of sentence’s basic
pattern, the whole sentence should be removed in order to keep the readability.
English sentences and clauses are organized in basic patterns, such as “Subject-Verb”, “Subject-
Verb-Object”, “Subject- Verb-Adjective”, “Subject-Verb-Adverb”, and “Subject-Verb- Noun”.
Every sentence or clause can be categorized into one pattern. The integrity of basic pattern is
essential to the readability of content. For example, the sentence “she sleeps on the sofa.” follows
“Subject-Verb” pattern. If we only remove “sleeps”, the rest of the sentence, “she on the sofa.”
will become nonsense and meaningless.
We will be applying these two rules during the filtering of the sentences.
3.5 Identify Removable Content by Grammatical Relations
A text or user message can be decomposed into a sequence of sentences. Each sentence is
considered as a unit in filtering. Given a sentence containing both offensive words and
inoffensive words, the goal of filtering is to identify inoffensive words which should be removed
together with offensive words. We define the words that should be removed by the filtering as
“removable” words.
We noticed that manual filtering can easily achieve this goal because human can easily
understand the context of words in a sentence and precisely identify which words should be
removed with known offensive words. So, we mimic the manual filtering in that, we extract the
grammatical relations among words from sentences and use the proposed filtering rules to
estimate the impact of removing offensive words on other inoffensive words based on extracted
grammatical relations.

Specifically, the proposed approach includes two steps. In the first step, we scan the sentence and
see if offensive words exist. If offensive words exist, we continue to retrieve grammatical
information (i.e. Part-of-Speech tags and typed dependency relations) among words in the
sentence. Using retrieved grammatical information, we create a tree data structure, named
RelTree, for the second step estimation. In this second step, we propose a set of estimation
functions following the filtering rules we proposed. Using the RelTree structure and the proposed
rules, we then estimate if there are inoffensive words that should be removed together with those
identified offensive words.
The overview idea of our semantic filtering approach is shown in Algorithm 1 below. Within the
algorithm, the functions POStagging and TDgenerator generate Part-of-Speech tags and typed
dependency relations, respectively. We use existing NLP (Natural Language Processing) tools to
implement these two functions. We will also focus on the design of two other functions
CreateRelTree and EstimateRelTree.
In this methodology, we are assuming that the filtering is based on a comprehensive offensive
lexicon containing all offensive words. Words that do not appear in the lexicon are considered
inoffensive.
input : a text comment T,
a blacklist of offensive words Blacklist
output: a filtered text comment T′
1 T′ ←“”;
2 senList ← chunk T into a list of sentences;
3 foreach sentence s ∈ senList do

4 scan s for offensive words using Blacklist;
5 if no offensive word found then
6 T′ ← T′ + s;
7 end
8 else
9 PTree ← POStagging(s);/*get parse tree*/
10 TDset ← TDgenerator(s);
/* get typed dependency relations */
11 RelTree ← CreateRelTree(PTree, TDset);
/* create RelTree */
12 LabelRelTree ← EstimateRelTree (RelTree,
Blacklist); /* estimate using RelTree */
13 s′ ← remove all words in LabelRelTree those are
labeled as “removable”;
14 T′ ← T′ + s′;
15 end
16 end
17 Return T′;
Algorithm 1: Procedure of Semantic Filtering
3.5.1 First Step: Grammatical Analysis
In the first step, we extract two types of grammatical information from a given sentence. One is
the Part-of-Speech information associated with every word. The other is the dependency relation

among words. Part-of-Speech information helps us to understand the organization of a sentence,
which is essential for keeping the readability when we try to remove words from a sentence.
Dependency relations will be used directly to estimate the impact of removing one word on other
semantically related words, making the filtering more “meaningful”. Combining these two types
of information, we can create a new data structure, called RelTree, for the next-step estimation.
3.5.1.1 Part of Speech Tagging
Part-of-Speech tagging has been widely used in Natural Language Processing applications to
identify the syntactic properties of lexical items in a sentence, such as word or phrase. Through
Part-of-Speech tagging, the sentence can be represented in a tree structure basing on Part-of-
Speech tags. We adopt the Penn Treebank tag set for our Part-of-Speech tagging.
An example of Penn Treebank style parse tree is shown in Figure 1 below.
Figure 1: A parse tree of a sentence basing on Part-of-Speech tags

Here, the leaf nodes are words appearing in the sentence. The non-leaf nodes represent syntactic
elements, such as phrases or clauses. Each element consists of the words within its subtree. For
example, the words “said” and “it” constitute a Verb Phrase (i.e. VP) node.
3.5.1.2 Typed Dependency Relations
Typed Dependency is a kind of general relations describing the grammatical dependencies within
a sentence, proposed by Stanford Natural Language Processing Group. Each typed dependency
includes a dependency type and a (governor, dependent) word pair. For example, in the sentence
“what the fuck is wrong with what he said?”, the typed dependency amod(wrong, fuck) means
that “fuck” is an adjectival modifier of an noun phrase containing “wrong”. A typed dependency
may represent the dependent relations between two syntactic elements, not limited to words only.
Fig 2: An example of typed dependency graph

The typed dependencies in a sentence can be represented as a graph. For example, Figure 2
shows the typed dependency relations for the same sentence shown in Figure 1. We explain the
relations appeared in Figure 2 from left to right: the nominal subject relation, nsubj(it, Samuel),
means that “Samuel” is the syntactic subject of the clause (same is nsubj(wrong, he)); the copula
relation, cop(it, said), means that “it” is the complement of verb “said” (same is cop(wrong, is));
the noun compound modifier, the determiner, det(fuck, the), means that “the” is a determiner of
“fuck”; the adjectival modifier, amod(fuck, wrong), means that “fuck” serves as adjectival
modifier of “wrong”; and the conjunct, conj and(it, wrong), means that the coordinating
conjunction “and” is used to connect two elements with head “it” and “wrong”, respectively.
3.5.1.3 Relation Tree (RelTree)
Both Part-of-Speech and typed dependency relations are utilized in the second step estimation.
The parse tree shows the sentence syntactic organization and typed dependency relations provide
semantic information among words. To combine both information, we propose a new data
structure called RelTree.
In a RelTree, the leaf nodes are words in the sentence. And the non-leaf node represents either a
phrase or a clause inside the sentence. In each nonleaf node, we associate the set of typed
dependency relations on the words within its subtree. Each node only contains the typed
dependency relations which have not appeared in its subtree nodes.

Figure 3: A RelTree combining the parse tree and typed dependency relations
input : a parse tree PTree,
a set of typed dependency relations TDset
output: a RelTree RelTree
1 RelTree ← PTree;
2 Remove all word nodes in RelTree;
3 Traverse RelTree in postorder foreach node n visited do
4 if n is a leaf node then
5 n.wordset ← {n};/*create word nodes*/
6 end
7 if n is not a leaf node then
8 n.wordset ← ∅;

9 foreach direct child node ci do
10 n.wordset ← n.wordset ∪ ci.wordset;
11 n.rel ← ∅;
12 foreach relation Ti(Gi,Di) in TDset do
13 if Gi ∈ n.wordset and Di ∈ n.wordset then
14 n.rel ← n.rel ∪ Ti(Gi,Di);
15 TDset ← TDset − Ti(Gi,Di);
16 end
17 end
18 end
19 end
20 end
21 Return RelTree;
Algorithm 2: create a RelTree using the parse tree and typed dependency relations
The RelTree data structure is proposed only for the convenience of offensiveness estimation in
the next step. Algorithm 2 shows the algorithm for RelTree construction. With the parse tree
PTree given, the computational complexity of algorithm CreateRelTree relies on the post-order
traversal and the search in TDset. As the number of relations never exceeds N(N −1)/2, where N
is the number of words in the sentence, the computational complexity is O(N3
). The
computational complexity itself is acceptable. Indeed, there are a lot of ways to improve the
efficiency in the implementation of this algorithm.

3.5.2 Step Two: Bottom-Up Estimation
In the second step, we first use the offensive lexicon to identify offensive words in the sentence.
The leaf node with an offensive word will be labeled as “removable”. Starting from leaf nodes,
we perform bottom-up estimation through a postorder traversal on the RelTree.
For each non-leaf node in the RelTree, we estimate whether it should be removed based on (1)
the associated typed dependency relations and (2) its child nodes within its subtree. If a non-leaf
node is estimated to be “removable”, all its descendants, including words, within its subtree will
also be labeled as “removable”. The meaning of “removable” to a non-leaf node is that all words,
phrases, or even clauses within its subtree have been determined to be removed at the end of
filtering. The estimation process includes two steps. We first estimate based on typed
dependency relations, and then apply a set of heuristic rules as complements.
3.5.2.1 Estimation with Typed Dependency Relations
Consider a non-leaf node n in a RelTree with a set n.rel of typed dependency relations. Each
relation describes a semantic connection between a governor word and a dependent word. Both
words are leaf nodes in the subtree rooted at n. n.rel could be empty when n only has one child
node. For each typed dependency relation in n.rel, we study its semantic information and map it
to an estimation function.
These estimation functions and mapping are created following the Modification Relation and
Pattern Integrity rules. Take the Direct Object (dobj) relation for example. The dobj(G, D)
relation is defined as : the direct object of a verb phrase, containing governor word G, is the noun
phrase, containing dependent word D. For example, in a relation dobj(win,matcℎ), “win” is the
governor word and “match” is the dependent word. According to Pattern Integrity rule, we know

that “Subject-Verb-Object” is a basic pattern. Therefore, if either the phrase with G or phrase
with D will be removed because of offensiveness, both phrases should be removed together.
To formalize, we define an estimation function H(T) =H(P(G)) OR H(P(D)) and map relation
dobj(G,D) to it. We use symbol C(G) and P(G) to denote the clause and phrase containing word
G as head, respectively. In this estimation function, H(T) is the label to be assigned to relation T
and H(P(G)) is the label with phrase node containing G in the RelTree.
Using the estimation function, we generate a label for every relation associated with node n and
then for the node itself. If a relation T(G,D) of node n is estimated and labeled as “removable”,
the two child nodes of n, containing word G and word D, will be labeled as “removable”. If all
relations in n.rel are labeled as “removable”, the node n as well as all its descendants, will be
labeled as “removable”.
3.5.2.2 Estimation with Heuristic Rules
Heuristic rules will also be applied as complement after typed dependency relation estimation.
Applying heuristic rules is necessary mainly because of two reasons. First of all, the typed
dependency relation contains some syntactic information but limited. For example, the
possessive ending (i.e. POS) tag, which is a quite popular Part-of-Speech tag, is ignored during
the typed dependency tagging.
Secondly, not all relations between syntactic elements in a sentence can be classified into one of
typed dependency relations. For those uncertain relations, a generic grammatical relation is being
defined, named dep. To prevent confusion to filter, we include dep into the Rule H(T) = H(G)
AND H(D) which means either G or D is labeled removable will not affect each other and the

label of T. Because dep relation stands for uncertain relation, we have to rely on Part-of-Speech
tags in the RelTree for our filtering.
Take the conj tag node rule as an example. The conjunct relation (conj) is a type of relation
between two syntactic elements connected by a coordinating conjunction, such as “and”. The
parameters of conj do not include the coordinating conjunction. However, explicitly, the
coordinating conjunction sits between the two parameters of conj. If one side is determined
removable, the coordinating conjunction should be removed as well. For example, in the
sentence “I like A and B”, if either A or B is removed, the coordinating conjunction “and”
should also be removed.
Figure 4: Estimate a RelTree in a bottom-up manner

3.5.2.3 Estimation Algorithm
To estimate and assign labels for all nodes in a RelTree, we perform the estimation also in a
bottom-up manner. Figure 4 shows an example estimation process. The number in the circle
represents the order of estimation for each node in the RelTree. The dashed nodes are estimated
as “removable”. For example, the clause node with nsubj(you, fuck) is estimated as “removable”
according to the estimation. Therefore, its two child nodes containing “you” and “fuck”
respectively are both labeled as “removable”. Moreover, the word “and” is removable according
to the heuristic rule (i.e. conj tag node rule), in order to keep the filtering transparent to readers.
Finally, inoffensive words, “what”, “the”, “is”, “wrong”, “with”, “he”, and “said”, are removed
with the offensive word, “fuck” in the filtering.
According to Algorithm 2, each typed dependency relation will appear exactly once in the
RelTree. No relation will be checked repeatedly in the estimation. The cleaned sentence after
filtering in this example will be “Samuel said it.”. As we can see, the result satisfies the
requirement of our proposed filtering philosophy. Only the offensive part, “what the fuck is
wrong with what he said”, is removed. The reader can still get the inoffensive information. The
detailed algorithm for estimation process is presented below.
input : a RelTree RelTree,
a blacklist of offensive words Blacklist,
output: a labeled RelTree LebelRelTree
1 LebelRelTree ← RelTree;
2 Label all leaf nodes with offensive words by
“removable” in LabelRelTree ;

3 Traverse LabelRelTree in postorder foreach node n
visited do
4 if n is a leaf node then
5 ignore; /* already labeled */
6 end
7 if n is not a leaf node then
8 if n only has one child node then
9 n.label ← n.cℎild.label;
10 end
11 if n has more than one child node then
12 Estimate the label for n by its associated
labels, using proposed estimation function and
heuristic rules;
13 end
14 end
15 end
16 Return LabelRelTree;
Algorithm 3: estimate nodes in RelTree

CHAPTER FOUR
IMPLEMENTATION
4.1. JUSTIFICATION OF PROGRAMMING LANGUAGE USED.
The Spam filtering system is an online application implemented using HTML, JAVA SERVLET
PAGE (JSP), JAVASCRIPT, and MYSQL relational database software.
4.1.1 HTML
HTML, which stands for Hypertext Markup Language, is the predominant markup language for
web pages. It provides a means to create structured documents by denoting structural semantics
for text such as heading, paragraphs, list, etc. bas well as for links, quotes and other items. It allows
images and objects to be embedded and can be used to create interactive forms. It is written in the
form of HTML elements consisting of “tags” surrounded by ankle brackets within the webpage
content. It can include or can load script in language such as JavaScript which affect the behaviour
of HTML processors like Web browsers; and Cascading Style Sheets (CSS) to define the
appearance and layout of text and other material.

4.1.2 JAVASCRIPT
JavaScript has been around for several years now, in many different flavors. The main benefit of
JavaScript is to add additional interaction between the web site and its visitors at the cost of a
little extra work by the web developer. JavaScript allows industrious web masters to get more out
of their website than HTML and CSS can provide.
By definition, JavaScript is a client-side scripting language. This means the web surfer's browser
will be running the script. This is opposite to client-side is server-side, which occurs in a
language like PHP. These PHP scripts are run by the web hosting server.
There are many uses (and abuses!) for the powerful JavaScript language. Here, it is being used
for:
 Alert Messages
 Popup Windows
 HTML Form Data Validation
4.1.3 JAVA SERVLET PAGE
"JSP is an HTML-embedded scripting language. JSP goal is to allow developers to write
dynamically generated pages quickly." It is a server-side programming language specifically
designed for creating dynamic web pages. JSP will allow you to:
 Reduce the time to create large websites.

 Create a customized user experience for visitors based on information that you have
gathered from them.
 Open up thousands of possibilities for online tools.
Unlike other server-side languages, JSP is an open source product.
When someone visits your JSP webpage, your web server processes the JAVA code. It then sees
which parts it needs to show to visitors (content and pictures) and hides the other stuff (file
operations, math calculations, etc.) then translates your JSP into HTML. After the translation
into HTML, it sends the webpage to your visitor's web browser.
4.1.4 MYSQL
MySQL is the most popular open source database server in existence because of its consistent fast
performance, high reliability and ease of use. It's used in more than 6 million installations ranging
from large corporations to specialized embedded applications on every continent in the world. It
is very commonly used in conjunction with PHP scripts to create dynamic and powerful server
applications. MySQL has been criticized in the past because it does not have all the features of
other Database Management Systems. However, MySQL continues to improve significantly, with
each major upgrade, and has great popularity because of these improvements.

4.1.5 CSS
Cascading Style Sheets (CSS) are a way to control the look and feel of the HTML documents in
an organized and efficient manner. Cascading Style Sheet enables us to add new looks to the
HTML, completely restyles a web site with only a few changes to the CSS code and also allows
us to use the "style" created on any webpage we wish. With CSS you will be able to:
 Add new looks to your old HTML
 Completely restyle a web site with only a few changes to your CSS code
 Use the "style" you create on any webpage you wish
4.2 System Specification
The system specifications is divided into two part:
1. Hardware Specification
2. Software Specification
4.2.1 HAREWARE SPECIFICATION FOR THE APPLICATION
Any computer tagged by the manufacturer as a workstation can be used to access this application
using the internet browser, but the following minimum specification would be required to host
the application:
1. A computer tagged by the manufacturer as a server

2. Core 2Duo processor and above
3. A 2GB memory
4. A keyboard and a mouse
5. A hard disk of 120GB and above
4.2.2 SOFTWARE APPLICATION FOR THE APPLICATION
 Windows Server 2005 and above
 Microsoft .NET framework version 3.0 and above must be installed
 Microsoft SQL Server 2005 and above should be installed
 Microsoft Internet Information Server (IIS) should be enabled
 Server FTP capability must be enabled
4.3 System Implementation
This section briefly described the screens of the online application.
4.3.1 Application Login Screen
This system contains a secure login panel that requires a combination of email address and
password. The email address is used because it is meant to be unique.

Fig 4.1 – Web Application Login Screen
4.3.2 Application Registration Page
FIG. 4.2 – Web Application Registration Page

Here the user fills in his/her details and after the system verifies that all details provided is correct,
it also has a captcha image which acts as a spam guard to ensure than the inputted data was done
by human and not robot.
4.3.2 Post and Comment Page
FIG. 4.3 – Filtered Post Page Using Keyword Censoring Approach

FIG. 4.4 – Filtered Post Page Using Content Control Censoring Approach
FIG. 4.5 – Filtered Post Page Using FOLOC Censoring Approach

Looking at the three post and comment pages above, we will realize the our proposed semantic
filtering approach mimics the procedure of manual filtering by trying to understand the relations
among words and has removed the offensive content semantically. The proposed semantic
filtering approach is fully automated and it required no interference of any administrator and at
the same time eliminating the offensive words in the sentence.
“What the fuck is wrong with you?” has been changed to “What is wrong with you?” using the
proposed semantic filtering approach instead of having “what the f*** is wrong with you?”
which still delivers the offensive words to the victims successfully.
Our semantic filtering result is also so close to that of manual filtering as our desired results have
been produced just by applying the heuristic rules in the filtering process.
FIG. 4.6 – Filtered Post Page Using Keyword Censoring Approach

FIG. 4.4 – Filtered Post Page Using Content Control Censoring Approach
FIG. 4.8 – Filtered Post Page Using FOLOC Censoring Approach

Looking at the three post and comment pages above in fig 4.6, 4.7 and 4.8, we will realize that
our proposed semantic filtering approach also mimics the procedure of manual filtering by trying
to understand the relations among words and has removed the offensive content semantically
again. The proposed semantic filtering approach is fully automated and it required no
interference of any administrator and at the same time eliminating the offensive words in the
sentence.
“I have told all these bitches to stop calling my husband’s phone” has been changed to “I have
told all to stop calling my husband’s phone” using the proposed semantic filtering approach
instead of having “I have told all these b****** to stop calling my husband’s phone” which still
delivers the offensive words to the victims successfully.
Our semantic filtering result is also so close to that of manual filtering as our desired results have
been produced just by applying the heuristic rules in the filtering process.

CHAPTER FIVE
SUMMARY, CONCLUSION AND RECOMMENDATIONS
5.1 Summary and Conclusion
teens, as a place where they can meet other people, communicate, and exchange information.
This has also brought cyberbullying which is a fast growing trend that experts believe is more
harmful than typical schoolyard bullying. Nearly all of us can be contacted 24/7 via online social
networking communities. Victims can be reached anytime and at anyplace. For many children,
home is no longer a refuge from the bullies. Children can escape threats and abuse in the
classroom, only to find offensive comments and posts from the same tormentors when they
arrive home. There’s no safe place anymore and one can be bullied 24/7; even in the privacy of
his/her own bedroom.
However, we are not only trying to filter out offensive content but also making sure the
sentences still make sense. From statistical analysis it has been revealed that, more than 60% of
insulting messages are posted as a direct insult and direct insulting messages always contain
insulting words or phrases. From psychological point of view, if these messages are categorized
and restrict a user to send these kinds of messages, then a human intension to post or exchange of
abusive messages can be significantly reduced.
Offensive language is a serious problem facing the online community. Our semantic filtering
technique is based on the grammatical relations of words in a sentence so that the rest of the
filtered sentence is readable and the existence of offensive words in the original sentence is hard

to notice. We tested the effectiveness of our approach with a large dataset and the results show
that our techniques are very effective and accurate with little process overhead.
5.2 Recommendation
Our future work includes looking at the issues described in the discussion section. Moreover, as
the most time-consuming part of semantic filtering is the sentence parsing process, we will
examine other light-weighted NLP techniques to speed up sentence parsing. Last but not the
lease, we also plan to extend our filtering approach to support other languages such as Chinese
and French.

SAMUEL FULL MSC PROJECT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to SAMUEL FULL MSC PROJECT

Similar to SAMUEL FULL MSC PROJECT (20)

SAMUEL FULL MSC PROJECT