SlideShare a Scribd company logo
Filtering Offensive Language in Online Communities using Grammatical Relations
BY
SAMUEL AYOKUNLE ADEKANMBI
MATRIC NO: 133466
Project submitted in partial fulfillment award of Master of Science degree
(Computer science)
Department of computer science,
University of Ibadan.
February, 2014.
Certification
I certify that this research work was carried out by Samuel Ayokunle ADEKANMBI (133466)
under my supervision.
.
____________________ _______________________
Date Dr B O Longe
DEDICATION
This entire work dedicated to everyone that believes in the PromoUpdate dream.
ACKNOWLEDGEMENT
My profound gratitude goes to my parent and my siblings for their moral and financial support
which has immensely led to the success of this project. To my Dad, You are the best; I love you
so much even though I don’t show it.
I am indeed grateful to my supervisor, Dr. Olumide B. Longe for his moral support, patience and
understanding during the course of this project. Thank you very much Sir.
I also want to appreciate my very good and crazy friends: Tini, Phina, Kunchasho, TY, Alamu,
Oluwashola Amiola Philip, Emmanuel, Muideen, Lola Mojekodunmi, Jane, Gbenro, N.O Jimoh,
Tifa; You guys are my brothers from another mother.
I can’t underestimate the effort of all my lecturers in the department; I pray the blessing of the lord
shall not depart from your homes.
My Msc. Programme will have being incomplete without some set of wonderful people: Tini,
Phina, Helen, Rotimi, Modupe, Tolu, Big Fish, Last Don, Giel, and the whole crew at chief Madu’s
Palace. Thanks for being there for me.
To all my classmates, Dimple, Becky, Elohor, Ben, Fake AYs, John, Uzomma, Deola, Banky,
Shukurat, Toyosi, Shola, Adesi, GP, Toyosi, Tosinsss, etc; you have been a blessing to me and the
success of my programme. I say a big thanks to you for your support throughout the programme.
I appreciate your love. Thanks for believing in the PromoUpdate dream. You guys are the best.
Finally, to anyone that has contributed to the success of this project and my success in life, whose
name is not mentioned here, please just know that you are not unknown to me to me and you are
appreciated more than you know. God bless you all. See you at the top.
TABLE OF CONTENT page
Title page i
Certification ii
Dedication iii
Acknowledgement iv
Table of content v
Abstract viii
CHAPTER ONE: INTRODUCTION
1.1 Background of study 1
1.2 Problem Statement 4
1.3 Aims and Objectives 4
1.4 Research Methodology 5
1.5 Scope and Limitation 5
1.6 Organization of the study 6
1.7 Expected Contribution to Knowledge 6
1.7.1 Glossary of terms. 7
CHAPTER TWO: REVIEW OF THE LITERATURE
2.1 Offensive Language in Online Communities 8
2.2 Rate of Cyberbullying among youth 9
2.3 Tradition-Bullying and Cyber-Bullying 10
2.4 Type of Bullying Online 12
2.5 Challenges in the fight to stop cyberbullying 12
2.6 Preventing Cyberbullying 13
2.7 Responding to Cyberbullying 14
2.8 Grammatical Relations 16
2.9 Using text mining techniques to detect online offensive content 17
2.10 Heads and Dependents 20
2.11 Statistical Parsing 21
2.12 Dependency Parsing 27
2.9 Using text mining techniques to detect online offensive content 17
2.9 Using text mining techniques to detect online offensive content 17
2.9 Using text mining techniques to detect online offensive content 17
CHAPTER THREE: SYSTEM ANALYSIS AND DESIGN
3.1 Systems Analysis 36
3.2 Analysis of the existing system 37
3.3 Problem of the existing approaches 40
3.4 Proposed Filtering Philosophy 41
3.5 Identify Removable Content by Grammatical Relations 44
CHAPTER FOUR: IMPLEMENTATION
4.1 Justification of Programming Language Used 56
4.2 System Specification 58
4.3 System Implementation 59
CHAPTER FIVE: SUMMARY, CONCLUSION AND FUTURE WORKS
5.1 Summary. 65
5.2 Conclusion 65
5.3 Future Works 66
References 67
ABSTRACT
Offensive language has risen to be a big issue to the health of both online communities and their
users. To the online community, the spread of offensive language undermines its reputation, drives
users away, and even directly affects its growth. To users, viewing offensive language brings
negative influence to their mental health, especially for children and youth.
A semantic filtering model is been proposed and implemented using grammatical analysis and part
of speech tagging. Statistical/probabilistic analysis of recurring offensive tokens is been done using
Bayesian method. The designed semantic filtering system was tested as an online web application
with a client application by engaging users to validate the efficiency of the designed system.
When offensive language is detected in a user message, a problem arises about how the offensive
language should be removed, i.e. the offensive language filtering problem.
Our semantic filtering technique is based on the grammatical relations of words in a sentence so
that the rest of the filtered sentence is readable and the existence of offensive words in the original
sentence is hard to notice. We tested the effectiveness of our approach with a large dataset and the
results show that our techniques are very effective and accurate with little process overhead.
Moreover, as the most time-consuming part of semantic filtering is the sentence parsing process,
we will examine other light-weighted NLP techniques to speed up sentence parsing. Also, we also
plan to extend our filtering approach to support other languages such as Chinese and French in
future works.
CHAPTER ONE
INTRODUCTION
Online social networking (OSN) websites have enjoyed a great success in recent years and have
become the new frontier in today’s social relationships providing great places for self-expression
and exchange of ideas.
Social networking has provided opportunities for new relationships as well as strengthening
existing relationships. Benefits of social networking platforms vary based on platform type,
features and the company itself. OSN allows organizations to improve communication and
productivity by disseminating information among different groups of employees in a more
efficient manner, resulting in increased productivity.
In the past, social networks were viewed as a distraction and offered no educational benefit.
Blocking these social networks was a form of protection for students against wasting time,
bullying, and invasions of privacy. In an educational setting, OSNs are seen by many instructors
and educators as a frivolous, time-wasting distraction from schoolwork, and it is not uncommon
to be banned in school computer labs. Cyberbullying has also become an issue of concern with
social networks. According to the Children Go Online survey of 9-24 year olds, it was found that
a third have received bullying comments online.( http://internetsafety101.org) To avoid this
problem, many school districts/boards have blocked access to online social networks within the
school environment. I
Social networking services often include a lot of personal information posted publicly, and many
believe that sharing personal information is a window into privacy theft. Schools have taken action
to protect students from this. It is believed that this outpouring of identifiable information and the
easy communication vehicle that social networking services opens the door to sexual predators,
cyberbullying, and cyber-stalking.(http://en.wikipedia.org/wiki/Social_networking_service) In
contrast, however, 70% of social media using teens and 85% of adults believe that people are
mostly kind to one another on social network sites.(
http://en.wikipedia.org/wiki/Social_networking_service) Research has suggested that there has
been a shift in blocking the use of social networking services. In many cases, the opposite is
occurring as the potential of online networking services is being realized. It has been suggested
that if schools block them [Online Social Networks], they’re preventing students from learning the
skills they need. Banning social networking is not only inappropriate but also borderline
irresponsible when it comes to providing the best educational experiences for students. Schools
have the option of educating safe media usage as well as incorporating digital media into the
classroom experience, thus preparing students for the literacy they will encounter in the future.
Cyberbullying is a fast growing trend that experts believe is more harmful than typical schoolyard
bullying. Nearly all of us can be contacted 24/7 via the internet or our mobile phones. Victims can
be reached anytime and at anyplace. For many children, home is no longer a refuge from the
bullies. “Children can escape threats and abuse in the classroom, only to find text messages and
emails from the same tormentors when they arrive home.”
“There’s no safe place anymore and one can be bullied 24/7; even in the privacy of his/her own
bedroom.” (Cyberbullying, Able Publishing Newsletter - Term 3, 2008).
Online social networking sites have become increasingly popular with children, especially young
teens, as a place where they can meet other people, communicate, and exchange information. No
type of bullying is harmless. In some cases, it can constitute criminal behaviour. In extreme
incidents, cyberbullying has led teenagers to suicide. Most victims, however, suffer shame,
embarrassment, anger, depression and withdrawal.(Cyberbullying, Able Publishing Newsletter -
Term 3, 2008) Cyberbullying is often seen as anonymous, and the nature of the internet allows it
to spread quickly to hundreds and thousands of people.
Cyberbullying has the same insidious effects as any kind of bullying, turning children away from
school, friendships, and in tragic instances, life itself. Parents often tell their children to turn off
the mobile phones or stay off the computer. Many parents don’t understand that the internet and
mobile phone act as a social lifeline for teenagers to their peer group. Victims often don't tell their
parents because they think their parents will only make the problem worse, or that they might even
confiscate their mobile phone or take away their internet access, removing that social lifeline.
While bullying is something that is often ‘under the radar’ of adults, cyberbullying is even more
so. Teenagers are increasingly communicating in ways that are often unknown by adults and away
from their supervision. They organize their social lives through these mediums. Their friendships
are made and broken over these mediums.
So the question remains "How can we avoid offensive languages in OSNs?” This research work
aims at removing offensive languages in a user message. When offensive language is detected in
a user message, a problem arises about how the offensive language should be removed, i.e. the
offensive language filtering problem. To solve this problem, manual filtering approach is known
to produce the best filtering result. However, manual filtering is costly in time and labor thus
cannot be widely applied.(http://en.wikipedia.org/wiki/Anti-spam_techniques) Here, we will
analyze the offensive language in text messages posted in online communities, and propose a new
automatic sentence-level filtering approach that is able to semantically remove the offensive
language by utilizing the grammatical relations among words. Comparing with existing automatic
filtering approaches, the proposed filtering approach provides filtering results much closer to
manual filtering.
1.1 Statement problem
The online community has encouraged the use of offensive languages which has spread into about
80% of all OSN; and has been very harmful to the mental health of both children and youth.(Zhi
Xu and Sencun Zhu, 2010) To the online community, the deluge of offensive language undermines
the community’s reputation, drives users away, and even directly affects its growth.
People have realized the problems brought by offensive language in online communities and many
efforts have been made on detecting the existence of offensive language within user messages.
However, detection alone is not enough to eliminate the hazard caused by offensive language.
When offensive contents are detected within a user message, a question arises naturally about how
the detected offensive content should be removed from the message before it is been transmitted.
Also, how do we remove or filter offensive languages and words form a message thoroughly and
still keep inoffensive content untouched as much as possible. Also, will the readability of filtered
content be guaranteed so as to make our filtering transparent to readers?
1.2 Aims and objectives:
This project work intends to develop and implement a sentence-level semantic filtering System,
which will
1. Utilize grammatical relations among words to stop cyberbullying by semantically remove
offensive content in a sentence.
2. Produce minimal error when filtering offensive languages and words form a message and
still keeps inoffensive content untouched as much as possible.
3. Guarantee the readability of filtered content so as to make the filtering transparent to
readers.
4. Implement the designed model which is going to be a sophisticated NLP application, not
an AI application, since learning is not going to be involved.
5. To help reduce the chances of victimization in Online Social Networking Sites.
1.3 Research Methodology
The methodology adopted in carrying out this project include the use of interviews to gather
primary data from a number of leading filtering vendors in Nigeria. Both telephone and face-to-
face interviews will be carried out with the relevant technology experts within selected
organizations. Also, an existing database of offensive words and languages will be collected and
use to simulate an offensive database engine. A semantic filtering model will be proposed and
implemented using XYZ. Statistical/probabilistic analysis of recurring offensive tokens will be
done using Bayesian method. The designed semantic filtering system will be tested as an online
web application with a client application by engaging users to validate the efficiency of the
designed system.
1.4 Organization of the study
The thesis work is arranged in five chapters with the breakdown as follows:
The First Chapter is termed introduction and it includes the Online Social Networking System,
research aim and objectives, research methodology and organization of dissertation.
Chapter Two deals with the literature review on grammatical relations, cyberbullying and the
concept of sematic filtering system.
Chapter Three presents the Methodology and analysis of the input and output specification of the
proposed system and the design of the system.
Chapter Four describes the system implementation and evaluation of the system design. This
would consist of a brief description of each program module and their functions. It also justifies
the choice of package and describes the software required to implement the system. It also shows
the measures taking during the implementation.
Chapter Five summarizes the project work. It covers conclusion and recommendation for the
project.
LITERATURE REVIEW
2.1 Offensive Language in Online Communities
A lot of people most especially kids have been bullying each other for generations. The latest
generation, however, has been able to utilize technology to expand their reach and the extent of
their harm. (http://cyberbullying.us) This phenomenon is being called cyberbullying, defined as:
“willful and repeated harm inflicted through the use of computers, cell phones, and other
electronic devices.” Basically, we are referring to incidents where adolescents use technology,
usually computers or cell phones, to harass, threaten, humiliate, or otherwise hassle their peers.
For example, youth can send hurtful text messages to others or spread rumors using cell phones
or computers. Teens have also created web pages, videos, profiles on social networking sites
making fun of others. With cell phones, adolescents have taken pictures in a bedroom, a
bathroom, or another location where privacy is expected, and posted or distributed them online.
More recently, some have recorded unauthorized videos of other kids and uploaded them for the
world to see, rate, tag, and discuss.( http://cyberbullying.us)
However, there are many detrimental outcomes associated with cyberbullying and making use of
offensive languages that reach into the real world. First, many targets of cyberbullying report
feeling depressed, sad, angry, and frustrated. As one teenager stated: “It makes me hurt both
physically and mentally. It scares me and takes away all my confidence. It makes me feel sick
and worthless.” Victims who experience cyberbullying also reveal that they were afraid or
embarrassed to go to school or even come out to talk in public.(http://cyberbullying.us) In
addition, there is a link between cyberbullying and low self-esteem, family problems, academic
problems, school violence, and delinquent behavior. Finally, cyberbullied youth also report
having suicidal thoughts, and there have been a number of examples around the world where
youth who were victimized ended up taking their own lives.(http://cyberbullying.us)
Cyberbullying occurs across a variety of venues and mediums in cyberspace, and it shouldn’t
come as a surprise that it occurs most often where teenagers congregate. Initially, many kids
hung out in chat rooms, and as a result that is where most harassment took place. In recent years,
most youth are have been drawn to social networking websites (such as Facebook, Twitter,
Linked In, etc.) and video-sharing websites (such as YouTube). This trend has led to increased
reports of cyberbullying occurring in those environments. (Burgess-Proctor, Patchin, & Hinduja,
2009; Hinduja & Patchin, 2008b; R. M. Kowalski & Limber, 2007; Lenhart, 2007; Li, 2007a;
Patchin & Hinduja, 2006). Instant messaging on the Internet or text messaging via a cell phone
also appear to be common ways in which youth are harassing one another.
2.2 Rate of Cyberbullying among Youth
Estimates of the number of youth who experience cyberbullying vary widely (ranging from 10-
40% or more), depending on the age of the group studied and how cyberbullying is formally
defined. In this research, we inform secondary school students(of International School, Ibadan;
Abadina College, U.I; and Igbobi College Yaba, Lagos) that cyberbullying is when someone
“repeatedly picks on another person by making use of offensive languages through OSN when
chatting or when someone posts something offensive online about another person that they don’t
like.” Using this definition, about 62% of the over 800 randomly-selected 11-18 year-old
students indicated they had been a victim at some point in their life. About this same number
admitted to cyberbullying others during their lifetime. Finally, about 40% of youths in this recent
study said they had both been a victim and an offender.
Fig 2.1
2.3 Traditional-Bullying and Cyber-Bullying
While often similar in terms of form and technique, bullying and cyberbullying have many
differences that can make the latter even more devastating. First, victims often do not know who
the bully is, or why they are being targeted. The cyberbully can cloak his or her identity behind a
computer using anonymous email addresses or pseudonymous screen names.
Second, the hurtful actions of a cyberbully are viral; that is, a large number of people (at school,
in the neighborhood, in the city, in the world!) can be involved in a cyber-attack on a victim, or
at least find out about the incident with a few keystrokes or clicks of the mouse. The perception,
then, is that absolutely everyone knows about it.
Third, it is often easier to be cruel using technology because cyberbullying can be done from a
physically distant location, and the bully doesn’t have to see the immediate response by the
target. In fact, some teens simply might not recognize the serious harm they are causing because
they are sheltered from the victim’s response.
Finally, while parents and teachers are doing a better job supervising youth at school and at
home, many adults don’t have the technological know-how to keep track of what teens are up to
online. As a result, a victim’s experience may be missed and a bully’s actions may be left
unchecked. Even if bullies are identified, many adults find themselves unprepared to adequately
respond.()5
All these and more makes cyberbullying is a growing problem because increasing numbers of
kids are using and have completely embraced interactions via computers and cell phones. Two-
thirds of youth go online every day for school work, to keep in touch with their friends, to play
games, to learn about celebrities, to share their digital creations, or for many other reasons.
Because the online communication tools have become an important part of their lives, it is not
surprising that some youths have decided to use the technology to be malicious or menacing
towards others. The fact that teens are connected to technology 24/7 means they are susceptible
to victimization (and able to act on mean intentions toward others) around the clock.() Apart
from a measure of anonymity, it is also easier to be hateful using typed words rather than spoken
words face-to-face and because some adults have been slow to respond to cyberbullying, many
cyberbullies feel that there are little to no consequences for their actions.
Cyberbullying crosses all geographical boundaries. The Internet has really opened up the whole
world to users who access it on a broad array of devices, and for the most part, this has been a
good thing. Nevertheless, some kids feel free to post or send whatever they want while online
without considering how that content can inflict pain – and sometimes cause severe
psychological and emotional wounds.
2.4 Types of Bullying Online
According to the Internet Safety 101 curriculum, there are many types of cyberbullying which
includes:
 Gossip: Posting or sending cruel gossip to damage a person’s reputation and
relationships with friends, family, and acquaintances.
 Exclusion: Deliberately excluding someone from an online group.
 Impersonation: Breaking into someone’s e-mail or other online account and sending
messages that will cause embarrassment or damage to the person’s reputation and
affect his or her relationship with others.
 Harassment: Repeatedly posting or sending offensive, rude, and insulting messages.
 Cyber-stalking: Posting or sending unwanted or intimidating messages, which may
include threats.
 Flaming: Online fights where scornful and offensive messages are posted on websites,
forums, or blogs.
 Outing and Trickery: Tricking someone into revealing secrets or embarrassing
information, which is then shared online.
 Cyber-threats: Remarks on the Internet threatening or implying violent behavior,
displaying suicidal tendencies.
2.5 Challenges in the fight to stop cyberbullying
There are two major challenges that make it difficult to prevent cyberbullying. First, many
people don’t see the harm associated with it. Some attempt to dismiss or disregard cyberbullying
because there are “more serious forms of aggression to worry about.” While it is true that there
are many issues facing adolescents, parents, teachers, and law enforcement today, we first need
to accept that cyberbullying is one such problem that will only get more serious if ignored.
The other challenge relates to who is willing to step up and take responsibility for responding to
inappropriate use of technology. Parents often say that they don’t have the technical skills to
keep up with their kids’ online behavior; teachers are afraid to intervene in behaviors that often
occur away from school; and law enforcement is hesitant to get involved unless there is clear
evidence of a crime or a significant threat to someone’s physical safety. As a result,
cyberbullying incidents often slip through the cracks. Indeed, the behavior often continues and
escalates because they are not quickly addressed.() Based on these challenges, there is need to
collectively create an environment where kids feel comfortable talking with adults about this
problem and feel confident that meaningful steps will be taken to resolve the situation. We also
need to get everyone involved - youth, parents, educators, counselors, law enforcement, social
media companies, and the community at large. It will take a concerted and comprehensive effort
from all stakeholders to really make a difference in reducing cyberbullying.
2.6 Preventing Cyberbullying
The most important preventive step that schools can take is to educate the school community
about responsible internet use. Students need to know that all forms of bullying are wrong and
that those who engage in harassing or threatening behaviors will be subject to discipline. It is
therefore important to discuss issues related to the appropriate use of online communications
technology in various areas of the general curriculum. To be sure, these messages should be
reinforced in classes that regularly utilize technology. Signage also should be posted in the
computer lab or at each computer workstation to remind students of the rules of acceptable use.
In general, it is crucial to establish and maintain a school climate of respect and integrity where
violations result in informal or formal sanction.()
Furthermore, school district personnel should review their harassment and bullying policies to
see if they allow for the discipline of students who engage in cyberbullying. If their policy covers
it, cyberbullying incidents that occur at school - or that originate off campus but ultimately result
in a substantial disruption of the learning environment - are well within a school’s legal authority
to intervene. The school then needs to make it clear to students, parents, and all staff that these
behaviors are unacceptable and will be subject to discipline. In some cases, simply discussing the
incident with the offender’s parents will result in the behavior stopping.
2.7 Responding to Cyberbullying
Students should already know that cyberbullying is unacceptable and that the behavior will result
in discipline. Utilize school liaison officers or other members of law enforcement to thoroughly
investigate incidents, as needed, if the behaviors cross a certain threshold of severity. Once the
offending party has been identified, develop a response that is commensurate with the harm done
and the disruption that occurred.
School administrators should also work with parents to convey to the student that cyberbullying
behaviors are taken seriously and are not trivialized. Moreover, schools should come up with
creative response strategies, particularly for relatively minor forms of harassment that do not
result in significant harm. For example, students may be required to create anti-cyberbullying
posters to be displayed throughout the school. Older students might be required to give a brief
presentation to younger students about the importance of using technology in ethically-sound
ways. The point here, again, is to condemn the behavior while sending a message to the rest of
the school community that bullying in any form is wrong and will not be tolerated.
Even though the vast majority of these incidents can be handled informally (calling parents,
counseling the bully and target, expressing condemnation of the behavior), there may be
occasions where formal response from the school is warranted. This is particularly the case in
incidents involving serious threats toward another student, if the target no longer feels
comfortable coming to school, or if cyberbullying behaviors continue after informal attempts to
stop it have failed. In these cases, detention, suspension, changes of placement, or even
expulsion may be necessary. If these extreme measures are required, it is important that
educators are able to clearly demonstrate the link to school and present evidence that supports
their action.
Also, youth should develop a relationship with an adult they trust (a parent, teacher, or someone
else) so they can talk about any experiences they have online (or off) that make them upset or
uncomfortable. If possible, teens should ignore minor teasing or name calling, and not respond to
the bully as that might simply make the problem continue. It’s also useful to keep all evidence of
cyberbullying to show an adult who can help with the situation. If targets of cyberbullying are
able to keep a log or a journal of the dates and times and instances of the online harassment, that
can also help prove what was going on and who started it.
Overall, youth should go online with their parents – show them what web sites they use, and
why. At the same time, they need to be responsible when interacting with others on the Internet.
For instance, they shouldn’t say anything to anyone online that they wouldn’t say to them in
person with their parents in the room. Finally, youth ought to take advantage of the privacy
settings within Facebook and other websites, and the social software (instant messaging, email,
and chat programs) that they use – they are there to help reduce the chances of victimization.
Users can adjust the settings to restrict and monitor who can contact them and who can read their
online content.
Law enforcement officers also have a role in preventing and responding to cyberbullying. To
begin, they need to be aware of ever-evolving state and local laws concerning online behaviors,
and equip themselves with the skills and knowledge to intervene as necessary. In a recent survey
of school resource officers, we found that almost one-quarter did not know if their state had a
cyberbullying law. This is surprising since their most visible responsibility involves responding
to actions which are in violation of law (e.g., harassment, threats, stalking). Even if the behavior
doesn’t immediately appear to rise to the level of a crime, officers should use their discretion to
handle the situation in a way that is appropriate for the circumstances. For example, a simple
discussion of the legal issues involved in cyberbullying may be enough to deter some youth from
future misbehavior. Officers might also talk to parents about their child’s conduct and express to
them the seriousness of online harassment.
Relatedly, officers can play an essential role in preventing cyberbullying from occurring or
getting out of hand in the first place. They can speak to students in classrooms about
cyberbullying and online safety issues more broadly in an attempt to discourage them from
engaging in risky or unacceptable actions and interactions. They might also speak to parents
about local and state laws, so that they are informed and can properly respond if their child is
involved in an incident.
2.8 Grammatical Relations
Grammatical relations refer to functional relationships between constituents in a clause. The
standard examples of grammatical functions from traditional grammar are subject, direct object,
and indirect object. Beyond these concepts from traditional grammar, more modern theories of
grammar are likely to acknowledge many further types of grammatical relations (e.g.
complement, specifier, predicative, etc.). The role of grammatical relations in theories of
grammar is the greatest in many dependency grammars, which tend to posit dozens of distinct
grammatical relations.() Every head-dependent dependency bears a grammatical function.
Grammatical relations are exemplified in traditional grammar by the notions of subject, direct
object, and indirect object;
For example:
Adekanmbi gave Samuel the book.
The subject Adekanmbi performs or is the source of the action. The direct object the book is
acted upon by the subject, and the indirect object Samuel receives the direct object or otherwise
benefits from the action. Traditional grammars often begin with these rather vague notions of the
grammatical functions. When one begins to examine the distinctions more closely, it quickly
becomes clear that these basic definitions do not provide much more than a loose orientation
point. What is indisputable about the grammatical relations is that they are relational. That is,
subject and object can exist as such only by virtue of the context in which they appear. A noun
such as Adekanmbi or a noun phrase such as the book cannot qualify as subject and direct object,
respectively, unless they appear in an environment, e.g. a clause, where they are related to each
other and/or to an action or state. In this regard, the main verb in a clause is responsible for
assigning grammatical relations to the clause "participants".
2.9 Using Text Mining Techniques to Detect Online Offensive Contents
Offensive language identification in social media is a difficult task because the textual contents
in such environment is often unstructured, informal, and even misspelled. While defensive
methods adopted by current social media are not sufficient, researchers have studied intelligent
ways to identify offensive contents using text mining approach. Implementing text mining
techniques to analyze online data requires the following phases:
1) Data acquisition and preprocess,
2) Feature extraction
3) Classification
The major challenges of using text mining to detect offensive contents lie on the feature selection
phrase, which will be elaborated in the following sections.
a) Message-level Feature Extraction
Most offensive content detection research extracts two kinds of features: lexical and syntactic
features.
Lexical features treat each word and phrase as an entity. Word patterns such as appearance of
certain keywords and their frequencies are often used to represent the language model. Early
research used Bag-of-Words (BoW) in offensiveness detection. The BoW approach treats a text
as an unordered collection of words and disregards the syntactic and semantic information.
However, using BoW approach alone not only yields low accuracy in subtle offensive language
detection, but also brings in a high false positive rate especially during heated arguments,
defensive reactions to others’ offensive posts, and even conversations between close friends. N-
gram approach is considered as an improved approach in that it brings words’ nearby context
information into consideration to detect offensive contents. N-grams represent subsequences of
N continuous words in texts. Bi-gram and Tri-gram are the most popular N-grams used in text
mining. However, N-gram suffers from difficulty in exploring related words separated by long
distances in texts. Simply increasing N can alleviate the problem but will slow down system
processing speed and bring in more false positives.
Syntactic features: Although lexical features perform well in detecting offensive entities,
without considering the syntactical structure of the whole sentence, they fail to distinguish
sentences’ offensiveness which contain same words but in different orders. Therefore, to
consider syntactical features in sentences, natural language parsers are introduced to parse
sentences on grammatical structures before feature selection. Equipping with a parser can help
avoid selecting un-related word sets as features in offensiveness detection.
b) User-level Offensiveness Detection
Most contemporary research on detecting online offensive languages only focus on sentence-
level and message-level constructs. Since no detection technique is 100% accurate, if users keep
connecting with the sources of offensive contents (e.g., online users or websites), they are at high
risk of continuously exposure to offensive contents. However, user-level detection is a more
challenging task and studies associated with the user level of analysis are largely missing. There
are some limited efforts at the user level. For example, Kontostathis et al propose a rule-based
communication model to track and categorize online predators. Pendar uses lexical features with
machine learning classifiers to differentiate victims from predators in online chatting
environment. Pazienza and Tudorache propose utilizing user profiling features to detect
aggressive discussions. They use users’ online behavior histories (e.g., presence and
conversations) to predict whether or not users’ future posts will be offensive. Although their
work points out an interesting direction to incorporate user information in detecting offensive
contents, more advanced user information such as users’ writing styles or posting trends or
reputations has not been included to improve the detection rate.()
Fig 2.2
2.10 Heads and dependents
The importance of the syntactic functions reaches its greatest extent in dependency grammar
(DG) theories of syntax. Every head-dependent dependency bears a syntactic function. The result
is that an inventory consisting of dozens of distinct syntactic functions is needed for each
language. For example, a determiner-noun dependency might be assumed to bear the DET
(determiner) function, and an adjective-noun dependency is assumed to bear the ATTR
(attribute) function. These functions are often produced as labels on the dependencies themselves
in the syntactic tree, e.g.
Fig 2.3
The tree contains the following syntactic functions: ATTR (attribute), CCOMP (clause
complement), DET (determiner), MOD (modifier), OBJ (object), SUBJ (subject), and VCOMP
(verb complement). The actual inventories of syntactic functions will differ from the one
suggested here in the number and types of functions that are assumed. In this regard, this tree is
merely intended to be illustrative of the importance that the syntactic functions can take on in
some theories of syntax and grammar.
2.11 Statistical parsing
CFGs can be used to parse, but some ambiguous sentences could not be disambiguated, and we
would like to know the most likely parse. A corpus could be used to do that.
2.11.1 Basic idea
1. Start with a Treebank (we can say bank of trees, e.g. Penn Treebank) which is a
collection of sentences with syntactic annotation, i.e., already-parsed sentences.
2. Examine which parse trees occur frequently
3. Extract grammar rules corresponding to those parse trees, estimating the probability of
the grammar rule based on its frequency.
That is, we’ll have a CFG augmented with probabilities (PCFG).
2.11.2 Probabilistic Context-Free Grammars (PCFGs)
Definition of a PCFG:
- Set of non-terminals (N)
- Set of terminals (T)
- Set of rules/productions (P), of the form Α → β
- Designated start symbol (S)
- Function, D assigns probabilities to each rule in P
D = P (A → β)
2.11.3 Estimating Probabilities using a Treebank
- Given a corpus of sentences annotated with syntactic annotation
(e.g., the Penn Treebank)
- Consider all parse trees
- (1) Each time we have a rule of the form A → ß applied in a parse tree, increment a counter for
that rule
- (2) Also count the number of times A is on the left hand side of a rule
- Divide (1) by (2) D = P (A→ ß | A) = Count (A → ß) / Count (A)
2.11.4 Using Probabilities to Parse
• P (T) = probability of a particular parse tree
= the product of the probabilities of all the rules r used to expand each node n in the parse
tree
Fig 2.4
We have the following rules and probabilities
- S → VP .05
- VP → V NP .40
- NP → Det N .20
- V → book .30
- Det → that .05
- N → flight .25
P ( T ) = P ( S → VP ) * P ( VP→ V NP ) *… * P ( N → flight )
= .05 * .40 * .20 * .30 * .05 * .25 = .000015
So, the probability for that parse is 0.000015. Probabilities are useful for comparing with other
probabilities. Whereas we couldn’t decide between two parses using a regular CFG, we now can.
2.11.5 Obtaining the best parse
The best parse T(S), where S is our sentence is the tree which has the highest probability.
We can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse
- CYK is a form of dynamic programming
- CYK is a chart parser, like the Earley parser
2.11.6 Problems with PCFGs
It’s still only a CFG, so dependencies on non-CFG information is not captured.
- e.g., Pronouns are more likely to be subjects than objects:
P [ ( NP → Pronoun ) | NP = subject ] >> P [ ( NP → Pronoun)
| NP =obj]
Ignores lexical dependency information (statistics), which is usually crucial for disambiguation
of “PP attachment ambiguity” and “Coordination ambiguity”.
- (T1) America sent [ [250,000 soldiers] [into Iraq] ]
- (T2) America sent [250,000 soldiers] [into Iraq]
“Sent” with “into”-PP always-attached high (T2) probability.
An example of Coordination ambiguity is two parses of the phrase “dogs in houses and cats”
- (T1) [ [NP dogs] in [ NP houses and cats ] ]
- (T2) [ [NP dogs in houses] and [NP cats ] ]
Here T1 is semantically wrong and T2 is correct but both tree results same score. So only PCFG
is not enough to disambiguate parse trees, lexical dependency information is also needed.
To handle lexical information, we’ll turn to lexicalized PCFGs.
2.11.7 Lexicalized PCFGs
 Lexicalized Parse Trees
- Add “headwords” to each phrasal node. Each PCFG rule in a tree is augmented to
identify one RHS constituent to be the head daughter
- The headword for a node is set to the head word of its head daughter
- Headship not in (most) treebanks
- Usually use head rules, e.g.:
- NP:
• Take leftmost NP
• Take rightmost N*
• Take rightmost JJ
• Take right child
- VP:
• Take leftmost VB*
• Take leftmost VP
• Take left child
Fig 2.5
2.11.8 Incorporating head probabilities
 Previously, we conditioned on the mother node (A):
- P ( A → β | A )
 Now, we can condition on the mother node and the headword of A (h(A)):
- P( A → β | A , h (A) )
We’re no longer conditioning on simply the mother category A, but on the mother category when
h(A) is the head.
- e.g., P ( VP → VBD NP PP | VP , dumped)
2.11.9 Calculating rule probabilities
 We calculate this by comparing how many times the rule occurs with h(n) as the
headword versus how many times the mother/headword combination appear in total:
P ( VP → VBD NP PP | VP , dumped )
= C (VP (dumped) → VBD NP PP) / Σβ C ( VP ( dumped ) → β)
2.11.10 Adding info about word-word dependencies
 We want to take into account one other factor: the probability of being a head word (in a
given context)
- P(h(n)=word | …)
 We condition this probability on two things: 1. the category of the node (n), and 2. the
headword of the mother (h(m(n)))
- P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n)))
- P(sacks | NP, dumped)
 What we’re really doing is factoring in how words relate to each other
 We will call this a dependency relation later: sacks is dependent on dumped, in this case
Fig 2.6: Lexicalized parsing can be seen as producing dependency trees
2.12 Dependency Parsing
Modern dependency grammar has been created by French linguistic Lucien Tesniere (1959).
Although its roots may be traced back to Panini’s grammar of Sankskrit (predecessor of bangla)
several centuries before. In NLP, dependency parse tree is thought as a ‘bridge’ between
syntactic and semantic analysis, since it gives some semantic information as well as syntactic.
Some peoples also argues that it is another version of chunk parsing, because a very careful
observation of a dependency tree will reveal that every subpart of a sentence: subject, object or
complements are appeared in different sub trees or under different relation, where each node is
dependent on another node. These sub trees or semantically dependent nodes can be thought of
as separate chunks.
2.12.1 Basic Concepts
In a dependency representation every node in the structure is a surface word (there are no
abstract nodes such as NP or VP), but each word may have additional attributes such as its part-
of-speech (POS) tag. The parent word is known as the head, and its children are its modifiers.
The observation which derives DG is: In a sentence, all but one word depend on other words.
The one word that doesn’t depend on any other is called the root of the sentence. A typical DG
analysis of the sentence “A man sleeps” is demonstrated below:
A depends on man
Man depends on sleep
Sleep depends on nothing (it is the root of the sentence)
Or, put differently
A modifies man
Man is the subject of sleep
Sleep is the main verb of the sentence
This is Dependency Grammar. A formulated dependency grammar is given below:
 Capturing relations between words is moving in the direction of dependency grammar
(DG)
 In DG, there is no such thing as constituency
 The structure of a sentence is purely the binary relations between words, A → B means
that B depends on A
Dependencies are motivated by grammatical function, both syntactically and semantically. A
word depends on another either if it is a complement or a modifier of the latter. The edge
between a parent and a child node specifies the grammatical relationship between the two words
(e.g. subj, obj, and adj).
In most formulations of DG for example, functional heads or governors (e.g. verbs)
subcategorize for their complements. Hence, a transitive verb like ‘like’ requires two
complements (dependents), one noun with the grammatical function subject and one with the
function object.
In this research thesis, we are using Stanford-Parser version-jdk1.5 for all of the output.
Ex sentence: John likes Italian food.
Tagged output: John/NNP likes/VBZ Italian/NN food/NN
Constituent structure output:
(ROOT
(S
(NP (NNP John))
(VP (VBZ likes)
(NP (NN Italian) (NN food)))))
Dependency structure output:
nsubj(likes-2, John-1)
nn(food-4, italian-3)
dobj(likes-2, food-4)
2.12.2 Dependency functions
2.12.2.1 Main functions
main
main element
The main element of a clause is usually a verb, but in a verb-less clause other elements may
serve as a head as well.
Ex: a sentence with a verb
He doesn't know whether to send a gift.
nsubj(know-4, He-1)
aux(know-4, does-2)
advmod(know-4, n't-3)
aux(send-7, to-6)
whether(know-4, send-7)
det(gift-9, a-8)
dobj(send-7, gift-9)
Ex: a sentence without a verb
A comprehensive grammar of the English language
det(grammar-3, A-1)
amod(grammar-3, comprehensive-2)
det(language-7, the-5)
amod(language-7, english-6)
of(grammar-3, language-7)
2.12.2.2 Verb complementation
nsubj
nominal subject
The dependency syntax collapses the classes of formal subject and ordinary subject into
one. The subject may also be a non-finite clause that-clause, WH-clause, etc.
dobj
direct object
The notion of object is wider than that in Quirk, comprising essentially all types of
second arguments, except subject complements. The motivation is that the subtypes of
second arguments are complementary, i.e. they occupy the same valency slot. There are
both simple nominal objects and more complex objects such as a non-finite clause, that-
clause, WH-clause or quote structure.
Ex: John explained that topic
nsubj(explained-2, John-1)
det(topic-4, that-3)
dobj(explained-2, topic-4)
ccomp
coordinated complement
Subject complement is the second argument of a copular verb.
Ex: Mary said John didn't go there
nsubj(said-2, Mary-1)
nsubj(go-6, John-3)
aux(go-6, did-4)
advmod(go-6, n't-5)
ccomp(said-2, go-6)
advmod(go-6, there-7)
iobj
indirect object
Indirect object corresponds to a third argument. The prepositional dative is described
accordingly. Again, the syntactic motivation is that the prepositional phrase occupies the
same valency slot as the indirect object and is semantically equivalent to it.
Ex: I gave him my address.
nsubj(gave-2, I-1)
iobj(gave-2, him-3)
dep(address-5, my-4)
dobj(gave-2, address-5)
What did Pauline give Tom?
Pauline gave it to Tom.
2.12.2.3 Determinative functions det
determiner
Central determiners (articles) or a determining pronoun. Successive determiners are
linked to each other.
Ex: This is an apple
nsubj(is-2, This-1)
det(apple-4, an-3)
dobj(is-2, apple-4)
2.12.3 Robinson’s axiom
Robinson (1970) formulated four axioms to govern the well-formedness of dependency
structures, depicted below:
1. One and only one element is independent.
2. All others depend directly on some element.
3. No element depends directly on more than one other.
4. If A depends directly on B and some element C intervenes between them (in the linear
order of string), then C depends directly on A or B or some other intervening element.
The first three axioms ensure that they shall be trees. Axioms 1 and 2 state that in each sentence,
only one element is independent and all others dependent on some other elements. Axiom 3
states that if element A depends on B, it must not depend on another element C. This
requirement is referred as single-headness. Axiom 4 is called the requirement of projectivity and
disallows crossing edges in dependency trees.
2.12.4 Dependency relation
A mapping M maps W to the actual words of a sentence. Now for w1, w2 ∈
W, <w1, w2 >∈ R asserts that w1 is dependent on w2. The properties of R treeness constraints
on dependency graphs as Robinson’s axioms.
Ex: Mary loves another Mary
↑ ↑ ↑ ↑
w1 w2 w3 w4
here, M (w1…w4 ∈ W)
1. R ⊂ W × W
2. ∀w1w2…wk-1 wk ∈ W: <w1,w2> ∈ R … < wk-1 wk > ∈ R: w1 ≠ wk
(acyclicity)
3. ∃!w1 ∈ W : ∀w2 ∈ W: < w1,w2 > ∉ R (rootedness)
4. ∀w1w2w3 ∈ W : <w1,w2> ∈ R ∧ <w1,w3> ∈ R→w2=w3 (singleheadedness)
2.12.5 Stanford dependency parser by Dan Klein
This parser uses the feature of Collin’s parser. Michael Collins in his ‘Head Driven Statistical
Parser’ showed mapping of his statistical parser to the dependency relation sets. Dan Klein’s
Stanford parser deals with tagged words: pairs <w, t>. First the head <wh, th> of a constituent is
generated using ‘Collins head finder’ method, then successive right dependents <wd, td> until a
‘stop’ token is generated, then successive left dependents until ‘stop’ token is generated. It
supports three formats for output:
1. dependencies
2. typedDependencies
3. typedDependenciesCollapsed
For example: Factory payrolls fell in September.
Tagged output: Factory/NN payrolls/NNS fell/VBD in/IN September/NNP Dependency
structure:
nn(payrolls-2, Factory-1)
nsubj(fell-3, payrolls-2)
in(fell-3, September-5)
Fig 2.7
First, fell-VBD is chosen as the head of the sentence, then, in-IN to the right is generated, which
then generates September-NN to the right, which generates ‘stop’ token on both sides. Then
return to in-IN, generate ‘stop’ to the right, and so on. The above output is the
‘typedDependenciesCollapsed’ format of Stanford dependency parse tree. This
‘typedDependenciesCollapsed’ doesn’t make separate nodes for the words, which are obvious in
any dependency relation in a sentence; instead it makes it a relation between two prominent
words. In the above example the preposition ‘in’ is used as a relation or dependency function
between the words ‘fell’ and ‘September’.
For example, only ‘typedDependencies’ format of the above sentence will be:
nn(payrolls-2, Factory-1)
nsubj(fell-3, payrolls-2)
dep(fell-3, in-4)
dep(in-4, September-5)
Fig 2.8
Example shows that it makes a separate node ‘in’ between ‘fell’ and ‘September’, which can be
used as a relation to make the tree shorter in depth. This thesis uses the
‘typedDependenciesCollapsed’ format as well because we don’t need to look at every word to
extract necessary information.
CHAPTER THREE
SYSTEM ANALYSIS AND DESIGN
In the following section of this chapter, existing sentence-level semantic filtering approaches and
methodologies for online social networking communities will be thoroughly examined, and issues
related to these approaches will be highlighted.
The proposed sentence-level semantic filtering approach will also be examined, and its operation
procedures, benefits, and feasibility will be expressed. Methodologies employed in acquiring the
requirement towards the successful implementation of the proposed filtering System will also be
discussed.
The design of the filtering system from both perspectives will be discussed, also with the program
components which will not be left out.
3.1 System Analysis
System analysis can be defined as the process of analyzing a system with the essential goal of
improving or modifying it. It can also be defined as the methodical study of a system, its current
and future required objectives, and procedures in order to form a basis for the system design.
It is the first of the three major phases in developing an information system. All the system analysis
efforts are directed towards deciding these 3 basic objectives:
1. Identify system owner and system users.
2. Define what the system will do.
3. Determine the technicality, economical and operational feasibility of the proposed system.
The purpose of analyzing is to produce a clear requirement specification of the newly designed or
upgraded system efficiently and effectively. It requires the ability to analyze the essential features
of a system.
This knowledge of a system is achieved through the investigation of the system and its
environment.
3.2 Analysis of the existing system
Online social networking sites have become increasingly popular with children, especially young
teens, as a place where they can meet other people, communicate, and exchange information.
However, this medium has encouraged the wide usage of offensive languages and also brought
about a fast growing trend that experts believe is very harmful called cyberbullying; which has
led teenagers to suicide in very extreme cases. People have realized the problems brought by
offensive language in online communities and many efforts have been made on detecting and
eliminating the existence of offensive language within user messages. The approaches used are
being discuss below.
3.2.1 Keyword Censoring Approach
Keyword censoring approaches match words appearing in user messages with offensive words
stored in the blacklist. Once found, these offensive words will be removed, partially replaced
(e.g., “bitch”), completely replaced (e.g., “******”), or substituted by family friendly words
(e.g., “naughty”). Because of its simplicity, keyword based censoring approach has been widely
applied in OSN websites, such as YouTube and World of Warcraft. However, the filtering result
is not as desired; brutally removing words from users message breaks the readability of the
messages. Replacing offensive words with symbols usually makes it easy to guess the original
offensive words. The idea of substitution seems tempting, but accurate substitution is usually
impractical. Inaccurate substitution will introduce additional issues. For example, in 2001,
Yahoo! deployed an Email filter which may automatically alter certain words in emails by family
friendly words. This filter was criticized as a “foolish filter” by BBC news because of its
inaccurate substitution.
To demonstrate the shortcoming of keyword censoring approaches, we present an example
below.
Filtering results with Keyword Censoring
Original comment: “What the fuck is wrong with you?”
Keyword Censoring: “What the f**k is wrong with you?”
According to presented filtering results, readers can still easily understand what the offender
wants to say and even be able to infer the removed words. This indicates the failure filtering
because offensive opinion has been successfully delivered to victims. Also, removing words
from a sentence without considering their context breaks the readability of rest of the sentence.
Compared with keyword censoring approaches, our proposed semantic filtering approach is
much more sophisticated and can achieve thorough filtering effort by utilizing the grammatical
relations among words in the sentence. Given a sentence containing both offensive and
inoffensive words, not only offensive words but also inoffensive words assisting to express
offensive opinions will be removed during our filtering. In this way, we essentially stop the
delivery of offensive opinion. And, there will be no way to infer the offensive content in original
messages after filtering.
3.2.2 Content Control Approach
Content control approaches are usually deployed at user side or ISP side to prevent user from
seeing inappropriate content on the Internet. Its filtering is usually done based on certain criteria,
such as URL address, the occurrence of offensive words, and topic classification. Here our focus
is text based criteria.
For example, if we present a sentence based content control approach with threshold set as the
number of offensive words in the sentence. If at least one offensive word is being detected within
a sentence, the filter will remove the sentence from user message.
To demonstrate the shortcoming of content control approaches, we present examples below.
Filtering results with Content control Censoring
Original comment: “What the fuck is wrong with you?”
Keyword Censoring: “ ”
However, content control approaches are too coarse-grained to be applied in online communities.
First of all, offender can easily bypass the filtering as long as knowing the estimation criteria.
More important, a sentence in user comment may contain both offensive and inoffensive content.
Inoffensive part may be removed falsely because of offensive part. Not allowing user to post
inoffensive content would easily drive users away and thus affect the growth of community.
Compared with content control approaches, we provide a fine-grained filtering by removing only
the smallest syntactic part in the sentence containing offensive language. The inoffensive content
in the original message will remain; thereby, user still has the freedom of speech for posting
inoffensive content. We believe such delicate filtering will be more acceptable to online
communities.
3.2.3 Manual Filtering Approach
Manual filtering is believed to produce the best filtering result. Basically, user messages are
reviewed by community administrator before being posted on the website.
Filtering results with Manual Filtering Approach
Original comment: “What the fuck is wrong with you?”
Keyword Censoring: “What is wrong with you?”
As shown above, the administrator is able to easily understand what the author wants to express
and precisely remove only the offensive content within the message.
However, manual filtering is very time and labor consuming, making it impossible to be widely
applied. For example, in the Linda-Ikeji blog community (http://lindaikeji.blogspot.com), the
blog administrator will manually review and filter user comments on some celebrities’ public
blogs. Obviously, users would expect a delay between posting a comment on a blog and
displaying this comment on the blog’s webpage. Further, the filtering totally relies on the
judgment of the community administrator. Our proposed semantic filtering approach mimics the
procedure of manual filtering by trying to understand the relations among words in order to
remove the offensive content semantically. The proposed semantic filtering approach will be
fully automatic, requiring no interference of any administrator.
3.3 Problem of the existing approaches
From the study of the existing approaches and based on the information provided above, the
following problems have been identified:
 Using the Keyword censoring approach, the readers can still easily understand what the
offender wants to say and even be able to infer the removed words. This indicates the
failure filtering because offensive opinion has been successfully delivered to victims. Also,
removing words from a sentence without considering their context breaks the readability
of rest of the sentence.
 Using the content control approaches will also be too coarse-grained to be applied in online
communities. Offender can easily bypass the filtering as long as knowing the estimation
criteria and more important, a sentence in user comment may contain both offensive and
inoffensive content. Inoffensive part may be removed falsely because of offensive part.
Not allowing user to post inoffensive content would easily drive users away and thus affect
the growth of community.
 Using the manual filtering approach is very time and labor consuming. The administrator
will have to manually review and filter all the users’ comments and messages; making it
impossible to be widely applied. Also, the filtering totally relies on the judgment of the
community administrator.
3.4 Proposed Filtering Philosophy
The goal of our semantic filtering is to achieve filtering results close to that of manual filtering.
To reach this goal, the foremost thing is to answer the question about how the filtering should be
performed in order to get the desired filtering results. In this section, we present our answer in
three steps. First, we analyze the characteristics of offensive text content in user messages. Then,
we introduce our filtering philosophy according to the summarized characteristics. Finally, we
show how this philosophy is transformed into heuristic rules applicable in the filtering process.
3.4.1 Offensive Language Text Content
Based on the observation on user comments collected from YouTube website, a sentence in a
user message may contain both offensive and inoffensive text content. Offensive text content is
exposed intentionally with purpose of bringing negative influence to victims (e.g., the readers of
message). The victim receives the negative influence by reading the offensive part of sentence
and understanding the carried offensive information.
Hence, the information carried by original sentence can be represented as
I = Ioff + Iinoff
The offender reaches his goal when the offensive information Ioff is delivered to readers.
Therefore, to achieve a thorough filtering, all words used to deliver Ioff should be removed.
Meanwhile, with respect to free speech, the part with Iinoff should be saved.
3.4.2 Filtering Philosophy
According to the analysis, we propose the philosophy that should be followed in sentence-level
offensive language filtering:
 Precisely identify all offensive contents and remove them semantically, so that viewers
will not notice the existence of offensive language in the original sentence;
 Keep the readability and inoffensive content in the sentence, so that the author will still
be allowed to express his opinion freely as long as it is not offensive;
This is called the philosophy of “filtering instead of blocking”. To the filter, the philosophy
states that: if removing one word will make another word meaningless or confusing to readers,
we should consider removing both words to keep the readability of a filtered sentence;
meanwhile, we only remove words that are affected by offensive words.
For example, in the sentence “Samuel said it and what the fuck is wrong with what he said?”,
suppose “fuck” is the only offensive words, the sentence can be separated into two parts. The
first part, “Samuel said it”, is inoffensive; but the second part, “what the fuck is wrong with
you?” is offensive. Therefore, we should remove the offensive word in the second part while
keeping the first part and also still making the sentence a meaningful and readable one. i.e. We
won’t have:
Samuel said it and what the is wrong with what he said? (Wrong)
But
Samuel said it and what is wrong with what he said? (Correct)
The words “the” and “fuck” must be removed in order to keep the transparency of filtering as
well as the readability of filtered text content.
3.4.3 Filtering Rules
Specifically, the proposed philosophy is transformed into two heuristic rules to estimate the
impact of removing words in a sentence.
Rule 1. (Modification Relation) in a modification relation, if the modifier is determined to be
offensive, removing modifier solely is enough; if the head is determined to be offensive, both the
head and the modifier should be removed.
The modification relation is a binary semantic relationship between two syntactic elements, such
as word, phrase, etc. One element is named head and the other is named modifier. The modifier
is used to describe the head (i.e. the modified component). Semantically, modifiers describe and
provide more accurate definitional meaning for head. As the modifier acts as a complement, the
removal of the modifier typically will not affect the grammaticality of the construction. For
example, in the sentence “she likes red apples.”, the adjective “red” is used to modify the noun
“apples”. Removing “red” will keep the readability of rest of sentence. We admit that, removing
modifiers will lose some information carried by modifiers. However, if the modifier is
determined removable but the head is not, removing modifier will remove only the offensive
information.
Rule 2. (Pattern Integrity) if removing the offensive word breaks the integrity of sentence’s basic
pattern, the whole sentence should be removed in order to keep the readability.
English sentences and clauses are organized in basic patterns, such as “Subject-Verb”, “Subject-
Verb-Object”, “Subject- Verb-Adjective”, “Subject-Verb-Adverb”, and “Subject-Verb- Noun”.
Every sentence or clause can be categorized into one pattern. The integrity of basic pattern is
essential to the readability of content. For example, the sentence “she sleeps on the sofa.” follows
“Subject-Verb” pattern. If we only remove “sleeps”, the rest of the sentence, “she on the sofa.”
will become nonsense and meaningless.
We will be applying these two rules during the filtering of the sentences.
3.5 Identify Removable Content by Grammatical Relations
A text or user message can be decomposed into a sequence of sentences. Each sentence is
considered as a unit in filtering. Given a sentence containing both offensive words and
inoffensive words, the goal of filtering is to identify inoffensive words which should be removed
together with offensive words. We define the words that should be removed by the filtering as
“removable” words.
We noticed that manual filtering can easily achieve this goal because human can easily
understand the context of words in a sentence and precisely identify which words should be
removed with known offensive words. So, we mimic the manual filtering in that, we extract the
grammatical relations among words from sentences and use the proposed filtering rules to
estimate the impact of removing offensive words on other inoffensive words based on extracted
grammatical relations.
Specifically, the proposed approach includes two steps. In the first step, we scan the sentence and
see if offensive words exist. If offensive words exist, we continue to retrieve grammatical
information (i.e. Part-of-Speech tags and typed dependency relations) among words in the
sentence. Using retrieved grammatical information, we create a tree data structure, named
RelTree, for the second step estimation. In this second step, we propose a set of estimation
functions following the filtering rules we proposed. Using the RelTree structure and the proposed
rules, we then estimate if there are inoffensive words that should be removed together with those
identified offensive words.
The overview idea of our semantic filtering approach is shown in Algorithm 1 below. Within the
algorithm, the functions POStagging and TDgenerator generate Part-of-Speech tags and typed
dependency relations, respectively. We use existing NLP (Natural Language Processing) tools to
implement these two functions. We will also focus on the design of two other functions
CreateRelTree and EstimateRelTree.
In this methodology, we are assuming that the filtering is based on a comprehensive offensive
lexicon containing all offensive words. Words that do not appear in the lexicon are considered
inoffensive.
input : a text comment T,
a blacklist of offensive words Blacklist
output: a filtered text comment T′
1 T′ ←“”;
2 senList ← chunk T into a list of sentences;
3 foreach sentence s ∈ senList do
4 scan s for offensive words using Blacklist;
5 if no offensive word found then
6 T′ ← T′ + s;
7 end
8 else
9 PTree ← POStagging(s);/*get parse tree*/
10 TDset ← TDgenerator(s);
/* get typed dependency relations */
11 RelTree ← CreateRelTree(PTree, TDset);
/* create RelTree */
12 LabelRelTree ← EstimateRelTree (RelTree,
Blacklist); /* estimate using RelTree */
13 s′ ← remove all words in LabelRelTree those are
labeled as “removable”;
14 T′ ← T′ + s′;
15 end
16 end
17 Return T′;
Algorithm 1: Procedure of Semantic Filtering
3.5.1 First Step: Grammatical Analysis
In the first step, we extract two types of grammatical information from a given sentence. One is
the Part-of-Speech information associated with every word. The other is the dependency relation
among words. Part-of-Speech information helps us to understand the organization of a sentence,
which is essential for keeping the readability when we try to remove words from a sentence.
Dependency relations will be used directly to estimate the impact of removing one word on other
semantically related words, making the filtering more “meaningful”. Combining these two types
of information, we can create a new data structure, called RelTree, for the next-step estimation.
3.5.1.1 Part of Speech Tagging
Part-of-Speech tagging has been widely used in Natural Language Processing applications to
identify the syntactic properties of lexical items in a sentence, such as word or phrase. Through
Part-of-Speech tagging, the sentence can be represented in a tree structure basing on Part-of-
Speech tags. We adopt the Penn Treebank tag set for our Part-of-Speech tagging.
An example of Penn Treebank style parse tree is shown in Figure 1 below.
Figure 1: A parse tree of a sentence basing on Part-of-Speech tags
Here, the leaf nodes are words appearing in the sentence. The non-leaf nodes represent syntactic
elements, such as phrases or clauses. Each element consists of the words within its subtree. For
example, the words “said” and “it” constitute a Verb Phrase (i.e. VP) node.
3.5.1.2 Typed Dependency Relations
Typed Dependency is a kind of general relations describing the grammatical dependencies within
a sentence, proposed by Stanford Natural Language Processing Group. Each typed dependency
includes a dependency type and a (governor, dependent) word pair. For example, in the sentence
“what the fuck is wrong with what he said?”, the typed dependency amod(wrong, fuck) means
that “fuck” is an adjectival modifier of an noun phrase containing “wrong”. A typed dependency
may represent the dependent relations between two syntactic elements, not limited to words only.
Fig 2: An example of typed dependency graph
The typed dependencies in a sentence can be represented as a graph. For example, Figure 2
shows the typed dependency relations for the same sentence shown in Figure 1. We explain the
relations appeared in Figure 2 from left to right: the nominal subject relation, nsubj(it, Samuel),
means that “Samuel” is the syntactic subject of the clause (same is nsubj(wrong, he)); the copula
relation, cop(it, said), means that “it” is the complement of verb “said” (same is cop(wrong, is));
the noun compound modifier, the determiner, det(fuck, the), means that “the” is a determiner of
“fuck”; the adjectival modifier, amod(fuck, wrong), means that “fuck” serves as adjectival
modifier of “wrong”; and the conjunct, conj and(it, wrong), means that the coordinating
conjunction “and” is used to connect two elements with head “it” and “wrong”, respectively.
3.5.1.3 Relation Tree (RelTree)
Both Part-of-Speech and typed dependency relations are utilized in the second step estimation.
The parse tree shows the sentence syntactic organization and typed dependency relations provide
semantic information among words. To combine both information, we propose a new data
structure called RelTree.
In a RelTree, the leaf nodes are words in the sentence. And the non-leaf node represents either a
phrase or a clause inside the sentence. In each nonleaf node, we associate the set of typed
dependency relations on the words within its subtree. Each node only contains the typed
dependency relations which have not appeared in its subtree nodes.
Figure 3: A RelTree combining the parse tree and typed dependency relations
input : a parse tree PTree,
a set of typed dependency relations TDset
output: a RelTree RelTree
1 RelTree ← PTree;
2 Remove all word nodes in RelTree;
3 Traverse RelTree in postorder foreach node n visited do
4 if n is a leaf node then
5 n.wordset ← {n};/*create word nodes*/
6 end
7 if n is not a leaf node then
8 n.wordset ← ∅;
9 foreach direct child node ci do
10 n.wordset ← n.wordset ∪ ci.wordset;
11 n.rel ← ∅;
12 foreach relation Ti(Gi,Di) in TDset do
13 if Gi ∈ n.wordset and Di ∈ n.wordset then
14 n.rel ← n.rel ∪ Ti(Gi,Di);
15 TDset ← TDset − Ti(Gi,Di);
16 end
17 end
18 end
19 end
20 end
21 Return RelTree;
Algorithm 2: create a RelTree using the parse tree and typed dependency relations
The RelTree data structure is proposed only for the convenience of offensiveness estimation in
the next step. Algorithm 2 shows the algorithm for RelTree construction. With the parse tree
PTree given, the computational complexity of algorithm CreateRelTree relies on the post-order
traversal and the search in TDset. As the number of relations never exceeds N(N −1)/2, where N
is the number of words in the sentence, the computational complexity is O(N3
). The
computational complexity itself is acceptable. Indeed, there are a lot of ways to improve the
efficiency in the implementation of this algorithm.
3.5.2 Step Two: Bottom-Up Estimation
In the second step, we first use the offensive lexicon to identify offensive words in the sentence.
The leaf node with an offensive word will be labeled as “removable”. Starting from leaf nodes,
we perform bottom-up estimation through a postorder traversal on the RelTree.
For each non-leaf node in the RelTree, we estimate whether it should be removed based on (1)
the associated typed dependency relations and (2) its child nodes within its subtree. If a non-leaf
node is estimated to be “removable”, all its descendants, including words, within its subtree will
also be labeled as “removable”. The meaning of “removable” to a non-leaf node is that all words,
phrases, or even clauses within its subtree have been determined to be removed at the end of
filtering. The estimation process includes two steps. We first estimate based on typed
dependency relations, and then apply a set of heuristic rules as complements.
3.5.2.1 Estimation with Typed Dependency Relations
Consider a non-leaf node n in a RelTree with a set n.rel of typed dependency relations. Each
relation describes a semantic connection between a governor word and a dependent word. Both
words are leaf nodes in the subtree rooted at n. n.rel could be empty when n only has one child
node. For each typed dependency relation in n.rel, we study its semantic information and map it
to an estimation function.
These estimation functions and mapping are created following the Modification Relation and
Pattern Integrity rules. Take the Direct Object (dobj) relation for example. The dobj(G, D)
relation is defined as : the direct object of a verb phrase, containing governor word G, is the noun
phrase, containing dependent word D. For example, in a relation dobj(win,matcℎ), “win” is the
governor word and “match” is the dependent word. According to Pattern Integrity rule, we know
that “Subject-Verb-Object” is a basic pattern. Therefore, if either the phrase with G or phrase
with D will be removed because of offensiveness, both phrases should be removed together.
To formalize, we define an estimation function H(T) =H(P(G)) OR H(P(D)) and map relation
dobj(G,D) to it. We use symbol C(G) and P(G) to denote the clause and phrase containing word
G as head, respectively. In this estimation function, H(T) is the label to be assigned to relation T
and H(P(G)) is the label with phrase node containing G in the RelTree.
Using the estimation function, we generate a label for every relation associated with node n and
then for the node itself. If a relation T(G,D) of node n is estimated and labeled as “removable”,
the two child nodes of n, containing word G and word D, will be labeled as “removable”. If all
relations in n.rel are labeled as “removable”, the node n as well as all its descendants, will be
labeled as “removable”.
3.5.2.2 Estimation with Heuristic Rules
Heuristic rules will also be applied as complement after typed dependency relation estimation.
Applying heuristic rules is necessary mainly because of two reasons. First of all, the typed
dependency relation contains some syntactic information but limited. For example, the
possessive ending (i.e. POS) tag, which is a quite popular Part-of-Speech tag, is ignored during
the typed dependency tagging.
Secondly, not all relations between syntactic elements in a sentence can be classified into one of
typed dependency relations. For those uncertain relations, a generic grammatical relation is being
defined, named dep. To prevent confusion to filter, we include dep into the Rule H(T) = H(G)
AND H(D) which means either G or D is labeled removable will not affect each other and the
label of T. Because dep relation stands for uncertain relation, we have to rely on Part-of-Speech
tags in the RelTree for our filtering.
Take the conj tag node rule as an example. The conjunct relation (conj) is a type of relation
between two syntactic elements connected by a coordinating conjunction, such as “and”. The
parameters of conj do not include the coordinating conjunction. However, explicitly, the
coordinating conjunction sits between the two parameters of conj. If one side is determined
removable, the coordinating conjunction should be removed as well. For example, in the
sentence “I like A and B”, if either A or B is removed, the coordinating conjunction “and”
should also be removed.
Figure 4: Estimate a RelTree in a bottom-up manner
3.5.2.3 Estimation Algorithm
To estimate and assign labels for all nodes in a RelTree, we perform the estimation also in a
bottom-up manner. Figure 4 shows an example estimation process. The number in the circle
represents the order of estimation for each node in the RelTree. The dashed nodes are estimated
as “removable”. For example, the clause node with nsubj(you, fuck) is estimated as “removable”
according to the estimation. Therefore, its two child nodes containing “you” and “fuck”
respectively are both labeled as “removable”. Moreover, the word “and” is removable according
to the heuristic rule (i.e. conj tag node rule), in order to keep the filtering transparent to readers.
Finally, inoffensive words, “what”, “the”, “is”, “wrong”, “with”, “he”, and “said”, are removed
with the offensive word, “fuck” in the filtering.
According to Algorithm 2, each typed dependency relation will appear exactly once in the
RelTree. No relation will be checked repeatedly in the estimation. The cleaned sentence after
filtering in this example will be “Samuel said it.”. As we can see, the result satisfies the
requirement of our proposed filtering philosophy. Only the offensive part, “what the fuck is
wrong with what he said”, is removed. The reader can still get the inoffensive information. The
detailed algorithm for estimation process is presented below.
input : a RelTree RelTree,
a blacklist of offensive words Blacklist,
output: a labeled RelTree LebelRelTree
1 LebelRelTree ← RelTree;
2 Label all leaf nodes with offensive words by
“removable” in LabelRelTree ;
3 Traverse LabelRelTree in postorder foreach node n
visited do
4 if n is a leaf node then
5 ignore; /* already labeled */
6 end
7 if n is not a leaf node then
8 if n only has one child node then
9 n.label ← n.cℎild.label;
10 end
11 if n has more than one child node then
12 Estimate the label for n by its associated
labels, using proposed estimation function and
heuristic rules;
13 end
14 end
15 end
16 Return LabelRelTree;
Algorithm 3: estimate nodes in RelTree
CHAPTER FOUR
IMPLEMENTATION
4.1. JUSTIFICATION OF PROGRAMMING LANGUAGE USED.
The Spam filtering system is an online application implemented using HTML, JAVA SERVLET
PAGE (JSP), JAVASCRIPT, and MYSQL relational database software.
4.1.1 HTML
HTML, which stands for Hypertext Markup Language, is the predominant markup language for
web pages. It provides a means to create structured documents by denoting structural semantics
for text such as heading, paragraphs, list, etc. bas well as for links, quotes and other items. It allows
images and objects to be embedded and can be used to create interactive forms. It is written in the
form of HTML elements consisting of “tags” surrounded by ankle brackets within the webpage
content. It can include or can load script in language such as JavaScript which affect the behaviour
of HTML processors like Web browsers; and Cascading Style Sheets (CSS) to define the
appearance and layout of text and other material.
4.1.2 JAVASCRIPT
JavaScript has been around for several years now, in many different flavors. The main benefit of
JavaScript is to add additional interaction between the web site and its visitors at the cost of a
little extra work by the web developer. JavaScript allows industrious web masters to get more out
of their website than HTML and CSS can provide.
By definition, JavaScript is a client-side scripting language. This means the web surfer's browser
will be running the script. This is opposite to client-side is server-side, which occurs in a
language like PHP. These PHP scripts are run by the web hosting server.
There are many uses (and abuses!) for the powerful JavaScript language. Here, it is being used
for:
 Alert Messages
 Popup Windows
 HTML Form Data Validation
4.1.3 JAVA SERVLET PAGE
"JSP is an HTML-embedded scripting language. JSP goal is to allow developers to write
dynamically generated pages quickly." It is a server-side programming language specifically
designed for creating dynamic web pages. JSP will allow you to:
 Reduce the time to create large websites.
 Create a customized user experience for visitors based on information that you have
gathered from them.
 Open up thousands of possibilities for online tools.
Unlike other server-side languages, JSP is an open source product.
When someone visits your JSP webpage, your web server processes the JAVA code. It then sees
which parts it needs to show to visitors (content and pictures) and hides the other stuff (file
operations, math calculations, etc.) then translates your JSP into HTML. After the translation
into HTML, it sends the webpage to your visitor's web browser.
4.1.4 MYSQL
MySQL is the most popular open source database server in existence because of its consistent fast
performance, high reliability and ease of use. It's used in more than 6 million installations ranging
from large corporations to specialized embedded applications on every continent in the world. It
is very commonly used in conjunction with PHP scripts to create dynamic and powerful server
applications. MySQL has been criticized in the past because it does not have all the features of
other Database Management Systems. However, MySQL continues to improve significantly, with
each major upgrade, and has great popularity because of these improvements.
4.1.5 CSS
Cascading Style Sheets (CSS) are a way to control the look and feel of the HTML documents in
an organized and efficient manner. Cascading Style Sheet enables us to add new looks to the
HTML, completely restyles a web site with only a few changes to the CSS code and also allows
us to use the "style" created on any webpage we wish. With CSS you will be able to:
 Add new looks to your old HTML
 Completely restyle a web site with only a few changes to your CSS code
 Use the "style" you create on any webpage you wish
4.2 System Specification
The system specifications is divided into two part:
1. Hardware Specification
2. Software Specification
4.2.1 HAREWARE SPECIFICATION FOR THE APPLICATION
Any computer tagged by the manufacturer as a workstation can be used to access this application
using the internet browser, but the following minimum specification would be required to host
the application:
1. A computer tagged by the manufacturer as a server
2. Core 2Duo processor and above
3. A 2GB memory
4. A keyboard and a mouse
5. A hard disk of 120GB and above
4.2.2 SOFTWARE APPLICATION FOR THE APPLICATION
 Windows Server 2005 and above
 Microsoft .NET framework version 3.0 and above must be installed
 Microsoft SQL Server 2005 and above should be installed
 Microsoft Internet Information Server (IIS) should be enabled
 Server FTP capability must be enabled
4.3 System Implementation
This section briefly described the screens of the online application.
4.3.1 Application Login Screen
This system contains a secure login panel that requires a combination of email address and
password. The email address is used because it is meant to be unique.
Fig 4.1 – Web Application Login Screen
4.3.2 Application Registration Page
FIG. 4.2 – Web Application Registration Page
Here the user fills in his/her details and after the system verifies that all details provided is correct,
it also has a captcha image which acts as a spam guard to ensure than the inputted data was done
by human and not robot.
4.3.2 Post and Comment Page
FIG. 4.3 – Filtered Post Page Using Keyword Censoring Approach
FIG. 4.4 – Filtered Post Page Using Content Control Censoring Approach
FIG. 4.5 – Filtered Post Page Using FOLOC Censoring Approach
Looking at the three post and comment pages above, we will realize the our proposed semantic
filtering approach mimics the procedure of manual filtering by trying to understand the relations
among words and has removed the offensive content semantically. The proposed semantic
filtering approach is fully automated and it required no interference of any administrator and at
the same time eliminating the offensive words in the sentence.
“What the fuck is wrong with you?” has been changed to “What is wrong with you?” using the
proposed semantic filtering approach instead of having “what the f*** is wrong with you?”
which still delivers the offensive words to the victims successfully.
Our semantic filtering result is also so close to that of manual filtering as our desired results have
been produced just by applying the heuristic rules in the filtering process.
FIG. 4.6 – Filtered Post Page Using Keyword Censoring Approach
FIG. 4.4 – Filtered Post Page Using Content Control Censoring Approach
FIG. 4.8 – Filtered Post Page Using FOLOC Censoring Approach
Looking at the three post and comment pages above in fig 4.6, 4.7 and 4.8, we will realize that
our proposed semantic filtering approach also mimics the procedure of manual filtering by trying
to understand the relations among words and has removed the offensive content semantically
again. The proposed semantic filtering approach is fully automated and it required no
interference of any administrator and at the same time eliminating the offensive words in the
sentence.
“I have told all these bitches to stop calling my husband’s phone” has been changed to “I have
told all to stop calling my husband’s phone” using the proposed semantic filtering approach
instead of having “I have told all these b****** to stop calling my husband’s phone” which still
delivers the offensive words to the victims successfully.
Our semantic filtering result is also so close to that of manual filtering as our desired results have
been produced just by applying the heuristic rules in the filtering process.
CHAPTER FIVE
SUMMARY, CONCLUSION AND RECOMMENDATIONS
5.1 Summary and Conclusion
Online social networking sites have become increasingly popular with children, especially young
teens, as a place where they can meet other people, communicate, and exchange information.
This has also brought cyberbullying which is a fast growing trend that experts believe is more
harmful than typical schoolyard bullying. Nearly all of us can be contacted 24/7 via online social
networking communities. Victims can be reached anytime and at anyplace. For many children,
home is no longer a refuge from the bullies. Children can escape threats and abuse in the
classroom, only to find offensive comments and posts from the same tormentors when they
arrive home. There’s no safe place anymore and one can be bullied 24/7; even in the privacy of
his/her own bedroom.
However, we are not only trying to filter out offensive content but also making sure the
sentences still make sense. From statistical analysis it has been revealed that, more than 60% of
insulting messages are posted as a direct insult and direct insulting messages always contain
insulting words or phrases. From psychological point of view, if these messages are categorized
and restrict a user to send these kinds of messages, then a human intension to post or exchange of
abusive messages can be significantly reduced.
Offensive language is a serious problem facing the online community. Our semantic filtering
technique is based on the grammatical relations of words in a sentence so that the rest of the
filtered sentence is readable and the existence of offensive words in the original sentence is hard
to notice. We tested the effectiveness of our approach with a large dataset and the results show
that our techniques are very effective and accurate with little process overhead.
5.2 Recommendation
Our future work includes looking at the issues described in the discussion section. Moreover, as
the most time-consuming part of semantic filtering is the sentence parsing process, we will
examine other light-weighted NLP techniques to speed up sentence parsing. Last but not the
lease, we also plan to extend our filtering approach to support other languages such as Chinese
and French.

More Related Content

What's hot

THE EFFECTS OF SOCIAL NETWORKING SITES ON THE ACADEMIC PERFORMANCE OF STUDENT...
THE EFFECTS OF SOCIAL NETWORKING SITES ON THE ACADEMIC PERFORMANCE OF STUDENT...THE EFFECTS OF SOCIAL NETWORKING SITES ON THE ACADEMIC PERFORMANCE OF STUDENT...
THE EFFECTS OF SOCIAL NETWORKING SITES ON THE ACADEMIC PERFORMANCE OF STUDENT...Kasthuripriya Nanda Kumar
 
The Role of Social Media in Today's College Student Experience
The Role of Social Media in Today's College Student ExperienceThe Role of Social Media in Today's College Student Experience
The Role of Social Media in Today's College Student Experience
Liz Gross, Ph.D.
 
Research Paper - Facebook
Research Paper - FacebookResearch Paper - Facebook
Research Paper - Facebook
GuiM _
 
The use of social media among nigerian youths.2
The use of social media among nigerian youths.2The use of social media among nigerian youths.2
The use of social media among nigerian youths.2
Lami Attah
 
The Effects on Social Networking on Education
The Effects on Social Networking on EducationThe Effects on Social Networking on Education
The Effects on Social Networking on Education
Nash Nash
 
Cyberbullying Resources
Cyberbullying ResourcesCyberbullying Resources
Cyberbullying Resources
Andy Jeter
 
Facebook and Academic Performance
Facebook and Academic PerformanceFacebook and Academic Performance
Facebook and Academic Performance
Htet Khaing
 
Survey paper: Social Networking and its impact on Youth, Culture, Communicati...
Survey paper: Social Networking and its impact on Youth, Culture, Communicati...Survey paper: Social Networking and its impact on Youth, Culture, Communicati...
Survey paper: Social Networking and its impact on Youth, Culture, Communicati...
Imesha Perera
 
USE OF SOCIAL NETWORKS AND ITS EFFECTS ON STUDENTS
USE OF SOCIAL NETWORKS AND ITS EFFECTS ON STUDENTS USE OF SOCIAL NETWORKS AND ITS EFFECTS ON STUDENTS
USE OF SOCIAL NETWORKS AND ITS EFFECTS ON STUDENTS
Mahesh Kodituwakku
 
Example of Proposal
Example of ProposalExample of Proposal
Example of Proposal
JohanEddyLuaran
 
Negative impacts of social media as my space and facebook on teenagers in th...
Negative impacts of social media as my space and facebook on teenagers  in th...Negative impacts of social media as my space and facebook on teenagers  in th...
Negative impacts of social media as my space and facebook on teenagers in th...
GeorgeDolezal
 
The effects of social media on college students
The effects of social media on college studentsThe effects of social media on college students
The effects of social media on college studentsArina Fauzi
 
Are Social Media Websites Harmful To The Youth?
Are Social Media Websites Harmful To The Youth?Are Social Media Websites Harmful To The Youth?
Are Social Media Websites Harmful To The Youth?
Evan Atkinson
 
Social Networking Sites and Reference Services
Social Networking Sites and Reference ServicesSocial Networking Sites and Reference Services
Social Networking Sites and Reference Services
Stephen Francoeur
 
Introduction to Social Media for Researchers
Introduction to Social Media for ResearchersIntroduction to Social Media for Researchers
Introduction to Social Media for Researchers
Helen Dixon
 
Impact_of_internet_use_on_young_students
Impact_of_internet_use_on_young_studentsImpact_of_internet_use_on_young_students
Impact_of_internet_use_on_young_studentsmiftah uddin
 
Social Media Effects on Study Habits
Social Media Effects on Study HabitsSocial Media Effects on Study Habits
Social Media Effects on Study HabitsRobert Breen
 
effects of Social media
effects of Social mediaeffects of Social media
effects of Social media
kimi7792
 
IMPACT OF FACEBOOK USAGE ON THEACADEMIC GRADES: A CASE STUDY
IMPACT OF FACEBOOK USAGE ON THEACADEMIC GRADES: A CASE STUDYIMPACT OF FACEBOOK USAGE ON THEACADEMIC GRADES: A CASE STUDY
IMPACT OF FACEBOOK USAGE ON THEACADEMIC GRADES: A CASE STUDY
Sajjad Sayed
 

What's hot (20)

THE EFFECTS OF SOCIAL NETWORKING SITES ON THE ACADEMIC PERFORMANCE OF STUDENT...
THE EFFECTS OF SOCIAL NETWORKING SITES ON THE ACADEMIC PERFORMANCE OF STUDENT...THE EFFECTS OF SOCIAL NETWORKING SITES ON THE ACADEMIC PERFORMANCE OF STUDENT...
THE EFFECTS OF SOCIAL NETWORKING SITES ON THE ACADEMIC PERFORMANCE OF STUDENT...
 
The Role of Social Media in Today's College Student Experience
The Role of Social Media in Today's College Student ExperienceThe Role of Social Media in Today's College Student Experience
The Role of Social Media in Today's College Student Experience
 
Final project
Final projectFinal project
Final project
 
Research Paper - Facebook
Research Paper - FacebookResearch Paper - Facebook
Research Paper - Facebook
 
The use of social media among nigerian youths.2
The use of social media among nigerian youths.2The use of social media among nigerian youths.2
The use of social media among nigerian youths.2
 
The Effects on Social Networking on Education
The Effects on Social Networking on EducationThe Effects on Social Networking on Education
The Effects on Social Networking on Education
 
Cyberbullying Resources
Cyberbullying ResourcesCyberbullying Resources
Cyberbullying Resources
 
Facebook and Academic Performance
Facebook and Academic PerformanceFacebook and Academic Performance
Facebook and Academic Performance
 
Survey paper: Social Networking and its impact on Youth, Culture, Communicati...
Survey paper: Social Networking and its impact on Youth, Culture, Communicati...Survey paper: Social Networking and its impact on Youth, Culture, Communicati...
Survey paper: Social Networking and its impact on Youth, Culture, Communicati...
 
USE OF SOCIAL NETWORKS AND ITS EFFECTS ON STUDENTS
USE OF SOCIAL NETWORKS AND ITS EFFECTS ON STUDENTS USE OF SOCIAL NETWORKS AND ITS EFFECTS ON STUDENTS
USE OF SOCIAL NETWORKS AND ITS EFFECTS ON STUDENTS
 
Example of Proposal
Example of ProposalExample of Proposal
Example of Proposal
 
Negative impacts of social media as my space and facebook on teenagers in th...
Negative impacts of social media as my space and facebook on teenagers  in th...Negative impacts of social media as my space and facebook on teenagers  in th...
Negative impacts of social media as my space and facebook on teenagers in th...
 
The effects of social media on college students
The effects of social media on college studentsThe effects of social media on college students
The effects of social media on college students
 
Are Social Media Websites Harmful To The Youth?
Are Social Media Websites Harmful To The Youth?Are Social Media Websites Harmful To The Youth?
Are Social Media Websites Harmful To The Youth?
 
Social Networking Sites and Reference Services
Social Networking Sites and Reference ServicesSocial Networking Sites and Reference Services
Social Networking Sites and Reference Services
 
Introduction to Social Media for Researchers
Introduction to Social Media for ResearchersIntroduction to Social Media for Researchers
Introduction to Social Media for Researchers
 
Impact_of_internet_use_on_young_students
Impact_of_internet_use_on_young_studentsImpact_of_internet_use_on_young_students
Impact_of_internet_use_on_young_students
 
Social Media Effects on Study Habits
Social Media Effects on Study HabitsSocial Media Effects on Study Habits
Social Media Effects on Study Habits
 
effects of Social media
effects of Social mediaeffects of Social media
effects of Social media
 
IMPACT OF FACEBOOK USAGE ON THEACADEMIC GRADES: A CASE STUDY
IMPACT OF FACEBOOK USAGE ON THEACADEMIC GRADES: A CASE STUDYIMPACT OF FACEBOOK USAGE ON THEACADEMIC GRADES: A CASE STUDY
IMPACT OF FACEBOOK USAGE ON THEACADEMIC GRADES: A CASE STUDY
 

Viewers also liked

Portal de transparencia.
Portal de transparencia. Portal de transparencia.
Portal de transparencia.
Juan Antonio Díaz
 
Yo soy reykon el lider
Yo soy reykon el liderYo soy reykon el lider
Yo soy reykon el lider
Flakita Deysi
 
áLbum De FotografíAs
áLbum De FotografíAsáLbum De FotografíAs
áLbum De FotografíAsfipmerchi
 
Anamaria bolos
Anamaria bolosAnamaria bolos
Anamaria bolosrose
 
Facebook How To Guide
Facebook How To GuideFacebook How To Guide
Facebook How To Guide
adcieo
 
GUERRA FRÍA
GUERRA FRÍAGUERRA FRÍA
GUERRA FRÍA
Waldir So Mora
 
buscadores
buscadoresbuscadores
buscadores
cristonfo
 
Anima el texto
Anima el textoAnima el texto
Anima el texto
Martin Solano
 
El principito
El principitoEl principito
El principito
paqui_linan
 
Nfl week 10 all picks
Nfl week 10 all picksNfl week 10 all picks
Nfl week 10 all picks
FootballSucks
 
Nestle Company
Nestle CompanyNestle Company
Nestle Company
Avinash Labade
 

Viewers also liked (13)

Portal de transparencia.
Portal de transparencia. Portal de transparencia.
Portal de transparencia.
 
Yo soy reykon el lider
Yo soy reykon el liderYo soy reykon el lider
Yo soy reykon el lider
 
áLbum De FotografíAs
áLbum De FotografíAsáLbum De FotografíAs
áLbum De FotografíAs
 
Anamaria bolos
Anamaria bolosAnamaria bolos
Anamaria bolos
 
Facebook How To Guide
Facebook How To GuideFacebook How To Guide
Facebook How To Guide
 
GUERRA FRÍA
GUERRA FRÍAGUERRA FRÍA
GUERRA FRÍA
 
Deniz ortiz
Deniz ortizDeniz ortiz
Deniz ortiz
 
buscadores
buscadoresbuscadores
buscadores
 
Anima el texto
Anima el textoAnima el texto
Anima el texto
 
El principito
El principitoEl principito
El principito
 
Loi Hay Y Dep
Loi Hay Y DepLoi Hay Y Dep
Loi Hay Y Dep
 
Nfl week 10 all picks
Nfl week 10 all picksNfl week 10 all picks
Nfl week 10 all picks
 
Nestle Company
Nestle CompanyNestle Company
Nestle Company
 

Similar to SAMUEL FULL MSC PROJECT

2021 adt imc_u_ba19_t0563_abah_stephany_mbong_680829564_task_9
2021 adt imc_u_ba19_t0563_abah_stephany_mbong_680829564_task_92021 adt imc_u_ba19_t0563_abah_stephany_mbong_680829564_task_9
2021 adt imc_u_ba19_t0563_abah_stephany_mbong_680829564_task_9
PuwaCalvin
 
Unit 1 cape sociology
Unit 1 cape sociologyUnit 1 cape sociology
Unit 1 cape sociology
Andreen18
 
Fernando 1Sheehan FernandoProfessor MorrisonEnglish 1001.docx
Fernando 1Sheehan FernandoProfessor MorrisonEnglish 1001.docxFernando 1Sheehan FernandoProfessor MorrisonEnglish 1001.docx
Fernando 1Sheehan FernandoProfessor MorrisonEnglish 1001.docx
ssuser454af01
 
Online social networking and the academic achievement of university students ...
Online social networking and the academic achievement of university students ...Online social networking and the academic achievement of university students ...
Online social networking and the academic achievement of university students ...
Alexander Decker
 
Caribbean studies IA Dejon Harris
Caribbean studies IA Dejon HarrisCaribbean studies IA Dejon Harris
Caribbean studies IA Dejon Harris
Dejon Harris
 
IMPACT OF SOCIAL NETWORKING SITES ON YOUNG GENERATION
IMPACT OF SOCIAL NETWORKING SITES ON YOUNG GENERATIONIMPACT OF SOCIAL NETWORKING SITES ON YOUNG GENERATION
IMPACT OF SOCIAL NETWORKING SITES ON YOUNG GENERATION
Arif, Mohammed Nazrul Islam
 
EXPLORING THE PERCEPTIONS AND USAGE OF SOCIAL NETWORKING SITES AMONG DISTANCE...
EXPLORING THE PERCEPTIONS AND USAGE OF SOCIAL NETWORKING SITES AMONG DISTANCE...EXPLORING THE PERCEPTIONS AND USAGE OF SOCIAL NETWORKING SITES AMONG DISTANCE...
EXPLORING THE PERCEPTIONS AND USAGE OF SOCIAL NETWORKING SITES AMONG DISTANCE...
African Virtual University
 
Engaging Youth & Young Adults in Social Media
Engaging Youth & Young Adults in Social MediaEngaging Youth & Young Adults in Social Media
Engaging Youth & Young Adults in Social Media
Brittany Smith
 
The Teleological Divide and ICT
The Teleological Divide and ICTThe Teleological Divide and ICT
The Teleological Divide and ICT
Colin Harrison
 
NM TIE Presentation on PD Ecosystems
NM TIE Presentation on PD EcosystemsNM TIE Presentation on PD Ecosystems
NM TIE Presentation on PD Ecosystems
Julia Parra
 
THE RELATIONSHIP BETWEEN THE USE OF BLACKBERRY WITH THE STUDENTS’ DEMAND FULF...
THE RELATIONSHIP BETWEEN THE USE OF BLACKBERRY WITH THE STUDENTS’ DEMAND FULF...THE RELATIONSHIP BETWEEN THE USE OF BLACKBERRY WITH THE STUDENTS’ DEMAND FULF...
THE RELATIONSHIP BETWEEN THE USE OF BLACKBERRY WITH THE STUDENTS’ DEMAND FULF...
cscpconf
 
Social Networks:Places of Learning?
Social Networks:Places of Learning?Social Networks:Places of Learning?
Social Networks:Places of Learning?
David Brear
 
Cyber / digital literacy.pptx
Cyber / digital literacy.pptxCyber / digital literacy.pptx
Cyber / digital literacy.pptx
Floralyn Victoria
 
cyberbullying detection seminar.pdf
cyberbullying detection seminar.pdfcyberbullying detection seminar.pdf
cyberbullying detection seminar.pdf
Akshay712352
 
Critique on social networking.
Critique on social networking.Critique on social networking.
Critique on social networking.Lucy Oliver
 
Analysis of social networking websites and its effect on academic students
Analysis of social networking websites and its effect on academic studentsAnalysis of social networking websites and its effect on academic students
Analysis of social networking websites and its effect on academic students
Jahangeer Qadiree
 
Extent of social media usage by students for improved learning in Tertiary In...
Extent of social media usage by students for improved learning in Tertiary In...Extent of social media usage by students for improved learning in Tertiary In...
Extent of social media usage by students for improved learning in Tertiary In...
iosrjce
 
Business research report on Internet and children
Business research report on Internet and childrenBusiness research report on Internet and children
Business research report on Internet and children
Apon Comilla
 
trabajo de tesis
trabajo de tesistrabajo de tesis
trabajo de tesis
Keviin Alexander
 
Social Media and its impact on students
Social Media and its impact on studentsSocial Media and its impact on students
Social Media and its impact on students
HaxNain BalGhari
 

Similar to SAMUEL FULL MSC PROJECT (20)

2021 adt imc_u_ba19_t0563_abah_stephany_mbong_680829564_task_9
2021 adt imc_u_ba19_t0563_abah_stephany_mbong_680829564_task_92021 adt imc_u_ba19_t0563_abah_stephany_mbong_680829564_task_9
2021 adt imc_u_ba19_t0563_abah_stephany_mbong_680829564_task_9
 
Unit 1 cape sociology
Unit 1 cape sociologyUnit 1 cape sociology
Unit 1 cape sociology
 
Fernando 1Sheehan FernandoProfessor MorrisonEnglish 1001.docx
Fernando 1Sheehan FernandoProfessor MorrisonEnglish 1001.docxFernando 1Sheehan FernandoProfessor MorrisonEnglish 1001.docx
Fernando 1Sheehan FernandoProfessor MorrisonEnglish 1001.docx
 
Online social networking and the academic achievement of university students ...
Online social networking and the academic achievement of university students ...Online social networking and the academic achievement of university students ...
Online social networking and the academic achievement of university students ...
 
Caribbean studies IA Dejon Harris
Caribbean studies IA Dejon HarrisCaribbean studies IA Dejon Harris
Caribbean studies IA Dejon Harris
 
IMPACT OF SOCIAL NETWORKING SITES ON YOUNG GENERATION
IMPACT OF SOCIAL NETWORKING SITES ON YOUNG GENERATIONIMPACT OF SOCIAL NETWORKING SITES ON YOUNG GENERATION
IMPACT OF SOCIAL NETWORKING SITES ON YOUNG GENERATION
 
EXPLORING THE PERCEPTIONS AND USAGE OF SOCIAL NETWORKING SITES AMONG DISTANCE...
EXPLORING THE PERCEPTIONS AND USAGE OF SOCIAL NETWORKING SITES AMONG DISTANCE...EXPLORING THE PERCEPTIONS AND USAGE OF SOCIAL NETWORKING SITES AMONG DISTANCE...
EXPLORING THE PERCEPTIONS AND USAGE OF SOCIAL NETWORKING SITES AMONG DISTANCE...
 
Engaging Youth & Young Adults in Social Media
Engaging Youth & Young Adults in Social MediaEngaging Youth & Young Adults in Social Media
Engaging Youth & Young Adults in Social Media
 
The Teleological Divide and ICT
The Teleological Divide and ICTThe Teleological Divide and ICT
The Teleological Divide and ICT
 
NM TIE Presentation on PD Ecosystems
NM TIE Presentation on PD EcosystemsNM TIE Presentation on PD Ecosystems
NM TIE Presentation on PD Ecosystems
 
THE RELATIONSHIP BETWEEN THE USE OF BLACKBERRY WITH THE STUDENTS’ DEMAND FULF...
THE RELATIONSHIP BETWEEN THE USE OF BLACKBERRY WITH THE STUDENTS’ DEMAND FULF...THE RELATIONSHIP BETWEEN THE USE OF BLACKBERRY WITH THE STUDENTS’ DEMAND FULF...
THE RELATIONSHIP BETWEEN THE USE OF BLACKBERRY WITH THE STUDENTS’ DEMAND FULF...
 
Social Networks:Places of Learning?
Social Networks:Places of Learning?Social Networks:Places of Learning?
Social Networks:Places of Learning?
 
Cyber / digital literacy.pptx
Cyber / digital literacy.pptxCyber / digital literacy.pptx
Cyber / digital literacy.pptx
 
cyberbullying detection seminar.pdf
cyberbullying detection seminar.pdfcyberbullying detection seminar.pdf
cyberbullying detection seminar.pdf
 
Critique on social networking.
Critique on social networking.Critique on social networking.
Critique on social networking.
 
Analysis of social networking websites and its effect on academic students
Analysis of social networking websites and its effect on academic studentsAnalysis of social networking websites and its effect on academic students
Analysis of social networking websites and its effect on academic students
 
Extent of social media usage by students for improved learning in Tertiary In...
Extent of social media usage by students for improved learning in Tertiary In...Extent of social media usage by students for improved learning in Tertiary In...
Extent of social media usage by students for improved learning in Tertiary In...
 
Business research report on Internet and children
Business research report on Internet and childrenBusiness research report on Internet and children
Business research report on Internet and children
 
trabajo de tesis
trabajo de tesistrabajo de tesis
trabajo de tesis
 
Social Media and its impact on students
Social Media and its impact on studentsSocial Media and its impact on students
Social Media and its impact on students
 

SAMUEL FULL MSC PROJECT

  • 1. Filtering Offensive Language in Online Communities using Grammatical Relations BY SAMUEL AYOKUNLE ADEKANMBI MATRIC NO: 133466 Project submitted in partial fulfillment award of Master of Science degree (Computer science) Department of computer science, University of Ibadan. February, 2014.
  • 2. Certification I certify that this research work was carried out by Samuel Ayokunle ADEKANMBI (133466) under my supervision. . ____________________ _______________________ Date Dr B O Longe
  • 3. DEDICATION This entire work dedicated to everyone that believes in the PromoUpdate dream.
  • 4. ACKNOWLEDGEMENT My profound gratitude goes to my parent and my siblings for their moral and financial support which has immensely led to the success of this project. To my Dad, You are the best; I love you so much even though I don’t show it. I am indeed grateful to my supervisor, Dr. Olumide B. Longe for his moral support, patience and understanding during the course of this project. Thank you very much Sir. I also want to appreciate my very good and crazy friends: Tini, Phina, Kunchasho, TY, Alamu, Oluwashola Amiola Philip, Emmanuel, Muideen, Lola Mojekodunmi, Jane, Gbenro, N.O Jimoh, Tifa; You guys are my brothers from another mother. I can’t underestimate the effort of all my lecturers in the department; I pray the blessing of the lord shall not depart from your homes. My Msc. Programme will have being incomplete without some set of wonderful people: Tini, Phina, Helen, Rotimi, Modupe, Tolu, Big Fish, Last Don, Giel, and the whole crew at chief Madu’s Palace. Thanks for being there for me. To all my classmates, Dimple, Becky, Elohor, Ben, Fake AYs, John, Uzomma, Deola, Banky, Shukurat, Toyosi, Shola, Adesi, GP, Toyosi, Tosinsss, etc; you have been a blessing to me and the success of my programme. I say a big thanks to you for your support throughout the programme. I appreciate your love. Thanks for believing in the PromoUpdate dream. You guys are the best. Finally, to anyone that has contributed to the success of this project and my success in life, whose name is not mentioned here, please just know that you are not unknown to me to me and you are appreciated more than you know. God bless you all. See you at the top.
  • 5. TABLE OF CONTENT page Title page i Certification ii Dedication iii Acknowledgement iv Table of content v Abstract viii CHAPTER ONE: INTRODUCTION 1.1 Background of study 1 1.2 Problem Statement 4 1.3 Aims and Objectives 4 1.4 Research Methodology 5 1.5 Scope and Limitation 5 1.6 Organization of the study 6 1.7 Expected Contribution to Knowledge 6 1.7.1 Glossary of terms. 7 CHAPTER TWO: REVIEW OF THE LITERATURE 2.1 Offensive Language in Online Communities 8 2.2 Rate of Cyberbullying among youth 9 2.3 Tradition-Bullying and Cyber-Bullying 10 2.4 Type of Bullying Online 12 2.5 Challenges in the fight to stop cyberbullying 12 2.6 Preventing Cyberbullying 13 2.7 Responding to Cyberbullying 14 2.8 Grammatical Relations 16
  • 6. 2.9 Using text mining techniques to detect online offensive content 17 2.10 Heads and Dependents 20 2.11 Statistical Parsing 21 2.12 Dependency Parsing 27 2.9 Using text mining techniques to detect online offensive content 17 2.9 Using text mining techniques to detect online offensive content 17 2.9 Using text mining techniques to detect online offensive content 17 CHAPTER THREE: SYSTEM ANALYSIS AND DESIGN 3.1 Systems Analysis 36 3.2 Analysis of the existing system 37 3.3 Problem of the existing approaches 40 3.4 Proposed Filtering Philosophy 41 3.5 Identify Removable Content by Grammatical Relations 44 CHAPTER FOUR: IMPLEMENTATION 4.1 Justification of Programming Language Used 56 4.2 System Specification 58 4.3 System Implementation 59 CHAPTER FIVE: SUMMARY, CONCLUSION AND FUTURE WORKS 5.1 Summary. 65
  • 7. 5.2 Conclusion 65 5.3 Future Works 66 References 67
  • 8. ABSTRACT Offensive language has risen to be a big issue to the health of both online communities and their users. To the online community, the spread of offensive language undermines its reputation, drives users away, and even directly affects its growth. To users, viewing offensive language brings negative influence to their mental health, especially for children and youth. A semantic filtering model is been proposed and implemented using grammatical analysis and part of speech tagging. Statistical/probabilistic analysis of recurring offensive tokens is been done using Bayesian method. The designed semantic filtering system was tested as an online web application with a client application by engaging users to validate the efficiency of the designed system. When offensive language is detected in a user message, a problem arises about how the offensive language should be removed, i.e. the offensive language filtering problem. Our semantic filtering technique is based on the grammatical relations of words in a sentence so that the rest of the filtered sentence is readable and the existence of offensive words in the original sentence is hard to notice. We tested the effectiveness of our approach with a large dataset and the results show that our techniques are very effective and accurate with little process overhead. Moreover, as the most time-consuming part of semantic filtering is the sentence parsing process, we will examine other light-weighted NLP techniques to speed up sentence parsing. Also, we also plan to extend our filtering approach to support other languages such as Chinese and French in future works.
  • 9. CHAPTER ONE INTRODUCTION Online social networking (OSN) websites have enjoyed a great success in recent years and have become the new frontier in today’s social relationships providing great places for self-expression and exchange of ideas. Social networking has provided opportunities for new relationships as well as strengthening existing relationships. Benefits of social networking platforms vary based on platform type, features and the company itself. OSN allows organizations to improve communication and productivity by disseminating information among different groups of employees in a more efficient manner, resulting in increased productivity. In the past, social networks were viewed as a distraction and offered no educational benefit. Blocking these social networks was a form of protection for students against wasting time, bullying, and invasions of privacy. In an educational setting, OSNs are seen by many instructors and educators as a frivolous, time-wasting distraction from schoolwork, and it is not uncommon to be banned in school computer labs. Cyberbullying has also become an issue of concern with social networks. According to the Children Go Online survey of 9-24 year olds, it was found that a third have received bullying comments online.( http://internetsafety101.org) To avoid this problem, many school districts/boards have blocked access to online social networks within the school environment. I Social networking services often include a lot of personal information posted publicly, and many believe that sharing personal information is a window into privacy theft. Schools have taken action to protect students from this. It is believed that this outpouring of identifiable information and the
  • 10. easy communication vehicle that social networking services opens the door to sexual predators, cyberbullying, and cyber-stalking.(http://en.wikipedia.org/wiki/Social_networking_service) In contrast, however, 70% of social media using teens and 85% of adults believe that people are mostly kind to one another on social network sites.( http://en.wikipedia.org/wiki/Social_networking_service) Research has suggested that there has been a shift in blocking the use of social networking services. In many cases, the opposite is occurring as the potential of online networking services is being realized. It has been suggested that if schools block them [Online Social Networks], they’re preventing students from learning the skills they need. Banning social networking is not only inappropriate but also borderline irresponsible when it comes to providing the best educational experiences for students. Schools have the option of educating safe media usage as well as incorporating digital media into the classroom experience, thus preparing students for the literacy they will encounter in the future. Cyberbullying is a fast growing trend that experts believe is more harmful than typical schoolyard bullying. Nearly all of us can be contacted 24/7 via the internet or our mobile phones. Victims can be reached anytime and at anyplace. For many children, home is no longer a refuge from the bullies. “Children can escape threats and abuse in the classroom, only to find text messages and emails from the same tormentors when they arrive home.” “There’s no safe place anymore and one can be bullied 24/7; even in the privacy of his/her own bedroom.” (Cyberbullying, Able Publishing Newsletter - Term 3, 2008). Online social networking sites have become increasingly popular with children, especially young teens, as a place where they can meet other people, communicate, and exchange information. No type of bullying is harmless. In some cases, it can constitute criminal behaviour. In extreme incidents, cyberbullying has led teenagers to suicide. Most victims, however, suffer shame,
  • 11. embarrassment, anger, depression and withdrawal.(Cyberbullying, Able Publishing Newsletter - Term 3, 2008) Cyberbullying is often seen as anonymous, and the nature of the internet allows it to spread quickly to hundreds and thousands of people. Cyberbullying has the same insidious effects as any kind of bullying, turning children away from school, friendships, and in tragic instances, life itself. Parents often tell their children to turn off the mobile phones or stay off the computer. Many parents don’t understand that the internet and mobile phone act as a social lifeline for teenagers to their peer group. Victims often don't tell their parents because they think their parents will only make the problem worse, or that they might even confiscate their mobile phone or take away their internet access, removing that social lifeline. While bullying is something that is often ‘under the radar’ of adults, cyberbullying is even more so. Teenagers are increasingly communicating in ways that are often unknown by adults and away from their supervision. They organize their social lives through these mediums. Their friendships are made and broken over these mediums. So the question remains "How can we avoid offensive languages in OSNs?” This research work aims at removing offensive languages in a user message. When offensive language is detected in a user message, a problem arises about how the offensive language should be removed, i.e. the offensive language filtering problem. To solve this problem, manual filtering approach is known to produce the best filtering result. However, manual filtering is costly in time and labor thus cannot be widely applied.(http://en.wikipedia.org/wiki/Anti-spam_techniques) Here, we will analyze the offensive language in text messages posted in online communities, and propose a new automatic sentence-level filtering approach that is able to semantically remove the offensive language by utilizing the grammatical relations among words. Comparing with existing automatic
  • 12. filtering approaches, the proposed filtering approach provides filtering results much closer to manual filtering. 1.1 Statement problem The online community has encouraged the use of offensive languages which has spread into about 80% of all OSN; and has been very harmful to the mental health of both children and youth.(Zhi Xu and Sencun Zhu, 2010) To the online community, the deluge of offensive language undermines the community’s reputation, drives users away, and even directly affects its growth. People have realized the problems brought by offensive language in online communities and many efforts have been made on detecting the existence of offensive language within user messages. However, detection alone is not enough to eliminate the hazard caused by offensive language. When offensive contents are detected within a user message, a question arises naturally about how the detected offensive content should be removed from the message before it is been transmitted. Also, how do we remove or filter offensive languages and words form a message thoroughly and still keep inoffensive content untouched as much as possible. Also, will the readability of filtered content be guaranteed so as to make our filtering transparent to readers? 1.2 Aims and objectives: This project work intends to develop and implement a sentence-level semantic filtering System, which will
  • 13. 1. Utilize grammatical relations among words to stop cyberbullying by semantically remove offensive content in a sentence. 2. Produce minimal error when filtering offensive languages and words form a message and still keeps inoffensive content untouched as much as possible. 3. Guarantee the readability of filtered content so as to make the filtering transparent to readers. 4. Implement the designed model which is going to be a sophisticated NLP application, not an AI application, since learning is not going to be involved. 5. To help reduce the chances of victimization in Online Social Networking Sites. 1.3 Research Methodology The methodology adopted in carrying out this project include the use of interviews to gather primary data from a number of leading filtering vendors in Nigeria. Both telephone and face-to- face interviews will be carried out with the relevant technology experts within selected organizations. Also, an existing database of offensive words and languages will be collected and use to simulate an offensive database engine. A semantic filtering model will be proposed and implemented using XYZ. Statistical/probabilistic analysis of recurring offensive tokens will be done using Bayesian method. The designed semantic filtering system will be tested as an online web application with a client application by engaging users to validate the efficiency of the designed system.
  • 14. 1.4 Organization of the study The thesis work is arranged in five chapters with the breakdown as follows: The First Chapter is termed introduction and it includes the Online Social Networking System, research aim and objectives, research methodology and organization of dissertation. Chapter Two deals with the literature review on grammatical relations, cyberbullying and the concept of sematic filtering system. Chapter Three presents the Methodology and analysis of the input and output specification of the proposed system and the design of the system. Chapter Four describes the system implementation and evaluation of the system design. This would consist of a brief description of each program module and their functions. It also justifies the choice of package and describes the software required to implement the system. It also shows the measures taking during the implementation. Chapter Five summarizes the project work. It covers conclusion and recommendation for the project.
  • 15. LITERATURE REVIEW 2.1 Offensive Language in Online Communities A lot of people most especially kids have been bullying each other for generations. The latest generation, however, has been able to utilize technology to expand their reach and the extent of their harm. (http://cyberbullying.us) This phenomenon is being called cyberbullying, defined as: “willful and repeated harm inflicted through the use of computers, cell phones, and other electronic devices.” Basically, we are referring to incidents where adolescents use technology, usually computers or cell phones, to harass, threaten, humiliate, or otherwise hassle their peers. For example, youth can send hurtful text messages to others or spread rumors using cell phones or computers. Teens have also created web pages, videos, profiles on social networking sites making fun of others. With cell phones, adolescents have taken pictures in a bedroom, a bathroom, or another location where privacy is expected, and posted or distributed them online. More recently, some have recorded unauthorized videos of other kids and uploaded them for the world to see, rate, tag, and discuss.( http://cyberbullying.us) However, there are many detrimental outcomes associated with cyberbullying and making use of offensive languages that reach into the real world. First, many targets of cyberbullying report feeling depressed, sad, angry, and frustrated. As one teenager stated: “It makes me hurt both physically and mentally. It scares me and takes away all my confidence. It makes me feel sick and worthless.” Victims who experience cyberbullying also reveal that they were afraid or embarrassed to go to school or even come out to talk in public.(http://cyberbullying.us) In addition, there is a link between cyberbullying and low self-esteem, family problems, academic
  • 16. problems, school violence, and delinquent behavior. Finally, cyberbullied youth also report having suicidal thoughts, and there have been a number of examples around the world where youth who were victimized ended up taking their own lives.(http://cyberbullying.us) Cyberbullying occurs across a variety of venues and mediums in cyberspace, and it shouldn’t come as a surprise that it occurs most often where teenagers congregate. Initially, many kids hung out in chat rooms, and as a result that is where most harassment took place. In recent years, most youth are have been drawn to social networking websites (such as Facebook, Twitter, Linked In, etc.) and video-sharing websites (such as YouTube). This trend has led to increased reports of cyberbullying occurring in those environments. (Burgess-Proctor, Patchin, & Hinduja, 2009; Hinduja & Patchin, 2008b; R. M. Kowalski & Limber, 2007; Lenhart, 2007; Li, 2007a; Patchin & Hinduja, 2006). Instant messaging on the Internet or text messaging via a cell phone also appear to be common ways in which youth are harassing one another. 2.2 Rate of Cyberbullying among Youth Estimates of the number of youth who experience cyberbullying vary widely (ranging from 10- 40% or more), depending on the age of the group studied and how cyberbullying is formally defined. In this research, we inform secondary school students(of International School, Ibadan; Abadina College, U.I; and Igbobi College Yaba, Lagos) that cyberbullying is when someone “repeatedly picks on another person by making use of offensive languages through OSN when chatting or when someone posts something offensive online about another person that they don’t like.” Using this definition, about 62% of the over 800 randomly-selected 11-18 year-old students indicated they had been a victim at some point in their life. About this same number
  • 17. admitted to cyberbullying others during their lifetime. Finally, about 40% of youths in this recent study said they had both been a victim and an offender. Fig 2.1 2.3 Traditional-Bullying and Cyber-Bullying While often similar in terms of form and technique, bullying and cyberbullying have many differences that can make the latter even more devastating. First, victims often do not know who the bully is, or why they are being targeted. The cyberbully can cloak his or her identity behind a computer using anonymous email addresses or pseudonymous screen names. Second, the hurtful actions of a cyberbully are viral; that is, a large number of people (at school, in the neighborhood, in the city, in the world!) can be involved in a cyber-attack on a victim, or at least find out about the incident with a few keystrokes or clicks of the mouse. The perception, then, is that absolutely everyone knows about it.
  • 18. Third, it is often easier to be cruel using technology because cyberbullying can be done from a physically distant location, and the bully doesn’t have to see the immediate response by the target. In fact, some teens simply might not recognize the serious harm they are causing because they are sheltered from the victim’s response. Finally, while parents and teachers are doing a better job supervising youth at school and at home, many adults don’t have the technological know-how to keep track of what teens are up to online. As a result, a victim’s experience may be missed and a bully’s actions may be left unchecked. Even if bullies are identified, many adults find themselves unprepared to adequately respond.()5 All these and more makes cyberbullying is a growing problem because increasing numbers of kids are using and have completely embraced interactions via computers and cell phones. Two- thirds of youth go online every day for school work, to keep in touch with their friends, to play games, to learn about celebrities, to share their digital creations, or for many other reasons. Because the online communication tools have become an important part of their lives, it is not surprising that some youths have decided to use the technology to be malicious or menacing towards others. The fact that teens are connected to technology 24/7 means they are susceptible to victimization (and able to act on mean intentions toward others) around the clock.() Apart from a measure of anonymity, it is also easier to be hateful using typed words rather than spoken words face-to-face and because some adults have been slow to respond to cyberbullying, many cyberbullies feel that there are little to no consequences for their actions. Cyberbullying crosses all geographical boundaries. The Internet has really opened up the whole world to users who access it on a broad array of devices, and for the most part, this has been a good thing. Nevertheless, some kids feel free to post or send whatever they want while online
  • 19. without considering how that content can inflict pain – and sometimes cause severe psychological and emotional wounds. 2.4 Types of Bullying Online According to the Internet Safety 101 curriculum, there are many types of cyberbullying which includes:  Gossip: Posting or sending cruel gossip to damage a person’s reputation and relationships with friends, family, and acquaintances.  Exclusion: Deliberately excluding someone from an online group.  Impersonation: Breaking into someone’s e-mail or other online account and sending messages that will cause embarrassment or damage to the person’s reputation and affect his or her relationship with others.  Harassment: Repeatedly posting or sending offensive, rude, and insulting messages.  Cyber-stalking: Posting or sending unwanted or intimidating messages, which may include threats.  Flaming: Online fights where scornful and offensive messages are posted on websites, forums, or blogs.  Outing and Trickery: Tricking someone into revealing secrets or embarrassing information, which is then shared online.  Cyber-threats: Remarks on the Internet threatening or implying violent behavior, displaying suicidal tendencies.
  • 20. 2.5 Challenges in the fight to stop cyberbullying There are two major challenges that make it difficult to prevent cyberbullying. First, many people don’t see the harm associated with it. Some attempt to dismiss or disregard cyberbullying because there are “more serious forms of aggression to worry about.” While it is true that there are many issues facing adolescents, parents, teachers, and law enforcement today, we first need to accept that cyberbullying is one such problem that will only get more serious if ignored. The other challenge relates to who is willing to step up and take responsibility for responding to inappropriate use of technology. Parents often say that they don’t have the technical skills to keep up with their kids’ online behavior; teachers are afraid to intervene in behaviors that often occur away from school; and law enforcement is hesitant to get involved unless there is clear evidence of a crime or a significant threat to someone’s physical safety. As a result, cyberbullying incidents often slip through the cracks. Indeed, the behavior often continues and escalates because they are not quickly addressed.() Based on these challenges, there is need to collectively create an environment where kids feel comfortable talking with adults about this problem and feel confident that meaningful steps will be taken to resolve the situation. We also need to get everyone involved - youth, parents, educators, counselors, law enforcement, social media companies, and the community at large. It will take a concerted and comprehensive effort from all stakeholders to really make a difference in reducing cyberbullying.
  • 21. 2.6 Preventing Cyberbullying The most important preventive step that schools can take is to educate the school community about responsible internet use. Students need to know that all forms of bullying are wrong and that those who engage in harassing or threatening behaviors will be subject to discipline. It is therefore important to discuss issues related to the appropriate use of online communications technology in various areas of the general curriculum. To be sure, these messages should be reinforced in classes that regularly utilize technology. Signage also should be posted in the computer lab or at each computer workstation to remind students of the rules of acceptable use. In general, it is crucial to establish and maintain a school climate of respect and integrity where violations result in informal or formal sanction.() Furthermore, school district personnel should review their harassment and bullying policies to see if they allow for the discipline of students who engage in cyberbullying. If their policy covers it, cyberbullying incidents that occur at school - or that originate off campus but ultimately result in a substantial disruption of the learning environment - are well within a school’s legal authority to intervene. The school then needs to make it clear to students, parents, and all staff that these behaviors are unacceptable and will be subject to discipline. In some cases, simply discussing the incident with the offender’s parents will result in the behavior stopping. 2.7 Responding to Cyberbullying Students should already know that cyberbullying is unacceptable and that the behavior will result in discipline. Utilize school liaison officers or other members of law enforcement to thoroughly investigate incidents, as needed, if the behaviors cross a certain threshold of severity. Once the
  • 22. offending party has been identified, develop a response that is commensurate with the harm done and the disruption that occurred. School administrators should also work with parents to convey to the student that cyberbullying behaviors are taken seriously and are not trivialized. Moreover, schools should come up with creative response strategies, particularly for relatively minor forms of harassment that do not result in significant harm. For example, students may be required to create anti-cyberbullying posters to be displayed throughout the school. Older students might be required to give a brief presentation to younger students about the importance of using technology in ethically-sound ways. The point here, again, is to condemn the behavior while sending a message to the rest of the school community that bullying in any form is wrong and will not be tolerated. Even though the vast majority of these incidents can be handled informally (calling parents, counseling the bully and target, expressing condemnation of the behavior), there may be occasions where formal response from the school is warranted. This is particularly the case in incidents involving serious threats toward another student, if the target no longer feels comfortable coming to school, or if cyberbullying behaviors continue after informal attempts to stop it have failed. In these cases, detention, suspension, changes of placement, or even expulsion may be necessary. If these extreme measures are required, it is important that educators are able to clearly demonstrate the link to school and present evidence that supports their action. Also, youth should develop a relationship with an adult they trust (a parent, teacher, or someone else) so they can talk about any experiences they have online (or off) that make them upset or uncomfortable. If possible, teens should ignore minor teasing or name calling, and not respond to the bully as that might simply make the problem continue. It’s also useful to keep all evidence of
  • 23. cyberbullying to show an adult who can help with the situation. If targets of cyberbullying are able to keep a log or a journal of the dates and times and instances of the online harassment, that can also help prove what was going on and who started it. Overall, youth should go online with their parents – show them what web sites they use, and why. At the same time, they need to be responsible when interacting with others on the Internet. For instance, they shouldn’t say anything to anyone online that they wouldn’t say to them in person with their parents in the room. Finally, youth ought to take advantage of the privacy settings within Facebook and other websites, and the social software (instant messaging, email, and chat programs) that they use – they are there to help reduce the chances of victimization. Users can adjust the settings to restrict and monitor who can contact them and who can read their online content. Law enforcement officers also have a role in preventing and responding to cyberbullying. To begin, they need to be aware of ever-evolving state and local laws concerning online behaviors, and equip themselves with the skills and knowledge to intervene as necessary. In a recent survey of school resource officers, we found that almost one-quarter did not know if their state had a cyberbullying law. This is surprising since their most visible responsibility involves responding to actions which are in violation of law (e.g., harassment, threats, stalking). Even if the behavior doesn’t immediately appear to rise to the level of a crime, officers should use their discretion to handle the situation in a way that is appropriate for the circumstances. For example, a simple discussion of the legal issues involved in cyberbullying may be enough to deter some youth from future misbehavior. Officers might also talk to parents about their child’s conduct and express to them the seriousness of online harassment.
  • 24. Relatedly, officers can play an essential role in preventing cyberbullying from occurring or getting out of hand in the first place. They can speak to students in classrooms about cyberbullying and online safety issues more broadly in an attempt to discourage them from engaging in risky or unacceptable actions and interactions. They might also speak to parents about local and state laws, so that they are informed and can properly respond if their child is involved in an incident. 2.8 Grammatical Relations Grammatical relations refer to functional relationships between constituents in a clause. The standard examples of grammatical functions from traditional grammar are subject, direct object, and indirect object. Beyond these concepts from traditional grammar, more modern theories of grammar are likely to acknowledge many further types of grammatical relations (e.g. complement, specifier, predicative, etc.). The role of grammatical relations in theories of grammar is the greatest in many dependency grammars, which tend to posit dozens of distinct grammatical relations.() Every head-dependent dependency bears a grammatical function. Grammatical relations are exemplified in traditional grammar by the notions of subject, direct object, and indirect object; For example: Adekanmbi gave Samuel the book. The subject Adekanmbi performs or is the source of the action. The direct object the book is acted upon by the subject, and the indirect object Samuel receives the direct object or otherwise benefits from the action. Traditional grammars often begin with these rather vague notions of the grammatical functions. When one begins to examine the distinctions more closely, it quickly
  • 25. becomes clear that these basic definitions do not provide much more than a loose orientation point. What is indisputable about the grammatical relations is that they are relational. That is, subject and object can exist as such only by virtue of the context in which they appear. A noun such as Adekanmbi or a noun phrase such as the book cannot qualify as subject and direct object, respectively, unless they appear in an environment, e.g. a clause, where they are related to each other and/or to an action or state. In this regard, the main verb in a clause is responsible for assigning grammatical relations to the clause "participants". 2.9 Using Text Mining Techniques to Detect Online Offensive Contents Offensive language identification in social media is a difficult task because the textual contents in such environment is often unstructured, informal, and even misspelled. While defensive methods adopted by current social media are not sufficient, researchers have studied intelligent ways to identify offensive contents using text mining approach. Implementing text mining techniques to analyze online data requires the following phases: 1) Data acquisition and preprocess, 2) Feature extraction 3) Classification The major challenges of using text mining to detect offensive contents lie on the feature selection phrase, which will be elaborated in the following sections. a) Message-level Feature Extraction Most offensive content detection research extracts two kinds of features: lexical and syntactic features.
  • 26. Lexical features treat each word and phrase as an entity. Word patterns such as appearance of certain keywords and their frequencies are often used to represent the language model. Early research used Bag-of-Words (BoW) in offensiveness detection. The BoW approach treats a text as an unordered collection of words and disregards the syntactic and semantic information. However, using BoW approach alone not only yields low accuracy in subtle offensive language detection, but also brings in a high false positive rate especially during heated arguments, defensive reactions to others’ offensive posts, and even conversations between close friends. N- gram approach is considered as an improved approach in that it brings words’ nearby context information into consideration to detect offensive contents. N-grams represent subsequences of N continuous words in texts. Bi-gram and Tri-gram are the most popular N-grams used in text mining. However, N-gram suffers from difficulty in exploring related words separated by long distances in texts. Simply increasing N can alleviate the problem but will slow down system processing speed and bring in more false positives. Syntactic features: Although lexical features perform well in detecting offensive entities, without considering the syntactical structure of the whole sentence, they fail to distinguish sentences’ offensiveness which contain same words but in different orders. Therefore, to consider syntactical features in sentences, natural language parsers are introduced to parse sentences on grammatical structures before feature selection. Equipping with a parser can help avoid selecting un-related word sets as features in offensiveness detection. b) User-level Offensiveness Detection Most contemporary research on detecting online offensive languages only focus on sentence- level and message-level constructs. Since no detection technique is 100% accurate, if users keep
  • 27. connecting with the sources of offensive contents (e.g., online users or websites), they are at high risk of continuously exposure to offensive contents. However, user-level detection is a more challenging task and studies associated with the user level of analysis are largely missing. There are some limited efforts at the user level. For example, Kontostathis et al propose a rule-based communication model to track and categorize online predators. Pendar uses lexical features with machine learning classifiers to differentiate victims from predators in online chatting environment. Pazienza and Tudorache propose utilizing user profiling features to detect aggressive discussions. They use users’ online behavior histories (e.g., presence and conversations) to predict whether or not users’ future posts will be offensive. Although their work points out an interesting direction to incorporate user information in detecting offensive contents, more advanced user information such as users’ writing styles or posting trends or reputations has not been included to improve the detection rate.() Fig 2.2
  • 28. 2.10 Heads and dependents The importance of the syntactic functions reaches its greatest extent in dependency grammar (DG) theories of syntax. Every head-dependent dependency bears a syntactic function. The result is that an inventory consisting of dozens of distinct syntactic functions is needed for each language. For example, a determiner-noun dependency might be assumed to bear the DET (determiner) function, and an adjective-noun dependency is assumed to bear the ATTR (attribute) function. These functions are often produced as labels on the dependencies themselves in the syntactic tree, e.g. Fig 2.3 The tree contains the following syntactic functions: ATTR (attribute), CCOMP (clause complement), DET (determiner), MOD (modifier), OBJ (object), SUBJ (subject), and VCOMP
  • 29. (verb complement). The actual inventories of syntactic functions will differ from the one suggested here in the number and types of functions that are assumed. In this regard, this tree is merely intended to be illustrative of the importance that the syntactic functions can take on in some theories of syntax and grammar. 2.11 Statistical parsing CFGs can be used to parse, but some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse. A corpus could be used to do that. 2.11.1 Basic idea 1. Start with a Treebank (we can say bank of trees, e.g. Penn Treebank) which is a collection of sentences with syntactic annotation, i.e., already-parsed sentences. 2. Examine which parse trees occur frequently 3. Extract grammar rules corresponding to those parse trees, estimating the probability of the grammar rule based on its frequency. That is, we’ll have a CFG augmented with probabilities (PCFG). 2.11.2 Probabilistic Context-Free Grammars (PCFGs) Definition of a PCFG: - Set of non-terminals (N) - Set of terminals (T)
  • 30. - Set of rules/productions (P), of the form Α → β - Designated start symbol (S) - Function, D assigns probabilities to each rule in P D = P (A → β) 2.11.3 Estimating Probabilities using a Treebank - Given a corpus of sentences annotated with syntactic annotation (e.g., the Penn Treebank) - Consider all parse trees - (1) Each time we have a rule of the form A → ß applied in a parse tree, increment a counter for that rule - (2) Also count the number of times A is on the left hand side of a rule - Divide (1) by (2) D = P (A→ ß | A) = Count (A → ß) / Count (A) 2.11.4 Using Probabilities to Parse • P (T) = probability of a particular parse tree = the product of the probabilities of all the rules r used to expand each node n in the parse tree
  • 31. Fig 2.4 We have the following rules and probabilities - S → VP .05 - VP → V NP .40 - NP → Det N .20 - V → book .30 - Det → that .05 - N → flight .25 P ( T ) = P ( S → VP ) * P ( VP→ V NP ) *… * P ( N → flight ) = .05 * .40 * .20 * .30 * .05 * .25 = .000015 So, the probability for that parse is 0.000015. Probabilities are useful for comparing with other probabilities. Whereas we couldn’t decide between two parses using a regular CFG, we now can.
  • 32. 2.11.5 Obtaining the best parse The best parse T(S), where S is our sentence is the tree which has the highest probability. We can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse - CYK is a form of dynamic programming - CYK is a chart parser, like the Earley parser 2.11.6 Problems with PCFGs It’s still only a CFG, so dependencies on non-CFG information is not captured. - e.g., Pronouns are more likely to be subjects than objects: P [ ( NP → Pronoun ) | NP = subject ] >> P [ ( NP → Pronoun) | NP =obj] Ignores lexical dependency information (statistics), which is usually crucial for disambiguation of “PP attachment ambiguity” and “Coordination ambiguity”. - (T1) America sent [ [250,000 soldiers] [into Iraq] ] - (T2) America sent [250,000 soldiers] [into Iraq] “Sent” with “into”-PP always-attached high (T2) probability. An example of Coordination ambiguity is two parses of the phrase “dogs in houses and cats” - (T1) [ [NP dogs] in [ NP houses and cats ] ] - (T2) [ [NP dogs in houses] and [NP cats ] ] Here T1 is semantically wrong and T2 is correct but both tree results same score. So only PCFG is not enough to disambiguate parse trees, lexical dependency information is also needed. To handle lexical information, we’ll turn to lexicalized PCFGs.
  • 33. 2.11.7 Lexicalized PCFGs  Lexicalized Parse Trees - Add “headwords” to each phrasal node. Each PCFG rule in a tree is augmented to identify one RHS constituent to be the head daughter - The headword for a node is set to the head word of its head daughter - Headship not in (most) treebanks - Usually use head rules, e.g.: - NP: • Take leftmost NP • Take rightmost N* • Take rightmost JJ • Take right child - VP: • Take leftmost VB* • Take leftmost VP • Take left child
  • 34. Fig 2.5 2.11.8 Incorporating head probabilities  Previously, we conditioned on the mother node (A): - P ( A → β | A )  Now, we can condition on the mother node and the headword of A (h(A)): - P( A → β | A , h (A) ) We’re no longer conditioning on simply the mother category A, but on the mother category when h(A) is the head. - e.g., P ( VP → VBD NP PP | VP , dumped)
  • 35. 2.11.9 Calculating rule probabilities  We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total: P ( VP → VBD NP PP | VP , dumped ) = C (VP (dumped) → VBD NP PP) / Σβ C ( VP ( dumped ) → β) 2.11.10 Adding info about word-word dependencies  We want to take into account one other factor: the probability of being a head word (in a given context) - P(h(n)=word | …)  We condition this probability on two things: 1. the category of the node (n), and 2. the headword of the mother (h(m(n))) - P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n))) - P(sacks | NP, dumped)  What we’re really doing is factoring in how words relate to each other  We will call this a dependency relation later: sacks is dependent on dumped, in this case
  • 36. Fig 2.6: Lexicalized parsing can be seen as producing dependency trees 2.12 Dependency Parsing Modern dependency grammar has been created by French linguistic Lucien Tesniere (1959). Although its roots may be traced back to Panini’s grammar of Sankskrit (predecessor of bangla) several centuries before. In NLP, dependency parse tree is thought as a ‘bridge’ between syntactic and semantic analysis, since it gives some semantic information as well as syntactic. Some peoples also argues that it is another version of chunk parsing, because a very careful observation of a dependency tree will reveal that every subpart of a sentence: subject, object or complements are appeared in different sub trees or under different relation, where each node is dependent on another node. These sub trees or semantically dependent nodes can be thought of as separate chunks.
  • 37. 2.12.1 Basic Concepts In a dependency representation every node in the structure is a surface word (there are no abstract nodes such as NP or VP), but each word may have additional attributes such as its part- of-speech (POS) tag. The parent word is known as the head, and its children are its modifiers. The observation which derives DG is: In a sentence, all but one word depend on other words. The one word that doesn’t depend on any other is called the root of the sentence. A typical DG analysis of the sentence “A man sleeps” is demonstrated below: A depends on man Man depends on sleep Sleep depends on nothing (it is the root of the sentence) Or, put differently A modifies man Man is the subject of sleep Sleep is the main verb of the sentence This is Dependency Grammar. A formulated dependency grammar is given below:  Capturing relations between words is moving in the direction of dependency grammar (DG)  In DG, there is no such thing as constituency  The structure of a sentence is purely the binary relations between words, A → B means that B depends on A
  • 38. Dependencies are motivated by grammatical function, both syntactically and semantically. A word depends on another either if it is a complement or a modifier of the latter. The edge between a parent and a child node specifies the grammatical relationship between the two words (e.g. subj, obj, and adj). In most formulations of DG for example, functional heads or governors (e.g. verbs) subcategorize for their complements. Hence, a transitive verb like ‘like’ requires two complements (dependents), one noun with the grammatical function subject and one with the function object. In this research thesis, we are using Stanford-Parser version-jdk1.5 for all of the output. Ex sentence: John likes Italian food. Tagged output: John/NNP likes/VBZ Italian/NN food/NN Constituent structure output: (ROOT (S (NP (NNP John)) (VP (VBZ likes) (NP (NN Italian) (NN food))))) Dependency structure output: nsubj(likes-2, John-1) nn(food-4, italian-3) dobj(likes-2, food-4)
  • 39. 2.12.2 Dependency functions 2.12.2.1 Main functions main main element The main element of a clause is usually a verb, but in a verb-less clause other elements may serve as a head as well. Ex: a sentence with a verb He doesn't know whether to send a gift. nsubj(know-4, He-1) aux(know-4, does-2) advmod(know-4, n't-3) aux(send-7, to-6) whether(know-4, send-7) det(gift-9, a-8) dobj(send-7, gift-9) Ex: a sentence without a verb A comprehensive grammar of the English language det(grammar-3, A-1) amod(grammar-3, comprehensive-2) det(language-7, the-5)
  • 40. amod(language-7, english-6) of(grammar-3, language-7) 2.12.2.2 Verb complementation nsubj nominal subject The dependency syntax collapses the classes of formal subject and ordinary subject into one. The subject may also be a non-finite clause that-clause, WH-clause, etc. dobj direct object The notion of object is wider than that in Quirk, comprising essentially all types of second arguments, except subject complements. The motivation is that the subtypes of second arguments are complementary, i.e. they occupy the same valency slot. There are both simple nominal objects and more complex objects such as a non-finite clause, that- clause, WH-clause or quote structure. Ex: John explained that topic nsubj(explained-2, John-1) det(topic-4, that-3) dobj(explained-2, topic-4) ccomp coordinated complement Subject complement is the second argument of a copular verb.
  • 41. Ex: Mary said John didn't go there nsubj(said-2, Mary-1) nsubj(go-6, John-3) aux(go-6, did-4) advmod(go-6, n't-5) ccomp(said-2, go-6) advmod(go-6, there-7) iobj indirect object Indirect object corresponds to a third argument. The prepositional dative is described accordingly. Again, the syntactic motivation is that the prepositional phrase occupies the same valency slot as the indirect object and is semantically equivalent to it. Ex: I gave him my address. nsubj(gave-2, I-1) iobj(gave-2, him-3) dep(address-5, my-4) dobj(gave-2, address-5) What did Pauline give Tom? Pauline gave it to Tom.
  • 42. 2.12.2.3 Determinative functions det determiner Central determiners (articles) or a determining pronoun. Successive determiners are linked to each other. Ex: This is an apple nsubj(is-2, This-1) det(apple-4, an-3) dobj(is-2, apple-4) 2.12.3 Robinson’s axiom Robinson (1970) formulated four axioms to govern the well-formedness of dependency structures, depicted below: 1. One and only one element is independent. 2. All others depend directly on some element. 3. No element depends directly on more than one other. 4. If A depends directly on B and some element C intervenes between them (in the linear order of string), then C depends directly on A or B or some other intervening element. The first three axioms ensure that they shall be trees. Axioms 1 and 2 state that in each sentence, only one element is independent and all others dependent on some other elements. Axiom 3 states that if element A depends on B, it must not depend on another element C. This
  • 43. requirement is referred as single-headness. Axiom 4 is called the requirement of projectivity and disallows crossing edges in dependency trees. 2.12.4 Dependency relation A mapping M maps W to the actual words of a sentence. Now for w1, w2 ∈ W, <w1, w2 >∈ R asserts that w1 is dependent on w2. The properties of R treeness constraints on dependency graphs as Robinson’s axioms. Ex: Mary loves another Mary ↑ ↑ ↑ ↑ w1 w2 w3 w4 here, M (w1…w4 ∈ W) 1. R ⊂ W × W 2. ∀w1w2…wk-1 wk ∈ W: <w1,w2> ∈ R … < wk-1 wk > ∈ R: w1 ≠ wk (acyclicity) 3. ∃!w1 ∈ W : ∀w2 ∈ W: < w1,w2 > ∉ R (rootedness) 4. ∀w1w2w3 ∈ W : <w1,w2> ∈ R ∧ <w1,w3> ∈ R→w2=w3 (singleheadedness) 2.12.5 Stanford dependency parser by Dan Klein This parser uses the feature of Collin’s parser. Michael Collins in his ‘Head Driven Statistical Parser’ showed mapping of his statistical parser to the dependency relation sets. Dan Klein’s
  • 44. Stanford parser deals with tagged words: pairs <w, t>. First the head <wh, th> of a constituent is generated using ‘Collins head finder’ method, then successive right dependents <wd, td> until a ‘stop’ token is generated, then successive left dependents until ‘stop’ token is generated. It supports three formats for output: 1. dependencies 2. typedDependencies 3. typedDependenciesCollapsed For example: Factory payrolls fell in September. Tagged output: Factory/NN payrolls/NNS fell/VBD in/IN September/NNP Dependency structure: nn(payrolls-2, Factory-1) nsubj(fell-3, payrolls-2) in(fell-3, September-5) Fig 2.7
  • 45. First, fell-VBD is chosen as the head of the sentence, then, in-IN to the right is generated, which then generates September-NN to the right, which generates ‘stop’ token on both sides. Then return to in-IN, generate ‘stop’ to the right, and so on. The above output is the ‘typedDependenciesCollapsed’ format of Stanford dependency parse tree. This ‘typedDependenciesCollapsed’ doesn’t make separate nodes for the words, which are obvious in any dependency relation in a sentence; instead it makes it a relation between two prominent words. In the above example the preposition ‘in’ is used as a relation or dependency function between the words ‘fell’ and ‘September’. For example, only ‘typedDependencies’ format of the above sentence will be: nn(payrolls-2, Factory-1) nsubj(fell-3, payrolls-2) dep(fell-3, in-4) dep(in-4, September-5) Fig 2.8
  • 46. Example shows that it makes a separate node ‘in’ between ‘fell’ and ‘September’, which can be used as a relation to make the tree shorter in depth. This thesis uses the ‘typedDependenciesCollapsed’ format as well because we don’t need to look at every word to extract necessary information.
  • 47. CHAPTER THREE SYSTEM ANALYSIS AND DESIGN In the following section of this chapter, existing sentence-level semantic filtering approaches and methodologies for online social networking communities will be thoroughly examined, and issues related to these approaches will be highlighted. The proposed sentence-level semantic filtering approach will also be examined, and its operation procedures, benefits, and feasibility will be expressed. Methodologies employed in acquiring the requirement towards the successful implementation of the proposed filtering System will also be discussed. The design of the filtering system from both perspectives will be discussed, also with the program components which will not be left out. 3.1 System Analysis System analysis can be defined as the process of analyzing a system with the essential goal of improving or modifying it. It can also be defined as the methodical study of a system, its current and future required objectives, and procedures in order to form a basis for the system design. It is the first of the three major phases in developing an information system. All the system analysis efforts are directed towards deciding these 3 basic objectives: 1. Identify system owner and system users. 2. Define what the system will do.
  • 48. 3. Determine the technicality, economical and operational feasibility of the proposed system. The purpose of analyzing is to produce a clear requirement specification of the newly designed or upgraded system efficiently and effectively. It requires the ability to analyze the essential features of a system. This knowledge of a system is achieved through the investigation of the system and its environment. 3.2 Analysis of the existing system Online social networking sites have become increasingly popular with children, especially young teens, as a place where they can meet other people, communicate, and exchange information. However, this medium has encouraged the wide usage of offensive languages and also brought about a fast growing trend that experts believe is very harmful called cyberbullying; which has led teenagers to suicide in very extreme cases. People have realized the problems brought by offensive language in online communities and many efforts have been made on detecting and eliminating the existence of offensive language within user messages. The approaches used are being discuss below.
  • 49. 3.2.1 Keyword Censoring Approach Keyword censoring approaches match words appearing in user messages with offensive words stored in the blacklist. Once found, these offensive words will be removed, partially replaced (e.g., “bitch”), completely replaced (e.g., “******”), or substituted by family friendly words (e.g., “naughty”). Because of its simplicity, keyword based censoring approach has been widely applied in OSN websites, such as YouTube and World of Warcraft. However, the filtering result is not as desired; brutally removing words from users message breaks the readability of the messages. Replacing offensive words with symbols usually makes it easy to guess the original offensive words. The idea of substitution seems tempting, but accurate substitution is usually impractical. Inaccurate substitution will introduce additional issues. For example, in 2001, Yahoo! deployed an Email filter which may automatically alter certain words in emails by family friendly words. This filter was criticized as a “foolish filter” by BBC news because of its inaccurate substitution. To demonstrate the shortcoming of keyword censoring approaches, we present an example below. Filtering results with Keyword Censoring Original comment: “What the fuck is wrong with you?” Keyword Censoring: “What the f**k is wrong with you?” According to presented filtering results, readers can still easily understand what the offender wants to say and even be able to infer the removed words. This indicates the failure filtering because offensive opinion has been successfully delivered to victims. Also, removing words
  • 50. from a sentence without considering their context breaks the readability of rest of the sentence. Compared with keyword censoring approaches, our proposed semantic filtering approach is much more sophisticated and can achieve thorough filtering effort by utilizing the grammatical relations among words in the sentence. Given a sentence containing both offensive and inoffensive words, not only offensive words but also inoffensive words assisting to express offensive opinions will be removed during our filtering. In this way, we essentially stop the delivery of offensive opinion. And, there will be no way to infer the offensive content in original messages after filtering. 3.2.2 Content Control Approach Content control approaches are usually deployed at user side or ISP side to prevent user from seeing inappropriate content on the Internet. Its filtering is usually done based on certain criteria, such as URL address, the occurrence of offensive words, and topic classification. Here our focus is text based criteria. For example, if we present a sentence based content control approach with threshold set as the number of offensive words in the sentence. If at least one offensive word is being detected within a sentence, the filter will remove the sentence from user message. To demonstrate the shortcoming of content control approaches, we present examples below. Filtering results with Content control Censoring Original comment: “What the fuck is wrong with you?” Keyword Censoring: “ ”
  • 51. However, content control approaches are too coarse-grained to be applied in online communities. First of all, offender can easily bypass the filtering as long as knowing the estimation criteria. More important, a sentence in user comment may contain both offensive and inoffensive content. Inoffensive part may be removed falsely because of offensive part. Not allowing user to post inoffensive content would easily drive users away and thus affect the growth of community. Compared with content control approaches, we provide a fine-grained filtering by removing only the smallest syntactic part in the sentence containing offensive language. The inoffensive content in the original message will remain; thereby, user still has the freedom of speech for posting inoffensive content. We believe such delicate filtering will be more acceptable to online communities. 3.2.3 Manual Filtering Approach Manual filtering is believed to produce the best filtering result. Basically, user messages are reviewed by community administrator before being posted on the website. Filtering results with Manual Filtering Approach Original comment: “What the fuck is wrong with you?” Keyword Censoring: “What is wrong with you?” As shown above, the administrator is able to easily understand what the author wants to express and precisely remove only the offensive content within the message. However, manual filtering is very time and labor consuming, making it impossible to be widely applied. For example, in the Linda-Ikeji blog community (http://lindaikeji.blogspot.com), the
  • 52. blog administrator will manually review and filter user comments on some celebrities’ public blogs. Obviously, users would expect a delay between posting a comment on a blog and displaying this comment on the blog’s webpage. Further, the filtering totally relies on the judgment of the community administrator. Our proposed semantic filtering approach mimics the procedure of manual filtering by trying to understand the relations among words in order to remove the offensive content semantically. The proposed semantic filtering approach will be fully automatic, requiring no interference of any administrator. 3.3 Problem of the existing approaches From the study of the existing approaches and based on the information provided above, the following problems have been identified:  Using the Keyword censoring approach, the readers can still easily understand what the offender wants to say and even be able to infer the removed words. This indicates the failure filtering because offensive opinion has been successfully delivered to victims. Also, removing words from a sentence without considering their context breaks the readability of rest of the sentence.  Using the content control approaches will also be too coarse-grained to be applied in online communities. Offender can easily bypass the filtering as long as knowing the estimation criteria and more important, a sentence in user comment may contain both offensive and inoffensive content. Inoffensive part may be removed falsely because of offensive part. Not allowing user to post inoffensive content would easily drive users away and thus affect the growth of community.
  • 53.  Using the manual filtering approach is very time and labor consuming. The administrator will have to manually review and filter all the users’ comments and messages; making it impossible to be widely applied. Also, the filtering totally relies on the judgment of the community administrator. 3.4 Proposed Filtering Philosophy The goal of our semantic filtering is to achieve filtering results close to that of manual filtering. To reach this goal, the foremost thing is to answer the question about how the filtering should be performed in order to get the desired filtering results. In this section, we present our answer in three steps. First, we analyze the characteristics of offensive text content in user messages. Then, we introduce our filtering philosophy according to the summarized characteristics. Finally, we show how this philosophy is transformed into heuristic rules applicable in the filtering process. 3.4.1 Offensive Language Text Content Based on the observation on user comments collected from YouTube website, a sentence in a user message may contain both offensive and inoffensive text content. Offensive text content is exposed intentionally with purpose of bringing negative influence to victims (e.g., the readers of message). The victim receives the negative influence by reading the offensive part of sentence and understanding the carried offensive information. Hence, the information carried by original sentence can be represented as I = Ioff + Iinoff
  • 54. The offender reaches his goal when the offensive information Ioff is delivered to readers. Therefore, to achieve a thorough filtering, all words used to deliver Ioff should be removed. Meanwhile, with respect to free speech, the part with Iinoff should be saved. 3.4.2 Filtering Philosophy According to the analysis, we propose the philosophy that should be followed in sentence-level offensive language filtering:  Precisely identify all offensive contents and remove them semantically, so that viewers will not notice the existence of offensive language in the original sentence;  Keep the readability and inoffensive content in the sentence, so that the author will still be allowed to express his opinion freely as long as it is not offensive; This is called the philosophy of “filtering instead of blocking”. To the filter, the philosophy states that: if removing one word will make another word meaningless or confusing to readers, we should consider removing both words to keep the readability of a filtered sentence; meanwhile, we only remove words that are affected by offensive words. For example, in the sentence “Samuel said it and what the fuck is wrong with what he said?”, suppose “fuck” is the only offensive words, the sentence can be separated into two parts. The first part, “Samuel said it”, is inoffensive; but the second part, “what the fuck is wrong with you?” is offensive. Therefore, we should remove the offensive word in the second part while keeping the first part and also still making the sentence a meaningful and readable one. i.e. We won’t have: Samuel said it and what the is wrong with what he said? (Wrong)
  • 55. But Samuel said it and what is wrong with what he said? (Correct) The words “the” and “fuck” must be removed in order to keep the transparency of filtering as well as the readability of filtered text content. 3.4.3 Filtering Rules Specifically, the proposed philosophy is transformed into two heuristic rules to estimate the impact of removing words in a sentence. Rule 1. (Modification Relation) in a modification relation, if the modifier is determined to be offensive, removing modifier solely is enough; if the head is determined to be offensive, both the head and the modifier should be removed. The modification relation is a binary semantic relationship between two syntactic elements, such as word, phrase, etc. One element is named head and the other is named modifier. The modifier is used to describe the head (i.e. the modified component). Semantically, modifiers describe and provide more accurate definitional meaning for head. As the modifier acts as a complement, the removal of the modifier typically will not affect the grammaticality of the construction. For example, in the sentence “she likes red apples.”, the adjective “red” is used to modify the noun “apples”. Removing “red” will keep the readability of rest of sentence. We admit that, removing modifiers will lose some information carried by modifiers. However, if the modifier is determined removable but the head is not, removing modifier will remove only the offensive information.
  • 56. Rule 2. (Pattern Integrity) if removing the offensive word breaks the integrity of sentence’s basic pattern, the whole sentence should be removed in order to keep the readability. English sentences and clauses are organized in basic patterns, such as “Subject-Verb”, “Subject- Verb-Object”, “Subject- Verb-Adjective”, “Subject-Verb-Adverb”, and “Subject-Verb- Noun”. Every sentence or clause can be categorized into one pattern. The integrity of basic pattern is essential to the readability of content. For example, the sentence “she sleeps on the sofa.” follows “Subject-Verb” pattern. If we only remove “sleeps”, the rest of the sentence, “she on the sofa.” will become nonsense and meaningless. We will be applying these two rules during the filtering of the sentences. 3.5 Identify Removable Content by Grammatical Relations A text or user message can be decomposed into a sequence of sentences. Each sentence is considered as a unit in filtering. Given a sentence containing both offensive words and inoffensive words, the goal of filtering is to identify inoffensive words which should be removed together with offensive words. We define the words that should be removed by the filtering as “removable” words. We noticed that manual filtering can easily achieve this goal because human can easily understand the context of words in a sentence and precisely identify which words should be removed with known offensive words. So, we mimic the manual filtering in that, we extract the grammatical relations among words from sentences and use the proposed filtering rules to estimate the impact of removing offensive words on other inoffensive words based on extracted grammatical relations.
  • 57. Specifically, the proposed approach includes two steps. In the first step, we scan the sentence and see if offensive words exist. If offensive words exist, we continue to retrieve grammatical information (i.e. Part-of-Speech tags and typed dependency relations) among words in the sentence. Using retrieved grammatical information, we create a tree data structure, named RelTree, for the second step estimation. In this second step, we propose a set of estimation functions following the filtering rules we proposed. Using the RelTree structure and the proposed rules, we then estimate if there are inoffensive words that should be removed together with those identified offensive words. The overview idea of our semantic filtering approach is shown in Algorithm 1 below. Within the algorithm, the functions POStagging and TDgenerator generate Part-of-Speech tags and typed dependency relations, respectively. We use existing NLP (Natural Language Processing) tools to implement these two functions. We will also focus on the design of two other functions CreateRelTree and EstimateRelTree. In this methodology, we are assuming that the filtering is based on a comprehensive offensive lexicon containing all offensive words. Words that do not appear in the lexicon are considered inoffensive. input : a text comment T, a blacklist of offensive words Blacklist output: a filtered text comment T′ 1 T′ ←“”; 2 senList ← chunk T into a list of sentences; 3 foreach sentence s ∈ senList do
  • 58. 4 scan s for offensive words using Blacklist; 5 if no offensive word found then 6 T′ ← T′ + s; 7 end 8 else 9 PTree ← POStagging(s);/*get parse tree*/ 10 TDset ← TDgenerator(s); /* get typed dependency relations */ 11 RelTree ← CreateRelTree(PTree, TDset); /* create RelTree */ 12 LabelRelTree ← EstimateRelTree (RelTree, Blacklist); /* estimate using RelTree */ 13 s′ ← remove all words in LabelRelTree those are labeled as “removable”; 14 T′ ← T′ + s′; 15 end 16 end 17 Return T′; Algorithm 1: Procedure of Semantic Filtering 3.5.1 First Step: Grammatical Analysis In the first step, we extract two types of grammatical information from a given sentence. One is the Part-of-Speech information associated with every word. The other is the dependency relation
  • 59. among words. Part-of-Speech information helps us to understand the organization of a sentence, which is essential for keeping the readability when we try to remove words from a sentence. Dependency relations will be used directly to estimate the impact of removing one word on other semantically related words, making the filtering more “meaningful”. Combining these two types of information, we can create a new data structure, called RelTree, for the next-step estimation. 3.5.1.1 Part of Speech Tagging Part-of-Speech tagging has been widely used in Natural Language Processing applications to identify the syntactic properties of lexical items in a sentence, such as word or phrase. Through Part-of-Speech tagging, the sentence can be represented in a tree structure basing on Part-of- Speech tags. We adopt the Penn Treebank tag set for our Part-of-Speech tagging. An example of Penn Treebank style parse tree is shown in Figure 1 below. Figure 1: A parse tree of a sentence basing on Part-of-Speech tags
  • 60. Here, the leaf nodes are words appearing in the sentence. The non-leaf nodes represent syntactic elements, such as phrases or clauses. Each element consists of the words within its subtree. For example, the words “said” and “it” constitute a Verb Phrase (i.e. VP) node. 3.5.1.2 Typed Dependency Relations Typed Dependency is a kind of general relations describing the grammatical dependencies within a sentence, proposed by Stanford Natural Language Processing Group. Each typed dependency includes a dependency type and a (governor, dependent) word pair. For example, in the sentence “what the fuck is wrong with what he said?”, the typed dependency amod(wrong, fuck) means that “fuck” is an adjectival modifier of an noun phrase containing “wrong”. A typed dependency may represent the dependent relations between two syntactic elements, not limited to words only. Fig 2: An example of typed dependency graph
  • 61. The typed dependencies in a sentence can be represented as a graph. For example, Figure 2 shows the typed dependency relations for the same sentence shown in Figure 1. We explain the relations appeared in Figure 2 from left to right: the nominal subject relation, nsubj(it, Samuel), means that “Samuel” is the syntactic subject of the clause (same is nsubj(wrong, he)); the copula relation, cop(it, said), means that “it” is the complement of verb “said” (same is cop(wrong, is)); the noun compound modifier, the determiner, det(fuck, the), means that “the” is a determiner of “fuck”; the adjectival modifier, amod(fuck, wrong), means that “fuck” serves as adjectival modifier of “wrong”; and the conjunct, conj and(it, wrong), means that the coordinating conjunction “and” is used to connect two elements with head “it” and “wrong”, respectively. 3.5.1.3 Relation Tree (RelTree) Both Part-of-Speech and typed dependency relations are utilized in the second step estimation. The parse tree shows the sentence syntactic organization and typed dependency relations provide semantic information among words. To combine both information, we propose a new data structure called RelTree. In a RelTree, the leaf nodes are words in the sentence. And the non-leaf node represents either a phrase or a clause inside the sentence. In each nonleaf node, we associate the set of typed dependency relations on the words within its subtree. Each node only contains the typed dependency relations which have not appeared in its subtree nodes.
  • 62. Figure 3: A RelTree combining the parse tree and typed dependency relations input : a parse tree PTree, a set of typed dependency relations TDset output: a RelTree RelTree 1 RelTree ← PTree; 2 Remove all word nodes in RelTree; 3 Traverse RelTree in postorder foreach node n visited do 4 if n is a leaf node then 5 n.wordset ← {n};/*create word nodes*/ 6 end 7 if n is not a leaf node then 8 n.wordset ← ∅;
  • 63. 9 foreach direct child node ci do 10 n.wordset ← n.wordset ∪ ci.wordset; 11 n.rel ← ∅; 12 foreach relation Ti(Gi,Di) in TDset do 13 if Gi ∈ n.wordset and Di ∈ n.wordset then 14 n.rel ← n.rel ∪ Ti(Gi,Di); 15 TDset ← TDset − Ti(Gi,Di); 16 end 17 end 18 end 19 end 20 end 21 Return RelTree; Algorithm 2: create a RelTree using the parse tree and typed dependency relations The RelTree data structure is proposed only for the convenience of offensiveness estimation in the next step. Algorithm 2 shows the algorithm for RelTree construction. With the parse tree PTree given, the computational complexity of algorithm CreateRelTree relies on the post-order traversal and the search in TDset. As the number of relations never exceeds N(N −1)/2, where N is the number of words in the sentence, the computational complexity is O(N3 ). The computational complexity itself is acceptable. Indeed, there are a lot of ways to improve the efficiency in the implementation of this algorithm.
  • 64. 3.5.2 Step Two: Bottom-Up Estimation In the second step, we first use the offensive lexicon to identify offensive words in the sentence. The leaf node with an offensive word will be labeled as “removable”. Starting from leaf nodes, we perform bottom-up estimation through a postorder traversal on the RelTree. For each non-leaf node in the RelTree, we estimate whether it should be removed based on (1) the associated typed dependency relations and (2) its child nodes within its subtree. If a non-leaf node is estimated to be “removable”, all its descendants, including words, within its subtree will also be labeled as “removable”. The meaning of “removable” to a non-leaf node is that all words, phrases, or even clauses within its subtree have been determined to be removed at the end of filtering. The estimation process includes two steps. We first estimate based on typed dependency relations, and then apply a set of heuristic rules as complements. 3.5.2.1 Estimation with Typed Dependency Relations Consider a non-leaf node n in a RelTree with a set n.rel of typed dependency relations. Each relation describes a semantic connection between a governor word and a dependent word. Both words are leaf nodes in the subtree rooted at n. n.rel could be empty when n only has one child node. For each typed dependency relation in n.rel, we study its semantic information and map it to an estimation function. These estimation functions and mapping are created following the Modification Relation and Pattern Integrity rules. Take the Direct Object (dobj) relation for example. The dobj(G, D) relation is defined as : the direct object of a verb phrase, containing governor word G, is the noun phrase, containing dependent word D. For example, in a relation dobj(win,matcℎ), “win” is the governor word and “match” is the dependent word. According to Pattern Integrity rule, we know
  • 65. that “Subject-Verb-Object” is a basic pattern. Therefore, if either the phrase with G or phrase with D will be removed because of offensiveness, both phrases should be removed together. To formalize, we define an estimation function H(T) =H(P(G)) OR H(P(D)) and map relation dobj(G,D) to it. We use symbol C(G) and P(G) to denote the clause and phrase containing word G as head, respectively. In this estimation function, H(T) is the label to be assigned to relation T and H(P(G)) is the label with phrase node containing G in the RelTree. Using the estimation function, we generate a label for every relation associated with node n and then for the node itself. If a relation T(G,D) of node n is estimated and labeled as “removable”, the two child nodes of n, containing word G and word D, will be labeled as “removable”. If all relations in n.rel are labeled as “removable”, the node n as well as all its descendants, will be labeled as “removable”. 3.5.2.2 Estimation with Heuristic Rules Heuristic rules will also be applied as complement after typed dependency relation estimation. Applying heuristic rules is necessary mainly because of two reasons. First of all, the typed dependency relation contains some syntactic information but limited. For example, the possessive ending (i.e. POS) tag, which is a quite popular Part-of-Speech tag, is ignored during the typed dependency tagging. Secondly, not all relations between syntactic elements in a sentence can be classified into one of typed dependency relations. For those uncertain relations, a generic grammatical relation is being defined, named dep. To prevent confusion to filter, we include dep into the Rule H(T) = H(G) AND H(D) which means either G or D is labeled removable will not affect each other and the
  • 66. label of T. Because dep relation stands for uncertain relation, we have to rely on Part-of-Speech tags in the RelTree for our filtering. Take the conj tag node rule as an example. The conjunct relation (conj) is a type of relation between two syntactic elements connected by a coordinating conjunction, such as “and”. The parameters of conj do not include the coordinating conjunction. However, explicitly, the coordinating conjunction sits between the two parameters of conj. If one side is determined removable, the coordinating conjunction should be removed as well. For example, in the sentence “I like A and B”, if either A or B is removed, the coordinating conjunction “and” should also be removed. Figure 4: Estimate a RelTree in a bottom-up manner
  • 67. 3.5.2.3 Estimation Algorithm To estimate and assign labels for all nodes in a RelTree, we perform the estimation also in a bottom-up manner. Figure 4 shows an example estimation process. The number in the circle represents the order of estimation for each node in the RelTree. The dashed nodes are estimated as “removable”. For example, the clause node with nsubj(you, fuck) is estimated as “removable” according to the estimation. Therefore, its two child nodes containing “you” and “fuck” respectively are both labeled as “removable”. Moreover, the word “and” is removable according to the heuristic rule (i.e. conj tag node rule), in order to keep the filtering transparent to readers. Finally, inoffensive words, “what”, “the”, “is”, “wrong”, “with”, “he”, and “said”, are removed with the offensive word, “fuck” in the filtering. According to Algorithm 2, each typed dependency relation will appear exactly once in the RelTree. No relation will be checked repeatedly in the estimation. The cleaned sentence after filtering in this example will be “Samuel said it.”. As we can see, the result satisfies the requirement of our proposed filtering philosophy. Only the offensive part, “what the fuck is wrong with what he said”, is removed. The reader can still get the inoffensive information. The detailed algorithm for estimation process is presented below. input : a RelTree RelTree, a blacklist of offensive words Blacklist, output: a labeled RelTree LebelRelTree 1 LebelRelTree ← RelTree; 2 Label all leaf nodes with offensive words by “removable” in LabelRelTree ;
  • 68. 3 Traverse LabelRelTree in postorder foreach node n visited do 4 if n is a leaf node then 5 ignore; /* already labeled */ 6 end 7 if n is not a leaf node then 8 if n only has one child node then 9 n.label ← n.cℎild.label; 10 end 11 if n has more than one child node then 12 Estimate the label for n by its associated labels, using proposed estimation function and heuristic rules; 13 end 14 end 15 end 16 Return LabelRelTree; Algorithm 3: estimate nodes in RelTree
  • 69. CHAPTER FOUR IMPLEMENTATION 4.1. JUSTIFICATION OF PROGRAMMING LANGUAGE USED. The Spam filtering system is an online application implemented using HTML, JAVA SERVLET PAGE (JSP), JAVASCRIPT, and MYSQL relational database software. 4.1.1 HTML HTML, which stands for Hypertext Markup Language, is the predominant markup language for web pages. It provides a means to create structured documents by denoting structural semantics for text such as heading, paragraphs, list, etc. bas well as for links, quotes and other items. It allows images and objects to be embedded and can be used to create interactive forms. It is written in the form of HTML elements consisting of “tags” surrounded by ankle brackets within the webpage content. It can include or can load script in language such as JavaScript which affect the behaviour of HTML processors like Web browsers; and Cascading Style Sheets (CSS) to define the appearance and layout of text and other material.
  • 70. 4.1.2 JAVASCRIPT JavaScript has been around for several years now, in many different flavors. The main benefit of JavaScript is to add additional interaction between the web site and its visitors at the cost of a little extra work by the web developer. JavaScript allows industrious web masters to get more out of their website than HTML and CSS can provide. By definition, JavaScript is a client-side scripting language. This means the web surfer's browser will be running the script. This is opposite to client-side is server-side, which occurs in a language like PHP. These PHP scripts are run by the web hosting server. There are many uses (and abuses!) for the powerful JavaScript language. Here, it is being used for:  Alert Messages  Popup Windows  HTML Form Data Validation 4.1.3 JAVA SERVLET PAGE "JSP is an HTML-embedded scripting language. JSP goal is to allow developers to write dynamically generated pages quickly." It is a server-side programming language specifically designed for creating dynamic web pages. JSP will allow you to:  Reduce the time to create large websites.
  • 71.  Create a customized user experience for visitors based on information that you have gathered from them.  Open up thousands of possibilities for online tools. Unlike other server-side languages, JSP is an open source product. When someone visits your JSP webpage, your web server processes the JAVA code. It then sees which parts it needs to show to visitors (content and pictures) and hides the other stuff (file operations, math calculations, etc.) then translates your JSP into HTML. After the translation into HTML, it sends the webpage to your visitor's web browser. 4.1.4 MYSQL MySQL is the most popular open source database server in existence because of its consistent fast performance, high reliability and ease of use. It's used in more than 6 million installations ranging from large corporations to specialized embedded applications on every continent in the world. It is very commonly used in conjunction with PHP scripts to create dynamic and powerful server applications. MySQL has been criticized in the past because it does not have all the features of other Database Management Systems. However, MySQL continues to improve significantly, with each major upgrade, and has great popularity because of these improvements.
  • 72. 4.1.5 CSS Cascading Style Sheets (CSS) are a way to control the look and feel of the HTML documents in an organized and efficient manner. Cascading Style Sheet enables us to add new looks to the HTML, completely restyles a web site with only a few changes to the CSS code and also allows us to use the "style" created on any webpage we wish. With CSS you will be able to:  Add new looks to your old HTML  Completely restyle a web site with only a few changes to your CSS code  Use the "style" you create on any webpage you wish 4.2 System Specification The system specifications is divided into two part: 1. Hardware Specification 2. Software Specification 4.2.1 HAREWARE SPECIFICATION FOR THE APPLICATION Any computer tagged by the manufacturer as a workstation can be used to access this application using the internet browser, but the following minimum specification would be required to host the application: 1. A computer tagged by the manufacturer as a server
  • 73. 2. Core 2Duo processor and above 3. A 2GB memory 4. A keyboard and a mouse 5. A hard disk of 120GB and above 4.2.2 SOFTWARE APPLICATION FOR THE APPLICATION  Windows Server 2005 and above  Microsoft .NET framework version 3.0 and above must be installed  Microsoft SQL Server 2005 and above should be installed  Microsoft Internet Information Server (IIS) should be enabled  Server FTP capability must be enabled 4.3 System Implementation This section briefly described the screens of the online application. 4.3.1 Application Login Screen This system contains a secure login panel that requires a combination of email address and password. The email address is used because it is meant to be unique.
  • 74. Fig 4.1 – Web Application Login Screen 4.3.2 Application Registration Page FIG. 4.2 – Web Application Registration Page
  • 75. Here the user fills in his/her details and after the system verifies that all details provided is correct, it also has a captcha image which acts as a spam guard to ensure than the inputted data was done by human and not robot. 4.3.2 Post and Comment Page FIG. 4.3 – Filtered Post Page Using Keyword Censoring Approach
  • 76. FIG. 4.4 – Filtered Post Page Using Content Control Censoring Approach FIG. 4.5 – Filtered Post Page Using FOLOC Censoring Approach
  • 77. Looking at the three post and comment pages above, we will realize the our proposed semantic filtering approach mimics the procedure of manual filtering by trying to understand the relations among words and has removed the offensive content semantically. The proposed semantic filtering approach is fully automated and it required no interference of any administrator and at the same time eliminating the offensive words in the sentence. “What the fuck is wrong with you?” has been changed to “What is wrong with you?” using the proposed semantic filtering approach instead of having “what the f*** is wrong with you?” which still delivers the offensive words to the victims successfully. Our semantic filtering result is also so close to that of manual filtering as our desired results have been produced just by applying the heuristic rules in the filtering process. FIG. 4.6 – Filtered Post Page Using Keyword Censoring Approach
  • 78. FIG. 4.4 – Filtered Post Page Using Content Control Censoring Approach FIG. 4.8 – Filtered Post Page Using FOLOC Censoring Approach
  • 79. Looking at the three post and comment pages above in fig 4.6, 4.7 and 4.8, we will realize that our proposed semantic filtering approach also mimics the procedure of manual filtering by trying to understand the relations among words and has removed the offensive content semantically again. The proposed semantic filtering approach is fully automated and it required no interference of any administrator and at the same time eliminating the offensive words in the sentence. “I have told all these bitches to stop calling my husband’s phone” has been changed to “I have told all to stop calling my husband’s phone” using the proposed semantic filtering approach instead of having “I have told all these b****** to stop calling my husband’s phone” which still delivers the offensive words to the victims successfully. Our semantic filtering result is also so close to that of manual filtering as our desired results have been produced just by applying the heuristic rules in the filtering process.
  • 80. CHAPTER FIVE SUMMARY, CONCLUSION AND RECOMMENDATIONS 5.1 Summary and Conclusion Online social networking sites have become increasingly popular with children, especially young teens, as a place where they can meet other people, communicate, and exchange information. This has also brought cyberbullying which is a fast growing trend that experts believe is more harmful than typical schoolyard bullying. Nearly all of us can be contacted 24/7 via online social networking communities. Victims can be reached anytime and at anyplace. For many children, home is no longer a refuge from the bullies. Children can escape threats and abuse in the classroom, only to find offensive comments and posts from the same tormentors when they arrive home. There’s no safe place anymore and one can be bullied 24/7; even in the privacy of his/her own bedroom. However, we are not only trying to filter out offensive content but also making sure the sentences still make sense. From statistical analysis it has been revealed that, more than 60% of insulting messages are posted as a direct insult and direct insulting messages always contain insulting words or phrases. From psychological point of view, if these messages are categorized and restrict a user to send these kinds of messages, then a human intension to post or exchange of abusive messages can be significantly reduced. Offensive language is a serious problem facing the online community. Our semantic filtering technique is based on the grammatical relations of words in a sentence so that the rest of the filtered sentence is readable and the existence of offensive words in the original sentence is hard
  • 81. to notice. We tested the effectiveness of our approach with a large dataset and the results show that our techniques are very effective and accurate with little process overhead. 5.2 Recommendation Our future work includes looking at the issues described in the discussion section. Moreover, as the most time-consuming part of semantic filtering is the sentence parsing process, we will examine other light-weighted NLP techniques to speed up sentence parsing. Last but not the lease, we also plan to extend our filtering approach to support other languages such as Chinese and French.