Ontology Based Opinion Mining for Book reviews

Ontology Based Opinion Mining for Book Reviews
Mohammed Samsudeen Firzhan Naqash
(138220P)
Degree of Master of Science
Department of Computer Science and Engineering
University of Moratuwa
Sri Lanka
March 2017

Ontology Based Opinion Mining for Book Reviews
Mohammed Samsudeen Firzhan Naqash
(138220P)
Dissertation submitted in partial fulfilment of the requirements for the degree Master
Department of Computer Science and Engineering
University of Moratuwa
Sri Lanka
March 2017

iii
ABSTRACT
The recent burst in web usage has contributed to the growth of a number of various online reviews
like consumer product reviews, legal reviews, political reviews, movie reviews and book reviews.
Out of these reviews, some are context sensitive and others are not that so. The objective based
review area has been heavily studied up to now where opinion mining is basically done using
predefined tags that are not context sensitive. On the other hand, subjective based reviews are
context sensitive and they depend on the polarity orientation of the term in the sentence. Opinion
mining on subjective reviews has not yet been explored in depth.
Unlike other reviews like movies and consumer electronic products, there hasn’t been any significant
work done in the area of opinion mining on book reviews, which can be categorised as subjective.
Contents of book reviews are subjective and each differs from the rest in various ways. Therefore, by
aggregating those different opinions on those reviews to a single perspective opinion on book
aspects may add more value to the book readers, in academic as well as commercial sectors.
This research focuses on introducing a fine-grained approach for opinion mining on online non-
scholarly book reviews, where an ontology reference model is used as an essential part of the
opinion extraction process by taking into account the relations between concepts. In other words, this
research exploits the benefits of using ontology structure for the mining of context sensitive book
reviews. Eventually the methodology adopted for mining the context sensitive reviews yielded quite
promising results when tested on amazon data set of book reviews.

iv
ACKNOWLEDGEMENT
I sincerely thank my family for continuously supporting and motivating me to make this report a
success. I would like to thank Dr. Daya Chinthana Wimalasuriya for providing me valuable
guideline during the initial state of this project. Last but not least, I would like to thank the
University of Moratuwa for granting me access to use the library resources for my research work.

v
Contents
ABSTRACT..........................................................................................................................................................i
ACKNOWLEDGEMENT ..................................................................................................................................iv
List of Figures...................................................................................................................................................viii
List of Tables ......................................................................................................................................................ix
LIST OF ABBREVIATIONS..............................................................................................................................x
Chapter 1..............................................................................................................................................................1
Introduction..........................................................................................................................................................1
1.1 Opinion Mining..........................................................................................................................................2
1.3 Ontology ....................................................................................................................................................2
1.4 The Problem/opportunity...........................................................................................................................3
1.5 Motivation..................................................................................................................................................3
1.6 Objectives ..................................................................................................................................................4
1.7 Contributions..............................................................................................................................................5
1.8 Outline of thesis.........................................................................................................................................5
1.8.1 Chapter 02 - Literature Survey............................................................................................................6
1.8.1 Chapter 03 - Methodology ..................................................................................................................6
1.8.1 Chapter 04 - Evaluation ......................................................................................................................6
1.8.1 Chapter 05 - Conclusion and Future Work .........................................................................................6
Literature Survey .................................................................................................................................................7
2.1 Opinion Mining..........................................................................................................................................7
2.1.1 Document level sentiment analysis.....................................................................................................8
2.1.2 Sentence level sentiment analysis.......................................................................................................9
2.1.3 Aspect level sentiment analysis ....................................................................................................... 10
2.1.4 Domains and Sentiment Analysis .................................................................................................... 10
2.1.5 Sentiment Lexicon ........................................................................................................................... 10
2.1.6 Tools for Sentiment Analysis........................................................................................................... 12
2.2 Ontology ................................................................................................................................................. 13

vi
2.2.1 Advantages of using an ontology..................................................................................................... 14
2.2.2 Semantic Web and OWL ................................................................................................................. 16
2.2.3 Wordnet............................................................................................................................................ 17
2.2.4 Ontology Development.................................................................................................................... 18
2.3.3 Ontology and Sentiment Analysis.................................................................................................... 20
2.3.4 Related Existing Ontology............................................................................................................... 21
2.4 Data Pre-processing ................................................................................................................................ 22
2.4.1 Lexical Analysis............................................................................................................................... 22
2.4.2 Name Entity Recognition (NER) ..................................................................................................... 23
2.5 Discussion............................................................................................................................................... 24
Methodology..................................................................................................................................................... 25
3.1 Book Ontology........................................................................................................................................ 26
3.1.1 Data Selection and Preparation........................................................................................................ 27
3.1.2 Ontology Development.................................................................................................................... 29
3.1.3 Ontology Model............................................................................................................................... 32
.................................................................................................................................................................. 34
3.2 Pre-processing......................................................................................................................................... 37
3.3 Feature extraction with Ontology ........................................................................................................... 40
3.3.1 Loading ontology concepts .............................................................................................................. 40
3.3.2 Prepare the pre-processed review text for feature extraction........................................................... 41
3.3.3 Prepare the tuples of aspects and modifier....................................................................................... 41
3.4 Feature Score Calculation....................................................................................................................... 49
3.4.1 Calculating the score for a single feature......................................................................................... 50
3.5 Polarity Identification ............................................................................................................................. 53
3.5.1 Loading the SenitWordNet 3.0 Dictionary ...................................................................................... 53
3.5.2 Calculating the Polarity.................................................................................................................... 55
3.6 Sentiment Analysis ................................................................................................................................. 56
3.6.1 Euclidean Vector.............................................................................................................................. 56
3.6.2 Calculating the Polarity.................................................................................................................... 57
Evaluation ......................................................................................................................................................... 59

vii
4.1 Selecting Corpus for reviews.................................................................................................................. 59
4.1.1 Primary set of book reviews............................................................................................................. 59
4.1.2 Secondary set of book reviews......................................................................................................... 60
4.2 Evaluation of Opinion Mining Methodology.......................................................................................... 60
4.3 Preparing human evaluators.................................................................................................................... 61
4.4 Detecting the Ideal Range for Tupling Concepts and Adjectives........................................................... 63
4.5 Result of Aspect detection ...................................................................................................................... 66
4.6 Results of Sentiment Analysis ................................................................................................................ 67
4.7 Summary of Evaluation .......................................................................................................................... 73
Conclusion and Future Work............................................................................................................................ 74
References ....................................................................................................................................................... 77

viii
List of Figures
Figure 2.1: Classes, Individuals and Properties ………………………….. 20
Figure 3.1: Overview of the implementation Architecture ………………. 30
Figure 3.2: Ontology Development …………………………………….... 36
Figure 3.3: Book Model Ontology In Protege ………………………….... 38
Figure 3.4: Fragment of Concept hierarchy of book ……………………. 39
Figure 3.5: Feature Model Ontology In Protege …………………........... 40
Figure 3.6: Fragment of Concept hierarchy of features ……………….... 43
Figure 3.7: Steps for pre-processing review texts ……………………….. 55
Figure 3.8: Feature Identification ………………………………………. 73
Figure 4.1: Negative Range Detection
results…………………………………….
73
Figure 4.2: Positive Range Detection results
…………………………………………………………………..
77
Figure 4.3: Positive Opinion Detection
Results…………………………………………………………………..
78
Figure 4.4: Negative Opinion Detection
Results…………………………………………………………………..
82
Figure 4.5: Aspects frequency comparison between positive and
negative reviews for secondary data set…………………………………
20

ix
List of Tables
Table 2.1: Document Level Opinion Mining ……………………………... 12
Table 3.1: OWL Properties of feature model …………………………….. 42
Table 3.2: Feature Scores ……………………………………………….... 59
Table 3.3: SentiWN:'exciting' ……………………………………………. 59
Table 4.1: Example of feature annotation by coders …………………....... 70
Table 4.2: output of the range detection experiment ……………………... 72
Table 4.3: Aspect/Feature detection results ………………………………. 75
Table 4.4: Opinion detection results ……………………………………... 76
Table 4.5: Descriptive statistics of positive book review concepts ………. 79 - 80
Table 4.6: Descriptive statistics of negative book review concepts ……….
Table 5.1 Features scores for the reviews of Water for Elephants ………...
80 – 81
90 - 91

x
LIST OF ABBREVIATIONS
Abbreviation Description
RHS Right Hand Side
LHS Left Hand Side
SentiNet SENTIWORDNET
POS Part Of Speech
SP Sentiment Polarity
SO Subject Objectivity
PN Positive Negative
ADJ Adjective
HTML Hyper Text Markup Language
OPHLC Opinion High Level Concepts
IE Information Extraction
ML Machine Learning
MUC Message Understanding Conference
OWL Web Ontology Language
NED Named Entity Detection
NER Named Entity Recognition
NLP Natural Language Processing
PDF Portable Document Format
POS Part Of Speech
RDF Resource Description Framework
RHS Right Hand Side
SEO Search Engine Optimization
SVM Support Vector Machine
W3C World Wide Web Consortium
TP True Positive
FP False Positive
FN False Negative

xi
XML Extensible Markup Language

1
Chapter 1
Introduction
The recent burst in web usage has contributed to the growth of online book reviews. Various readers,
poets, novelists, journalists and websites like Amazon (http://www.amazon.com) encourage others to
write more reviews on books. Opinions vary as different people have different views about the book
they are reviewing. Therefore, some aspects carry positive comments whereas some carry negative
comments [1].
These opinions are useful for readers to decide whether a particular book matches their taste. When
considering the reviews, initially the reviewers consider the high level aspects like introduction of
subject, introduction of author, summary of the intended purpose of the book, contribution of the
book on improving the discipline, description on how the author approaches the topic, the rigor of
the research, the logic of argument, readability of prose, comparison with earlier or similar books in
the same domain or discipline, and evaluation of the book's merits and usefulness [22]. Each high
level concept has its own set of sub-aspects. For example, the high-level concept method of
development considers attributes such as description, narration, exposition and argument [22].
Therefore, users have to go through each granular level attribute to summarize the high-level
concepts. Thereafter, based on the orientation of those attributes, users can conclude whether the
book has the necessary content to interest him/her.
Different reviewers may go through various attributes and may identify the aspects as positive,
negative, or neutral. In other words, a review can contain a mixture of negative, positive and neutral
comments.
Therefore, this project aims at performing opinion mining on book reviews from the high concept
level to the granular attribute level. Since this opinion mining is being done across various levels of
reviews, the final outcome of the opinion mining on book reviews may be used to find books based
on finer granular level attributes or based on the high level concepts of the book.

2
1.1 Opinion Mining
Opinion mining is a discipline that detects or identifies user opinions from textual contents.
Detection of opinion is performed purely based on objective topics and not based on the subjective
perceptions. Therefore, opinion mining is totally different from traditional text mining. Text mining
focuses on specific topics as well as topic shifts whereas opinion mining is much more difficult than
those for topic mining. This is partially attributed to the fact that topics are represented explicitly
with keywords, while opinions are expressed with subtlety [2][3][5][9][23].
The online book reviews contain user generated content. The user generated content contains
unstructured text, which may be positive, negative or neutral sentiment orientation of opinion. This
orientation is called sentiment polarity [2][23]. Opinion mining identifies the polarities of all the
important sentiments and it eventually provides global sentiment polarity of that concerned
unstructured text.
Opinion mining can be performed at multiple levels as document level, sentence level and phrase
level (aspect level) [16][17][21] . Each level of opinion mining is performed to address different
issues. For example, document level opinion mining is used for classification purposes of the
document or for spam detection.
1.3 Ontology
Ontology is a formal explicit description of concepts in a domain of discourse. It defines a common
vocabulary for researchers who need to share information in a domain. It includes machine-
interpretable definitions of basic concepts in the domain and relations among them
[1][7][23][24][25].

3
Ontology explains the properties of each concept, restrictions of each concept and the relationships
among the concepts. The knowledge base of an ontology constitutes of a set of individual instances
of concepts.
1.4 The Problem/opportunity
As detailed in previous sections, there are a huge number of online book reviews available all over
the web. Those reviews are either blogged or written on websites like Amazon
(http://www.amazon.com). Therefore, a single book may be reviewed by a variety of reviewers.
Each reviewer may have a different perspective with respect to the features they consider for
reviewing. In addition, they may use different wording under different contexts to explain features in
the book. These features may receive negative or positive comments from the reviewer and
sometimes the same feature may receive negative orientation from one reviewer whereas the same
feature may receive positive orientation from a different reviewer.
Therefore, readers who want to see the review before trying out the book have a hard time
identifying the real value and validity of the book due to the conflicting nature of the reviews.
Therefore, the domain of opinion mining on book reviews needs immediate attention to sort out the
issues mentioned above.
1.5 Motivation
Research on aspect level opinion mining has been done extensively in the domain of consumer
product reviews [3][4][6] and movie reviews [49][1][28][29]. However, there hasn’t been any
research done on the domain of book reviews.
On the other hand, research done by Zhou and Chaovalit [28] identified ontology is the best fitting
solution for conceptualizing of domain specific information in a much more structured way for

4
opinion mining. As further proof, research done by Zhao and Li [1] and Isidro and Rafael [48] used
the ontology model as the domain model and they figured out new approaches to effectively perform
sentiment analysis and opinion mining. All this research based on ontology yielded positive results
because of the fact that ontology describes the semantics of domain in a much structured, and both
humanly readable and machine processable format.
The contents of the book reviews are quite diverse as some of the contents are more relevant in
discovering aspects and others are neutral contents that are not relevant for the opinion mining and
sentiment analysis task. In addition, book reviews may have several aspects and each aspect can be
expressed in different ways by different users.
Therefore, the recent emergence of using ontology for the conceptualization of domain–specific
information has really enabled to solve the subjective based aspect identification problem in the
domain of book review by employing ontology structure. The ontology structure enables us to
interpret book reviews at finer granularity aspect levels with shared meaning.
By means of ontology, during the feature extraction process, specific aspects of a domain and the
relationship between the concepts of that domain can be efficiently identified. This improves,
enhances and refines the process of sentiment analysis. In addition, new domain concepts (aspects)
and relationships can be added without needing to change semantic rules. In other words, ontology
provides a common vocabulary for a domain to enable ontology based parsing of the corpus to
identify aspects in reviews.
Therefore, by developing ontology models for books and features, and using those models to
perform sentiment analysis and opinion mining using the suitable techniques already researched can
produce the desired results in determining the polarity of aspects mentioned in book reviews.
1.6 Objectives
The main objective of this research is to develop a methodology to effectively perform opinion
mining on book reviews using ontology. This objective can be categorised into the following sub-
objectives.

5
● Build an ontology model for books and book features.
● Discover aspects out of unstructured review text by ontology based parsing
● Calculating the polarities of individual aspects within the structured text of review
● Eventually calculate the global polarity to determine the value of opinion.
1.7 Contributions
The main contributions of this thesis are as following
● This thesis presents a methodology to perform opinion mining on book reviews by means of
ontology. As described in chapter 3, this methodology along with the ontology model
processes book reviews and outputs sentiment analysis and opinion mining scores of those
review texts.
● This thesis presents new ontology models for both books and features (aspects). The concepts
of these ontology models are defined after thoroughly analysing the contents of the primary
data set of book reviews. Afterwards, ontology models are annotated with synonyms of
concepts.
● This thesis presents some of the important lexical rules that are used to perform lexical
analysis on top of pre-processed review text. As described in chapter 3, these rules are
designed to use the lexical relationships between phrases in sentences to determine the next
concept nouns/aspects and the corresponding adjective/gerund that is modifying the
particular aspect.
● This thesis presents the test results for the methodology presented in chapter 3. As described
in chapter 4, two data sets, namely, primary and secondary data sets, are prepared to run
different types of tests like aspect detection, polarity calculation and aspect sentiment
detection. The results obtained out of all these tests are really encouraging.
1.8 Outline of thesis
This section provides the overview outline of each chapter’s content.

6
1.8.1 Chapter 02 - Literature Survey
In order to derive a methodology on how to continue our research topic, all the areas related to this
domain have been analysed. Chapter 02 illustrates those findings, as initially we cover the area of
sentiment analysis and opinion mining. Afterwards, we analyse the area of ontology, the use of
ontology in the area of opinion mining and how to develop an efficient ontology domain model for
books. Finally, we focus on unstructured data processing and how the unstructured review texts can
be converted into a structured and machine processable format. We summed up this chapter with a
discussion topic discussing the importance of this research topic.
1.8.1 Chapter 03 - Methodology
This chapter comprehensively explains the methodology that we adopted to achieve our aim and
objectives. At a higher level, this research is mainly intended to build ontology models for both book
and aspects and using that built ontology, we performed aspect level opinion mining on unstructured
book review texts.
1.8.1 Chapter 04 - Evaluation
This chapter provides the evaluation results of our methodology for the opinion mining of book
reviews using ontology. In this chapter, we provide a detailed explanation of the results obtained for
aspect identification, and sentiment analysis.
1.8.1 Chapter 05 - Conclusion and Future Work
This chapter concludes the thesis with future work.

7
Chapter 2
Literature Survey
This chapter describes the literature survey done for deriving the methodology to solve the problem
that has been addressed through this research. This chapter gives a comprehensive overview of the
background research that has been done based on the existing web resources and research papers.
This chapter initially focuses on opinion mining. Then, it looks into the topics of ontology principles
and ontology development. Eventually, it explores the topics of sentiment analysis and polarity
identification. This chapter explains the basic principles behind opinion mining and sentiment
classification. Thereafter, this chapter describes the integration of ontology with opinion mining.
Finally, this chapter gives a detailed explanation regarding polarity identification and sentiment
analysis.
2.1 Opinion Mining
Opinion mining is a topic that has been discussed extensively in research literature. This is a mining
strategy used to detect the pattern among opinions. The entire opinion mining concept revolves
around an entity. An entity is a concrete or abstract object that can be represented as a hierarchy of
components, sub-components, etc. Components are represented as nodes and each of them has a set
of attributes. Therefore, an opinion can be expressed in any node or attribute of that node [23].
Therefore, an opinion can be described using a quintuple [11][13][15][16]
Opinion = ( e j , a jk , so ijkl , h i , t l )
ej – Target entity
ajk – kth Aspect of entity e j

8
hi – Opinion holder
tl – Time when opinion is expressed
so ijkl - Sentiment orientation of opinion holder hi on feature ajk of entity ej at time tl .
The five components of the quintuple are essential and must correspond to one another. Without any
of them, the opinion may be of limited use. Therefore, the goal of opinion mining is to discover all
the quintuples in various review documents. In simpler terms, the unstructured text in the review
document is converted into structured quintuples [14][26] .
The converted and structured quintuples are then processed to various level of sentiment analysis.
Those are [26]
● Document level sentiment analysis
● Sentence level sentiment analysis
● Aspect level sentiment analysis
2.1.1 Document level sentiment analysis
This is the simplest form of sentiment analysis done under the assumption that there is only one
main subject described in the document. Many approaches use a simple bag of words or a bag of
phrases model and may utilize TF-IDF (Term Frequency – Inverted Document Frequency) and POS
(Parts of Speech) information of terms. Turney [42] had introduced a method to classify reviews
using the average semantic orientation of the adjectival or adverbial phrases in the review. This work
also introduces a method to measure the semantic orientation of a phrase using the Point-wise
Mutual Information (PMI) [43] between words alleviating the need of a predefined sentiment
lexicon. PMI is a measure of correlation (co-occurrence) between two words. In a given phrase, it’s
PMI to the words “excellent” and “poor” are used to determine its semantic orientation. The

9
probabilities are calculated using an information retrieval (IR) method by counting the number of
hits for phrase queries using a near operator to the words “poor” and “excellent”.
Document level sentiment analysis has the following levels of analysis as shown in Table 2.1
Opinion Mining Description
Subjectivity
Classification
Determines whether or not a given document expresses an opinion
Sentiment
Classification
Determines whether a sentiment polarity is positive or negative
Opinion Helpfulness
Prediction
Estimates the helpfulness of a review
Opinion Spam
Detection
Identifies whether or not a review is spam
Table 2.1 Document level sentiment analysis [26]
2.1.2 Sentence level sentiment analysis
Here, the analysis is done under the assumption that each sentence or phrase may contain a different
but single entity being appraised. Discrimination is made for the subjectivity or objectivity of
opinions. Handling objective statements is not researched much. Most approaches [38][39][42][43]
classify sentences according to subjectivity and classify subjective sentences or clauses as positive or
negative.
Liu et al [26] argue that it is unlikely that it is a one-technique-fits-all solution, and advocates
dealing with specific types of sentences differently by exploiting their unique characteristics.
Sentence level sentiment analysis performs opinion summarization by extracting key sentences
based on either concept or aspect.

10
2.1.3 Aspect level sentiment analysis
Sentiments are analyzed to the finest grain where the author appraises various aspects (attributes or
features) of an entity. Various approaches are followed to identify aspects of entities such as frequent
item sets [20], supervised learning (e.g. Conditional Random Fields) [13] [14] and co-occurrence
with sentiment expressions [15]. In the sentiment analysis phase, the sentiments are linked to the
referred aspects in addition to finding orientation. The common approaches for sentiment analysis at
this level are rule-based approaches [20], supervised learning, and clustering. Agarwal [20] uses a
rule-based approach where a window of 5 words around the target aspect is analyzed. Mukherjee and
Bhattacharyya [17] use a clustering based method where the sentences are transformed into a parse
tree and the probable aspects are chosen as cluster heads. The sentiment words are clustered
according to the shortest length to the cluster heads in the parse tree. These cluster members are then
considered as the sentiment expressions related to cluster head aspects.
2.1.4 Domains and Sentiment Analysis
Myriads of opinion mining and sentiment analysis techniques are being implemented and tested on
various domains. However, predominantly, most of the sentiment analysis techniques are tested on
the reviews of movies and consumer electronics products [1][28][49][29][3[4][6].
In addition, there are domains like restaurant reviews [52][54], hotel reviews [53] and transportation
reviews [45] being researched recently. With the emergence of opinion mining and sentiment
analysis into finer-granular level, more and more domains are being researched but we still couldn’t
find any research on sentiment analysis and opinion mining for the domain of book reviews.
2.1.5 Sentiment Lexicon
Sentiment lexicons are crucial resources for representing linguistic knowledge for opinion mining.
These lexicons contain entries of words with sentiment orientation.
The following public domain sentiment lexicons are available.

11
● SentiWordNet 3.0 [22]
● Sentiment Lexicon [18]
● Emotion lexicon [20]
● Manual or Automatic Acquisition [21]
2.1.5.1 SentiwordNet 3.0
SentiwordNet is based on quantitative analysis of glosses associated to synsets. Once these synsets
go through the semi-supervised classification, it produces a vectorial term representation for synset
classifications. This vectorial representation is used to derive three scores, which is produced by the
combination of eight ternary classifiers. Each classification has similar accuracy level but different
classification [22].
As described above, SentiwordNet is a lexical resource in which each synset of WordNet is
associated with three numerical values. Those are positive, negative and neutral values. The
underlying assumption of switching from terms to synset is that different senses of the same term
may have different opinion related properties. Therefore, each synset has scores ranging from 0.0 to
1.0. The total accumulation of this score is equal to 1.0. This graded evaluation of the opinion related
properties of terms can be helpful in eradicating the possibility of missing its subtly subjective
character [22].
2.1.5.2 Sentiment lexicon
This knowledge contains the information on probability distribution over positive, negative and
objective of every synset word in dictionary [21]. For an example, Sentiment Lexicon knowledge
represents word “interesting” with different polarity values under different senses.

12
2.1.5.3 Emotion lexicon
This knowledge possesses whether the word is associated with any emotions and the orientation of
the emotion, for example, negative or positive [23].
2.1.5.4 Manual or Automatic Acquisition
The formation of a general purpose or domain specific lexicon can be done through manual coding,
using a dictionary based approach [21] [22][23] or a corpus based approach [24] or a hybrid
approach. Both dictionary and corpus based approaches use a known seed set to expand the
vocabulary. Corpus based approaches are widely used in building domain specific sentiment
lexicons [24]. An approach called double propagation to simultaneously acquire a domain-specific
sentiment lexicon and a set of aspects was introduced in Qiu et al [25]. It uses known aspects to find
sentiment expressions and vice versa.
2.1.6 Tools for Sentiment Analysis
There are a number of text analysing tools that provide the facility of sentiment analysis. Most of
them use a machine learning based approach.
2.1.6.1 Sentiment Analysis with Stanford Library
Stanford sentiment analysis library is a free open source analytics library module for sentiment
analysis. By default, this has been trained for the movie review domain but can be trained for any
domain as well. This uses a deep learning model, which builds up a whole representation sentence
based on sentence structure. This model is called “Sentiment Treebank”.

13
2.1.6.2 Sentiment Analysis with Rapid Miner
Rapid Miner is a free open source analytics tool and is an excellent prototyping platform due to its
flexibility and robustness. It has a comprehensive set of algorithms that allows you to quickly swap
out and try different models. Rapid Miner has R and Groovy plugins. As it is based on Java, it can
run on any platform.
For training, two sets of reviews should be given; one containing positive reviews, and another
containing a negative review. Based on the training set, it counts the frequency of present words.
Based on the frequency, Rapid Miner analyse new reviews and give the probability of being a
positive or negative review. This is a kind of simpler approach to sentiment analysis.
2.1.6.3 Sentiment Analysis with LingPipe
One way to do sentiment analysis using LingPipe is to use LingPipe's language classification
framework to do two classification tasks: separating subjective from objective sentences, and
separating positive from negative product reviews. Later, we can build a hierarchical classifier by
composing these models. Specifically, they use the subjectivity classifier to extract subjective
sentences from reviews to be used for polarity classification. Hierarchical models are quite common
in the classification and general statistics and machine learning literatures.
2.2 Ontology
Ontology is a formal explicit description of concepts in a domain of discourse. It defines a common
vocabulary for researchers who need to share information in a domain. It includes machine-
interpretable definitions of basic concepts in the domain and relations among them
[1][7][23][24][25]. Ontology uses components like classes, individuals, attributes and properties to
represent domain specific or general computational information.

14
Within the recent years, ontology has become a defacto standard for developing and maintaining a
domain model. Therefore, guiding the information extraction from a domain corpus and presenting
the domain results has become a general idea [27]. In addition to that, while extracting information
from a domain corpus, ontology can be further enhanced and populated with new instances as well.
2.2.1 Advantages of using an ontology
With the emergence of the semantic web, the use of ontologies has gained a lot of momentum and
interest. It can be seen that the use of an ontology for both the purpose of representing domain
knowledge and as a guide to the information extraction process has many advantages. The
importance of ontology has been discussed by Menzies [30].
2.2.1.1 Share common understanding of the structure of information among people
Ontology models are an expressive and structured knowledge base that allows defining a
relationship’s inter domain as well as intra domains. Ontologies can be easily built using templates
and this allows multiple applications to share and use aggregated information of ontology for their
own purpose.
2.2.1.2 Enabling reuse of domain knowledge
Reusing the ontologies across multiple applications enhances the interoperability among them. In
addition, it saves lot of time in defining the ontology models. There are standard ontology models
such as the movie model [33] that has been used as it is or a bit modified on different research areas.
Using parts of these ontologies in new ontologies is acceptable.
2.2.1.3 Separating the domain knowledge from the operational knowledge
A common conceptualization allows various applications to share domain knowledge. Therefore, the
applications should be focused only on operational knowledge. Also, different ontologies
representing the same concepts can be mapped to a common terminology between different
applications.

15
2.2.1.4 Facilitating organization of knowledge modeling
Ontologies allow to define knowledge in a hierarchical is-a relationship. In addition, it supports the
relationships, like part, whole and has. In addition, it gives the flexibility of defining ontology
constructs as disjoints or unions. This ability of the ontology helps a great deal in structuring the
organizational knowledge model.
2.2.1.5 Easing the communication across entities and machines
Since ontologies are both machine and human understandable, ontologies can be used for
communication purposes to ensure interoperability between computer programs (as well as humans)
and to share a common understanding of the structure of information.
It can be used to disambiguate and uniquely identify the meaning of domain concepts. It also
facilitates knowledge transfer excluding unwanted interpretations through the use of formal
semantics.
2.2.1.6 Facilitating computational inference
By having different axioms within, the ontology structure assists the programs to enable their own
inferencing mechanism during searching and querying the domain model and instances. In addition,
ontology enables programming language independent serialization along with model inconsistency
detection mechanism as well.
2.2.1.7 Knowledge querying and browsing
Ontology creates a set of metadata to enable querying and browsing within OWL ontologies. These
metadata are being used by SPARQL technique to query ontology.

16
2.2.2 Semantic Web and OWL
The Semantic Web is a standard to promote common data formats in the World Wide Web. The
main intention of this standard is to transform the entire World Wide Web into a structured, well
understood and easily processable web of data [31]. OWL is a semantic markup language developed
using the semantic web standard to publish and share ontologies into an easily machine processable
format. OWL uses the ability of XML to define custom tag schemes and the flexible approach of
RDF to represent data. RDF provides semantics for the syntax and it defines the hierarchies and
generalization. OWL adds vocabulary for describing properties and classes [32].
2.2.2.2.1 OWL constructs
OWL like ontology also has 3 primary constructs. Those are
● Classes:- This is similar to the concept in ontology. It represents a set of objects in the
domain.
● Individuals:- This is similar to the instances in ontology. It represents individuals of each
class of that domain.
● Properties:- Binary relationship between individuals.
The Figure 2.1 illustrates this.

17
Figure 2-1- Classes, Individuals, and Properties.
2.2.3 Wordnet
Wordnet is a large lexical database for English, which encompasses nouns, adjectives and adverbs in
English lexicon [35][36]. Wordnet offers two different kinds of relationships; Lexical and
Conceptual.
Lexical relations help us to identify synonymy and antonymy. This part of the lexical relations is
used to enrich the concepts with synonyms via concept labels in the OWL. On the other hand,
conceptual relations can be used to discover the antonyms, hypernyms-hyponyms, part-of and the
whole relationship.

18
2.2.4 Ontology Development
Based on the domain dependence, ontology can be divided into four types [1][14][24][25].
Those are,
● Generic Ontology
● Domain Ontology
● Task Ontology
● Application Ontology
Domain ontology describes the concepts in special area, attribute of concepts, relationship between
concepts and constraints among relationships. The target of constructing domain ontology is to
define common terminologies in the area, and give the definition of the relationships among the
terminologies.
Ontology construction can be performed by the following methodologies. Those are [29]
● Top-Down Approach
● Bottom-Up Approach
● Formal Concept Analysis
● Hybrid Model
2.3.1.1 Top-Down Approach
Top-down initially starts with high level ontological concepts. Gradually, it expands into a fully-
fledged and complete ontological structure [28]. During this approach, initially important metadata
information of the domain is identified at the highest level, for example for the movie domain, title,
cast, crew and production, and miscellaneous concepts are identified as initial ontological concepts.
During the next phase, based on the initial ontological structure, by performing content analysis of
the domain/content, all the child concepts of the initial concepts are identified as “protagonist” and
the concept is identified as a child of the parent concept cast.

19
2.3.1.2 Bottom-Up Approach
Bottom-up approach is the inverse of the top-down approach, where initially, entire domain contents
are thoroughly analysed and fine-grained aspects are detected first. These fine grained aspects are
the concepts.
During the next phase, these concepts are grouped together under parent concepts to create an
ontological hierarchy. For example, concepts like protagonist, supporting cast, comedian are
subclasses of the cast concept. Likewise, all of the appropriate concepts are brought under suitable
parent concepts to create an ontological hierarchy [28][51].
2.3.1.3 Formal Concept Analysis (FCA)
This approach has been described by Shein [38]. Unlike the top-down and bottom up approaches,
FCA considers two elements. Those are formal objects and formal attributes. As per this method,
objects are chosen as formal concepts and features are chosen as formal attributes. By adopting this
approach, it allows forming semantic structures that are a formal abstraction of concepts of human
thoughts and it is possible to identify conceptual structures among data sets.
The main characteristics of FCA are,
● Concepts are described by properties
● Properties determine the hierarchy of the concepts
● When the properties of different concepts are the same, then the concepts are the same
Contexts

20
FCA define the context using triples (O,A,R)
O=finite set of object
A=finite set of attributes
R=binary relation on O and A
FCA uses the following steps to develop ontology,
● Start with an empty set of concepts and properties
● Add concepts and properties
● Modify the ontology by adding/removing concepts to update the changes
● These steps should be repeated until we complete the ontology construction
2.3.1.4 Hybrid Model
This approach has been described by Zhou and Chaovalit [28]. This approach combines both
bottom-up and top-down approaches to create movie ontology. The top-down approach starts with
the high-level ontological concepts, which then gradually expands into a full-fledged ontology. The
bottom-up approach starts with textual documents and extracts ontological knowledge from the
documents. A hybrid approach that is becoming popular in recent years, simultaneously derives
knowledge from the top-level ontology and extracts low-level ontologies from documents, and then
creates mappings between the different levels of ontologies.
2.3.3 Ontology and Sentiment Analysis
Sentiment analysis found it really hard to address the issues arising while performing polarity mining
based on aspects and sub aspects as each sentence was describing different kinds of aspects [28]. In
other words, those techniques found it really hard to conceptualize domain specific information in a
much more structured pattern to represent domain specific information at a finer-granular level.

21
Zhou and Chaovalit [28] identified that ontology is the best fitting solution for conceptualizing the
domain specific information in a much more structured way for sentiment analysis. They developed
a movie ontology model and evaluated the model against multiple opinion mining techniques at
various granularity levels. Notably they evaluated the model against techniques like support vector
machines, Naive Bayes classifier and decision tree at the granularity levels of document, sentence
and phrase.
With the introduction of ontology as the domain specific knowledge model for opinion mining,
many other researchers carried out sentiment analysis based on ontology for domains of products
and movie reviews. Most notably, the work done by Zhao and Li [1] where they focused on building
two ontology models to structure domain specific information, namely the movie model (to hold the
meta information of the movie) and feature model (to hold the aspects information of movie model).
Based on these models, they introduced a technique to compute the polarities of opinions through
traversing the nodes by means of a hierarchy relationship. Later, Isidro and Rafael [48] introduced a
new approach for ontology based sentiment analysis by applying vector analysis for opinion mining
calculations. Freitas and Vieira [29] followed the same approach as Isidro and Rafael [48] but they
did opinion mining for Portuguese movie reviews. In the domain of product review work done by
Wang, Nie and Liu [15], they introduce a new hierarchical fuzzy domain sentiment ontology that
defines a space of product features and corresponding opinions. This enables product classification
and based on the classification, scores are assigned to common features during sentiment analysis.
2.3.4 Related Existing Ontology
2.3.4.1 Movie Ontology
Movie ontology is an ontology for describing the types and properties of the semantic web. Movie
ontology is made up of 78 concepts, 30 object properties and 4 data properties [33].
Movie ontology has super classes like Genre, Presentation, Territory, Award and Certification
and multiple sub classes. These generic classes represent actual movies of a respective type and a
wider set of movies related to a certain category.

22
As this ontology is designed for various applications in e-commerce websites, it has the ability to
represent the following other than the movie domain themselves.
This ontology supports three types of movie properties. They are,
● Quantitative properties – Used for movie features with numeric values
● Qualitative properties – Used for movie features with predefined value instances
● Data type properties - Used only for features with the data types string, date, time, datetime
or boolean.
2.3.4.2 Hotel Ontology
Hotel ontology is an ontology describing the concepts, types and properties in the accommodation
domain. Its current model has 282 concepts, 8 object properties and 31 data properties [50].
Hotel ontology has super classes like HotelCategory, Facility, GuestType, HotelChain, Room,
Service, Staff, Location, Meal and Price and multiple sub classes. Since Hotel ontology is a much
more comprehensive, complex and widely changing domain model, it has been widely used by
almost all hospitality applications. This allows the domain model to change without affecting the
application.
2.4 Data Pre-processing
Data pre-processing is an extensive topic where unstructured text can be transformed into a
structured form of text that involves a huge amount of research throughout the decades. This
includes lexical analysis, Named Entity Recognition (NER) and Co-reference resolution.
2.4.1 Lexical Analysis

23
Lexical analysis is the phase of identifying the sentences, splitting the sentences, performing Part of
Speech (POS) tagging on them, identifying the dependencies and extracting noun and verb chunks as
well [39].
Lexical analysis can be defined as below,
● Sentence splitting and tokenization
● Part Of Speech tagging
● Phrase chunker
● Dependency analyser
The sentence analyser splits sentences and the tokenizer breaks each sentence into tokens.
Part of Speech (POS) taggers analyse sentences and assign each word its grammatical category such
as noun, verb, adjective, adverb, article, conjunct and pronoun. POS taggers can analyse the
grammar in sentences to a very detailed level, giving information about the tense of verbs and
active/passiveness.
A Phrase Chunker segments a sentence into its sub constituents such as noun phrases, verb phrases
and prepositional phrases. Typically, a context free grammar is used to identify theses constituents.
This comes in handy in Named Entity Extraction, as named entities are typically noun phrases.
The dependency analyser identifies words in a sentence that forms arguments of other words in it.
2.4.2 Name Entity Recognition (NER)
Named Entity Recognition is the form of identifying the named entities like ‘Person’, ‘Location’,
and ‘Organization’ using domain knowledge of the information to be extracted. NER identifies such
entities without specific info about that entity. For example, in the sentence “Cold weather has been
reported in Colombo”, NER identifies Colombo as a place.
In addition to that, Named Entity Extraction also includes co-reference resolution. That is, detecting
expressions relating to the same entity within a sentence or anywhere in the document. For example,

24
in the sentence “The story is unique, but it is a bit slow.”, “it” refers to the story and the co-
reference resolution is able to identify this [42].
2.5 Discussion
With the increase in online reviews in various domains, opinion mining has emerged as one of the
highly researched domains. Especially the surge has occurred in domains such as movie reviews and
consumer product reviews. Initially, the reviews are done using the hard coded tagged bag-of-
aspects classifier and other means. However, with the emergence of ontology as a structured well-
defined data model, researchers started using ontology as the domain model to hold aspect related
information while performing opinion mining. Since ontology being a powerful, structured domain
model, researches started focusing more on how to leverage the domain model of ontology in
processing unstructured review texts. As a result, they were able to come up with some good
techniques [49][29][1][28] to perform sentiment analysis on a review corpus. In addition to that,
after the introduction of some advanced lexical analysis functionalities, performing sentiment
analysis became much more accurate with a higher number of precision and recall as well.
Although there has been considerable amount of sentiment analysis work done in the domains of
movie reviews and consumer product reviews, the book reviews domain remains unexplored. But
with a higher number of books being released throughout the world on a daily basis, performing
opinion mining based on book reviews definitely adds more value to the readers.

25
Chapter 3
Methodology
The entire process involves following tasks to perform opinion mining on book reviews [1]. Those
are,
● Data Pre-processing
● Feature Identification
● Polarity Identification
● Sentiment Analysis
The complete workflow can be explained as depicted in Figure 3.1.

26
Figure 3-1-Overview of the implementation Architecture
As detailed out in previous sections, the main objective of this project is to utilize the ontology
structure for the conceptualization of domain–specific information to enable us to solve the
subjective based feature identification problem in the book review domain. Since we have already
obtained a set of book reviews from Amazon, we do not have to focus much on obtaining reviews
from external sites.
3.1 Book Ontology
As already mentioned, based on the domain dependence, ontology can be divided into four types
[1][14][24][25]; Generic Ontology, Domain Ontology, Task Ontology and Application Ontology.
This research focuses on a specific domain, which is the book review domain. Therefore, for this
research, we constructed a domain dependent ontology.

27
3.1.1 Data Selection and Preparation
3.1.1.1 Book Reviews
Book reviews are the main source of information that has been processed throughout the research.
For this research, we obtained book reviews from Amazon as large text files, where each review is
separated from others via JSON messages. All the fields in a JSON message and a sample JSON
review message are given below.
reviewerID - ID of the reader who reviewed it
asin - ASIN number of the book
reviewerName - Name of the reviewer
helpful - [Up votes, Down votes] up votes and down votes received by this review
reviewText - Review text
overall - Ratings given by the reviewer
summary - Summary of the review
unixReviewTime - Reviewed time in UNIX value
reviewTime - Review date, month and year
{"reviewerID": "AXZ6WA3GA2WRH", "asin": "0002007770", "reviewerName": "Amazon
Customer", "helpful": [1, 1], "reviewText": "I had heard a lot of positive comments on this book
for a while and thought I would see for myself what all the praise was about. Well this book
deserves 5 stars and more. It is really a book that will stay with you for a while. I know it will
be among my all time favorites, as well it should. The writer has done an excellent job fleshing
out the people and you almost feel as if you know them personally or have dealt with people
like time in real life. The depression was deep in the US when the story is introduced. Life can
hand out a lot of hard knocks, even to animals. I really never thought about that aspect of life
before. I am very happy I had the opportunity to read this book. I delayed reading it many
times because frankly the title put me off. Then I remembered to never judge a book by it's

28
title.", "overall": 5.0, "summary": "Wonderful story", "unixReviewTime": 1361836800,
"reviewTime": "02 26, 2013"}
Out of all of the given fields, for this research, we only consider the reviewText as this field consists
only of review text and all other fields explain meta information of the review text. We used the
book reviews obtained from Amazon [47]. From the entire review corpus, we carefully selected 600
reviews under two different data sets. Those reviews are divided into primary data set and secondary
data set. We have chosen two types of data sets. Those are,
● Primary set of book reviews
1. Reviews that are more positive - 75
2. Reviews that are more negative - 75
● Secondary set of book reviews
1. Reviews that are more positive - 225
2. Reviews that are more negative - 225
The details of this data set have been further explained in section 4.1.
Each review text may tend to have positive, negative or neutral reviews of various aspects.
A positive review in the sense, for example, if the review text has a sentence such as “Good
character descriptions and scenes are beautifully depicted”, it clearly mentions a positive opinion
of the aspects character and scenes in the book. On the other hand, a sentence with negative
polarity has a negative opinion of an aspect like in this review text, “The storyline of the novel is
quite predictable”. Here the aspect story/storyline is considered as predictable. Additionally,
neutral polarity of the aspect is where the reviewer mentioned a statement about an aspect like “This
author previously has written good books and I love previous books as well”. As this sentence
praises the aspect author but didn’t mention anything about the author’s current work, it is
considered a neutral review. Likewise if the user didn’t review the aspects using adjectives/gerunds,
that review is also calculated as a neutral review.

29
However, when categorizing the entire document, we consider the overall reviews of the aspect and
if 50% of the aspects have a positive sentiment, we consider the review as positive and vice versa for
a negative review.
These 600 reviews were used for experimenting this research. We decided to choose 600 reviews
based on the previous research that’s done to evaluate sentiment analysis on movie reviews using
ontology. Zhao and Li [1] had used 120 reviews with 60 positive reviews and 60 negative reviews to
calculate the polarity. In addition Zhou and Chaovalit [28] had used 180 IMDB reviews for their
research. Likewise Isidro and Rafael [49] used 100 movie reviews to validate their approach. Within
a huge dataset, choosing the negative and positive reviews randomly is sufficient to evaluate this
methodology.
Under all three categories of these book reviews that we have classified, each classification has a
different level of book review, for example, longer review, shorter review, medium sized review,
review with a higher number of up votes and reviews with a lower number of up votes.
The reviews of the primary dataset is carefully analysed, reviewed and used for ontology
development.
3.1.2 Ontology Development
Since ontology is a generic concept in the domain of opinion mining, the ontology that we build for
the book is not as complicated as the structure of OWL. The aim is to find the opinion of a feature of
the book or some attributes of a feature of the book. Therefore, book ontology concepts are divided
into two parts; book and feature.
In order to build the book domain ontology, we adopted the bottom-up approach in an iterative way.
As per the bottom-up approach, it starts with textual documents and extracts ontological concepts
from raw text documents. This process occurs iteratively until no new concepts are found with fully-
fledged domain ontology.

30
The goal of ontology construction is to extract the concepts with a seeds set from book reviews. The
following steps are followed for this task [1].
● Select relevant sentences with initial/basic conceptions
● Extraction of conception from those sentences
3.1.2.1 Select relevant sentences with initial/basic conceptions
Before commencing this step, we manually labelled some of the concept seeds by going through the
book review documents. Once this manually labelling phase is done, we automated the concept
discovery phase during later stages.
After completing the initial manually labelling operation, we identified the concept seeds on the
chosen corpus of book reviews. We followed two procedures to extract and identify further concept
seeds.
1. Identifying the sentences containing conjunction words.
2. Detecting the sentences containing at least one concept seed.
As explained above, first we checked whether the sentence contains a conjunction words. If that
sentence had a conjunction word, we checked whether that particular sentence had a concept seed.
For example, consider the sentence “This book has a good plot and storyline”. As we already
know, the storyline is a known concept and the conjunction can be applied only with the noun with
the same characteristics. Therefore, by applying the conjunction theory, we easily identified that the
noun “plot” is also a concept in the feature domain model.
3.1.2.2 Extraction of concepts from sentences

31
Once the sentences with conjunctions are identified, before adding those concepts to the ontology,
first we checked whether we have already added the new concept under different synonyms. If the
new concept didn’t exist in the book domain ontology, we added it to the feature domain model and
labelled the new concept with synonyms as well.
The above process is repeated until no new concepts/aspects are discovered.
As explained above, initially, ontology for the book model was generated after analysing the review
corpus of the primary data set. But before processing the secondary data set, we traversed the
secondary data set and looked for new concepts/aspects as well.
For example, in the sentence “The writing and research of the book is first rate”, the
concepts/aspects writing and research are connected using the conjunction word “and”. On the
other hand, suppose the current ontology model for book aspects is not aware of research as a
concept/aspect and it only knows writing. By following this step, we added the concept research
into the ontology model as in a sentence conjunction happens between the entities with the same
characteristics.
This entire process can be explained as depicted in Figure 3.2 [2].

32
Figure 3-2-Ontology Development
3.1.3 Ontology Model
Since we were more concerned about the concepts and properties in the book domain, we didn’t
have the necessity to build a complex ontology model. We adopted a rather similar approach to the
model mentioned in Zhao and Li’s work [1].
As ontology models, we built separate individual models for book and features of the books.
3.1.3.2 Book Model
The ontology data model used here is to indicate the relationship among books. This model adapted
the forest structure to maintain the relationship among books [25]. According to the forest structure,
each and every node has a father node and a son node. In addition, each node has several synonyms
that represent the terminologies in the area. These synonyms are labelled for each concept. For

33
example, the concept “writer” can be interchanged with the word “author”. The feature of a single
node having several synonyms has made this model a forest type model.
In abstract, the book model shows the hierarchy relationship of concepts where it starts with the root
node “Book”.
A fragment of the book model developed using the Protege tool has the structure as depicted in
Figure 3.3.
Figure 3-3-Book Model Ontology In Protege
A fragment of the concept hierarchy of a book can be described as depicted in Figure 3.4.

34
Figure 3-4-Fragment of the concept hierarchy of book
3.1.3.3 Feature Model
The ontology data model used here is to indicate the relationship among features. The root node of
the feature model is “Features”.
Unlike the Book model, this model didn’t have multiple nodes with a hierarchy. All the nodes on
this model have a single parent node, which is the “Features” node. Sibling nodes are related to
each other over object properties of Ontology. For example, the Object property “hasWriting” has

35
the range class as “Writing” and domain class as “Book”. Using these properties, all the concepts
are connected in the feature model.
The Feature model developed using Protege tool has the structure as depicted in Figure 3.5.
Figure 3-5-Feature Model Ontology In Protege
Fragment of the concept hierarchy of book features can be described as Figure-3.6.

36
Figure 3-6-Fragment of the concept hierarchy of features
We defined 19 different properties to connect the concept classes of features with properties. All the
concept classes of Features are disjointed to each other.
All the concept classes were connected with the properties via domain and range attributes of OWL.
The domains are the set of individuals to which the property is applicable. The range is the set of
individuals that are applicable as values of the property. Table 3.1 describes how the properties are
connected with domains and ranges in the feature model.

37
Property Name Concept Domain Class Concept Range Class
hasChapter Book Chapter
hasCharacter Book Character
hasClimax Book Climax
hasContent Book Content
hasCover Book Cover
hasDescription Book Description
hasDialogue Book Dialogue
hasIdea Book Idea
hasPace Book Pace
hasPage Book Page
hasPlot Book Plot
hasProtagonist Book Protagonist
hasRead Book Read
hasRendering Book Rendering
hasResearch Book Research
hasSentence Book Sentence
hasSetting Book Setting
hasStory Book Story
hasTitle Book Title
hasVocabulary Book Vocabulary
hasWriting Book Writing
Table 3.1 OWL Properties of feature model
3.2 Pre-processing

38
Pre-processing is the step where unstructured review text is converted into fixed-format,
unambiguous and structured text. This step involves detecting sentence, tokenization, POS tagging,
lemmatizing, syntactic parsing, named entity recognition and resolving co-reference among
sentences.
The entire pre-processing step is carried out using only Stanford NLP libraries.
The entire process can be explained as shown in Figure 3-7.
Figure 3-7-Steps for Pre-processing review texts

39
1. English Tokenizer :
Tokenizes words
2. Sentence Splitter :
Detects the ending of sentences and splits them from the ending point. Sometimes,
this is tricky because of sentences like “It took only 3.5 hours for me to finish this
book.”.
In the above sentence, the sentence splitter shouldn’t be splitting the sentence at 3.5
phrase.
3. Morphological Analyzer
In other terms, we used to call this lemmatizer where this derives the root of each
word like the word “simplest” where the root is “simple”.
4. POS Tagger
This process segments the sentences and identifies the verbs, nouns, adjectives, adverbs,
determiners and prepositions.
5. Co-reference Pruning
This process involves finding all the references related to a particular entity in the
whole text. For example, “The story is unique, but it is a bit slow.”.
After performing the co-reference pruning, we can modify the sentence as,
“The story is unique, but the story is a bit slow.”
Let us consider the review text “Book is an exceptional read. The characters are well-developed
and authentic, the story entertaining and enlightening, the writing and research first rate. All
in all, a tremendous read. I can't imagine anyone who would not enjoy this book, I highly
recommend.”
Once this review went through the pre-processing stage, we got the following output.
Book_NN is_VB an_DT exceptional_JJ read_NN ._. The_DT characters_NN are_VB well-
developed_JJ and_CC authentic_JJ,_, the_DT story_NN entertaining_JJ and_CC
enlightening_JJ ,_, the_DT writing_NN and_CC research_CC first_JJ rate_NN._. All_DT
in_IN all_DT,_, a_DT tremendous_JJ read_NN._. I_PRP ca_VB n't_RB imagine_VB
anyone_NN who_WP would_MB not_RB enjoy_VB this_DT book_NN,_, I_PRP highly_RB
recommend_VBP._.

40
The above output is further used for feature identification, polarity measuring and sentiment
analysis.
3.3 Feature extraction with Ontology
The pre-processed texts are integrated with ontology to improve the accuracy of feature extraction.
In this step, ontology terminologies are used to extract the feature based POS tagged words for
further processing. Here, the sentences with ontology terminologies are identified first and from
those sentences, features are easily extracted [1]. Feature extraction has the following steps.
1. Loading the ontology feature concepts into the system.
2. Prepare the pre-processed review text for feature extraction.
3. Extract the concept noun and adjectives that modifies that particular concept.
3.3.1 Loading ontology concepts
In order to read the OWL file, we are using the OWL API [48]. This OWL API is used to read,
manipulate and serialize the OWL concepts, instances and properties. We used the OWL API to read
the feature model and book model to identify concepts to be looked at from the review texts.
Each loaded concept model object has the following structure.
ConceptModel{
private String className;
private Map<String, String> propertyMap = new HashMap<String, String>();
private List<String> labelList = new ArrayList<String>();
}
As can be seen, each OWL concept is labelled with synonyms. Therefore, for example, the Author
concept has the labels author and writer.
Therefore, when going through the review text, if the system encounters a noun, it checks each and
every concept model’s labels to make sure that it can find any related concepts for that particular
model.

41
3.3.2 Prepare the pre-processed review text for feature extraction
This phase of the operation is quite interesting. Pre-processed review text is being further processed
using the following operations:
● First, we updated the singular pronoun and plural pronoun with the corresponding concepts.
For example, if the review sentence is “Sara Gruen has written an excellent book”, where
Sara Gruen is the name of the author, we pre-processed the sentence to be “Author has
written an excellent book”. Likewise, we updated the name of the characters as well with
the constant word “Character”. Since the concepts author and character exist in the ontology
domain model, this part of the review won’t be missed. Pronouns are been marked with the
tag NNP by the Stanford POS tagger.
● Next, we removed all stop words such as “a”, “an” or “the”. These words are called
determiner (DT) by the Stanford POS tagger. In addition, we eliminated prepositional terms
like “that” and “this”. These kind of words are tagged as IN by the Stanford POS tagger.
3.3.3 Prepare the tuples of aspects and modifier
Now on top of the pre-processed text, we discovered the aspect nouns. From each aspect noun, we
defined a range of 4 for both sides of an aspect noun as the border area to extract adjectives. As per
the research done by Freitas and Vieira [29] for the Portuguese language, they defined a range of 3,
but for English, after performing multiple analysis of review text, we decided to stick with 4 [Refer
section 4.4]. Therefore, if we consider the aspect noun “author” and the pre-processed sentence is
“It is amazing that Author was able to write with such realism.”, after this sentence is pre-
processed, eventually we got the following processed text;
It_PRP amazing_JJ Author_NN is_VBD able_JJ to_TO write_VB such_JJ realism_NN ._.”

42
As can be seen from the above sentence, the author has a range of 4 words in the right hand side (is,
able, to, write) and 2 words in the left hand side (It, amazing ). Although here, the author has
inherited 2 adjectives as “able” and “amazing”, “amazing” is right next to a preposition “It”.
Therefore, by having specific linguistic rules like this we can come to the conclusion that the author
has the adjective able.
Therefore, the tuple for the concept author is (Author, able).
Likewise, we defined multiple domain specific rules to correctly identify the adjectives for each and
every aspect.
By applying the rules, eventually, we ended up with the following tuples.
(story, interesting ), (story, entertaining ), (story, thrilling)
Below, we have mentioned some important rules that were heavily used during lexical analysis
operation.
Rule: Defining the range of aspect nouns
Although we defined a hard range of four, at some occasions, there aren’t enough words to cover the
four words in a sentence due to the sentence ending. Apart from that, there are certain occasions
where we exceeded the range of four on the right hand side and at some occasions we could not
cover all four words in the left hand side as well.
Let us consider the following review text,
“The characters are well-developed and authentic, the story entertaining and enlightening, the
writing and research first rate. Thank you author for a Refreshing, Awesome, touching read!”
After this sentence goes through all the pre-processing where common determiner terms are
removed, words are lemmatized and each word is appended with POS tags, eventually we got the
following processed text;

43
“characters_NNS is_VBP well-develop_JJ, interesting_JJ, engaging_JJ and_CC authentic_JJ
,_, story_NN entertaining_JJ and_CC enlightening_VB ,_, writing_NN and_CC research_NN
first_JJ rate_NN. Thank_VB you_PRP author_NN for_IN Refreshing_JJ, Awesome_JJ ,
exciting_JJ, touching_VBG read_NN !_.”
Rule #1
As we consider the noun/aspect “story” in the above review text, LHS of the noun “story” has a
comma separator. If a scenario like that happens in the LHS before we finished reaching the count
four, only the words following the comma separator are included into the range of four.
If we consider the aspect “story” and if we apply the LHS word count rule, we get the following set
of words in RHS and LHS:
RHS [story]= {entertaining, and, enlightening, “,”}
LHS [story] = {} (Empty set since there are no words following the comma and the only word next
to the comma is the aspect “story”.)
Rule #2
As we consider the noun/aspect “character”, ideally we have the following set for RHS:
RHS[character] = { is, well-develop, “,”, interesting}
In the RHS set, we found out that there are two adjectives; namely “well-develop” and
“interesting”.
In that case, we have to check whether there are any prepositions/nouns/pronouns in between these
two adjectives. Since we do not have any of these, we can safely assume that these adjectives belong
only to the noun/aspect “character”.
However, since we only considered the range of four, we only checked whether there are more
adjectives describing the noun/aspect “character” in this series of phrases/sentence. Therefore, we

44
applied the following lexical techniques to identify the complete set of adjectives of that
noun/aspect.
● If the RHS set ends with an adjective (disregarding the total number of adjectives in the set),
we then checked beyond the four-range limit. In that case, we checked the 5th
word from the
noun/aspect. If the 5th word is a comma separator or a conjunction (like AND or OR) and
if the 6th word is an adjective or gerund, that adjective/gerund definitely belongs to the
concerned aspect/noun only. We repeated this step of checking for a comma separator or a
conjunction and then checking for adjectives/gerunds until we encountered a non-
adjective/gerund term or a full stop or if the comma and adjective pattern failed.
● If the RHS has an adjective/gerund as the 3rd word and if the 4th word is a comma
separator or a conjunction, we followed the same iteration to discover the adjectives/gerunds.
After following the above rules for the aspect “characters”, we ended up with the following RHS
set.
RHS[characters] = { is, well-develop, enlightening, “,” , interesting, engaging, and, authentic}
Rule #3
Let us consider the following phrase of a pre-processed review text.
“It_PRP is_VBZ exciting_JJ and_CC paints_NN several_JJ unique_JJ ,_, unforgettable_JJ
and_CC vivid_JJ scenes_NN, which_DT is_VBZ MORE_RBR than_IN most_JJS books_NNS
are_VBP able_JJ do_VB.”
As we consider the noun/aspect “scenes”, after applying Rule#1 and Rule#2, we got an empty set
for RHS:
RHS[scenes] = {} (empty set)
On the other hand, we used the following rules to create the LHS set.
● Initially, the first 4 words that are to the left of the aspect/concept/noun are considered to be
part of the set unless there is a comma separator within those 4 words or a full stop of the
previous sentence.
● However, if there is an adjective/gerund as the first word of the LHS:

45
1. Check for the second word. If the second word is a conjunction (AND, OR) continue
traversing to the third word.
2. In the event that the first word is an adjective/gerund and the second word is not a
conjunction, only the first word is considered as the adjective/gerund that modifies
the concept noun from the LHS.
3. In the event that the second word is a conjunction, check the third word for an
adjective/gerund. If the third one is an adjective/gerund, then check for the fourth
word. If the fourth word has a comma separator and the fifth word has an
adjective/gerund, we continue to iterate till we cannot find this pattern anymore.
4. Negation words were usually close to the adjectives/gerunds only and when negation
is present, it can be detected within the first 4 words and closer to the first
adjective/gerund.
After applying the above principles, we obtained the following set of LHS:
LHS[scenes] = {vivid, and, unforgettable, unique, several}
Finally, we got the following output tuples:
(scenes, vivid), (scenes, unforgettable), (scenes, unique) and (scenes, several)
Rule: Defining the range of aspect nouns
Once the ranges of the aspects are defined as LHS and RHS, these sets of rules were applied on top
of them to discover the most suitable adjectives/gerunds/verbs that affect the aspect/noun. Here,
most of the rules are defined to handle adjectives/gerunds only. Some of the important rules are
explained in this section.
Rule #4
Rule #4 is applied on top of the output that we got from Rule#1, Rule#2 and Rule #3.
When we consider the last sentence from the above review text:

46
“Thank_VB you_PRP author_NN for_IN Refreshing_JJ, Awesome_JJ , exciting_JJ,
touching_VBG read_NN !_.”
When we consider the aspect/concept “author”, we got the following RHS:
RHS[author] ={for, Refreshing, “,” , Awesome, “,”, exciting}
However, when we consider the first word of the set, the word “for” is a preposition since a
preposition is a connecting word, which explains the relation between two different words. Although
there are adjectives within the range of author, they do not belong to the author.
From the word “exciting” we moved forward until we found a noun/pronoun. We defined the
following RHS set for our example. In fact, this RHS became the LHS of the aspect/concept “read”
from the above phrase.
LHS[read] = {Refreshing, “,” , Awesome, “,”, exciting, touching}
Rule #5
All the previous rules are mostly concerned in handling adjectives and gerunds. This part of the rule
is to handle the nouns/concepts/aspects.
In the above review itself, let us consider the sentence,
characters_NNS is_VBP well-develop_JJ, interesting_JJ, engaging_JJ and_CC authentic_JJ
,_, story_NN entertaining_JJ and_CC enlightening_VB ,_, writing_NN and_CC research_NN
first_JJ rate_NN.
In the above sentence, when considering the aspect writing, we got the following RHS and LHS:
RHS[writing] = {and, research, first, rate}
However, as seen above, the first word of the RHS is a conjunction. If there is a conjunction and if
the following noun is also an aspect, whatever adjectives are used to describe the concerned concept
is also used to describe the other following noun of conjunction. As per the above example adjective
“first” is used to describe the nouns/aspects writing and research. Therefore, we created tuples like
this:

47
(writing, first) and (research, first)
Rule #6
This is a variant of Rule #4.
In Rule 3, the preposition “for” followed the concept/aspect, but this rule deals with the prepositions
preceding the concept/aspect.
Let us consider the following review text.
“I grew up with the Berenstein Bear's books and love their illustrations and messages, but this
one lacks in content and craftsmanship.”
After pre-processing this, we got the following output. During the pre-processing, proper nouns like
book name and character names are replaced with tags like [BOOK_NAME] and character.
“I_PRP grew_VBD up_RP with_IN [BOOK_NAME] 's_POS books_NNS and_CC love_VBP
their_PRP illustrations_NNS and_CC messages_NNS ,_, but_CC one_CD lacks_VBZ in_IN
content_NN and_CC craftsmanship_NN.”
When we considered the noun/aspect “content” by applying Rule #4 “craftsmanship”, it also had
the same effect because of the preposition “in”. A preposition connects the noun/pronoun with other
parts of the sentence.
Therefore, the LHS of the noun/aspect “content” is:
LHS[content] = {in, lacks, one, but}
However, as seen above, the first word of the LHS is a preposition. Therefore, the 2nd word reflects
the sentiment of the noun aspect. This second word can be any form of the verb. In some scenarios,
prepositions can be more than a single word like “as if”. In that case, the next immediate verb phrase
has to be selected for sentiment analysis.

48
Rule #7
Let us consider the following phrase of a pre-processed review text.
“I_PRP loved_VBD that_IN author_NN did_VBD significant_JJ research_NN make_VB
story_NN realistic_JJ.”
When we consider the noun/concept/aspect author, we got the following RHS of the word range:
RHS[author] = { did, significant, research, make}
As per the above set, the first word is a verb and the adjective follows next. But, the adjective
“significant” is closely located to the concept “research”. In addition, there are no connecting
words between the aspects research and author.
Therefore, we can safely assume in this scenario that the adjective “significant” modifies the
aspect/concept research only.
Now this set can be redefined as follows:
LHS[research] = {significant, did, author, that}
Eventually, we got the final tuple for the aspect “research” as (research, significant).
Likewise, by adding multiple rules, we performed the lexical analysis. As we defined more rules, our
precision and recall rate increased as well. Let us assume that after processing through rules, we get
the following set of noun and adjective tuples for the aspect/concept story;
(story, interesting), (story, entertaining ), (story, thrilling)
Next, we further processed the RHS and LHS sets of the story aspect to check if negation words like
NO, NEVER or NOT exist in that particular RHS and LHS sets. Whenever we found any negation
words, we updated the tuple with a new value, “1”. Now the new values of the tuples are:
(story, interesting, 1 ), (story, entertaining, 1 ), (story, thrilling, 1)
If those adjectives had not had any negation words, instead of having a 1, it would be replaced with
0.

49
(story, interesting, 0 ), (story, entertaining, 0 ), (story, thrilling, 0)
Now the concept/aspects are ready to be processed for sentiment analysis and opinion mining.
The entire process can be depicted as shown in Figure 3.8
Figure 3.8-Feature Identification
3.4 Feature Score Calculation
Once the features are extracted, we assigned scores for each feature. This score is later used to
calculate the polarity for each feature.

50
Unlike traditional methods where they give equal importance to all the features, we followed the
approach followed by Isidro and Rafel [49]. They give importance to the feature based on multiple
factors. These factors are:
● Features often cited by users used to have higher scores.
● When calculating the score for each feature, the score depends on the occurrence of the
feature in the text.
In Rafel’s work they have divided the entire review corpus into three parts based on the count of the
number of words. However, for this research, since the phrases in each sentence carry the same
weight, we decided to divide the review text into three parts based on the count of the sentences.
Based on the work [29] [49], we also decided to use the same constant values for each part of the
sentence.
Therefore, the initial part denotes z1’s value as 0.3, second part (z2)’s value as 0.2 and the final part
(z3)’s value as 0.5.
3.4.1 Calculating the score for a single feature
In order to calculate the score for each user opinion, we found out that rather than depending on a
single review, it is ideal to crawl all the 600 reviews and calculate the score across the entire corpus.
Therefore, as a way of crawling, based on the work done by Isidro and Rafael [49], we decided to
use the same equation. The only difference is that rather than calculating for a single review, we
choose to look for the same feature across all the available reviews.
Therefore this equation is used to calculate the score for an opinion on a single review [29]:-

51
Equation #1
score(f, userop
i
) = z1 * |O1| + z2 * |O2| + z3 * |O3|
As per the Equation #1, z1, z2 and z3 are the constants mentioned in the previous subsection.
z1 = 0.3 , z2 = 0.2 and z3 = 0.5
| O1 | = number of occurrences of that particular concept in the first part of the paragraph
| O2 | = number of occurrences of that particular concept in the second part of the paragraph
| O3 | = number of occurrences of that particular concept in the final part of the paragraph
Next, we got the aggregated score across all reviews and calculated the arithmetic mean of those
reviews to get the average score of that particular concept. That can be displayed as shown in
Equation #2 [1] [29][49]
Equation #2
𝑇𝑠𝑐𝑜𝑟𝑒(𝑓) =
∑ 𝑠𝑐𝑜𝑟𝑒( 𝑓, 𝑢𝑠𝑒𝑟𝑜𝑝𝑖)
𝑛
The value of n is 600. Only to calculate the feature score, we decided to do the calculation for both
the primary and secondary data sets together. This is to make sure that aspects of both data sets have
the same base scores across tests. After we ran the test, we came up with the following scores for
individual aspects as shown in Table 3.2.

52
Concept Score Value
Chapter 0.005
Character 0.015
Climax 0.02
Content 0.06
Cover 0.01
Description 0.03
Dialogue 0.05
Idea 0.08
Pace 0.02
Page 0.01
Plot 0.04
Protagonist 0.03
Read 0.10
Rendering 0.08
Research 0.05
Sentence 0.09
Setting 0.05
Story 0.18
Title 0.03
Vocabulary 0.01
Writing 0.04
Table 3.2 .Feature Scores

53
3.5 Polarity Identification
Polarity calculation for this research was carried out using SentiWordNet 3.0 [21]. As we have
already explained, the feature extraction process outputs tuples with a concept.
Let us assume we got a tuple like (story, exciting, 0). Next, we performed a SentiWordnet
dictionary lookup to identify the polarity of the adjective “exciting”. This enabled us to determine
the polarity of the concept story.
3.5.1 Loading the SenitWordNet 3.0 Dictionary
SentiWordNet 3.0 has the information of polarity values of the positive, negative and neutral nouns,
adjectives and verbs that are located within the defined range of concepts/features.
Once we go through the SentiWordNet 3.0 CSV file, we noticed the following information as
shown in Table 3.3.
POS
Tagger
Offset Pos(t) Neg(t) Sense
Adj 00921014 0.375 0.0 1
Adj 336578 0.25 0.025 2
Adj 423458 0.0 0.65 3
Table 3.3 SentiWN:exciting
SentiWordNet 3.0 has three different senses of polarity values for the adjective word “exciting”.
Therefore, before performing the polarity processing operation, we updated the SentiWordNet 3.0
dictionary with appropriate positive, negative and neutral values.
As the initial step, neutral values of all three senses can be easily calculated by following Equation
#3.

54
Equation #3
score(neutral ) = 1 - ( score(pos) + score(neg) )
Once the neutral value has been identified, as a next step, we found common positive, negative and
neutral polarity values for a word. When analysing the SentiWordNet 3.0, we were able to observe
that there are a considerable amount of phrases with multiple senses. Therefore, to calculate a single
positive, negative and neutral value, we used the following approach.
As we have already mentioned, the adjective “exciting” has 3 senses. Each sense represents a single
row entry in the SentiWordNet dictionary and each sense has separate polarity values. In this case,
the adjective “exciting” has 3 sets of positive, negative and neutral values, whereas the equations
mentioned below find out a single positive, negative and neutral value for the adjective “exciting”.
Equation #4 [1]
score(′word′ = pos)
=
∑ 𝑠𝑐𝑜𝑟𝑒(𝑤𝑠𝑖)𝑤𝑠 𝑖∈𝑝𝑜𝑠 − ∑ 𝑠𝑐𝑜𝑟𝑒(𝑤𝑠𝑖)𝑤𝑠 𝑖∈𝑛𝑒𝑔
|𝑤𝑜𝑟𝑑 𝑤𝑠 𝑖∈𝑝𝑜𝑠|
Equation #5 [1]
score(′word′ = neg)
=
∑ 𝑠𝑐𝑜𝑟𝑒(𝑤𝑠𝑖)𝑤𝑠 𝑖∈𝑛𝑒𝑔 − ∑ 𝑠𝑐𝑜𝑟𝑒(𝑤𝑠𝑖)𝑤𝑠 𝑖∈𝑝𝑜𝑠
|𝑤𝑜𝑟𝑑 𝑤𝑠 𝑖∈𝑛𝑒𝑔|
Equation #6 [1]
score(′word′ = obj) =
∑ 𝑠𝑐𝑜𝑟𝑒(𝑤𝑠𝑖)𝑤𝑠 𝑖∈𝑜𝑏𝑗
|𝑤𝑜𝑟𝑑 𝑤𝑠 𝑖∈𝑜𝑏𝑗|

55
In the above Equations #4, #5 and #6,
score(‘word’) = average sense value of word.
𝑠𝑐𝑜𝑟𝑒(𝑤𝑠𝑖)= Score of the ith
sense of the word. For example, the positive value of the 2nd sense of
exciting adjective is 0.25
𝑤𝑜𝑟𝑑 𝑤𝑠𝑖 = Cardinal value of the ith
sense of the adjective.
Equations #4, #5 and #6 can be used to calculate the terms with all three senses. If the term has two
senses out of three, those two senses can be calculated using Equations #4, #5, or #6. Thereafter, the
third polarity value can be calculated by deducting the additions of those two senses from 1. If the
score has only one sense, the value of the SentiWordNet 3.0 can be used directly.
By applying the equations #4, #5 and #6, average negative, positive and neutral polarity values for
the word “exciting” can be obtained. These newly obtained values are stored in the SentiWordNet
dictionary map. As per the new structure, each word or phrase has been defined in the following
way.
exciting = [ positive polarity, negative polarity, neutral ]
3.5.2 Calculating the Polarity
For example, when calculating the polarity of the concept story, the concept and the adjective can be
represented as
(story, exciting, 0).
According to Sanchez and Penalver [49], multiplication of the score of the story feature, which we
calculated using Equation #2, and the polarity vector of the adjective exciting can be used to identify
the polarity for the story feature.
This entire operation is shown below.

56
Equation #7 [49]
𝑉(𝑓) = 𝑇𝑠𝑐𝑜𝑟𝑒(𝑓) ∗ ( 𝑎 ∗ 𝑆𝑐𝑜𝑟𝑒𝑃𝑜𝑠, 𝑎 ∗ 𝑆𝑐𝑜𝑟𝑒𝑁𝑒𝑔, 𝑆𝑐𝑜𝑟𝑒𝑂𝑏𝑗)
𝑉(𝑓) = olarity of feature f. In our case, it is the feature story.
𝑇𝑠𝑐𝑜𝑟𝑒(𝑓)= score of feature. This has been derived from Equation #2
a = This can be either 1/-1. If there are any negation characters within the range of the given concept
noun, a is -1 and both ScorePos and ScoreNeg obtained negative values.
Once we obtain the V(f), the polarity of the feature story can be calculated easily. Next, we
identified the feature sentiment classification for that particular feature.
3.6 Sentiment Analysis
In order to perform sentiment analysis, we used a technique based on vector analysis for aspect
sentiment score computation. As explained in the previous section, we defined our sentiment score
as a vector in the form of R3
[48]. This module follows the approach of Isidro and Rafael [48].
The Euclidean vector target point contains polarity values as follows:
[positive sentiment, negative sentiment, neutral sentiment].
This Euclidean vector is further processed to obtain the final aggregated global polarity of that
review text.
3.6.1 Euclidean Vector
The Euclidean vector is defined in the form of R3
[1]. The Euclidean vector takes the points of origin
and target.
Origin point is always (0, 0, 0) = O = (0, 0, 0)
Target point takes the following expression T = (x, y, z)

57
Therefore, the expression of Euclidean difference/position vector is V = ( x, y, z)
3.6.2 Calculating the Polarity
As in the form of Euclidean vector (V), we represent feature polarities as well.
V(f) = ( x, y, z)
x = positive polarity of the feature
y = negative polarity of the feature
z = neutral polarity of the feature
Now, based on Equation #7 and vector V(f), we derived the following equation.
Equation #8 [49]
+𝑥(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) − 1 ≤ 𝑥 ≤ 1
𝑃𝑜𝑙𝑎𝑟𝑖𝑡𝑦(𝑓) = −𝑦(𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒) − 1 ≤ 𝑦 ≤ 1
0 0 ≤ 𝑧 ≤ 1
Since Equation #8 defines the polarity for a single feature, the global polarity for the whole user
review can be determined by having the summation of polarities for all features. Equation #9 gives
the global polarity.
Equation #9 [49]
𝑃𝑜𝑙𝑎𝑟𝑖𝑡𝑦 = ∑ 𝑉(𝑓𝑖)
𝑛
𝑖=1

58
Equation #9 provides an output with the Euclidean vector where the global positive polarity, global
negative polarity and neutral value is always 0.

59
Chapter 4
Evaluation
4.1 Selecting Corpus for reviews
In the research conducted to evaluate sentiment analysis on movie reviews by means of ontology,
Zhao and Li [1] had used 120 reviews with 60 positive reviews and 60 negative reviews. In order to
evaluate the proposed method for book reviews and how effective it is, we choose a corpus of 600
book reviews. Unlike movies, books are richer in literature, grammar, plot and storyline. Therefore,
in order to cover all the aspects, we decided to choose a corpus of 600 book reviews with variety and
diversity.
We chose two types of data sets,
● Primary set of book reviews
● Secondary set of book reviews
4.1.1 Primary set of book reviews
In this set of reviews, we have 150 movie reviews. Those reviews are carefully chosen reviews that
have a relatively larger text content and cover multiple aspects with fairly complex lexical patterns.
These 150 reviews are reviewed and analysed to create the ontology and design the lexical patterns
for analysis.
Those reviews can be broken down into the following categories.
● Reviews that are more positive - 75
● Reviews that are more negative - 75

60
4.1.2 Secondary set of book reviews
This set contains 450 book reviews. These reviews are chosen carefully from a review corpus of 15
000 book reviews. The content size of the reviews within this set varies from less to more.
Those reviews can be broken down into the following categories.
● Reviews that are more positive - 225
● Reviews that are more negative - 225
4.2 Evaluation of Opinion Mining Methodology
We primarily carried out three different type of tests to evaluate the proposed methodology such as
best suitable range for window length of words/concepts/aspects, aspect detection and sentiment
analysis. All the results are summarized and reported by means of Precision, Recall and F1
measure.
Precision
Precision can be defined using the following equation.
In the above equation,
TP = Number of polarities correctly predicted as matching (as annotated by human coder)
TP + FP = Total number of polarities predicted as matching

61
Recall
Recall can be defined using the following equation,
In the above equation,
TP = Number of polarities correctly predicted as matching (as annotated by human coder)
P = Total number of polarities correctly matching + Total number of polarities incorrectly
predicted as not matching
F1 measure
F measure can be calculated using following formula,
F1 measure = Number of polarities correctly predicted as matching/Total number of polarities
matching
Based on the above equations scores for precision, recall and F1 measure can be calculated.
4.3 Preparing human evaluators
In order to verify the output, initially two human coders were trained to label sentences with features
and polarity. They were trained with 4 book reviews out of 600 reviews from the corpus on how to
annotate the sentence with features and the polarity of the sentence. Afterwards, each coder was
given a review set of 300 book reviews. Once they are done, they are asked to switch the output
among them and verify the results again. Any inconsistencies between the coders were resolved over
discussion.
Eventually, we prepared an aggregated output of sentence labelling and polarity of that sentence.
For example, if we had a review text stating, “Water for Elephants is an exceptional read. The
characters are well-developed and authentic, the story entertaining and enlightening, the

62
writing and research first rate. All in all, a tremendous read. I can't imagine anyone who
would not enjoy this book, I highly recommend. I love author’s other books as well.”
The aggregated output of the annotated table is shown in Table 4.1
Review Sentence features Sentiment
Water for Elephants is an
exceptional read. The
characters are well-
developed and authentic, the
story entertaining and
enlightening, the writing and
research first rate
characters,
story,
writing,
research
characters = +,
story = +,
writing = +
research = +
All in all, a tremendous read read read = +
I can't imagine anyone who
would not enjoy this book, I
highly recommend
- 0
I love author’s other books as
well
author author = 0
Table 4.1.Example of feature annotation by coders
As shown in the above table, features read, characters, story, writing and research are being
annotated with positive polarity whereas the feature author is considered as neutral. Likewise, all
600 reviews are annotated by both coders and later aggregated to a single output.
The same approach is followed by researchers Zhao and Li [1] and Zhou and Chaovalit [28].

Ontology Based Opinion Mining for Book reviews

Ontology Based Opinion Mining for Book reviews

Recommended

Recommended

More Related Content

Similar to Ontology Based Opinion Mining for Book reviews

Similar to Ontology Based Opinion Mining for Book reviews (20)

Recently uploaded

Recently uploaded (20)

Ontology Based Opinion Mining for Book reviews