P r e d i c t i n g t h e Semantic Orientation of A d j e c .docx

P r e d i c t i n g t h e Semantic Orientation of A d j e c t i v e
s
V a s i l e i o s H a t z i v a s s i l o g l o u a n d K a t h l e e n
R . M c K e o w n
D e p a r t m e n t o f C o m p u t e r S c i e n c e
450 C o m p u t e r S c i e n c e B u i l d i n g
C o l u m b i a U n i v e r s i t y
N e w Y o r k , N . Y . 1 0 0 2 7 , U S A
{ v h , kathy)©cs, columbia, edu
Abstract
We identify and validate from a large cor-
pus constraints from conjunctions on the
positive or negative semantic orientation
of the conjoined adjectives. A log-linear
regression model uses these constraints to
predict whether conjoined adjectives are
of same or different orientations, achiev-
ing 82% accuracy in this task when each
conjunction is considered independently.
Combining the constraints across m a n y ad-
jectives, a clustering algorithm separates
the adjectives into groups of different orien-
tations, and finally, adjectives are labeled
positive or negative. Evaluations on real

d a t a and simulation experiments indicate
high levels of performance: classification
precision is more t h a n 90% for adjectives
t h a t occur in a modest number of conjunc-
tions in the corpus.
1 I n t r o d u c t i o n
T h e semantic orientation or polarity of a word indi-
cates the direction the word deviates from the norm
for its semantic group or lezical field (Lehrer, 1974).
It also constrains the word's usage in the language
(Lyons, 1977), due to its evaluative characteristics
(Battistella, 1990). For example, some nearly syn-
onymous words differ in orientation because one im-
plies desirability and the other does not (e.g., sim-
ple versus simplisfic). In linguistic constructs such
as conjunctions, which impose constraints on the se-
mantic orientation of their arguments (Anscombre
and Ducrot, 1983; Elhadad and McKeown, 1990),
the choices of arguments and connective are mutu-
ally constrained, as illustrated by:
T h e t a x proposal was
simple and well-received }
simplistic but well-received
*simplistic and well-received
by the public.
In addition, almost all antonyms have different se-
mantic orientations3 If we know t h a t two words
relate to the same property (for example, members
of the same scalar group such as hot and cold) but
have different orientations, we can usually infer t h a t

they are antonyms. Given t h a t semantically similar
words can be identified automatically on the basis of
distributional properties and linguistic cues (Brown
et al., 1992; Pereira et al., 1993; Hatzivassiloglou and
McKeown, 1993), identifying the semantic orienta-
tion of words would allow a system to fu rt h er refine
the retrieved semantic similarity relationships, ex-
tracting antonyms.
Unfortunately, dictionaries and similar sources
(theusari, WordNet (Miller et al., 1990)) do not in-
clude semantic orientation information. 2 Explicit
links between an t o n y m s and synonyms m a y also be
lacking, particularly when they depend on the do-
main of discourse; for example, the opposition bear-
bull appears only in stock market reports, where the
two words take specialized meanings.
In this paper, we present and evaluate a m e t h o d
t h a t automatically retrieves semantic orientation in-
fo rm at i o n using indirect information collected from
a large corpus. Because the m e t h o d relies on the cor-
pus, it extracts domain-dependent i n f o r m a t i o n and
automatically adapts to a new d o m ai n when the cor-
pus is changed. Our m e t h o d achieves high preci-
sion (more t h an 90%), and, while our focus to date
has been on adjectives, it can be directly applied to
other word classes. Ultimately, our goal is to use this
m e t h o d in a larger system to a u t o m a t i c a l l y identify
an t o n y m s and distinguish near synonyms.
2 O v e r v i e w o f O u r A p p r o a c h
Our approach relies on an analysis of t e x t u a l corpora
t h a t correlates linguistic features, or indicators, with

1 Exceptions include a small number of terms that are
both negative from a pragmatic viewpoint and yet stand
in all antonymic relationship; such terms frequently lex-
icalize two unwanted extremes, e.g., verbose-terse.
2 Except implicitly, in the form of definitions and us-
age examples.
174
semantic orientation. While no direct indicators of
positive or negative semantic orientation have been
proposed 3, we demonstrate t h a t conjunctions be-
tween adjectives provide indirect information ab o u t
orientation. For most connectives, the conjoined ad-
jectives usually are of the same orientation: compare
f a i r and legitimate and corrupt and brutal which ac-
tually occur in our corpus, with ~fair and brutal and
*corrupt and legitimate (or the other cross-products
of the above conjunctions) which are semantically
anomalous. T h e situation is reversed for but, which
usually connects two adjectives of different orienta-
tions.
T h e system identifies and uses this indirect infor-
m a t i o n in the following stages:
1. All conjunctions of adjectives are extracted
from the corpus along with relevant morpho-
logical relations.
2. A log-linear regression model combines informa-
tion from different conjunctions to determine
if each two conjoined adjectives are of same

or different orientation. T h e result is a graph
with hypothesized same- or different-orientation
links between adjectives.
3. A clustering algorithm separates the adjectives
into two subsets of different orientation. It
places as many words of same orientation as
possible into the same subset.
4. T h e average frequencies in each group are com-
pared and the group with the higher frequency
is labeled as positive.
In the following sections, we first present the set
of adjectives used for training and evaluation. We
next validate our hypothesis that conjunctions con-
strain the orientation of conjoined adjectives and
then describe the remaining three steps of the algo-
rithm. After presenting our results and evaluation,
we discuss simulation experiments that show how
our m e t h o d performs under different conditions of
sparseness of data.
3 D a t a C o l l e c t i o n
For our experiments, we use the 21 million word
1987 Wall Street Journal corpus 4, automatically an-
n o t a t e d with part-of-speech tags using the PARTS
tagger (Church, 1988).
In order to verify our hypothesis about the ori-
entations of conjoined adjectives, and also to train
and evaluate our subsequent algorithms, we need a
3Certain words inflected with negative affixes (such
as in- or un-) tend to be mostly negative, but this rule

applies only to a fraction of the negative words. Further-
more, there are words so inflected which have positive
orientation, e.g., independent and unbiased.
4Available form the ACL Data Collection Initiative
as CD ROM 1.
Positive: adequate central clever famous
intelligent remarkable reputed
sensitive slender thriving
Negative: contagious drunken ignorant lanky
listless primitive strident troublesome
unresolved unsuspecting
Figure 1: Randomly selected adjectives with positive
and negative orientations.
set of adjectives with predetermined orientation la-
bels. We constructed this set by taking all adjectives
appearing in our corpus 20 times or more, then re-
moving adjectives t h at have no orientation. These
are typically members of groups of complementary,
qualitative terms (Lyons, 1977), e.g., domestic or
medical.
We then assigned an orientation label (either + or
- ) to each adjective, using an evaluative approach.
Th e criterion was whether the use of this adjective
ascribes in general a positive or negative quality to
the modified item, making it b e t t e r or worse t h a n a
similar unmodified item. We were unable to reach
a unique label out of context for several adjectives
which we removed from consideration; for example,
cheap is positive if it is used as a sy n o n y m of in-
expensive, but negative if it implies inferior quality.

T h e operations of selecting adjectives and assigning
labels were performed before testing our conjunction
hypothesis or implementing any other algorithms, to
avoid any influence on our labels. T h e final set con-
tained 1,336 adjectives (657 positive and 679 nega-
tive terms). Figure 1 shows randomly selected terms
from this set.
To further validate our set of labeled adjectives,
we subsequently asked four people to independently
label a randomly drawn sample of 500 of these
adjectives. T h e y agreed with us t h a t the posi-
tive/negative concept applies to 89.15% of these ad-
jectives on average. For the adjectives where a pos-
itive or negative label was assigned by b o t h us and
the independent evaluators, the average agreement
on the label was 97.38%. Th e average inter-reviewer
agreement on labeled adjectives was 96.97%. These
results are extremely significant statistically and
compare favorably with validation studies performed
for other tasks (e.g., sense disambiguation) in the
past. T h e y show t h a t positive and negative orien-
tation are objective properties t h a t can be reliably
determined by humans.
To extract conjunctions between adjectives, we
used a two-level finite-state grammar, which covers
complex modification patterns and noun-adjective
apposition. Running this parser on the 21 mil-
lion word corpus, we collected 13,426 conjunctions
of adjectives, expanding to a t o t al o f 15,431 con-
joined adjective pairs. After morphological trans-
175

Conjunction category
Conjunction
types
analyzed
All appositive and conjunctions
All conjunctions 2,748
All and conjunctions 2,294
All or conjunctions 305
All but conjunctions 214
All attributive and conjunctions 1,077
All predicative and conjunctions 860
30
% same-
orientation
(types)
77.84%
81.73%
77.05%
30.84%
80.04%
84.77%
70.00%
% same-
orientation
(tokens)
72.39%

78.07%
60.97%
25.94%
76.82%
84.54%
63.64%
P-Value
(for types)
< i • i0 -I~
< 1 • 10 -1~
< 1 • 10 -1~
2 . 0 9 . 1 0 -:~
< 1. i 0 -16
< 1. i 0 -I~
0.04277
Table 1: Validation of our conjunction hypothesis. T h e P-
value is the probability t h a t similar
extreme results would have been obtained if same- and
different-orientation conjunction types were
equally distributed.
or more
actually
formations, the remaining 15,048 conjunction tokens
involve 9,296 distinct pairs of conjoined adjectives
(types). Each conjunction token is classified by the
parser according to three variables: the conjunction
used (and, or, bu~, either-or, or neither-nor), the
type of modification (attributive, predicative, appos-
itive, resultative), and the number of the modified
noun (singular or plural).

4 V a l i d a t i o n o f the Conjunction
H y p o t h e s i s
Using the three attributes extracted by the parser,
we constructed a cross-classification of the conjunc-
tions in a three-way table. We counted types and to-
kens of each conjoined pair t h a t had b o t h members
in t h e set o f pre-selected labeled adjectives discussed
above; 2,748 (29.56%) of all conjoined pairs (types)
and 4,024 (26.74%) of all conjunction occurrences
(tokens) met this criterion. We augmented this ta-
ble with marginal totals, arriving at 90 categories,
each of which represents a triplet of a t t r i b u t e values,
possibly with one or more "don't care" elements.
We then measured the percentage of conjunctions
in each category with adjectives of same or differ-
ent orientations. Under the null hypothesis of same
proportions of adjective pairs (types) of same and
different orientation in a given category, the num-
ber of same- or different-orientation pairs follows a
binomial distribution with p = 0.5 (Conover, 1980).
We show in Table 1 the results for several repre-
sentative categories, and summarize all results be-
low:
• Our conjunction hypothesis is validated overall
and for almost all individual cases. T h e results
are extremely significant statistically, except for
a few cases where the sample is small.
• Aside from the use of but with adjectives of
different orientations, there are, rather surpris-
ingly, small differences in the behavior of con-

junctions between linguistic environments (as
represented by the three attributes). Th ere are
a few exceptions, e.g., appositive and conjunc-
tions modifying plural nouns are evenly split
between same and different orientation. But
in these exceptional cases the sample is very
small, and the observed behavior m a y be due
to chance.
• Further analysis of different-orientation pairs in
conjunctions other t h an but shows t h a t con-
joined antonyms are far more frequent t h a n ex-
pected by chance, in agreement with (Justeson
and Katz, 1991).
5 P r e d i c t i o n o f L i n k T y p e
T h e analysis in the previous section suggests a base-
line m e t h o d for classifying links between adjectives:
since 77.84% of all links from conjunctions indicate
same orientation, we can achieve this level of perfor-
mance by always guessing t h a t a link is of the same-
orientation type. However, we can improve perfor-
mance by noting t h a t conjunctions using but exhibit
the opposite pattern, usually involving adjectives of
different orientations. Thus, a revised b u t still sim-
ple rule predicts a different-orientation link if the
two adjectives have been seen in a but conjunction,
and a same-orientation link otherwise, assuming the
two adjectives were seen connected by at least one
conjunction.
Morphological relationships between adjectives al-
so play a role. Adjectives related in form (e.g., ade-
quate-inadequate or thoughtful-thoughtless) almost

always have different semantic orientations. We im-
plemented a morphological analyzer which matches
adjectives related in this manner. T h i s process is
highly accurate, b u t unfortunately does not apply
to m an y of the possible pairs: in our set of 1,336
labeled adjectives (891,780 possible pairs), 102 pairs
are morphologically related; among them, 99 are of
different orientation, yielding 97.06% accuracy for
the morphology method. This information is orthog-
onal to t h a t extracted from conjunctions: only 12
of the 102 morphologically related pairs have been
observed in conjunctions in our corpus. Thus, we
176
Prediction
m e t h o d
Always predict
same orientation
But rule
Log-linear model
Morphology
used?
N o
Yes
N o
Yes
Accuracy on reported

same-orientation links
77.84%
78.18%
81.81%
82.20%
No
Yes
81.53%
82.00%
Accuracy on reported
different-orientation links
97.06%
69.16%
78.16%
73.70%
82.44%
Table 2: Accuracy of several link prediction models.
Overall
accuracy
77.84%
78.86%
80.82%
81.75%
80.97%
82.05%
add to the predictions made from conjunctions the
different-orientation links suggested by morphologi-
cal relationships.

We improve the accuracy of classifying links de-
rived from conjunctions as same or different orienta-
tion with a log-linear regression model (Santner and
Duffy, 1989), exploiting the differences between the
various conjunction categories. This is a generalized
linear model (McCullagh and Nelder, 1989) with a
linear predictor
= w W x
where x is the vector of the observed counts in the
various conjunction categories for the particular ad-
jective pair we try to classify and w is a vector of
weights to be learned during training. T h e response
y is non-linearly related to r/ through the inverse
logit function,
e0
Y = l q-e"
Note t h a t y E (0, 1), with each of these endpoints
associated with one of the possible outcomes.
We have 90 possible predictor variables, 42 of
which are linearly independent. Since using all the
42 independent predictors invites overfitting (Duda
and Hart, 1973), we have investigated subsets of the
full log-linear model for our d a t a using the m e t h o d
o f iterative stepwise refinement: starting with an ini-
tial model, variables are added or dropped if their
contribution to the reduction or increase of the resid-
ual deviance compares favorably to the resulting loss
or gain of residual degrees of freedom. This process
led to the selection of nine predictor variables.

We evaluated the three prediction models dis-
cussed above with and without the secondary source
of morphology relations. For the log-linear model,
we repeatedly partitioned our d a t a into equally sized
training and testing sets, estimated the weights on
the training set, and scored the model's performance
on the testing set, averaging the resulting scores. 5
Table 2 shows the results of these analyses. Al-
though the log-linear model offers only a small im-
provement on pair classification than the simpler but
prediction rule, it confers the i m p o r t a n t advantage
5When morphology is to be used as a supplementary
predictor, we remove the morphologically related pairs
from the training and testing sets.
of rating each prediction between 0 and 1. We make
extensive use of this in the next phase of our algo-
rithm.
6 Finding Groups of Same-Oriented
Adjectives
T h e third phase of our m et h o d assigns the adjectives
into groups, placing adjectives of the same (but un-
known) orientation in the same group. Each pair
of adjectives has an associated dissimilarity value
between 0 and 1; adjectives connected by same-
orientation links have low dissimilarities, and con-
versely, different-orientation links result in high dis-
similarities. Adjective pairs with no connecting links
are assigned the neutral dissimilarity 0.5.
T h e baseline and but methods make qualitative
distinctions only (i.e., same-orientation, different-

orientation, or unknown); for them, we define dis-
similarity for same-orientation links as one minus
the probability t h a t such a classification link is cor-
rect and dissimilarity for different-orientation links
as the probability t h a t such a classification is cor-
rect. These probabilities are estimated from sep-
arate training data. Note t h at for these prediction
models, dissimilarities are identical for similarly clas-
sifted links.
T h e log-linear model, on the other hand, offers
an estimate of how good each prediction is, since it
produces a value y between 0 and 1. We construct
the model so t h a t 1 corresponds to same-orientation,
and define dissimilarity as one minus the produced
value.
Same and different-orientation links between ad-
jectives form a graph. To partition the graph nodes
into subsets of the same orientation, we employ an
iterative optimization procedure on each connected
component, based on the exchange method, a non-
hierarchical clustering algorithm (Spgth, 1985). We
define an objective/unction ~ scoring each possible
partition 7 ) of the adjectives into two subgroups C1
and C2 as
i=1 x,y E Ci
177
Number of
adjectives in

test set ([An[)
2 730
3 516
4 369
5 236
Number of
links in
test set ([L~[)
2,568
2,159
1,742
1,238
Average number
o f l i n k s f o r
each adjective
7.04
8.37
9.44
10.49
Accuracy
78.08%
82.56%
87.26%
92.37%
Ratio o f average
group frequencies

1.8699
1.9235
L3486
1.4040
Table 3: Evaluation of the adjective classification and labeling
methods.
where [Cil stands for the cardinality of cluster i, and
d(z, y) is the dissimilarity between adjectives z and
y. We want to select the partition :Pmin t h a t min-
imizes ~, subject to the additional constraint t h a t
for each adjective z in a cluster C,
1 1
I C l - 1 d(=,y) < --IVl d(=, y) (1)
where C is the complement of cluster C, i.e., the
other member of the partition. This constraint,
based on Rousseeuw's (1987) s=lhoue~es, helps cor-
rect wrong cluster assignments.
To find Pmin, we first construct a r a n d o m parti-
tion of the adjectives, then locate the adjective t h a t
will most reduce the objective function if it is moved
from its current cluster. We move this adjective and
proceed with the next iteration until no movements
can improve the objective function. At the final it-
eration, the cluster assignment of any adjective t h a t
violates constraint (1) is changed. This is a steepest-
descent hill-climbing method, and thus is guaran-
teed to converge. However, it will in general find a
local minimum rather than the global one; the prob-
lem is NP-complete (Garey and $ohnson, 1979). We
can arbitrarily increase the probability of finding the

globally optimal solution by repeatedly running the
algorithm with different starting partitions.
7 L a b e l i n g t h e C l u s t e r s a s P o s i t i v e
o r N e g a t i v e
T h e clustering algorithm separates each component
of the graph into two groups of adjectives, b u t does
not actually label the adjectives as positive or neg-
ative. To accomplish that, we use a simple criterion
t h a t applies only to pairs or groups of words o f oppo-
site orientation. We have previously shown (Hatzi-
vassiloglou and McKeown, 1995) t h a t in oppositions
of gradable adjectives where one member is semanti-
cally unmarked, the unmarked member is the most
frequent one about 81% of the time. This is relevant
to our task because semantic markedness exhibits
a strong correlation with orientation, the unmarked
m e m b e r almost always having positive orientation
(Lehrer, 1985; Battistella, 1990).
We compute the average frequency of the words
in each group, expecting the group with higher av-
erage frequency to contain the positive terms. This
aggregation operation increases the precision of the
labeling dramatically since indicators for m a n y pairs
of words are combined, even when some of the words
are incorrectly assigned to their group.
8 R e s u l t s a n d E v a l u a t i o n
Since graph connectivity affects performance, we de-
vised a m e t h o d of selecting test sets t h a t makes this
dependence explicit. Note t h at the g rap h density is

largely a function of corpus size, and thus can be
increased by adding more data. Nevertheless, we
report results on sparser test sets to show how our
algorithm scales up.
We separated our sets of adjectives A (containing
1,336 adjectives) and conjunction- and morphology-
based links L (containing 2,838 links) into training
and testing groups by selecting, for several values
o f the p a r a m e t e r a, the maximal subset of A, An,
which includes an adjective z if and only if there
exist at least a links from L between x and other
elements of An. This operation in t u r n defines a
subset of L, L~, which includes all links between
members of An. We train our log-linear model on
L - La (excluding links between morphologically re-
lated adjectives), compute predictions and dissimi-
larities for the links in L~, and use these to classify
and label the adjectives in An. c~ must be at least
2, since we need to leave some links for training.
Table 3 shows the results of these experiments for
a = 2 to 5. Our m e t h o d produced the correct clas-
sification between 78% of the time on the sparsest
test set up to more than 92% of the time when a
higher number of links was present. Moreover, in all
cases, the ratio of the two group frequencies correctly
identified the positive subgroup. These results are
extremely significant statistically (P-value less t h a n
10 -16 ) when compared with the baseline m e t h o d of
ran d o m l y assigning orientations to adjectives, or the
baseline m e t h o d of always predicting the most fre-
quent (for types) category (50.82% o f the adjectives
in our collection are classified as negative). Figure 2
shows some of the adjectives in set A4 and their clas-
sifications.

178
Classified as positive:
b o l d d e c i s i v e disturbing generous good
honest important large mature patient
peaceful positive proud sound
stimulating s t r a i g h t f o r w a r d strange
talented vigorous witty
Classified as negative:
ambiguous cautious cynical evasive
harmful hypocritical inefficient insecure
i r r a t i o n a l irresponsible minor outspoken
pleasant reckless risky selfish tedious
unsupported vulnerable wasteful
Figure 2: Sample retrieved classifications of adjec-
tives from set A4. Correctly m a t c h e d adjectives are
shown in bold.
9 G r a p h C o n n e c t i v i t y a n d
P e r f o r m a n c e
A strong point of our m e t h o d is t h a t decisions on
individual words are aggregated to provide decisions
on how to group words into a class and whether to
label the class as positive or negative. Thus, the
overall result can be much more accurate t h a n the
individual indicators. To verify this, we ran a series
o f simulation experiments. Each experiment m e a -
sures how our algorithm performs for a given level
o f precision P for identifying links and a given av-

erage n u m b e r of links k for each word. T h e goal is
to show t h a t even when P is low, given enough d a t a
(i.e., high k), we can achieve high performance for
the grouping.
As we noted earlier, the corpus d a t a is eventually
represented in our s y s t e m as a graph, with the nodes
corresponding to adjectives and the links to predic-
tions a b o u t whether the two connected adjectives
have the s a m e or different orientation. Thus the pa-
r a m e t e r P in the simulation experiments measures
how well we are able to predict each link indepen-
dently of the others, and the p a r a m e t e r k measures
the n u m b e r of distinct adjectives each adjective ap-
pears with in conjunctions. P therefore directly rep-
resents the precision of the link classification algo-
r i t h m , while k indirectly represents the corpus size.
To m e a s u r e the effect of P and k (which are re-
flected in the g r a p h topology), we need to carry out a
series of experiments where we systematically vary
their values. For example, as k (or the a m o u n t of
d a t a ) increases for a given level of precision P for in-
dividual links, we want to measure how this affects
overall accuracy of the resulting groups of nodes.
T h u s , we need to construct a series of d a t a sets,
or graphs, which represent different scenarios cor-
responding to a given combination of values of P
a n d k. To do this, we construct a r a n d o m g r a p h
by r a n d o m l y assigning 50 nodes to the two possible
orientations. Because we d o n ' t have frequency and
m o r p h o l o g y information on these a b s t r a c t nodes, we
cannot predict whether two nodes are o f the s a m e
or different orientation. Rather, we r a n d o m l y as-
sign links between nodes so t h a t , on average, each

node participates in k links and 100 x P % o f all
links connect nodes of the s a m e orientation. T h e n
we consider these links as identified by the link pre-
diction algorithm as connecting two nodes with the
s a m e orientation (so t h a t 100 x P % of these pre-
dictions will be correct). This is equivalent to the
baseline link classification m e t h o d , and provides a
lower bound on the performance of the a l g o r i t h m
actually used in our system (Section 5).
Because of the lack of actual m e a s u r e m e n t s such
as frequency on these abstract nodes, we also de-
couple the partitioning and labeling c o m p o n e n t s of
our s y s t e m and score the partition found under the
best m a t c h i n g conditions for the actual labels. T h u s
the simulation measures only how well the s y s t e m
separates positive f r o m negative adjectives, not how
well it determines which is which. However, in all
the experiments performed on real corpus d a t a (Sec-
tion 8), the s y s t e m correctly found the labels o f the
groups; any misclassifications came f r o m misplacing
an adjective in the wrong group. T h e whole proce-
dure of constructing the r a n d o m g r a p h and finding
and scoring the groups is repeated 200 times for any
given combination of P and k, and the results are
averaged, thus avoiding accidentally evaluating our
s y s t e m on a g r a p h t h a t is not truly representative of
graphs with the given P and k.
We observe (Figure 3) t h a t even for relatively low
t9, our ability to correctly classify the nodes ap-
proaches very high levels with a m o d e s t n u m b e r of
links. For P = 0.8, we need only a b o u t ? links
per adjective for classification performance over 90%
and only 12 links per adjective for p e r f o r m a n c e over
99%. s T h e difference between low and high values

of P is in the rate at which increasing d a t a increases
overall precision. These results are s o m e w h a t m o r e
optimistic t h a n those obtained with real d a t a (Sec-
tion 8), a difference which is p r o b a b l y due to the uni-
f o r m distributional assumptions in the simulation.
Nevertheless, we expect the trends to be similar to
the ones shown in Figure 3 and the results of T a b l e 3
on real d a t a support this expectation.
1 0 C o n c l u s i o n a n d F u t u r e W o r k
We have proposed and verified from corpus d a t a con-
straints on the semantic orientations of conjoined ad-
jectives. We used these constraints to a u t o m a t i c a l l y
construct a log-linear regression model, which, com-
bined with s u p p l e m e n t a r y m o r p h o l o g y rules,
pre-
dicts whether two conjoined adjectives are o f s a m e
812 links per adjective for a set of n adjectives requires
6n conjunctions between the n adjectives in the corpus.
179
~ 75'
70.
65.
60"
55-
50 ~

0 i 2 ~ 4 5 6 7 8 9 1 ( ) 1'2 14 16 18 20
Avem0e neiohbo~ per node
(a) P = 0.75
25 30 32.77
95.
90.
85.
~75'
Average neighbors per node
( b ) P = 0.8
,~ 70
65
6O
5,5
50
Average netghbo~ per node
(c) P = 0.85
25 28.64

Figure 3: Simulation results obtained on 50 nodes.
10(
95
9O
85
P
~ 7o
55
Average neighb0m per node
(d) P = 0.9
In each figure, the last z coordinate indicates the
(average) m a x i m u m possible value of k for this P , and the
d o t t ed line shows the performance o f a r a n d o m
classifier.
or different orientation with 82% accuracy. We then
classified several sets of adjectives according to the
links inferred in this way and labeled them as posi-
tive or negative, obtaining 92% accuracy on the clas-
sification task for reasonably dense graphs and 100%
accuracy on the labeling task. Simulation experi-
ments establish t h a t very high levels of performance
can be obtained with a modest number of links per
word, even when the links themselves are not always
correctly classified.
As p a r t of our clustering algorithm's output, a

"goodness-of-fit" measure for each word is com-
puted, based on Rousseeuw's (1987) silhouettes.
T h i s measure ranks the words according to how well
t h e y fit in their group, and can thus be used as
a quantitative measure of orientation, refining the
binary positive-negative distinction. By restricting
the labeling decisions to words with high values of
this measure we can also increase the precision of
our system, at the cost of sacrificing some coverage.
We are currently combining the o u t p u t of this sys-
t e m with a semantic group finding system so t h a t we
can automatically identify a n t o n y m s from the cor-
pus, without access to any semantic descriptions.
T h e learned semantic categorization of the adjec-
tives can also be used in the reverse direction, to
help in interpreting the conjunctions they partici-
pate. We will also extend our analyses to nouns and
verbs.
A c k n o w l e d g e m e n t s
Th i s work was supported in p art by the Office
of Naval Research under grant N00014-95-1-0745,
j o i n t l y by the Office of Naval Research and the
Advanced Research Projects Agency under grant
N00014-89-J-1782, by the National Science Founda-
180
tion under grant GER-90-24069, and by the New
York State Center for Advanced Technology un-
der contracts NYSSTF-CAT(95)-013 and NYSSTF-

CAT(96)-013. We thank Ken Church and the
AT&T Bell Laboratories for making the PARTS
part-of-speech tagger available to us. We also thank
Dragomir Radev, Eric Siegel, and Gregory Sean
McKinley who provided models for the categoriza-
tion of the adjectives in our training and testing sets
as positive and negative.
R e f e r e n c e s
Jean-Claude Anscombre and Oswald Ducrot. 1983.
L ' Argumentation dans la Langue. Philosophic et
Langage. Pierre Mardaga, Brussels, Belgium.
Edwin L. Battistella. 1990. Markedness: The Eval-
uative Superstructure of Language. State Univer-
sity of New York Press, Albany, New York.
Peter F. Brown, Vincent J. della Pietra, Peter V.
de Souza, Jennifer C. Lai, and Robert L. Mercer.
1992. Class-based n-gram models of natural lan-
guage. Computational Linguistics, 18(4):487-479.
Kenneth W. Church. 1988. A stochastic parts
program and noun phrase parser for unrestricted
text. In Proceedings of the Second Conference on
Applied Natural Language Processing (ANLP-88),
pages 136-143, Austin, Texas, February. Associa-
tion for Computational Linguistics.
W. J. Conover. 1980. Practical Nonparametric
Statistics. Wiley, New York, 2nd edition.
Richard O. Duda and Peter E. Hart. 1973. Pattern
Classification and Scene Analysis. Wiley, New
York.

Michael Elhadad and Kathleen R. McKeown. 1990.
A procedure for generating connectives. In Pro-
ceedings of COLING, Helsinki, Finland, July.
Michael R. Garey and David S. Johnson. 1979.
Computers and Intractability: A Guide to the
Theory ofNP-Completeness. W. H. Freeman, San
Francisco, California.
Vasileios Hatzivassiloglou and Kathleen R. McKe-
own. 1993. Towards the automatic identification
of adjectival scales: Clustering adjectives accord-
ing to meaning. In Proceedings of the 31st Annual
Meeting of the ACL, pages 172-182, Columbus,
Ohio, June. Association for Computational Lin-
guistics.
Vasileios I-Iatzivassiloglou and Kathleen R. MeKe-
own. 1995. A quantitative evaluation of linguis-
tic tests for the automatic prediction of semantic
markedness. In Proceedings of the 83rd Annual
Meeting of the ACL, pages 197-204, Boston, Mas-
sachusetts, June. Association for Computational
Linguistics.
John S. Justeson and Slava M. Katz. 1991. Co-
occurrences of antonymous adjectives and their
contexts. Computational Linguistics, 17(1):1-19.
Adrienne Lehrer. 1974. Semantic Fields and Lezical
Structure. North Holland, Amsterdam and New
York.
Adrienne Lehrer. 1985. Markedness and antonymy.

Journal of Linguistics, 31(3):397-429, September.
John Lyons. 1977. Semantics, volume 1. Cambridge
University Press, Cambridge, England.
Peter McCullagh and John A. Nelder. 1989. Gen-
eralized Linear Models. Chapman and Hall, Lon-
don, 2nd edition.
George A. Miller, Richard Beckwith, Christiane Fell-
baum, Derek Gross, and Katherine J. Miller.
1990. Introduction to WordNet: An on-line lexi-
cal database. International Journal of Lexicogra-
phy (special issue), 3(4):235-312.
Fernando Pereira, Naftali Tishby, and Lillian Lee.
1993. Distributional clustering of English words.
In Proceedings of the 3Ist Annual Meeting of the
ACL, pages 183-190, Columbus, Ohio, June. As-
sociation for Computational Linguistics.
Peter J. Rousseeuw. 1987. Silhouettes: A graphical
aid to the interpretation and validation of cluster
analysis. Journal of Computational and Applied
Mathematics, 20:53-65.
Thomas J. Santner and Diane E. Duffy. 1989. The
Statistical Analysis of Discrete Data. Springer-
Verlag, New York.
Helmuth Sp~ith. 1985. Cluster Dissection and Anal-
ysis: Theory, FORTRAN Programs, Examples.
Ellis Horwo0d, Chiehester, West Sussex, England.
181

ROBERT:
What are the characteristics of someone having you as an
abundance or deficiency?
Deficiency:
-Goiter- Those who lack adequate levels of me, causes the
thyroid to progressively grown larger as it tries to keep up with
the demand for the thyroid hormone production. The goiter is
large and can often interfere with breathing and eating and lead
to choking. I do not cause this problem in the United States but
I am a major reason that goiters are present Worldwide.
- Hypothyroidism- As my levels drop another possible disorder
is hypothyroidism which is the lack or low levels of the TH in
the body. It is not common for lack of me to be the cause of
hypothyroidism in the United States but it is more common
worldwide.
- In Pregnancy- When my levels are low in infants and pregnant
women mean that I can cause miscarriages, stillbirth, preterm
labor, and congenital abnormalities in the babies.
Abundance:
Taking too much of me can also cause problems. This is
especially true in individuals that already have thyroid
problems, such as nodules, hyperthyroidism and autoimmune
thyroid disease. Administration of large amounts of me through
medications, radiology procedures and dietary excess can cause
or worsen hyperthyroidism and hypothyroidism.
In addition, individuals who move from a deficient region (for
example, parts of Europe) to a region with adequate intake (for
example, the United States) may also develop thyroid problems
since their thyroids have become very good at taking up and
using small amounts.
How would your recommended daily allowance affect special
populations (pregnant women, children, men, women, etc.)?

Men: 150 μg per day
Women: 150 μg per day
Birth to 6 months: 110 μg
7 months to 12 months: 130 μg
1- 8 years: 90 μg
9-13 years: 120 μg
Teens: 150 μg
Pregnant Women: 220 μg
Breastfeeding Women: 290 μg
These levels are specific for the United States.
What are some of the ways you might adjust the amount of this
mineral/water in a person's diet?
The easiest way is table salt and making sure that I am included
in the salt you chose. You can find me in some multivitamins
but not often in the United States.
What are some of the more serious long term and detrimental
effects (deficiency disorders, etc.) that might affect the body?
Goiters and Hypothyroidism are the most serious disorders that
can happen with long term low levels.
What other prevention, treatment, and safety measures might be
beneficial to your patient?
There are no tests to confirm if you have enough of me in your
body. When deficiency is seen in an entire population, it is best
managed by ensuring that common foods that people eat contain
sufficient levels of me. Since even mild deficiency during
pregnancy can have effects on delivery and the developing
baby, all pregnant and breastfeeding women should take a
multivitamin containing at least 150 μg per day.

PAT:
I am an essential part of an electrolyte, but for some reason the
electrolyte drink Gatorade doesn’t think I am important and
only puts 37mg in an 8oz bottle.
Sodium is my co-worker and we keep each other out of trouble,
but I am the most positively charged ion and together we help to
attract some glucose into our work building called the “cell”.
Have you ever had a cramp? Well sometime it is caused by my
co-worker sodium overloading in cells. No worries, eat a
banana because I know how to regulate sodium and will help to
alleviate that cramp.
Don’t you dare neglect me but don’t over indulge either.
Without me you may have may be a weakling, have trouble
breathing or moving stool. But a lack of me can cause abnormal
heart contractions too. Conversely, if there is too much of this
good thing I can cause the heart to stop.
You are not going to find me in garbage food, so enough with
the Micky-dees and Dairy queen. Hook yourself up with fruits
and veggies, whole grains and high quality meats.
Once you have made your initial post, the second portion of this
weekly discussion requires that you respond to three different
classmates.
AERICK:
I am a mineral that is very important for the human body. I have
many effects on your body both in the event of a deficiency as
well as having too much of me. I also mix well with many other
elements such as chloride to help clear ice from your driveway
during winter. Too much of me may cause kidney stones and
cause interference in the way your heart and brain work.
However, too little of me can result in serious health issues
such as osteopenia and osteoporosis. I can weaken the most
essential parts of your body and Vitamin D helps me.
I am important for pregnant women and they should consume at

least 1,000mg a day of me. For infants I should exist up to
700mg a day which amounts should increase to over 1000mg per
day as they grow older. Adults should regularly be taking it at
least 1000mg of me a day.
You can find me regularly in foods, however if you are lacking
you can also find me in supplements but I can also exist in
medications such as anti-acids.
Symptoms of disorders as a result of abusing me may end up
causing confusion and memory loss as well as muscle spasms,
numbness and tingling of the hands feet and face.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to
Unsupervised Classification of Reviews
Peter D. Turney
Institute for Information Technology
National Research Council of Canada
Ottawa, Ontario, Canada, K1A 0R6
[email protected]
Abstract
This paper presents a simple unsupervised
learning algorithm for classifying reviews
as recommended (thumbs up) or not rec-
ommended (thumbs down). The classifi-
cation of a review is predicted by the
average semantic orientation of the
phrases in the review that contain adjec-
tives or adverbs. A phrase has a positive
semantic orientation when it has good as-

sociations (e.g., “ subtle nuances” ) and a
negative semantic orientation when it has
bad associations (e.g., “ very cavalier” ). In
this paper, the semantic orientation of a
phrase is calculated as the mutual infor-
mation between the given phrase and the
word “ excellent” minus the mutual
information between the given phrase and
the word “ poor” . A review is classified as
recommended if the average semantic ori-
entation of its phrases is positive. The al-
gorithm achieves an average accuracy of
74% when evaluated on 410 reviews from
Epinions, sampled from four different
domains (reviews of automobiles, banks,
movies, and travel destinations). The ac-
curacy ranges from 84% for automobile
reviews to 66% for movie reviews.
1 Introduction
If you are considering a vacation in Akumal, Mex-
ico, you might go to a search engine and enter the
query “ Akumal travel review” . However, in this
case, Google1 reports about 5,000 matches. It
would be useful to know what fraction of these
matches recommend Akumal as a travel destina-
tion. With an algorithm for automatically classify-
ing a review as “ thumbs up” or “ thumbs down” , it
would be possible for a search engine to report
such summary statistics. This is the motivation for
the research described here. Other potential appli-
cations include recognizing “ flames” (abusive
newsgroup messages) (Spertus, 1997) and develop-
ing new kinds of search tools (Hearst, 1992).

In this paper, I present a simple unsupervised
learning algorithm for classifying a review as rec-
ommended or not recommended. The algorithm
takes a written review as input and produces a
classification as output. The first step is to use a
part-of-speech tagger to identify phrases in the in-
put text that contain adjectives or adverbs (Brill,
1994). The second step is to estimate the semantic
orientation of each extracted phrase (Hatzivassi-
loglou & McKeown, 1997). A phrase has a posi-
tive semantic orientation when it has good
associations (e.g., “ romantic ambience” ) and a
negative semantic orientation when it has bad as-
sociations (e.g., “ horrific events” ). The third step is
to assign the given review to a class, recommended
or not recommended, based on the average seman-
tic orientation of the phrases extracted from the re-
view. If the average is positive, the prediction is
that the review recommends the item it discusses.
Otherwise, the prediction is that the item is not
recommended.
The PMI-IR algorithm is employed to estimate
the semantic orientation of a phrase (Turney,
2001). PMI-IR uses Pointwise Mutual Information
(PMI) and Information Retrieval (IR) to measure
the similarity of pairs of words or phrases. The se-
1 http://www.google.com
mantic orientation of a given phrase is calculated
by comparing its similarity to a positive reference
word (“ excellent” ) with its similarity to a negative

reference word (“ poor” ). More specifically, a
phrase is assigned a numerical rating by taking the
mutual information between the given phrase and
the word “ excellent” and subtracting the mutual
information between the given phrase and the word
“ poor” . In addition to determining the direction of
the phrase’ s semantic orientation (positive or nega-
tive, based on the sign of the rating), this numerical
rating also indicates the strength of the semantic
orientation (based on the magnitude of the num-
ber). The algorithm is presented in Section 2.
Hatzivassiloglou and McKeown (1997) have
also developed an algorithm for predicting seman-
tic orientation. Their algorithm performs well, but
it is designed for isolated adjectives, rather than
phrases containing adjectives or adverbs. This is
discussed in more detail in Section 3, along with
other related work.
The classification algorithm is evaluated on 410
reviews from Epinions2, randomly sampled from
four different domains: reviews of automobiles,
banks, movies, and travel destinations. Reviews at
Epinions are not written by professional writers;
any person with a Web browser can become a
member of Epinions and contribute a review. Each
of these 410 reviews was written by a different au-
thor. Of these reviews, 170 are not recommended
and the remaining 240 are recommended (these
classifications are given by the authors). Always
guessing the majority class would yield an accu-
racy of 59%. The algorithm achieves an average
accuracy of 74%, ranging from 84% for automo-
bile reviews to 66% for movie reviews. The ex-
perimental results are given in Section 4.

The interpretation of the experimental results,
the limitations of this work, and future work are
discussed in Section 5. Potential applications are
outlined in Section 6. Finally, conclusions are pre-
sented in Section 7.
2 Classifying Reviews
The first step of the algorithm is to extract phrases
containing adjectives or adverbs. Past work has
demonstrated that adjectives are good indicators of
subjective, evaluative sentences (Hatzivassiloglou
2 http://www.epinions.com
& Wiebe, 2000; Wiebe, 2000; Wiebe et al., 2001).
However, although an isolated adjective may indi-
cate subjectivity, there may be insufficient context
to determine semantic orientation. For example,
the adjective “ unpredictable” may have a negative
orientation in an automotive review, in a phrase
such as “ unpredictable steering” , but it could have
a positive orientation in a movie review, in a
phrase such as “ unpredictable plot” . Therefore the
algorithm extracts two consecutive words, where
one member of the pair is an adjective or an adverb
and the second provides context.
First a part-of-speech tagger is applied to the
review (Brill, 1994).3 Two consecutive words are
extracted from the review if their tags conform to
any of the patterns in Table 1. The JJ tags indicate
adjectives, the NN tags are nouns, the RB tags are
adverbs, and the VB tags are verbs.4 The second

pattern, for example, means that two consecutive
words are extracted if the first word is an adverb
and the second word is an adjective, but the third
word (which is not extracted) cannot be a noun.
NNP and NNPS (singular and plural proper nouns)
are avoided, so that the names of the items in the
review cannot influence the classification.
Table 1. Patterns of tags for extracting two-word
phrases from reviews.
First Word Second Word Third Word
(Not Extracted)
1. JJ NN or NNS anything
2. RB, RBR, or
RBS
JJ not NN nor NNS
3. JJ JJ not NN nor NNS
4. NN or NNS JJ not NN nor NNS
5. RB, RBR, or
RBS
VB, VBD,
VBN, or VBG
anything
The second step is to estimate the semantic ori-
entation of the extracted phrases, using the PMI-IR
algorithm. This algorithm uses mutual information
as a measure of the strength of semantic associa-
tion between two words (Church & Hanks, 1989).
PMI-IR has been empirically evaluated using 80

synonym test questions from the Test of English as
a Foreign Language (TOEFL), obtaining a score of
74% (Turney, 2001). For comparison, Latent Se-
mantic Analysis (LSA), another statistical measure
of word association, attains a score of 64% on the
3 http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z
4 See Santorini (1995) for a complete description of the tags.
same 80 TOEFL questions (Landauer & Dumais,
1997).
The Pointwise Mutual Information (PMI) be-
tween two words, word1 and word2, is defined as
follows (Church & Hanks, 1989):
p(word1 & word2)
PMI(word1, word2) = log2
p(word1) p(word2)
(1)
Here, p(word1 & word2) is the probability that
word1 and word2 co-occur. If the words are statisti-
cally independent, then the probability that they
co-occur is given by the product p(word1)
p(word2). The ratio between p(word1 & word2) and
p(word1) p(word2) is thus a measure of the degree
of statistical dependence between the words. The
log of this ratio is the amount of information that
we acquire about the presence of one of the words
when we observe the other.

The Semantic Orientation (SO) of a phrase,
phrase, is calculated here as follows:
SO(phrase) = PMI(phrase, “ excellent” )
- PMI(phrase, “ poor” )
(2)
The reference words “ excellent” and “ poor” were
chosen because, in the five star review rating sys-
tem, it is common to define one star as “ poor” and
five stars as “ excellent” . SO is positive when
phrase is more strongly associated with “ excellent”
and negative when phrase is more strongly associ-
ated with “ poor” .
PMI-IR estimates PMI by issuing queries to a
search engine (hence the IR in PMI-IR) and noting
the number of hits (matching documents). The fol-
lowing experiments use the AltaVista Advanced
Search engine5, which indexes approximately 350
million web pages (counting only those pages that
are in English). I chose AltaVista because it has a
NEAR operator. The AltaVista NEAR operator
constrains the search to documents that contain the
words within ten words of one another, in either
order. Previous work has shown that NEAR per-
forms better than AND when measuring the
strength of semantic association between words
(Turney, 2001).
Let hits(query) be the number of hits returned,
given the query query. The following estimate of
SO can be derived from equations (1) and (2) with

5 http://www.altavista.com/sites/search/adv
some minor algebraic manipulation, if co-
occurrence is interpreted as NEAR:
SO(phrase) =
hits(phrase NEAR “ excellent” ) hits(“ poor” )
log2
hits(phrase NEAR “ poor” ) hits(“ excellent” )
(3)
Equation (3) is a log-odds ratio (Agresti, 1996).
To avoid division by zero, I added 0.01 to the hits.
I also skipped phrase when both hits(phrase
NEAR “ excellent” ) and hits(phrase NEAR
“ poor” ) were (simultaneously) less than four.
These numbers (0.01 and 4) were arbitrarily cho-
sen. To eliminate any possible influence from the
testing data, I added “ AND (NOT host:epinions)”
to every query, which tells AltaVista not to include
the Epinions Web site in its searches.
The third step is to calculate the average seman-
tic orientation of the phrases in the given review
and classify the review as recommended if the av-
erage is positive and otherwise not recommended.
Table 2 shows an example for a recommended
review and Table 3 shows an example for a not
recommended review. Both are reviews of the
Bank of America. Both are in the collection of 410
reviews from Epinions that are used in the experi-

ments in Section 4.
Table 2. An example of the processing of a review that
the author has classified as recommended.6
Extracted Phrase Part-of-Speech
Tags
Semantic
Orientation
online experience JJ NN 2.253
low fees JJ NNS 0.333
local branch JJ NN 0.421
small part JJ NN 0.053
online service JJ NN 2.780
printable version JJ NN -0.705
direct deposit JJ NN 1.288
well other RB JJ 0.237
inconveniently
located
RB VBN -1.541
other bank JJ NN -0.850
true service JJ NN -0.732
Average Semantic Orientation 0.322
6 The semantic orientation in the following tables is calculated
using the natural logarithm (base e), rather than base 2. The
natural log is more common in the literature on log-odds ratio.
Since all logs are equivalent up to a constant factor, it makes
no difference for the algorithm.

Table 3. An example of the processing of a review that
the author has classified as not recommended.
Extracted Phrase Part-of-Speech
Tags
Semantic
Orientation
little difference JJ NN -1.615
clever tricks JJ NNS -0.040
programs such NNS JJ 0.117
possible moment JJ NN -0.668
unethical practices JJ NNS -8.484
low funds JJ NNS -6.843
old man JJ NN -2.566
other problems JJ NNS -2.748
probably wondering RB VBG -1.830
virtual monopoly JJ NN -2.050
other bank JJ NN -0.850
extra day JJ NN -0.286
direct deposits JJ NNS 5.771
online web JJ NN 1.936
cool thing JJ NN 0.395
very handy RB JJ 1.349
lesser evil RBR JJ -2.288
Average Semantic Orientation -1.218
3 Related Work
This work is most closely related to Hatzivassi-
loglou and McKeown’ s (1997) work on predicting
the semantic orientation of adjectives. They note
that there are linguistic constraints on the semantic

orientations of adjectives in conjunctions. As an
example, they present the following three sen-
tences (Hatzivassiloglou & McKeown, 1997):
1. The tax proposal was simple and well-
received by the public.
2. The tax proposal was simplistic but well-
received by the public.
3. (* ) The tax proposal was simplistic and
well-received by the public.
The third sentence is incorrect, because we use
“ and” with adjectives that have the same semantic
orientation (“ simple” and “ well-received” are both
positive), but we use “ but” with adjectives that
have different semantic orientations (“ simplistic”
is negative).
Hatzivassiloglou and McKeown (1997) use a
four-step supervised learning algorithm to infer the
semantic orientation of adjectives from constraints
on conjunctions:
1. All conjunctions of adjectives are extracted
from the given corpus.
2. A supervised learning algorithm combines
multiple sources of evidence to label pairs of
adjectives as having the same semantic orienta-
tion or different semantic orientations. The re-
sult is a graph where the nodes are adjectives
and links indicate sameness or difference of
semantic orientation.

3. A clustering algorithm processes the graph
structure to produce two subsets of adjectives,
such that links across the two subsets are
mainly different-orientation links, and links in-
side a subset are mainly same-orientation links.
4. Since it is known that positive adjectives
tend to be used more frequently than negative
adjectives, the cluster with the higher average
frequency is classified as having positive se-
mantic orientation.
This algorithm classifies adjectives with accuracies
ranging from 78% to 92%, depending on the
amount of training data that is available. The algo-
rithm can go beyond a binary positive-negative dis-
tinction, because the clustering algorithm (step 3
above) can produce a “ goodness-of-fit” measure
that indicates how well an adjective fits in its as-
signed cluster.
Although they do not consider the task of clas-
sifying reviews, it seems their algorithm could be
plugged into the classification algorithm presented
in Section 2, where it would replace PMI-IR and
equation (3) in the second step. However, PMI-IR
is conceptually simpler, easier to implement, and it
can handle phrases and adverbs, in addition to iso-
lated adjectives.
As far as I know, the only prior published work
on the task of classifying reviews as thumbs up or
down is Tong’ s (2001) system for generating sen-
timent timelines. This system tracks online discus-
sions about movies and displays a plot of the
number of positive sentiment and negative senti-

ment messages over time. Messages are classified
by looking for specific phrases that indicate the
sentiment of the author towards the movie (e.g.,
“ great acting” , “ wonderful visuals” , “ terrible
score” , “ uneven editing” ). Each phrase must be
manually added to a special lexicon and manually
tagged as indicating positive or negative sentiment.
The lexicon is specific to the domain (e.g., movies)
and must be built anew for each new domain. The
company Mindfuleye7 offers a technology called
Lexant™ that appears similar to Tong’ s (2001)
system.
Other related work is concerned with determin-
ing subjectivity (Hatzivassiloglou & Wiebe, 2000;
Wiebe, 2000; Wiebe et al., 2001). The task is to
distinguish sentences that present opinions and
evaluations from sentences that objectively present
factual information (Wiebe, 2000). Wiebe et al.
(2001) list a variety of potential applications for
automated subjectivity tagging, such as recogniz-
ing “ flames” (Spertus, 1997), classifying email,
recognizing speaker role in radio broadcasts, and
mining reviews. In several of these applications,
the first step is to recognize that the text is subjec-
tive and then the natural second step is to deter-
mine the semantic orientation of the subjective
text. For example, a flame detector cannot merely
detect that a newsgroup message is subjective, it
must further detect that the message has a negative
semantic orientation; otherwise a message of praise
could be classified as a flame.

Hearst (1992) observes that most search en-
gines focus on finding documents on a given topic,
but do not allow the user to specify the directional-
ity of the documents (e.g., is the author in favor of,
neutral, or opposed to the event or item discussed
in the document?). The directionality of a docu-
ment is determined by its deep argumentative
structure, rather than a shallow analysis of its ad-
jectives. Sentences are interpreted metaphorically
in terms of agents exerting force, resisting force,
and overcoming resistance. It seems likely that
there could be some benefit to combining shallow
and deep analysis of the text.
4 Experiments
Table 4 describes the 410 reviews from Epinions
that were used in the experiments. 170 (41%) of
the reviews are not recommended and the remain-
ing 240 (59%) are recommended. Always guessing
the majority class would yield an accuracy of 59%.
The third column shows the average number of
phrases that were extracted from the reviews.
Table 5 shows the experimental results. Except
for the travel reviews, there is surprisingly little
variation in the accuracy within a domain. In addi-
7 http://www.mindfuleye.com/
tion to recommended and not recommended, Epin-
ions reviews are classified using the five star rating
system. The third column shows the correlation be-
tween the average semantic orientation and the
number of stars assigned by the author of the re-

view. The results show a strong positive correla-
tion between the average semantic orientation and
the author’ s rating out of five stars.
Table 4. A summary of the corpus of reviews.
Domain of Review Number of
Reviews
Average
Phrases per
Review
Automobiles 75 20.87
Honda Accord 37 18.78
Volkswagen Jetta 38 22.89
Banks 120 18.52
Bank of America 60 22.02
Washington Mutual 60 15.02
Movies 120 29.13
The Matrix 60 19.08
Pearl Harbor 60 39.17
Travel Destinations 95 35.54
Cancun 59 30.02
Puerto Vallarta 36 44.58
All 410 26.00
Table 5. The accuracy of the classification and the cor-
relation of the semantic orientation with the star rating.
Domain of Review Accuracy Correlation
Automobiles 84.00 % 0.4618

Honda Accord 83.78 % 0.2721
Volkswagen Jetta 84.21 % 0.6299
Banks 80.00 % 0.6167
Bank of America 78.33 % 0.6423
Washington Mutual 81.67 % 0.5896
Movies 65.83 % 0.3608
The Matrix 66.67 % 0.3811
Pearl Harbor 65.00 % 0.2907
Travel Destinations 70.53 % 0.4155
Cancun 64.41 % 0.4194
Puerto Vallarta 80.56 % 0.1447
All 74.39 % 0.5174
5 Discussion of Results
A natural question, given the preceding results, is
what makes movie reviews hard to classify? Table
6 shows that classification by the average SO tends
to err on the side of guessing that a review is not
recommended, when it is actually recommended.
This suggests the hypothesis that a good movie
will often contain unpleasant scenes (e.g., violence,
death, mayhem), and a recommended movie re-
view may thus have its average semantic orienta-
tion reduced if it contains descriptions of these un-
pleasant scenes. However, if we add a constant
value to the average SO of the movie reviews, to
compensate for this bias, the accuracy does not

improve. This suggests that, just as positive re-
views mention unpleasant things, so negative re-
views often mention pleasant scenes.
Table 6. The confusion matrix for movie classifications.
Author’ s Classification
Average
Semantic
Orientation
Thumbs
Up
Thumbs
Down
Sum
Positive 28.33 % 12.50 % 40.83 %
Negative 21.67 % 37.50 % 59.17 %
Sum 50.00 % 50.00 % 100.00 %
Table 7 shows some examples that lend support
to this hypothesis. For example, the phrase “ more
evil” does have negative connotations, thus an SO
of -4.384 is appropriate, but an evil character does
not make a bad movie. The difficulty with movie
reviews is that there are two aspects to a movie, the
events and actors in the movie (the elements of the
movie), and the style and art of the movie (the
movie as a gestalt; a unified whole). This is likely
also the explanation for the lower accuracy of the
Cancun reviews: good beaches do not necessarily
add up to a good vacation. On the other hand, good
automotive parts usually do add up to a good

automobile and good banking services add up to a
good bank. It is not clear how to address this issue.
Future work might look at whether it is possible to
tag sentences as discussing elements or wholes.
Another area for future work is to empirically
compare PMI-IR and the algorithm of Hatzivassi-
loglou and McKeown (1997). Although their algo-
rithm does not readily extend to two-word phrases,
I have not yet demonstrated that two-word phrases
are necessary for accurate classification of reviews.
On the other hand, it would be interesting to evalu-
ate PMI-IR on the collection of 1,336 hand-labeled
adjectives that were used in the experiments of
Hatzivassiloglou and McKeown (1997). A related
question for future work is the relationship of ac-
curacy of the estimation of semantic orientation at
the level of individual phrases to accuracy of re-
view classification. Since the review classification
is based on an average, it might be quite resistant
to noise in the SO estimate for individual phrases.
But it is possible that a better SO estimator could
produce significantly better classifications.
Table 7. Sample phrases from misclassified reviews.
Movie: The Matrix
Author’ s Rating: recommended (5 stars)
Average SO: -0.219 (not recommended)
Sample Phrase: more evil [RBR JJ]
SO of Sample
Phrase:
-4.384

Context of Sample
Phrase:
The slow, methodical way he
spoke. I loved it! It made him
seem more arrogant and even
more evil.
Movie: Pearl Harbor
Author’ s Rating: recommended (5 stars)
Average SO: -0.378 (not recommended)
Sample Phrase: sick feeling [JJ NN]
SO of Sample
Phrase:
-8.308
Context of Sample
Phrase:
During this period I had a sick
feeling, knowing what was
coming, knowing what was
part of our history.
Movie: The Matrix
Author’ s Rating: not recommended (2 stars)
Average SO: 0.177 (recommended)
Sample Phrase: very talented [RB JJ]
SO of Sample
Phrase:
1.992
Context of Sample
Phrase:

Well as usual Keanu Reeves is
nothing special, but surpris-
ingly, the very talented Laur-
ence Fishbourne is not so good
either, I was surprised.
Movie: Pearl Harbor
Author’ s Rating: not recommended (3 stars)
Average SO: 0.015 (recommended)
Sample Phrase: blue skies [JJ NNS]
SO of Sample
Phrase:
1.263
Context of Sample
Phrase:
Anyone who saw the trailer in
the theater over the course of
the last year will never forget
the images of Japanese war
planes swooping out of the
blue skies, flying past the
children playing baseball, or
the truly remarkable shot of a
bomb falling from an enemy
plane into the deck of the USS
Arizona.
Equation (3) is a very simple estimator of se-
mantic orientation. It might benefit from more so-
phisticated statistical analysis (Agresti, 1996). One

possibility is to apply a statistical significance test
to each estimated SO. There is a large statistical
literature on the log-odds ratio, which might lead
to improved results on this task.
This paper has focused on unsupervised classi-
fication, but average semantic orientation could be
supplemented by other features, in a supervised
classification system. The other features could be
based on the presence or absence of specific
words, as is common in most text classification
work. This could yield higher accuracies, but the
intent here was to study this one feature in isola-
tion, to simplify the analysis, before combining it
with other features.
Table 5 shows a high correlation between the
average semantic orientation and the star rating of
a review. I plan to experiment with ordinal classi-
fication of reviews in the five star rating system,
using the algorithm of Frank and Hall (2001). For
ordinal classification, the average semantic orienta-
tion would be supplemented with other features in
a supervised classification system.
A limitation of PMI-IR is the time required to
send queries to AltaVista. Inspection of Equation
(3) shows that it takes four queries to calculate the
semantic orientation of a phrase. However, I
cached all query results, and since there is no need
to recalculate hits(“ poor” ) and hits(“ excellent” ) for
every phrase, each phrase requires an average of
slightly less than two queries. As a courtesy to
AltaVista, I used a five second delay between que-
ries.8 The 410 reviews yielded 10,658 phrases, so

the total time required to process the corpus was
roughly 106,580 seconds, or about 30 hours.
This might appear to be a significant limitation,
but extrapolation of current trends in computer
memory capacity suggests that, in about ten years,
the average desktop computer will be able to easily
store and search AltaVista’ s 350 million Web
pages. This will reduce the processing time to less
than one second per review.
6 Applications
There are a variety of potential applications for
automated review rating. As mentioned in the in-
8 This line of research depends on the good will of the major
search engines. For a discussion of the ethics of Web robots,
see http://www.robotstxt.org/wc/robots.html. For query robots,
the proposed extended standard for robot exclusion would be
useful. See http://www.conman.org/people/spc/robots2.html.
troduction, one application is to provide summary
statistics for search engines. Given the query
“ Akumal travel review” , a search engine could re-
port, “ There are 5,000 hits, of which 80% are
thumbs up and 20% are thumbs down.” The search
results could be sorted by average semantic orien-
tation, so that the user could easily sample the most
extreme reviews. Similarly, a search engine could
allow the user to specify the topic and the rating of
the desired reviews (Hearst, 1992).
Preliminary experiments indicate that semantic
orientation is also useful for summarization of re-

views. A positive review could be summarized by
picking out the sentence with the highest positive
semantic orientation and a negative review could
be summarized by extracting the sentence with the
lowest negative semantic orientation.
Epinions asks its reviewers to provide a short
description of pros and cons for the reviewed item.
A pro/con summarizer could be evaluated by
measuring the overlap between the reviewer’ s pros
and cons and the phrases in the review that have
the most extreme semantic orientation.
Another potential application is filtering
“ flames” for newsgroups (Spertus, 1997). There
could be a threshold, such that a newsgroup mes-
sage is held for verification by the human modera-
tor when the semantic orientation of a phrase drops
below the threshold. A related use might be a tool
for helping academic referees when reviewing
journal and conference papers. Ideally, referees are
unbiased and objective, but sometimes their criti-
cism can be unintentionally harsh. It might be pos-
sible to highlight passages in a draft referee’ s
report, where the choice of words should be modi-
fied towards a more neutral tone.
Tong’ s (2001) system for detecting and track-
ing opinions in on-line discussions could benefit
from the use of a learning algorithm, instead of (or
in addition to) a hand-built lexicon. With auto-
mated review rating (opinion rating), advertisers
could track advertising campaigns, politicians
could track public opinion, reporters could track
public response to current events, stock traders
could track financial opinions, and trend analyzers

could track entertainment and technology trends.
7 Conclusions
This paper introduces a simple unsupervised learn-
ing algorithm for rating a review as thumbs up or
down. The algorithm has three steps: (1) extract
phrases containing adjectives or adverbs, (2) esti-
mate the semantic orientation of each phrase, and
(3) classify the review based on the average se-
mantic orientation of the phrases. The core of the
algorithm is the second step, which uses PMI-IR to
calculate semantic orientation (Turney, 2001).
In experiments with 410 reviews from Epin-
ions, the algorithm attains an average accuracy of
74%. It appears that movie reviews are difficult to
classify, because the whole is not necessarily the
sum of the parts; thus the accuracy on movie re-
views is about 66%. On the other hand, for banks
and automobiles, it seems that the whole is the sum
of the parts, and the accuracy is 80% to 84%.
Travel reviews are an intermediate case.
Previous work on determining the semantic ori-
entation of adjectives has used a complex algo-
rithm that does not readily extend beyond isolated
adjectives to adverbs or longer phrases (Hatzivassi-
loglou and McKeown, 1997). The simplicity of
PMI-IR may encourage further work with semantic
orientation.
The limitations of this work include the time

required for queries and, for some applications, the
level of accuracy that was achieved. The former
difficulty will be eliminated by progress in hard-
ware. The latter difficulty might be addressed by
using semantic orientation combined with other
features in a supervised classification algorithm.
Acknowledgements
Thanks to Joel Martin and Michael Littman for
helpful comments.
References
Agresti, A. 1996. An introduction to categorical data
analysis. New York: Wiley.
Brill, E. 1994. Some advances in transformation-based
part of speech tagging. Proceedings of the Twelfth
National Conference on Artificial Intelligence (pp.
722-727). Menlo Park, CA: AAAI Press.
Church, K.W., & Hanks, P. 1989. Word association
norms, mutual information and lexicography. Pro-
ceedings of the 27th Annual Conference of the ACL
(pp. 76-83). New Brunswick, NJ: ACL.
Frank, E., & Hall, M. 2001. A simple approach to ordi-
nal classification. Proceedings of the Twelfth Euro-
pean Conference on Machine Learning (pp. 145-
156). Berlin: Springer-Verlag.
Hatzivassiloglou, V., & McKeown, K.R. 1997. Predict-
ing the semantic orientation of adjectives. Proceed-
ings of the 35th Annual Meeting of the ACL and the

8th Conference of the European Chapter of the ACL
(pp. 174-181). New Brunswick, NJ: ACL.
Hatzivassiloglou, V., & Wiebe, J.M. 2000. Effects of
adjective orientation and gradability on sentence sub-
jectivity. Proceedings of 18th International Confer-
ence on Computational Linguistics. New Brunswick,
NJ: ACL.
Hearst, M.A. 1992. Direction-based text interpretation
as an information access refinement. In P. Jacobs
(Ed.), Text-Based Intelligent Systems: Current Re-
search and Practice in Information Extraction and
Retrieval. Mahwah, NJ: Lawrence Erlbaum Associ-
ates.
Landauer, T.K., & Dumais, S.T. 1997. A solution to
Plato’ s problem: The latent semantic analysis theory
of the acquisition, induction, and representation of
knowledge. Psychological Review, 104, 211-240.
Santorini, B. 1995. Part-of-Speech Tagging Guidelines
for the Penn Treebank Project (3rd revision, 2nd
printing). Technical Report, Department of Computer
and Information Science, University of Pennsylvania.
Spertus, E. 1997. Smokey: Automatic recognition of
hostile messages. Proceedings of the Conference on
Innovative Applications of Artificial Intelligence (pp.
1058-1065). Menlo Park, CA: AAAI Press.
Tong, R.M. 2001. An operational system for detecting
and tracking opinions in on-line discussions. Working
Notes of the ACM SIGIR 2001 Workshop on Opera-
tional Text Classification (pp. 1-6). New York, NY:
ACM.

Turney, P.D. 2001. Mining the Web for synonyms:
PMI-IR versus LSA on TOEFL. Proceedings of the
Twelfth European Conference on Machine Learning
(pp. 491-502). Berlin: Springer-Verlag.
Wiebe, J.M. 2000. Learning subjective adjectives from
corpora. Proceedings of the 17th National Confer-
ence on Artificial Intelligence. Menlo Park, CA:
AAAI Press.
Wiebe, J.M., Bruce, R., Bell, M., Martin, M., & Wilson,
T. 2001. A corpus study of evaluative and specula-
tive language. Proceedings of the Second ACL SIG
on Dialogue Workshop on Discourse and Dialogue.
Aalborg, Denmark.

P r e d i c t i n g t h e Semantic Orientation of A d j e c .docx

Recommended

Recommended

More Related Content

Similar to P r e d i c t i n g t h e Semantic Orientation of A d j e c .docx

Similar to P r e d i c t i n g t h e Semantic Orientation of A d j e c .docx (20)

More from gerardkortney

More from gerardkortney (20)

Recently uploaded

Recently uploaded (20)

P r e d i c t i n g t h e Semantic Orientation of A d j e c .docx