SlideShare a Scribd company logo
Integrating Multiple Knowledge Sources to
     Disambiguate Word Sense: An Exemplar-Based Approach

                      Hwee Tou Ng                                          Hian Beng Lee
            Defence Science Organisation                           Defence Science Organisation
                20 S c i e n c e P a r k D r i v e                    20 S c i e n c e P a r k D r i v e
                  S i n g a p o r e 118230                              S i n g a p o r e 118230
           nhweet ou©trantor, dso. gov. s g                       lhianben@trant or. d s o . gov. sg



                    Abstract                                  (LEXical Ambiguity-resolving _System), we tested it
                                                              on a common data set involving the noun "interest"
    In this paper, we present a new approach
                                                              used by Bruce and Wiebe (Bruce and Wiebe, 1994).
    for word sense disambiguation (WSD) us-                   LEXAS achieves a mean accuracy of 87.4% on this
    ing an exemplar-based learning algorithm.
                                                              data set, which is higher than the accuracy of 78%
    This approach integrates a diverse set of
                                                              reported in (Bruce and Wiebe, 1994).
    knowledge sources to disambiguate word
    sense, including part of speech of neigh-                    Moreover, to test the scalability of LEXAS, we have
                                                              acquired a corpus in which 192,800 word occurrences
    boring words, morphological form, the un-
                                                              have been manually tagged with senses from WORD-
    ordered set of surrounding words, local
                                                              NET, which is a public domain lexical database con-
    collocations, and verb-object syntactic re-
                                                              taining about 95,000 word forms and 70,000 lexical
    lation.   We tested our WSD program,
                                                              concepts (Miller, 1990). These sense tagged word
    named LEXAS, on both a common data
                                                              occurrences consist of 191 most frequently occur-
    set used in previous work, as well as on
                                                              ring and most ambiguous nouns and verbs. When
    a large sense-tagged corpus that we sep-
                                                              tested on this large data set, LEXAS performs better
    arately constructed.    LEXAS achieves a
                                                              than the default strategy of picking the most fre-
    higher accuracy on the common data set,
                                                              quent sense. To our knowledge, this is the first time
    and performs better than the most frequent
    heuristic on the highly ambiguous words                   that a WSD program has been tested on such a large
    in the large corpus tagged with the refined               scale, and yielding results better than the most fre-
                                                              quent heuristic on highly ambiguous words with the
    senses of WoRDNET.
                                                              refined sense distinctions of WOttDNET.
1   Introduction
                                                              2     Task   Description
One important problem of Natural Language Pro-
cessing (NLP) is figuring out what a word means               The input to a WSD program consists of unre-
when it is used in a particular context. The different        stricted, real-world English sentences. In the out-
meanings of a word are listed as its various senses in        put, each word occurrence w is tagged with its cor-
a dictionary. The task of Word Sense Disambigua-              rect sense (according to the context) in the form of
tion (WSD) is to identify the correct sense of a word         a sense number i, where i corresponds to the i-th
in context. Improvement in the accuracy of iden-              sense definition of w as given in some dictionary.
tifying the correct word sense will result in better          The choice of which sense definitions to use (and
machine translation systems, information retrieval            according to which dictionary) is agreed upon in ad-
systems, etc. For example, in machine translation,            vance.
knowing the correct word sense helps to select the               For our work, we use the sense definitions as given
appropriate target words to use in order to translate         in WORDNET, which is comparable to a good desk-
into a target language.                                       top printed dictionary in its coverage and sense dis-
   In this paper, we present a new approach for               tinction. Since WO•DNET only provides sense def-
WSD using an exemplar-based learning algorithm.               initions for content words, (i.e., words in the parts
This approach integrates a diverse set of knowledge           of speech (POS) noun, verb, adjective, and adverb),
sources to disambiguate word sense, including part            LEXAS is only concerned with disambiguating the
of speech (POS) of neighboring words, morphologi-             sense of content words. However, almost all existing
cal form, the unordered set of surrounding words,             work in WSD deals only with disambiguating con-
local collocations, and verb-object syntactic rela-           tent words too.
tion. To evaluate our WSD program, named LEXAS                   LEXAS assumes that each word in an input sen-
                                                         40
tence has been pre-tagged with its correct POS, so               ing w, the morphological form of w, the frequently
that the possible senses to consider for a content               co-occurring words surrounding w, the local colloca-
word w are only those associated with the particu-               tions containing w, and the verb t h a t takes w as an
lar POS of w in the sentence. For instance, given                object (for the case when w is a noun). These values
the sentence "A reduction of principal and interest              form the features of a test example.
is one way the problem m a y be solved.", since the                 This test example is then compared to every train-
word "interest" appears as a noun in this sentence,              ing example. The sense of word w in the test exam-
LEXAS will only consider the noun senses of "inter-              ple is the sense of w in the closest matching train-
est" but not its verb senses. T h a t is, LEXAS is only          ing example, where there is a precise, computational
concerned with disambiguating senses of a word in                definition of "closest match" as explained later.
a given POS. Making such an assumption is reason-
able since POS taggers that can achieve accuracy                 3.1     Feature Extraction
of 96% are readily available to assign POS to un-                The first step of the algorithm is to extract a set F
restricted English sentences (Brill, 1992; Cutting et            of features such that each sentence containing an oc-
al., 1992).                                                      currence of w will form a training example supplying
   In addition, sense definitions are only available for         the necessary values for the set F of features.
root words in a dictionary. These are words that                    Specifically, LEXAS uses the following set of fea-
are not morphologically inflected, such as "interest"            tures to form a training example:
(as opposed to the plural form "interests"), "fall"
(as opposed to the other inflected forms like "fell",            L3, L2, LI, 1~i, R2, R3, M, K I , . . . , Kin,   el,..., 69, V
 "fallen", "falling", "falls"), etc. The sense of a mor-
                                                                 3.1.1 Part of Speech and Morphological
phologically inflected content word is the sense of its
                                                                       Form
uninflected form. LEXAS follows this convention by
first converting each word in an input sentence into                The value of feature Li is the part of speech (POS)
its morphological root using the morphological ana-              of the word i-th position to the left of w. The value
lyzer of WORD NET, before assigning the appropriate              of Ri is the POS of the word i-th position to the right
word sense to the root form.                                     of w. Feature M denotes the morphological form of
                                                                 w in the sentence s. For a noun, the value for this
                                                                 feature is either singular or plural; for a verb, the
3    Algorithm                                                   value is one of infinitive (as in the uninflected form
LEXAS performs WSD by first learning from a train-               of a verb like "fall"), present-third-person-singular
ing corpus of sentences in which words have been                 (as in "falls"), past (as in "fell"), present-participle
pre-tagged with their correct senses. T h a t is, it uses        (as in "falling") or past-participle (as in "fallen").
supervised learning, in particular exemplar-based                3.1.2       Unordered Set of Surrounding Words
learning, to achieve WSD. Our approach has been
                                                                    K t , • •., Km are features corresponding to a set of
fully implemented in the program LExAs. Part of
                                                                 keywords that frequently co-occur with word w in
the implementation uses PEBLS (Cost and Salzberg,
                                                                 the same sentence. For a sentence s, the value of
1993; Rachlin and Salzberg, 1993), a public domain
                                                                 feature Ki is one if the keyword It'~ appears some-
exemplar-based learning system.
                                                                 where in sentence s, else the value of Ki is zero.
   LEXAS builds one exemplar-based classifier for                  The set of keywords K 1 , . . . , Km are determined
each content word w. It operates in two phases:                  based on conditional probability. All the word to-
training phase and test phase. In the training phase,            kens other than the word occurrence w in a sen-
LEXAS is given a set S of sentences in the training              tence s are candidates for consideration as keywords.
corpus in which sense-tagged occurrences of w ap-                These tokens are converted to lower case form before
pear. For each training sentence with an occurrence              being considered as candidates for keywords.
of w, LEXAS extracts the parts of speech (POS) of                  Let cp(ilk ) denotes the conditional probability of
words surrounding w, the morphological form of w,                sense i of w given keyword k, where
the words that frequently co-occur with w in the
same sentence, and the local collocations containing                                             Ni,k
w. For disambiguating a noun w, the verb which                                         cp(ilk) = N~
takes the current noun w as the object is also iden-
tified. This set of values form the features of an ex-           Nk is the number of sentences in which keyword k co-
ample, with one training sentence contributing one               occurs with w, and Ni,k is the number of sentences
training example.                                                in which keyword k co-occurs with w where w has
   Subsequently, in the test phase, LEXAS is given               sense i.
new, previously unseen sentences. For a new sen-                    For a keyword k to be selected as a feature, it
                                                                 must satisfy the following criteria:
tence containing the word w, LI~XAS extracts from
the new sentence the values for the same set of fea-               1. cp(ilk ) >_ Mi for some sense i, where M1 is some
tures, including parts of speech of words surround-                    predefined m i n i m u m probability.
                                                            41
2. The keyword k must occur at least M2 times                 Left Offset     Right Offset         Collocation Example
     in some sense i, where /1//2 is some predefined                 -1              -1             accrued interest
     m i n i m u m value.                                             1               1             interest rate
                                                                     -2              -1             principal and interest
 3. Select at most M3 number of keywords for a
                                                                     -1               1             national interest in
    given sense i if the number of keywords satisfy-
                                                                      1               2             interest and dividends
    ing the first two criteria for a given sense i ex-
    ceeds M3. In this case, keywords that co-occur                   -3              -1             sale of an interest
    more frequently (in terms of absolute frequency)                 -2                             in the interest of
    with sense i of word w are selected over those                   -1               2             an interest in a
    co-occurring less frequently.                                     1               3             interest on the bonds

   Condition 1 ensures that a selected keyword is in-                    Table 1: Features for Collocations
dicative of some sense i of w since cp(ilk) is at least
some m i n i m u m probability M1. Condition 2 reduces
the possibility of selecting a keyword based on spu-           of the nine collocation features, LEXAS concatenates
rious occurrence. Condition 3 prefers keywords that            the words between the left and right offset positions.
co-occur more frequently if there is a large number            Using similar conditional probability criteria for the
of eligible keywords.                                          selection of keywords, collocations t h a t are predic-
   For example, M1 = 0.8, Ms = 5, M3 = 5 when                  tive of a certain sense are selected to form the pos-
LEXAS was tested on the c o m m o n data set reported          sible values for a collocation feature.
in Section 4.1.
                                                               3.1.4 V e r b - O b j e c t S y n t a c t i c R e l a t i o n
   To illustrate, when disambiguating the noun "in-
terest", some of the selected keywords are: ex-                   LEXAS also makes use of the verb-object syntactic
pressed, acquiring, great, attracted, expressions,             relation as one feature V for the disambiguation of
pursue, best, conflict, served, short, minority, rates,        nouns. If a noun to be disambiguated is the head of
rate, bonds, lower, payments.                                  a noun group, as indicated by its last position in a
                                                               noun group bracketing, and if the word immediately
3.1.3   Local Collocations                                     preceding the opening noun group bracketing is a
   Local collocations are common expressions con-              verb, LEXAS takes such a verb-noun pair to be in a
taining the word to be disambiguated. For our pur-             verb-object syntactic relation. Again, using similar
pose, the term collocation does not imply idiomatic            conditional probability criteria for the selection of
usage, just words that are frequently adjacent to the          keywords, verbs that are predictive of a certain sense
word to be disambiguated. Examples of local collo-             of the noun to be disambiguated are selected to form
cations of the noun "interest" include "in the interest        the possible values for this verb-object feature V.
of", "principal and interest", etc. When a word to                Since our training and test sentences come with
be disambiguated occurs as part of a collocation, its          noun group bracketing, determining verb-object re-
sense can be frequently determined very reliably. For          lation using the above heuristic can be readily done.
example, the collocation "in the interest of" always           In future work, we plan to incorporate more syntac-
implies the "advantage, advancement, favor" sense              tic relations including subject-verb, and adjective-
of the noun "interest". Note t h a t the method for            headnoun relations.         We also plan to use verb-
extraction of keywords that we described earlier will          object and subject-verb relations to disambiguate
fail to find the words "in", "the", "of" as keywords,          verb senses.
since these words will appear in m a n y different po-
sitions in a sentence for m a n y senses of the noun           3.2    Training and Testing
 "interest". It is only when these words appear in             The heart of exemplar-based learning is a measure
the exact order "in the interest of" around the noun           of the similarity, or distance, between two examples.
 "interest" that strongly implies the "advantage, ad-          If the distance between two examples is small, then
vancement, favor" sense.                                       the two examples are similar. We use the following
   There are nine features related to collocations in          definition of distance between two symbolic values
an example. Table 1 lists the nine features and some           vl and v2 of a feature f:
collocation examples for the noun "interest". For ex-
ample, the feature with left offset = -2 and right off-
set = 1 refers to the possible collocations beginning
                                                                              e(vl, v2) =         Ic1'cl c2,c. I
                                                                                            i=1
 at the word two positions to the left of "interest"
 and ending at the word one position to the right of              Cl,i is the number of training examples with value
 "interest". An example of such a collocation is "in           vl for feature f that is classified as sense i in the
 the interest of".                                             training corpus, and C1 is the number of training
   The method for extraction of local collocations is          examples with value vl for feature f in any sense.
similar to that for extraction of keywords. For each           C2,i and C2 denote similar quantities for value v2 of
                                                          42
feature f . n is the total number of senses for a word                 L D O C E sense            Frequency      Percent
W.                                                               1: readiness to give                361          15%
   This metric for measuring distance is adopted                 attention
from (Cost and Salzberg, 1993), which in turn is                 2: quality of causing                 11          <1%
adapted from the value difference metric of the ear-             attention to be given
lier work of (Stanfill and Waltz, 1986). The distance            3: activity, subject, etc.            67           3%
between two examples is the sum of the distances                 which one gives time and
between the values of all the features of the two ex-            attention to
amples.                                                          4: advantage,                        178           8%
   During the training phase, the appropriate set of             advancement, or favor
features is extracted based on the method described              5: a share (in a company,            499          21%
in Section 3.1. From the training examples formed,               business, etc.)
the distance between any two values for a feature f              6: money paid for the use            1253         53%
is computed based on the above formula.                          of money
   During the test phase, a test example is compared
against allthe training examples. LEXAS then deter-                      Table 2: Distribution of Sense Tags
mines the closest matching training example as the
one with the m i n i m u m distance to the test example.
The sense of w in the test example is the sense of w            verb-object syntactic relation to form the features of
in this closest matching training example.                      examples.
   If there is a tie among several training examples               In the results reported in (Bruce and Wiebe,
with the same m i n i m u m distance to the test exam-          1994), they used a test set of 600 randomly selected
ple, LEXAS randomly selects one of these training               sentences from the 2369 sentences. Unfortunately,
examples as the closet matching training example in             in the data set made available in the public domain,
order to break the tie.                                         there is no indication of which sentences are used as
                                                                test sentences. As such, we conducted 100 random
4     Evaluation                                                trials, and in each trial, 600 sentences were randomly
                                                                selected to form the test set. LEXAS is trained on
To evaluate the performance of LEXAS, we con-                   the remaining 1769 sentences, and then tested on a
ducted two tests, one on a common data set used in              separate test set of sentences in each trial.
(Bruce and Wiebe, 1994), and another on a larger                    Note that in Bruce and Wiebe's test run, the pro-
data set that we separately collected.                          portion of sentences in each sense in the test set is
                                                                approximately equal to their proportion in the whole
4.1    Evaluation on a Common          Data Set                 d a t a set. Since we use r a n d o m selection of test sen-
To our knowledge, very few of the existing work on              tences, the proportion of each sense in our test set is
WSD has been tested and compared on a common                    also approximately equal to their proportion in the
data set. This is in contrast to established practice           whole d a t a set in our r a n d o m trials.
in the machine learning community. This is partly                  The average accuracy of LEXAS over 100 random
because there are not m a n y c o m m o n data sets pub-        trials is 87.4%, and the standard deviation is 1.37%.
licly available for testing WSD programs.                       In each of our 100 r a n d o m trials, the accuracy of
   One exception is the sense-tagged data set used              LEXAS is always higher than the accuracy of 78%
in (Bruce and Wiebe, 1994), which has been made                 reported in (Bruce and Wiebe, 1994).
available in the public domain by Bruce and Wiebe.                  Bruce and Wiebe also performed a separate test
This data set consists of 2369 sentences each con-              by using a subset of the "interest" data set with only
taining an occurrence of the noun "interest" (or its            4 senses (sense 1, 4, 5, and 6), so as to compare their
plural form "interests") with its correct sense man-            results with previous work on WSD (Black, 1988;
ually tagged. The noun "interest" occurs in six dif-            Zernik, 1990; Yarowsky, 1992), which were tested
ferent senses in this d a t a set. Table 2 shows the            on 4 senses of the noun "interest". However, the
distribution of sense tags from the d a t a set that we         work of (Black, 1988; Zernik, 1990; Yarowsky, 1992)
obtained. Note that the sense definitions used in this          were not based on the present set of sentences, so
data set are those from Longman Dictionary of Con-              the comparison is only suggestive. We reproduced
temporary English (LDOCE) (Procter, 1978). This                 in Table 3 the results of past work as well as the clas-
does not pose any problem for LEXAS, since LEXAS                sification accuracy of LEXAS, which is 89.9% with a
only requires that there be a division of senses into           standard deviation of 1.09% over 100 random trials.
different classes, regardless of how the sense classes             In summary, when tested on the noun "interest",
are defined or numbered.                                        LEXAS gives higher classification accuracy than pre-
   POS of words are given in the data set, as well              vious work on WSD.
as the bracketings of noun groups. These are used                  In order to evaluate the relative contribution of
to determine the POS of neighboring words and the               the knowledge sources, including (1) POS and mor-
                                                           43
WSD research                 Accuracy                   These 192,800 word occurrences consist of 121
         Black (1988)                       72%                    nouns and 70 verbs which are the m o s t frequently oc-
         Zernik (1990)                      70%                    curring and most ambiguous words of English. The
         Yarowsky (1992)                    72%                    121 nouns are:
         Bruce & Wiebe (1994)               79%
                                                                       action activity age air area art board
         LEXhS (1996)                       89%
                                                                       body book business car case center cen-
                                                                       tury change child church city class college
      Table 3: Comparison with previous results
                                                                       community company condition cost coun-
                                                                       try course day death development differ-
 Knowledge Source           Mean Accuracy           Std Dev
                                                                       ence door effect effort end example experi-
 POS & morpho                   77.2%                1.44%             ence face fact family field figure foot force
 surrounding words              62.0%                1.82%             form girl government ground head history
 collocations                   80.2%                1.55%             home hour house information interest job
 verb-object                    43.5%                1.79%             land law level life light line m a n mate-
                                                                       rial m a t t e r m e m b e r mind m o m e n t money
Table 4:      Relative     Contribution       of Knowledge             m o n t h n a m e nation need number order
Sources                                                                part party picture place plan point pol-
                                                                       icy position power pressure problem pro-
                                                                       cess program public purpose question rea-
phological form; (2) unordered set of surrounding
                                                                       son result right r o o m school section sense
words; (3) local collocations; and (4) verb to the left
                                                                       service side society stage state step student
(verb-object syntactic relation), we conducted 4 sep-
                                                                       study surface system table t e r m thing time
arate runs of 100 r a n d o m trials each. In each run,
                                                                       town type use value voice water way word
we utilized only one knowledge source and compute
                                                                       work world
the average classification accuracy and the standard
deviation. The results are given in Table 4.                         The 70 verbs are:
   Local collocation knowledge yields the highest ac-
curacy, followed by POS and morphological form.                         add appear ask become believe bring build
Surrounding words give lower accuracy, perhaps be-                      call carry change come consider continue
cause in our work, only the current sentence forms                      determine develop draw expect fall give
the surrounding context, which averages about 20                        go grow happen help hold indicate involve
words. Previous work on using the unordered set of                      keep know lead leave lie like live look lose
                                                                        mean meet move need open pay raise read
surrounding words have used a much larger window,
such as the 100-word window of (Yarowsky, 1992),                        receive r e m e m b e r require return rise run
and the 2-sentence context of (Leacock et al., 1993).                   see seem send set show sit speak stand start
Verb-object syntactic relation is the weakest knowl-                    stop strike take talk tell think turn wait
edge source.                                                            walk want work write
   Our experimental finding, t h a t local collocations               For this set of nouns and verbs, the average num-
are the most predictive, agrees with past observa-                 ber of senses per noun is 7.8, while the average num-
tion that humans need a narrow window of only a                    ber of senses per verb is 12.0. We draw our sen-
few words to perform WSD (Choueka and Lusignan,                    tences containing the occurrences of the 191 words
 1985).                                                            listed above from the combined corpus of the 1 mil-
   The processing speed of LEXAS is satisfactory.                  lion word Brown corpus and the 2.5 million word
Running on an SGI Unix workstation, LEXAS can                      Wall Street Journal (WSJ) corpus. For every word
process about 15 examples per second when tested                   in the two lists, up to 1,500 sentences each con-
on the "interest" data set.                                        taining an occurrence of the word are extracted
                                                                   from the combined corpus. In all, there are about
4.2    E v a l u a t i o n o n a L a r g e D a t a Set             113,000 noun occurrences and a b o u t 79,800 verb oc-
Previous research on W S D tend to be tested only                  currences. This set of 121 nouns accounts for about
on a dozen number of words, where each word fre-                   20% of all occurrences of nouns t h a t one expects to
quently has either two or a few senses. To test the                encounter in any unrestricted English text. Simi-
scalability of LEXAS, we have gathered a corpus in                 larly, about 20% of all verb occurrences in any unre-
which 192,800 word occurrences have been manually                  stricted text come from the set of 70 verbs chosen.
tagged with senses from WoRDNET 1.5. This data                        We estimate that there are 10-20% errors in our
set is almost two orders of magnitude larger in size               sense-tagged d a t a set. To get an idea of how the
than the above "interest" data set. Manual tagging                 sense assignments of our d a t a set compare with
was done by university undergraduates majoring in                  those provided by WoRDNET linguists in SEMCOR,
Linguistics, and approximately one man-year of ef-                 the sense-tagged subset of Brown corpus prepared
forts were expended in tagging our data set.                       by Miller et al. (Miller et al., 1994), we compare
                                                              44
Test set   Sense 1    Most Frequent     LEXAS              ambiguate words coming from such a wide variety of
   BC50        40.5%        47.1%          54.0%              texts.
   WSJ6        44.8%        63.7%          68.6%
                                                              5   Related     Work
     Table 5: Evaluation on a Large Data Set
                                                              There is now a large body of past work on WSD.
                                                              Early work on WSD, such as (Kelly and Stone, 1975;
a subset of the occurrences that overlap. Out of              Hirst, 1987) used hand-coding of knowledge to per-
5,317 occurrences that overlap, about 57% of the              form WSD. The knowledge acquisition process is la-
sense assignments in our data set agree with those            borious. In contrast, LEXAS learns from tagged sen-
in SEMCOR. This should not be too surprising, as              tences, without human engineering of complex rules.
it is widely believed that sense tagging using the               The recent emphasis on corpus based NLP has re-
full set of refined senses found in a large dictionary        sulted in much work on WSD of unconstrained real-
like WORDNET involve making subtle human judg-                world texts. One line of research focuses on the use
ments (Wilks et al., 1990; Bruce and Wiebe, 1994),            of the knowledge contained in a machine-readable
such that there are many genuine cases where two              dictionary to perform WSD, such as (Wilks et al.,
humans will not agree fully on the best sense assign-         1990; Luk, 1995). In contrast, LEXAS uses super-
ments.                                                        vised learning from tagged sentences, which is also
   We evaluated LEXAS on this larger set of noisy,            the approach taken by most recent work on WSD, in-
sense-tagged data. We first set aside two subsets for         cluding (Bruce and Wiebe, 1994; Miller et al., 1994;
testing. The first test set, named BC50, consists of          Leacock et al., 1993; Yarowsky, 1994; Yarowsky,
7,119 occurrences of the 191 content words that oc-           1993; Yarowsky, 1992).
cur in 50 text files of the Brown corpus. The second             The work of (Miller et al., 1994; Leacock et al.,
test set, named WSJ6, consists of 14,139 occurrences          1993; Yarowsky, 1992) used only the unordered set of
of the 191 content words that occur in 6 text files of        surrounding words to perform WSD, and they used
the WSJ corpus.                                               statistical classifiers, neural networks, or IR-based
   We compared the classification accuracy of LEXAS           techniques. The work of (Bruce and Wiebe, 1994)
against the default strategy of picking the most fre-         used parts of speech (POS) and morphological form,
quent sense. This default strategy has been advo-             in addition to surrounding words. However, the POS
cated as the baseline performance level for compar-           used are abbreviated POS, and only in a window of
ison with WSD programs (Gale et al., 1992). There             -b2 words. No local collocation knowledge is used. A
are two instantiations of this strategy in our current        probabilistic classifier is used in (Bruce and Wiebe,
evaluation. Since WORDNET orders its senses such              1994).
that sense 1 is the most frequent sense, one pos-                T h a t local collocation knowledge provides impor-
sibility is to always pick sense 1 as the best sense          tant clues to WSD is pointed out in (Yarowsky,
assignment. This assignment method does not even              1993), although it was demonstrated only on per-
need to look at the training sentences. We call this          forming binary (or very coarse) sense disambigua-
method "Sense 1" in Table 5. Another assignment               tion. The work of (Yarowsky, 1994) is perhaps the
method is to determine the most frequently occur-             most similar to our present work. However, his work
ring sense in the training sentences, and to assign           used decision list to perform classification, in which
this sense to all test sentences. We call this method         only the single best disambiguating evidence that
 "Most Frequent" in Table 5. The accuracy of LEXAS            matched a target context is used. In contrast, we
on these two test sets is given in Table 5.                   used exemplar-based learning, where the contribu-
   Our results indicate that exemplar-based classi-           tions of all features are summed up and taken into
fication of word senses scales up quite well when             account in coming up with a classification. We also
tested on a large set of words. The classification            include verb-object syntactic relation as a feature,
accuracy of LEXAS is always better than the default           which is not used in (Yarowsky, 1994). Although the
strategy of picking the most frequent sense. We be-           work of (Yarowsky, i994) can be applied to WSD,
lieve that our result is significant, especially when         the results reported in (Yarowsky, 1994) only dealt
the training data is noisy, and the words are highly          with accent restoration, which is a much simpler
ambiguous with a large number of refined sense dis-           problem. It is unclear how Yarowsky's method will
tinctions per word.                                           fare on WSD of a common test data set like the one
   The accuracy on Brown corpus test files is lower           we used, nor has his method been tested on a large
than that achieved on the Wall Street Journal test            data set with highly ambiguous words tagged with
files, primarily because the Brown corpus consists            the refined senses of WORDNET.
of texts from a wide variety of genres, including                The work of (Miller et al., 1994) is the only prior
newspaper reports, newspaper editorial, biblical pas-         work we know of which attempted to evaluate WSD
sages, science and mathematics articles, general fic-         on a large data set and using the refined sense dis-
tion, romance story, humor, etc. It is harder to dis-         tinction of WORDNET. However, their results show
                                                         45
no improvement (in fact a slight degradation in per-           References
formance) when using surrounding words to perform
WSD as compared to the most frequent heuristic.                Ezra Black. 1988. An experiment in computational
They attributed this to insufficient training data in            discrimination of English word senses. IBM Jour-
SEMCOm In contrast, we adopt a different strategy                nal of Research and Development, 32(2):185-194.
of collecting the training data set. Instead of tagging        Eric Brill. 1992. A simple rule-based part of speech
every word in a running text, as is done in SEMCOR,              tagger. In Proceedings of the Third Conference on
we only concentrate on the set of 191 most frequently            Applied Natural Language Processing, pages 152-
occurring and most ambiguous words, and collected                155.
large enough training data for these words only. This
strategy yields better results, as indicated by a bet-         Rebecca Bruce and Janyce Wiebe. 1994. Word-
ter performance of LEXAS compared with the most                  sense disambiguation using decomposable mod-
frequent heuristic on this set of words.                         els. In Proceedings of the 32nd Annual Meeting
   Most recently, Yarowsky used an unsupervised                  of the Association for Computational Linguistics,
learning procedure to perform WSD (Yarowsky,                     Las Cruces, New Mexico.
1995), although this is only tested on disambiguat-            Claire Cardie. 1993. A case-based approach to
ing words into binary, coarse sense distinction. The             knowledge acquisition for domain-specific sen-
effectiveness of unsupervised learning on disam-                 tence analysis. In Proceedings of the Eleventh Na-
biguating words into the refined sense distinction of            tional Conference on Artificial Intelligence, pages
WoRBNET needs to be further investigated. The                    798-803, Washington, DC.
work of (McRoy, 1992) pointed out that a diverse
set of knowledge sources are important to achieve              Y. Choueka and S. Lusignan. 1985. Disambiguation
WSD, but no quantitative evaluation was given on                 by short contexts. Computers and the Humani-
the relative importance of each knowledge source.                ties, 19:147-157.
No previous work has reported any such evaluation              Scott Cost and Steven Salzberg. 1993. A weighted
either. The work of (Cardie, 1993) used a case-based             nearest neighbor algorithm for learning with sym-
approach that simultaneously learns part of speech,              bolic features. Machine Learning, 10(1):57-78.
word sense, and concept activation knowledge, al-
                                                               Doug Cutting, Julian Kupiec, Jan Pedersen, and
though the method is only tested on domain-specific
                                                                 Penelope Sibun. 1992. A practical part-of-speech
texts with domain-specific word senses.
                                                                tagger. In Proceedings of the Third Conference on
6    Conclusion                                                 Applied Natural Language Processing, pages 133-
                                                                 140.
In this paper, we have presented a new approach for            William Gale, Kenneth Ward Church, and David
WSD using an exemplar based learning algorithm.                 Yarowsky. 1992. Estimating upper and lower
This approach integrates a diverse set of knowledge             bounds on the performance of word-sense disam-
sources to disambiguate word sense. When tested on
                                                                biguation programs. In Proceedings of the 30th
a common data set, our WSD program gives higher
                                                                 Annual Meeting of the Association for Computa-
classification accuracy than previous work on WSD.               tional Linguistics, Newark, Delaware.
When tested on a large, separately collected data
set, our program performs better than the default              Graeme Hirst. 1987. Semantic Interpretation and
strategy of picking the most frequent sense. To our              the Resolution of Ambiguity. Cambridge Univer-
knowledge, this is the first time that a WSD program            sity Press, Cambridge.
has been tested on such a large scale, and yielding            Edward Kelly and Phillip Stone.     1 9 7 5 . Com-
results better than the most frequent heuristic on               puter Recognition of English Word Senses. North-
highly ambiguous words with the refined senses of                Holland, Amsterdam.
WoRDNET.
                                                               Claudia Leacock, Geoffrey Towell, and Ellen
7    Acknowledgements                                            Voorhees. 1993. Corpus-based statistical sense
                                                                 resolution. In Proceedings of the ARPA Human
We would like to thank: Dr Paul Wu for sharing                   Language Technology Workshop.
the Brown Corpus and Wall Street Journal Corpus;
Dr Christopher Ting for downloading and installing             Alpha K. Luk. 1995. Statistical sense disambigua-
WoRDNET and SEMCOR, and for reformatting the                     tion with relatively small corpora using dictio-
corpora; the 12 undergraduates from the Linguis-                 nary definitions. In Proceedings of the 33rd An-
tics Program of the National University of Singa-                nual Meeting of the Association for Computa-
pore for preparing the sense-tagged corpus; and Prof             tional Linguistics, Cambridge, Massachusetts.
K. P. Mohanan for his support of the sense-tagging             Susan W. McRoy 1992. Using multiple knowledge
project.                                                         sources for word sense discrimination. Computa-
                                                                 tional Linguistics, 18(1):1-30.
                                                          46
George A. Miller, Ed. 1990. WordNet: An on-line
  lexical database. International Journal of Lezi-
  cography, 3(4):235-312.
George A. Miller, Martin Chodorow, Shari Landes,
  Claudia Leacock, and Robert G. Thomas. 1994.
  Using a semantic concordance for sense identifi-
  cation. In Proceedings of the ARPA Human Lan-
  guage Technology Workshop.
Paul Procter et al. 1978. Longman Dictionary of
  Contemporary English.
John Rachlin and Steven Salzberg. 1993. PEBLS
  3.0 User's Guide.
C Stanfill and David Waltz. 1986. Toward memory-
  based reasoning. Communications of the ACM,
  29(12):1213-1228.
Yorick Wilks, Dan Fass, Cheng-Ming Guo, James E.
  McDonald, Tony Plate, and Brian M. Slator.
  1990. Providing machine tractable dictionary
  tools. Machine Translation, 5(2):99-154.
David Yarowsky. 1992. Word-sense disambigua-
  tion using statistical models of Roger's categories
  trained on large corpora. In Proceedings of the
  Fifteenth International Conference on Computa-
  tional Linguistics, pages 454-460, Nantes, France.
David Yarowsky. 1993. One sense per colloca-
  tion. In Proceedings of the ARPA Human Lan-
  guage Technology Workshop.
David Yarowsky. 1994. Decision lists for lexical am-
  biguity resolution: Application to accent restora-
  tion in Spanish and French. In Proceedings of the
   32nd Annual Meeting of the Association for Com-
  putational Linguistics, Las Cruces, New Mexico.
David Yarowsky. 1995. Unsupervised word sense
  disambiguation rivaling supervised methods. In
  Proceedings of the 33rd Annual Meeting of the
  Association for Computational Linguistics, Cam-
  bridge, Massachusetts.
Uri Zernik. 1990. Tagging word senses in corpus:
  the needle in the haystack revisited. Technical
  Report 90CRD198, GE R&D Center.




                                                        47

More Related Content

What's hot

DDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical DomainDDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
LuukBoulogne
 
The CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesThe CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognates
Mark Planigale
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
Milind Gokhale
 
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...
Svetlin Nakov
 
LDG-basic-slides
LDG-basic-slidesLDG-basic-slides
LDG-basic-slides
Noor van Leusen
 
Automatic Identification of False Friends in Parallel Corpora: Statistical an...
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Automatic Identification of False Friends in Parallel Corpora: Statistical an...
Automatic Identification of False Friends in Parallel Corpora: Statistical an...
Svetlin Nakov
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
ssuserc35c0e
 
Parts Of A Speech
Parts Of A SpeechParts Of A Speech
Parts Of A Speech
guestfefd6f8
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Fulvio Rotella
 
Cohesion In English Wasee
Cohesion In English  WaseeCohesion In English  Wasee
Cohesion In English Wasee
Dr. Cupid Lucid
 
Cohesion Final
Cohesion FinalCohesion Final
Cohesion Final
Dr. Cupid Lucid
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
andrefsantos
 
OBSC Framework
OBSC FrameworkOBSC Framework
OBSC Framework
Mouhamad Kawas
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
Editor IJARCET
 
Formal languages
Formal languagesFormal languages
Formal languages
Vestforsk.no
 
Surface realization
Surface realizationSurface realization
Surface realization
Mahalingam Ramaswamy
 
dialogue act modeling for automatic tagging and recognition
 dialogue act modeling for automatic tagging and recognition dialogue act modeling for automatic tagging and recognition
dialogue act modeling for automatic tagging and recognition
Vipul Munot
 
The three little pigs
The three little pigsThe three little pigs
The three little pigs
sgilcep
 
P-6
P-6P-6
P-6
butest
 

What's hot (19)

DDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical DomainDDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
 
The CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesThe CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognates
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...
 
LDG-basic-slides
LDG-basic-slidesLDG-basic-slides
LDG-basic-slides
 
Automatic Identification of False Friends in Parallel Corpora: Statistical an...
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Automatic Identification of False Friends in Parallel Corpora: Statistical an...
Automatic Identification of False Friends in Parallel Corpora: Statistical an...
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
 
Parts Of A Speech
Parts Of A SpeechParts Of A Speech
Parts Of A Speech
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Cohesion In English Wasee
Cohesion In English  WaseeCohesion In English  Wasee
Cohesion In English Wasee
 
Cohesion Final
Cohesion FinalCohesion Final
Cohesion Final
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
OBSC Framework
OBSC FrameworkOBSC Framework
OBSC Framework
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
Formal languages
Formal languagesFormal languages
Formal languages
 
Surface realization
Surface realizationSurface realization
Surface realization
 
dialogue act modeling for automatic tagging and recognition
 dialogue act modeling for automatic tagging and recognition dialogue act modeling for automatic tagging and recognition
dialogue act modeling for automatic tagging and recognition
 
The three little pigs
The three little pigsThe three little pigs
The three little pigs
 
P-6
P-6P-6
P-6
 

Viewers also liked

Feminism in 'Da Vinci Code'
Feminism in 'Da Vinci Code'Feminism in 'Da Vinci Code'
Feminism in 'Da Vinci Code'
pinksguru
 
Presentatie derrik zakkelijk_apps
Presentatie derrik zakkelijk_appsPresentatie derrik zakkelijk_apps
Presentatie derrik zakkelijk_appsDerrik van Bemmel
 
Project 6 – online information
Project 6 – online informationProject 6 – online information
Project 6 – online information
Simon Peters
 
Pagan Vs Christianity
Pagan Vs ChristianityPagan Vs Christianity
Pagan Vs Christianity
pinksguru
 
Feminism in 'Da Vinci Code'
Feminism in 'Da Vinci Code'Feminism in 'Da Vinci Code'
Feminism in 'Da Vinci Code'
pinksguru
 
TripleWide Media In Action
TripleWide Media In ActionTripleWide Media In Action
TripleWide Media In Action
Teddy Cheek
 
pvcc April7
pvcc April7pvcc April7
pvcc April7
pvccangola
 
'Uhuru" in 'A Grain of Wheat'
'Uhuru" in 'A Grain of Wheat''Uhuru" in 'A Grain of Wheat'
'Uhuru" in 'A Grain of Wheat'
pinksguru
 
TV and Children
TV and ChildrenTV and Children
TV and Children
pinksguru
 
Unify_Presentation_Short_20160106
Unify_Presentation_Short_20160106Unify_Presentation_Short_20160106
Unify_Presentation_Short_20160106
Oben Melih Tunç
 
Marginal costing
Marginal costingMarginal costing
Marginal costing
Ajiths Sathya
 
Transformer design and protection
Transformer design and protectionTransformer design and protection
Transformer design and protection
ashwin fcc
 
госдолг сша
госдолг сшагосдолг сша
госдолг сша
Maria Solo
 
Diktat sistem basis_data
Diktat sistem basis_dataDiktat sistem basis_data
Diktat sistem basis_dataSolihin Lihin
 
Genre Research - Horror Forms and Conventions
Genre Research - Horror Forms and ConventionsGenre Research - Horror Forms and Conventions
Genre Research - Horror Forms and Conventions
Lee Turner
 

Viewers also liked (15)

Feminism in 'Da Vinci Code'
Feminism in 'Da Vinci Code'Feminism in 'Da Vinci Code'
Feminism in 'Da Vinci Code'
 
Presentatie derrik zakkelijk_apps
Presentatie derrik zakkelijk_appsPresentatie derrik zakkelijk_apps
Presentatie derrik zakkelijk_apps
 
Project 6 – online information
Project 6 – online informationProject 6 – online information
Project 6 – online information
 
Pagan Vs Christianity
Pagan Vs ChristianityPagan Vs Christianity
Pagan Vs Christianity
 
Feminism in 'Da Vinci Code'
Feminism in 'Da Vinci Code'Feminism in 'Da Vinci Code'
Feminism in 'Da Vinci Code'
 
TripleWide Media In Action
TripleWide Media In ActionTripleWide Media In Action
TripleWide Media In Action
 
pvcc April7
pvcc April7pvcc April7
pvcc April7
 
'Uhuru" in 'A Grain of Wheat'
'Uhuru" in 'A Grain of Wheat''Uhuru" in 'A Grain of Wheat'
'Uhuru" in 'A Grain of Wheat'
 
TV and Children
TV and ChildrenTV and Children
TV and Children
 
Unify_Presentation_Short_20160106
Unify_Presentation_Short_20160106Unify_Presentation_Short_20160106
Unify_Presentation_Short_20160106
 
Marginal costing
Marginal costingMarginal costing
Marginal costing
 
Transformer design and protection
Transformer design and protectionTransformer design and protection
Transformer design and protection
 
госдолг сша
госдолг сшагосдолг сша
госдолг сша
 
Diktat sistem basis_data
Diktat sistem basis_dataDiktat sistem basis_data
Diktat sistem basis_data
 
Genre Research - Horror Forms and Conventions
Genre Research - Horror Forms and ConventionsGenre Research - Horror Forms and Conventions
Genre Research - Horror Forms and Conventions
 

Similar to Exempler approach

Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
csandit
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
ijnlc
 
Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationStatistically-Enhanced New Word Identification
Statistically-Enhanced New Word Identification
Andi Wu
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
Gurram Poorna Prudhvi
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
RwanEnan
 
Text smilarity02 corpus_based
Text smilarity02 corpus_basedText smilarity02 corpus_based
Text smilarity02 corpus_based
cyan1d3
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional Semantics
Andre Freitas
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
MLconf
 
05 linguistic theory meets lexicography
05 linguistic theory meets lexicography05 linguistic theory meets lexicography
05 linguistic theory meets lexicography
Duygu Aşıklar
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
Lukáš Svoboda
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
connectbeubax
 
Class14
Class14Class14
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
LaraOlmosCamarena
 
Gwc 2016
Gwc 2016Gwc 2016
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Saurabh Kaushik
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense Disambiguation
Surabhi Verma
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
JemalNesre1
 
1 l5eng
1 l5eng1 l5eng
1 l5eng
Noobie312
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
IJERA Editor
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 

Similar to Exempler approach (20)

Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
 
Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationStatistically-Enhanced New Word Identification
Statistically-Enhanced New Word Identification
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Text smilarity02 corpus_based
Text smilarity02 corpus_basedText smilarity02 corpus_based
Text smilarity02 corpus_based
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional Semantics
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
05 linguistic theory meets lexicography
05 linguistic theory meets lexicography05 linguistic theory meets lexicography
05 linguistic theory meets lexicography
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
Class14
Class14Class14
Class14
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Gwc 2016
Gwc 2016Gwc 2016
Gwc 2016
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense Disambiguation
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
1 l5eng
1 l5eng1 l5eng
1 l5eng
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 

Exempler approach

  • 1. Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach Hwee Tou Ng Hian Beng Lee Defence Science Organisation Defence Science Organisation 20 S c i e n c e P a r k D r i v e 20 S c i e n c e P a r k D r i v e S i n g a p o r e 118230 S i n g a p o r e 118230 nhweet ou©trantor, dso. gov. s g lhianben@trant or. d s o . gov. sg Abstract (LEXical Ambiguity-resolving _System), we tested it on a common data set involving the noun "interest" In this paper, we present a new approach used by Bruce and Wiebe (Bruce and Wiebe, 1994). for word sense disambiguation (WSD) us- LEXAS achieves a mean accuracy of 87.4% on this ing an exemplar-based learning algorithm. data set, which is higher than the accuracy of 78% This approach integrates a diverse set of reported in (Bruce and Wiebe, 1994). knowledge sources to disambiguate word sense, including part of speech of neigh- Moreover, to test the scalability of LEXAS, we have acquired a corpus in which 192,800 word occurrences boring words, morphological form, the un- have been manually tagged with senses from WORD- ordered set of surrounding words, local NET, which is a public domain lexical database con- collocations, and verb-object syntactic re- taining about 95,000 word forms and 70,000 lexical lation. We tested our WSD program, concepts (Miller, 1990). These sense tagged word named LEXAS, on both a common data occurrences consist of 191 most frequently occur- set used in previous work, as well as on ring and most ambiguous nouns and verbs. When a large sense-tagged corpus that we sep- tested on this large data set, LEXAS performs better arately constructed. LEXAS achieves a than the default strategy of picking the most fre- higher accuracy on the common data set, quent sense. To our knowledge, this is the first time and performs better than the most frequent heuristic on the highly ambiguous words that a WSD program has been tested on such a large in the large corpus tagged with the refined scale, and yielding results better than the most fre- quent heuristic on highly ambiguous words with the senses of WoRDNET. refined sense distinctions of WOttDNET. 1 Introduction 2 Task Description One important problem of Natural Language Pro- cessing (NLP) is figuring out what a word means The input to a WSD program consists of unre- when it is used in a particular context. The different stricted, real-world English sentences. In the out- meanings of a word are listed as its various senses in put, each word occurrence w is tagged with its cor- a dictionary. The task of Word Sense Disambigua- rect sense (according to the context) in the form of tion (WSD) is to identify the correct sense of a word a sense number i, where i corresponds to the i-th in context. Improvement in the accuracy of iden- sense definition of w as given in some dictionary. tifying the correct word sense will result in better The choice of which sense definitions to use (and machine translation systems, information retrieval according to which dictionary) is agreed upon in ad- systems, etc. For example, in machine translation, vance. knowing the correct word sense helps to select the For our work, we use the sense definitions as given appropriate target words to use in order to translate in WORDNET, which is comparable to a good desk- into a target language. top printed dictionary in its coverage and sense dis- In this paper, we present a new approach for tinction. Since WO•DNET only provides sense def- WSD using an exemplar-based learning algorithm. initions for content words, (i.e., words in the parts This approach integrates a diverse set of knowledge of speech (POS) noun, verb, adjective, and adverb), sources to disambiguate word sense, including part LEXAS is only concerned with disambiguating the of speech (POS) of neighboring words, morphologi- sense of content words. However, almost all existing cal form, the unordered set of surrounding words, work in WSD deals only with disambiguating con- local collocations, and verb-object syntactic rela- tent words too. tion. To evaluate our WSD program, named LEXAS LEXAS assumes that each word in an input sen- 40
  • 2. tence has been pre-tagged with its correct POS, so ing w, the morphological form of w, the frequently that the possible senses to consider for a content co-occurring words surrounding w, the local colloca- word w are only those associated with the particu- tions containing w, and the verb t h a t takes w as an lar POS of w in the sentence. For instance, given object (for the case when w is a noun). These values the sentence "A reduction of principal and interest form the features of a test example. is one way the problem m a y be solved.", since the This test example is then compared to every train- word "interest" appears as a noun in this sentence, ing example. The sense of word w in the test exam- LEXAS will only consider the noun senses of "inter- ple is the sense of w in the closest matching train- est" but not its verb senses. T h a t is, LEXAS is only ing example, where there is a precise, computational concerned with disambiguating senses of a word in definition of "closest match" as explained later. a given POS. Making such an assumption is reason- able since POS taggers that can achieve accuracy 3.1 Feature Extraction of 96% are readily available to assign POS to un- The first step of the algorithm is to extract a set F restricted English sentences (Brill, 1992; Cutting et of features such that each sentence containing an oc- al., 1992). currence of w will form a training example supplying In addition, sense definitions are only available for the necessary values for the set F of features. root words in a dictionary. These are words that Specifically, LEXAS uses the following set of fea- are not morphologically inflected, such as "interest" tures to form a training example: (as opposed to the plural form "interests"), "fall" (as opposed to the other inflected forms like "fell", L3, L2, LI, 1~i, R2, R3, M, K I , . . . , Kin, el,..., 69, V "fallen", "falling", "falls"), etc. The sense of a mor- 3.1.1 Part of Speech and Morphological phologically inflected content word is the sense of its Form uninflected form. LEXAS follows this convention by first converting each word in an input sentence into The value of feature Li is the part of speech (POS) its morphological root using the morphological ana- of the word i-th position to the left of w. The value lyzer of WORD NET, before assigning the appropriate of Ri is the POS of the word i-th position to the right word sense to the root form. of w. Feature M denotes the morphological form of w in the sentence s. For a noun, the value for this feature is either singular or plural; for a verb, the 3 Algorithm value is one of infinitive (as in the uninflected form LEXAS performs WSD by first learning from a train- of a verb like "fall"), present-third-person-singular ing corpus of sentences in which words have been (as in "falls"), past (as in "fell"), present-participle pre-tagged with their correct senses. T h a t is, it uses (as in "falling") or past-participle (as in "fallen"). supervised learning, in particular exemplar-based 3.1.2 Unordered Set of Surrounding Words learning, to achieve WSD. Our approach has been K t , • •., Km are features corresponding to a set of fully implemented in the program LExAs. Part of keywords that frequently co-occur with word w in the implementation uses PEBLS (Cost and Salzberg, the same sentence. For a sentence s, the value of 1993; Rachlin and Salzberg, 1993), a public domain feature Ki is one if the keyword It'~ appears some- exemplar-based learning system. where in sentence s, else the value of Ki is zero. LEXAS builds one exemplar-based classifier for The set of keywords K 1 , . . . , Km are determined each content word w. It operates in two phases: based on conditional probability. All the word to- training phase and test phase. In the training phase, kens other than the word occurrence w in a sen- LEXAS is given a set S of sentences in the training tence s are candidates for consideration as keywords. corpus in which sense-tagged occurrences of w ap- These tokens are converted to lower case form before pear. For each training sentence with an occurrence being considered as candidates for keywords. of w, LEXAS extracts the parts of speech (POS) of Let cp(ilk ) denotes the conditional probability of words surrounding w, the morphological form of w, sense i of w given keyword k, where the words that frequently co-occur with w in the same sentence, and the local collocations containing Ni,k w. For disambiguating a noun w, the verb which cp(ilk) = N~ takes the current noun w as the object is also iden- tified. This set of values form the features of an ex- Nk is the number of sentences in which keyword k co- ample, with one training sentence contributing one occurs with w, and Ni,k is the number of sentences training example. in which keyword k co-occurs with w where w has Subsequently, in the test phase, LEXAS is given sense i. new, previously unseen sentences. For a new sen- For a keyword k to be selected as a feature, it must satisfy the following criteria: tence containing the word w, LI~XAS extracts from the new sentence the values for the same set of fea- 1. cp(ilk ) >_ Mi for some sense i, where M1 is some tures, including parts of speech of words surround- predefined m i n i m u m probability. 41
  • 3. 2. The keyword k must occur at least M2 times Left Offset Right Offset Collocation Example in some sense i, where /1//2 is some predefined -1 -1 accrued interest m i n i m u m value. 1 1 interest rate -2 -1 principal and interest 3. Select at most M3 number of keywords for a -1 1 national interest in given sense i if the number of keywords satisfy- 1 2 interest and dividends ing the first two criteria for a given sense i ex- ceeds M3. In this case, keywords that co-occur -3 -1 sale of an interest more frequently (in terms of absolute frequency) -2 in the interest of with sense i of word w are selected over those -1 2 an interest in a co-occurring less frequently. 1 3 interest on the bonds Condition 1 ensures that a selected keyword is in- Table 1: Features for Collocations dicative of some sense i of w since cp(ilk) is at least some m i n i m u m probability M1. Condition 2 reduces the possibility of selecting a keyword based on spu- of the nine collocation features, LEXAS concatenates rious occurrence. Condition 3 prefers keywords that the words between the left and right offset positions. co-occur more frequently if there is a large number Using similar conditional probability criteria for the of eligible keywords. selection of keywords, collocations t h a t are predic- For example, M1 = 0.8, Ms = 5, M3 = 5 when tive of a certain sense are selected to form the pos- LEXAS was tested on the c o m m o n data set reported sible values for a collocation feature. in Section 4.1. 3.1.4 V e r b - O b j e c t S y n t a c t i c R e l a t i o n To illustrate, when disambiguating the noun "in- terest", some of the selected keywords are: ex- LEXAS also makes use of the verb-object syntactic pressed, acquiring, great, attracted, expressions, relation as one feature V for the disambiguation of pursue, best, conflict, served, short, minority, rates, nouns. If a noun to be disambiguated is the head of rate, bonds, lower, payments. a noun group, as indicated by its last position in a noun group bracketing, and if the word immediately 3.1.3 Local Collocations preceding the opening noun group bracketing is a Local collocations are common expressions con- verb, LEXAS takes such a verb-noun pair to be in a taining the word to be disambiguated. For our pur- verb-object syntactic relation. Again, using similar pose, the term collocation does not imply idiomatic conditional probability criteria for the selection of usage, just words that are frequently adjacent to the keywords, verbs that are predictive of a certain sense word to be disambiguated. Examples of local collo- of the noun to be disambiguated are selected to form cations of the noun "interest" include "in the interest the possible values for this verb-object feature V. of", "principal and interest", etc. When a word to Since our training and test sentences come with be disambiguated occurs as part of a collocation, its noun group bracketing, determining verb-object re- sense can be frequently determined very reliably. For lation using the above heuristic can be readily done. example, the collocation "in the interest of" always In future work, we plan to incorporate more syntac- implies the "advantage, advancement, favor" sense tic relations including subject-verb, and adjective- of the noun "interest". Note t h a t the method for headnoun relations. We also plan to use verb- extraction of keywords that we described earlier will object and subject-verb relations to disambiguate fail to find the words "in", "the", "of" as keywords, verb senses. since these words will appear in m a n y different po- sitions in a sentence for m a n y senses of the noun 3.2 Training and Testing "interest". It is only when these words appear in The heart of exemplar-based learning is a measure the exact order "in the interest of" around the noun of the similarity, or distance, between two examples. "interest" that strongly implies the "advantage, ad- If the distance between two examples is small, then vancement, favor" sense. the two examples are similar. We use the following There are nine features related to collocations in definition of distance between two symbolic values an example. Table 1 lists the nine features and some vl and v2 of a feature f: collocation examples for the noun "interest". For ex- ample, the feature with left offset = -2 and right off- set = 1 refers to the possible collocations beginning e(vl, v2) = Ic1'cl c2,c. I i=1 at the word two positions to the left of "interest" and ending at the word one position to the right of Cl,i is the number of training examples with value "interest". An example of such a collocation is "in vl for feature f that is classified as sense i in the the interest of". training corpus, and C1 is the number of training The method for extraction of local collocations is examples with value vl for feature f in any sense. similar to that for extraction of keywords. For each C2,i and C2 denote similar quantities for value v2 of 42
  • 4. feature f . n is the total number of senses for a word L D O C E sense Frequency Percent W. 1: readiness to give 361 15% This metric for measuring distance is adopted attention from (Cost and Salzberg, 1993), which in turn is 2: quality of causing 11 <1% adapted from the value difference metric of the ear- attention to be given lier work of (Stanfill and Waltz, 1986). The distance 3: activity, subject, etc. 67 3% between two examples is the sum of the distances which one gives time and between the values of all the features of the two ex- attention to amples. 4: advantage, 178 8% During the training phase, the appropriate set of advancement, or favor features is extracted based on the method described 5: a share (in a company, 499 21% in Section 3.1. From the training examples formed, business, etc.) the distance between any two values for a feature f 6: money paid for the use 1253 53% is computed based on the above formula. of money During the test phase, a test example is compared against allthe training examples. LEXAS then deter- Table 2: Distribution of Sense Tags mines the closest matching training example as the one with the m i n i m u m distance to the test example. The sense of w in the test example is the sense of w verb-object syntactic relation to form the features of in this closest matching training example. examples. If there is a tie among several training examples In the results reported in (Bruce and Wiebe, with the same m i n i m u m distance to the test exam- 1994), they used a test set of 600 randomly selected ple, LEXAS randomly selects one of these training sentences from the 2369 sentences. Unfortunately, examples as the closet matching training example in in the data set made available in the public domain, order to break the tie. there is no indication of which sentences are used as test sentences. As such, we conducted 100 random 4 Evaluation trials, and in each trial, 600 sentences were randomly selected to form the test set. LEXAS is trained on To evaluate the performance of LEXAS, we con- the remaining 1769 sentences, and then tested on a ducted two tests, one on a common data set used in separate test set of sentences in each trial. (Bruce and Wiebe, 1994), and another on a larger Note that in Bruce and Wiebe's test run, the pro- data set that we separately collected. portion of sentences in each sense in the test set is approximately equal to their proportion in the whole 4.1 Evaluation on a Common Data Set d a t a set. Since we use r a n d o m selection of test sen- To our knowledge, very few of the existing work on tences, the proportion of each sense in our test set is WSD has been tested and compared on a common also approximately equal to their proportion in the data set. This is in contrast to established practice whole d a t a set in our r a n d o m trials. in the machine learning community. This is partly The average accuracy of LEXAS over 100 random because there are not m a n y c o m m o n data sets pub- trials is 87.4%, and the standard deviation is 1.37%. licly available for testing WSD programs. In each of our 100 r a n d o m trials, the accuracy of One exception is the sense-tagged data set used LEXAS is always higher than the accuracy of 78% in (Bruce and Wiebe, 1994), which has been made reported in (Bruce and Wiebe, 1994). available in the public domain by Bruce and Wiebe. Bruce and Wiebe also performed a separate test This data set consists of 2369 sentences each con- by using a subset of the "interest" data set with only taining an occurrence of the noun "interest" (or its 4 senses (sense 1, 4, 5, and 6), so as to compare their plural form "interests") with its correct sense man- results with previous work on WSD (Black, 1988; ually tagged. The noun "interest" occurs in six dif- Zernik, 1990; Yarowsky, 1992), which were tested ferent senses in this d a t a set. Table 2 shows the on 4 senses of the noun "interest". However, the distribution of sense tags from the d a t a set that we work of (Black, 1988; Zernik, 1990; Yarowsky, 1992) obtained. Note that the sense definitions used in this were not based on the present set of sentences, so data set are those from Longman Dictionary of Con- the comparison is only suggestive. We reproduced temporary English (LDOCE) (Procter, 1978). This in Table 3 the results of past work as well as the clas- does not pose any problem for LEXAS, since LEXAS sification accuracy of LEXAS, which is 89.9% with a only requires that there be a division of senses into standard deviation of 1.09% over 100 random trials. different classes, regardless of how the sense classes In summary, when tested on the noun "interest", are defined or numbered. LEXAS gives higher classification accuracy than pre- POS of words are given in the data set, as well vious work on WSD. as the bracketings of noun groups. These are used In order to evaluate the relative contribution of to determine the POS of neighboring words and the the knowledge sources, including (1) POS and mor- 43
  • 5. WSD research Accuracy These 192,800 word occurrences consist of 121 Black (1988) 72% nouns and 70 verbs which are the m o s t frequently oc- Zernik (1990) 70% curring and most ambiguous words of English. The Yarowsky (1992) 72% 121 nouns are: Bruce & Wiebe (1994) 79% action activity age air area art board LEXhS (1996) 89% body book business car case center cen- tury change child church city class college Table 3: Comparison with previous results community company condition cost coun- try course day death development differ- Knowledge Source Mean Accuracy Std Dev ence door effect effort end example experi- POS & morpho 77.2% 1.44% ence face fact family field figure foot force surrounding words 62.0% 1.82% form girl government ground head history collocations 80.2% 1.55% home hour house information interest job verb-object 43.5% 1.79% land law level life light line m a n mate- rial m a t t e r m e m b e r mind m o m e n t money Table 4: Relative Contribution of Knowledge m o n t h n a m e nation need number order Sources part party picture place plan point pol- icy position power pressure problem pro- cess program public purpose question rea- phological form; (2) unordered set of surrounding son result right r o o m school section sense words; (3) local collocations; and (4) verb to the left service side society stage state step student (verb-object syntactic relation), we conducted 4 sep- study surface system table t e r m thing time arate runs of 100 r a n d o m trials each. In each run, town type use value voice water way word we utilized only one knowledge source and compute work world the average classification accuracy and the standard deviation. The results are given in Table 4. The 70 verbs are: Local collocation knowledge yields the highest ac- curacy, followed by POS and morphological form. add appear ask become believe bring build Surrounding words give lower accuracy, perhaps be- call carry change come consider continue cause in our work, only the current sentence forms determine develop draw expect fall give the surrounding context, which averages about 20 go grow happen help hold indicate involve words. Previous work on using the unordered set of keep know lead leave lie like live look lose mean meet move need open pay raise read surrounding words have used a much larger window, such as the 100-word window of (Yarowsky, 1992), receive r e m e m b e r require return rise run and the 2-sentence context of (Leacock et al., 1993). see seem send set show sit speak stand start Verb-object syntactic relation is the weakest knowl- stop strike take talk tell think turn wait edge source. walk want work write Our experimental finding, t h a t local collocations For this set of nouns and verbs, the average num- are the most predictive, agrees with past observa- ber of senses per noun is 7.8, while the average num- tion that humans need a narrow window of only a ber of senses per verb is 12.0. We draw our sen- few words to perform WSD (Choueka and Lusignan, tences containing the occurrences of the 191 words 1985). listed above from the combined corpus of the 1 mil- The processing speed of LEXAS is satisfactory. lion word Brown corpus and the 2.5 million word Running on an SGI Unix workstation, LEXAS can Wall Street Journal (WSJ) corpus. For every word process about 15 examples per second when tested in the two lists, up to 1,500 sentences each con- on the "interest" data set. taining an occurrence of the word are extracted from the combined corpus. In all, there are about 4.2 E v a l u a t i o n o n a L a r g e D a t a Set 113,000 noun occurrences and a b o u t 79,800 verb oc- Previous research on W S D tend to be tested only currences. This set of 121 nouns accounts for about on a dozen number of words, where each word fre- 20% of all occurrences of nouns t h a t one expects to quently has either two or a few senses. To test the encounter in any unrestricted English text. Simi- scalability of LEXAS, we have gathered a corpus in larly, about 20% of all verb occurrences in any unre- which 192,800 word occurrences have been manually stricted text come from the set of 70 verbs chosen. tagged with senses from WoRDNET 1.5. This data We estimate that there are 10-20% errors in our set is almost two orders of magnitude larger in size sense-tagged d a t a set. To get an idea of how the than the above "interest" data set. Manual tagging sense assignments of our d a t a set compare with was done by university undergraduates majoring in those provided by WoRDNET linguists in SEMCOR, Linguistics, and approximately one man-year of ef- the sense-tagged subset of Brown corpus prepared forts were expended in tagging our data set. by Miller et al. (Miller et al., 1994), we compare 44
  • 6. Test set Sense 1 Most Frequent LEXAS ambiguate words coming from such a wide variety of BC50 40.5% 47.1% 54.0% texts. WSJ6 44.8% 63.7% 68.6% 5 Related Work Table 5: Evaluation on a Large Data Set There is now a large body of past work on WSD. Early work on WSD, such as (Kelly and Stone, 1975; a subset of the occurrences that overlap. Out of Hirst, 1987) used hand-coding of knowledge to per- 5,317 occurrences that overlap, about 57% of the form WSD. The knowledge acquisition process is la- sense assignments in our data set agree with those borious. In contrast, LEXAS learns from tagged sen- in SEMCOR. This should not be too surprising, as tences, without human engineering of complex rules. it is widely believed that sense tagging using the The recent emphasis on corpus based NLP has re- full set of refined senses found in a large dictionary sulted in much work on WSD of unconstrained real- like WORDNET involve making subtle human judg- world texts. One line of research focuses on the use ments (Wilks et al., 1990; Bruce and Wiebe, 1994), of the knowledge contained in a machine-readable such that there are many genuine cases where two dictionary to perform WSD, such as (Wilks et al., humans will not agree fully on the best sense assign- 1990; Luk, 1995). In contrast, LEXAS uses super- ments. vised learning from tagged sentences, which is also We evaluated LEXAS on this larger set of noisy, the approach taken by most recent work on WSD, in- sense-tagged data. We first set aside two subsets for cluding (Bruce and Wiebe, 1994; Miller et al., 1994; testing. The first test set, named BC50, consists of Leacock et al., 1993; Yarowsky, 1994; Yarowsky, 7,119 occurrences of the 191 content words that oc- 1993; Yarowsky, 1992). cur in 50 text files of the Brown corpus. The second The work of (Miller et al., 1994; Leacock et al., test set, named WSJ6, consists of 14,139 occurrences 1993; Yarowsky, 1992) used only the unordered set of of the 191 content words that occur in 6 text files of surrounding words to perform WSD, and they used the WSJ corpus. statistical classifiers, neural networks, or IR-based We compared the classification accuracy of LEXAS techniques. The work of (Bruce and Wiebe, 1994) against the default strategy of picking the most fre- used parts of speech (POS) and morphological form, quent sense. This default strategy has been advo- in addition to surrounding words. However, the POS cated as the baseline performance level for compar- used are abbreviated POS, and only in a window of ison with WSD programs (Gale et al., 1992). There -b2 words. No local collocation knowledge is used. A are two instantiations of this strategy in our current probabilistic classifier is used in (Bruce and Wiebe, evaluation. Since WORDNET orders its senses such 1994). that sense 1 is the most frequent sense, one pos- T h a t local collocation knowledge provides impor- sibility is to always pick sense 1 as the best sense tant clues to WSD is pointed out in (Yarowsky, assignment. This assignment method does not even 1993), although it was demonstrated only on per- need to look at the training sentences. We call this forming binary (or very coarse) sense disambigua- method "Sense 1" in Table 5. Another assignment tion. The work of (Yarowsky, 1994) is perhaps the method is to determine the most frequently occur- most similar to our present work. However, his work ring sense in the training sentences, and to assign used decision list to perform classification, in which this sense to all test sentences. We call this method only the single best disambiguating evidence that "Most Frequent" in Table 5. The accuracy of LEXAS matched a target context is used. In contrast, we on these two test sets is given in Table 5. used exemplar-based learning, where the contribu- Our results indicate that exemplar-based classi- tions of all features are summed up and taken into fication of word senses scales up quite well when account in coming up with a classification. We also tested on a large set of words. The classification include verb-object syntactic relation as a feature, accuracy of LEXAS is always better than the default which is not used in (Yarowsky, 1994). Although the strategy of picking the most frequent sense. We be- work of (Yarowsky, i994) can be applied to WSD, lieve that our result is significant, especially when the results reported in (Yarowsky, 1994) only dealt the training data is noisy, and the words are highly with accent restoration, which is a much simpler ambiguous with a large number of refined sense dis- problem. It is unclear how Yarowsky's method will tinctions per word. fare on WSD of a common test data set like the one The accuracy on Brown corpus test files is lower we used, nor has his method been tested on a large than that achieved on the Wall Street Journal test data set with highly ambiguous words tagged with files, primarily because the Brown corpus consists the refined senses of WORDNET. of texts from a wide variety of genres, including The work of (Miller et al., 1994) is the only prior newspaper reports, newspaper editorial, biblical pas- work we know of which attempted to evaluate WSD sages, science and mathematics articles, general fic- on a large data set and using the refined sense dis- tion, romance story, humor, etc. It is harder to dis- tinction of WORDNET. However, their results show 45
  • 7. no improvement (in fact a slight degradation in per- References formance) when using surrounding words to perform WSD as compared to the most frequent heuristic. Ezra Black. 1988. An experiment in computational They attributed this to insufficient training data in discrimination of English word senses. IBM Jour- SEMCOm In contrast, we adopt a different strategy nal of Research and Development, 32(2):185-194. of collecting the training data set. Instead of tagging Eric Brill. 1992. A simple rule-based part of speech every word in a running text, as is done in SEMCOR, tagger. In Proceedings of the Third Conference on we only concentrate on the set of 191 most frequently Applied Natural Language Processing, pages 152- occurring and most ambiguous words, and collected 155. large enough training data for these words only. This strategy yields better results, as indicated by a bet- Rebecca Bruce and Janyce Wiebe. 1994. Word- ter performance of LEXAS compared with the most sense disambiguation using decomposable mod- frequent heuristic on this set of words. els. In Proceedings of the 32nd Annual Meeting Most recently, Yarowsky used an unsupervised of the Association for Computational Linguistics, learning procedure to perform WSD (Yarowsky, Las Cruces, New Mexico. 1995), although this is only tested on disambiguat- Claire Cardie. 1993. A case-based approach to ing words into binary, coarse sense distinction. The knowledge acquisition for domain-specific sen- effectiveness of unsupervised learning on disam- tence analysis. In Proceedings of the Eleventh Na- biguating words into the refined sense distinction of tional Conference on Artificial Intelligence, pages WoRBNET needs to be further investigated. The 798-803, Washington, DC. work of (McRoy, 1992) pointed out that a diverse set of knowledge sources are important to achieve Y. Choueka and S. Lusignan. 1985. Disambiguation WSD, but no quantitative evaluation was given on by short contexts. Computers and the Humani- the relative importance of each knowledge source. ties, 19:147-157. No previous work has reported any such evaluation Scott Cost and Steven Salzberg. 1993. A weighted either. The work of (Cardie, 1993) used a case-based nearest neighbor algorithm for learning with sym- approach that simultaneously learns part of speech, bolic features. Machine Learning, 10(1):57-78. word sense, and concept activation knowledge, al- Doug Cutting, Julian Kupiec, Jan Pedersen, and though the method is only tested on domain-specific Penelope Sibun. 1992. A practical part-of-speech texts with domain-specific word senses. tagger. In Proceedings of the Third Conference on 6 Conclusion Applied Natural Language Processing, pages 133- 140. In this paper, we have presented a new approach for William Gale, Kenneth Ward Church, and David WSD using an exemplar based learning algorithm. Yarowsky. 1992. Estimating upper and lower This approach integrates a diverse set of knowledge bounds on the performance of word-sense disam- sources to disambiguate word sense. When tested on biguation programs. In Proceedings of the 30th a common data set, our WSD program gives higher Annual Meeting of the Association for Computa- classification accuracy than previous work on WSD. tional Linguistics, Newark, Delaware. When tested on a large, separately collected data set, our program performs better than the default Graeme Hirst. 1987. Semantic Interpretation and strategy of picking the most frequent sense. To our the Resolution of Ambiguity. Cambridge Univer- knowledge, this is the first time that a WSD program sity Press, Cambridge. has been tested on such a large scale, and yielding Edward Kelly and Phillip Stone. 1 9 7 5 . Com- results better than the most frequent heuristic on puter Recognition of English Word Senses. North- highly ambiguous words with the refined senses of Holland, Amsterdam. WoRDNET. Claudia Leacock, Geoffrey Towell, and Ellen 7 Acknowledgements Voorhees. 1993. Corpus-based statistical sense resolution. In Proceedings of the ARPA Human We would like to thank: Dr Paul Wu for sharing Language Technology Workshop. the Brown Corpus and Wall Street Journal Corpus; Dr Christopher Ting for downloading and installing Alpha K. Luk. 1995. Statistical sense disambigua- WoRDNET and SEMCOR, and for reformatting the tion with relatively small corpora using dictio- corpora; the 12 undergraduates from the Linguis- nary definitions. In Proceedings of the 33rd An- tics Program of the National University of Singa- nual Meeting of the Association for Computa- pore for preparing the sense-tagged corpus; and Prof tional Linguistics, Cambridge, Massachusetts. K. P. Mohanan for his support of the sense-tagging Susan W. McRoy 1992. Using multiple knowledge project. sources for word sense discrimination. Computa- tional Linguistics, 18(1):1-30. 46
  • 8. George A. Miller, Ed. 1990. WordNet: An on-line lexical database. International Journal of Lezi- cography, 3(4):235-312. George A. Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G. Thomas. 1994. Using a semantic concordance for sense identifi- cation. In Proceedings of the ARPA Human Lan- guage Technology Workshop. Paul Procter et al. 1978. Longman Dictionary of Contemporary English. John Rachlin and Steven Salzberg. 1993. PEBLS 3.0 User's Guide. C Stanfill and David Waltz. 1986. Toward memory- based reasoning. Communications of the ACM, 29(12):1213-1228. Yorick Wilks, Dan Fass, Cheng-Ming Guo, James E. McDonald, Tony Plate, and Brian M. Slator. 1990. Providing machine tractable dictionary tools. Machine Translation, 5(2):99-154. David Yarowsky. 1992. Word-sense disambigua- tion using statistical models of Roger's categories trained on large corpora. In Proceedings of the Fifteenth International Conference on Computa- tional Linguistics, pages 454-460, Nantes, France. David Yarowsky. 1993. One sense per colloca- tion. In Proceedings of the ARPA Human Lan- guage Technology Workshop. David Yarowsky. 1994. Decision lists for lexical am- biguity resolution: Application to accent restora- tion in Spanish and French. In Proceedings of the 32nd Annual Meeting of the Association for Com- putational Linguistics, Las Cruces, New Mexico. David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cam- bridge, Massachusetts. Uri Zernik. 1990. Tagging word senses in corpus: the needle in the haystack revisited. Technical Report 90CRD198, GE R&D Center. 47