Doc format.


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Doc format.

  1. 1. Jessica Hullman Natural Language Processing, Fall 2006 Professor Rich Thomason December 15, 2006. Abstract I modeled my project after the implementation of supervised word sense disambiguation with Support Vector Machines by Lee, Ng, and Chia. The authors participated in the word sense disambiguation competition Senseval-3 in the English lexical sample task, using information of the Part of Speech (POS) of neighboring words, single words in surrounding contexts, local collocations, and syntactic relations to implement the machine learning technique of Support Vector Machines (SVM). This paper details the first section of my project in which I modify the POS portion of their implementation using the identically formatted Senseval-2 data. I scored my performance on the accuracy of the sense assignments by the SVM and received a mean average accuracy of 87%, with a standard deviation of 15% and a median of 92%. This paper has five parts (Introduction, Support Vector Machines, Method, Evaluation section, and Possible Improvements section). This is followed by a bibliography and a section called “Programs” which outlines how the experiment proceeded more specifically. I would like to acknowledge the following individuals who helped me (particularly in designing a couple of the more complicated programs): Robert Finn, Joshua Gerrish, and Rich Thomason. Introduction Word sense disambiguation, an area of considerable research in Computational Linguistics, refers to the problem of differentiating the various meanings of a word. A word is described as polysemous if it has multiple meanings; for example, given a word “bar”, and a set of word senses such as “a long piece of wood, metal etc. used as a support”, “a barrier of any kind”, “a plea arresting an action or claim”, etc. The goal is to identify the correct sense of “bar” in a given sentence. The problem of disambiguation can be described as AI complete in that some representation of common sense and real world knowledge is required before it can be resolved (Lecture). Two steps arise in disambiguating a given word. First, all of the different senses of the word as well as the words to be considered along with the given word must be determined, and second, a means must be determined by which to assign each occurrence of a word to the correct sense. Several major sources of information are typically used: the word’s context as well as external knowledge sources including lexicons (Ide and Veronis, 1998, pg. 3).
  2. 2. WordNet, a large lexical database of English, developed under George A. Miller, is the best known of the external knowledge sources. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept; different senses of a word are therefore in different synsets (WordNet, The meaning of the synsets is further clarified with short definitions. Context based methods (also called data-driven or corpus-based methods) use knowledge about the previously disambiguated instances of the word within corpora (Ide and Veronis, 1998, pg. 3). The distinction between lexicon driven, knowledge-based methods and corpus based methods is often the same as the distinction between supervised and unsupervised learning (supervised referring to a task in which the sense label of each training instance is known, unsupervised in which it is not). Unsupervised methods outline a clustering task, in which the external knowledge source of a dictionary or lexicon is used to seed the system, which then augments the labeled instances by learning from unlabeled instances. Supervised learning, on the other hand, can be seen as a classification task in which a function is deduced based on data points. Numerous issues arise with regard to word sense disambiguation. WordNet’s numerous synsets per word bring up one of the most prevalent of these, determining the appropriate degree of sense granularity for a given task. Several authors (e.g. Slater and Wilks, 1987) have remarked that the sense divisions one finds in dictionaries are often too fine for the purposes of NLP work; WordNet’s sense distinctions have been criticized, for example, for being more fine-grained that what may be needed in most natural language processing applications (Ide and Veronis, 1998, pg 13). Overly fine sense distinctions create practical difficulties for automated WSD by requiring making sense choices that are extremely difficult, even for expert lexicographers. The problem of data-sparseness becomes severe. Very large amounts of text are needed for supervised methods to ensure that all of the possible senses of a word are represented. Producing corpora hand-labeled for senses, however, is an expensive, time-consuming task, and the results are often less than satisfactory. There is often a fair amount of disparity among human taggers regarding the finer sense distinctions of a word. Natural Language Processing tasks in which word sense disambiguation is a relevant concern include information retrieval, machine translation, and speech processing. Despite the issues of granularity, evaluating WSD systems outside of these tasks remains a well-documented problem, arising from not only the substantial differences in test conditions across studies, but also the difference in test words and variance in the criteria for evaluating the correctness of a sense assignment. The SENSEVAL competition arose out of this need for accepted evaluation standards. SENSEVAL uses in vitro evaluation, which involves comparing a systems output for a given input using precision and recall (versus in vivo evaluation, in which results are evaluated in terms of their contribution to the overall performance of a system for a given application) (Ides and Veronis, 1998, pg. 25). While somewhat artificial, the reasoning
  3. 3. behind the Senseval competition, and thus that behind my project, is that close examination of the problems that arise in word sense disambiguation will best improve the methods used. Within the Senseval competition, participants can compete in tasks including translation as well as language-specific disambiguation. English tasks in Senseval include an English all-words task and the English lexical sample task, the latter with which my project is concerned. In the lexical sample task, evaluation is based on how well a system disambiguates word-class specific (for example, all noun) instances in the test data of a sampling of words pulled from the WordNet lexicon. Tagging algorithms are expected to assign probabilities to the possible tags they output. To date three Senseval competitions have been held; this project uses Senseval-2 data. The corpus for the Senseval-2 English tasks is comprised of sentences from the British National Corpus 2, the Penn Treebank, and the web, and is provided in xml format. I used only this corpus in training my system. Machine Learning Using Support Vector Machines (SVM) As stated, in recent years linear regression-based methods have increased in popularity with regard to supervised learning tasks. Any linear classifier is simply a classifier that uses a linear function of its inputs to base a classification decision on. In other words, given that the input feature vector to the classifier is a real vector , then the estimated output score (or probability) is where is a real vector of weights and f is a function that converting the dot product of the two vectors into the desired output (Wikipedia, Linear classifier, In general, linear classifiers are fast and work well when the number of dimensions of the input vector are very large; for example, in document classification, each element in the input vector is typically the number of counts of a word in a document. They can be divided into generative models, which model conditional density functions, and discriminative training, models that attempt to maximize the quality of the output of a training set. While common generative methods like Bayesian classification handle missing data well, discriminative training methods, including perceptron and Support Vector Machines, generally yield a higher accuracy (Wikipedia, Linear Classifier,
  4. 4. For a binary classification problem, f is a simple function mapping of all values above a certain threshold to one class and all other values to a second class (i.e “yes” and “no”). One can visualize the operation of a linear classifier as splitting a high-dimensional input space with a hyperplane: all points on one side of the plane belong to the first class, all points on the other side belong to the second class. The SVM is a binary classification learning method that categorizes data by constructing a hyperplane, using optimization, between training instances mapped in a feature space (Schölkopf and Smola, 2002). Because Lee, Ng, and Chia built one binary classifier for each sense class, I opted to do the same. Like the authors, I converted nominal features with numerous possible values into the corresponding number of binary (0 or 1) features. In this scheme, if a nominal takes the nth value then the corresponding (nth) feature is 1 and all of the other features are set to 0 (Witten and Frank, 2000). The software I used is SVMlight, an implementation of SVMs in C. SVMligh,t solves classification and regression problems and ranking problems, by learning a ranking function. It handles many thousands of support vectors and several hundred-thousands of training examples. SVMlight is an implementation of Vapnik's Support Vector Machine (Vapnik, 1995) and the algorithms used in SVMlight are described in (Joachims, 1999). The goal is to learn a function from preference examples, so that it orders a new set of objects as accurately as possible. Such ranking problems naturally occur in applications like search engines and recommender systems. The code has been used on a large range of problems, including text classification, image recognition tasks, bioinformatics and medical applications (SVMlight, Method Features: POS of Neighboring Words The decision of which features to use determines the project. Like Lee, Ng, and Chia, the features I used were the Parts of Speech (POS) of neighboring words. The first step involved deciding upon how many words ahead of and behind the given word I wanted to consider as far as POS information. Again, both because Lee, Ng, and Chia used a three- word window, and because research has shown that a window of more k=3 or 4 is unnecessary (Yarowsky, 1994a and b), I opted to use a three-word window.
  5. 5. As an example, given the training corpus sentence “As the leaves grow, train them through the bars for a lovely effect,” the input vector, corresponding to < P-3, P-2, P-1, P0, P1, P2, P3 > is set to < DT, NNS, VB, VB, PRP, IN DT >. I converted all nominal features with numerous possible values into the corresponding number of binary features. This results in an input vector resembling < 01000…, 00001…, 00000…, 00000…, 10000…, 00000…, 00000… >; wherein each place in the vector corresponds to a 45 digit string of 0’s, with a 1 in the place corresponding to that particular tag. Senseval provides the corpus already divided into a training and test set. My first step involved parsing the XML format of the corpus. For this I used XML::Twig, a non-event based XML parser that provides an easy-to-access tree interface (XML::Twig, While Twig made the initial parsing task much more efficient in terms of programming, I was still required to develop a program to insert spacing where the parser took out certain tags. The accuracy of the POS tagger used in a word sense disambiguation task is a limiting factor. My next step being to POS tag the corpus, I opted to use the Brill Tagger, an error-driven transformation-based tagger that works by first tagging a corpus based on the broadest of a set of tagging rules, then applying a slightly more rule, repeating this process until some stopping criterion is reached (Jurafsky and Martin, 2006). I chose the tagger for its accuracy of 95-97% (Brill Tagger, Next, I needed to substitute certain characters outputted by the Brill tagger, because these characters were recognized by Perl. I then needed to extract the POS information of the given words from the output of this program and convert the information to the format needed by the SVM, namely vectors of zeros and ones. To do this I used a program that created a table corresponding to the 45 parts of speech, and then read through the parsed, POS-tagged corpora, keeping track of when it reached a new instance of a word. It then reads through the context associated with each instance, keeping track of when it gets to the given word. When the word to be disambiguated is reached, the POS’s of the three words before and after are converted into a vector of seven 45 digit strings, with each place in the string corresponding to a POS. For each POS in the vector, a one is inserted in the place corresponding to that POS, while all the other places remain 0’s. A separate file is created for each word. After this, the corpora (now in the form of separate files for each word) needed to be separated into files corresponding to each separate word sense, so that the SVM could be run once for each particular sense of a word. Evaluation I evaluated my project using the evaluation module built into the SVM software. Provided that the correct answers are supplied with the test data, the SVM outputs statistics on the accuracy, precision, and recall of its sense assignments. I evaluated my
  6. 6. project on the accuracy of the sense assignments, getting an average of 87%, with a median of 92% and standard deviation of 15%. Possible Improvements There are multiple minor improvements which might considerably influence my results. Most importantly, to accurately compare this project to that which it was modeled on (Lee, Ng, and Chia), I would need to use the Senseval-3 data (now in the public domain) as well as to use the Senseval scoring software. Running a sentence segmentation program on the corpora before POS tagging it would’ve allowed me to track where in the sentence the word to be disambiguated occurred. Currently, my project tracks POS information across sentence boundaries. Like Lee, Ng, and Chia, I built one binary classifier for each instance (meaning) of a word. However, I might have run the SVM by instead a step-wise reduction method, in which a binary classifier is first built for all instances of a word, then as one word at a time is eliminated by the SVM it is removed from the input data file, and a new classifier built for the remaining instances. This method would be more computationally efficient, but whether this would improve the accuracy remains to be seen.
  7. 7. Bibliography Ide, Nancy and Jean Veronis. (1998). “Word sense disambiguation: The state of the art.” In Computational Linguistics, 24(1). Joachims, Thorsten. (1999). “Transductive inference for text classification using support vector machines.” Universitat Dormund, Dortmund, Germany. Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition, 2nd edition. (Online version). Lee, Keok Yoong and Hwee Tou Ng. (2002). “An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Schölkopf, Bernhard and Alex Smola. (2002). Learning with kernels. MIT Press, Cambridge, MA. Vapnik, Vladimir N. (1995). The nature of statistical learning theory. Springer-Verlag, New York. Witten, Ian H. and Eibe Frank. (2000). Data mining: Practical machine learning tools and techniques with java implementations. Morgan Kaufman, San Francisco. Yarowsky, D. (1995). “A comparison of corpus-based techniques for restoring accents in Spanish and French text.” Proceedings of the 2nd Annual Workshop on Very Large Text Corpora. Las Curces.
  8. 8. Programs All programs can be found (and are to be run) from /data0/users/rthomaso/tmp/hullman on tangra. All files created within the referenced programs output to this directory. This sequence of commands/programs was run first on the training data, then on the test data. To re-run it, some of the pathnames specified in the programs may need to be changed back (they are currently set to run on the test data); the actual running of this sequence gets rather complicated. Steps: 1. Run “” This program inserts a space in the original xml corpora so that the next program, which parses it, does not abut two words together without a space between. 2. Run “” This program calls up XML::Twig and parses the corpora, outputting a file “senseval_data_spaced.txt”. 3. Run the following three commands: cd /data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data export PATH=$PATH:/data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data /data2/tools/RULE_BASED_TAGGER_V1.14/Bin_and_Data/tagger LEXICON / data0/users/rthomaso/tmp/hullman/senseval_data_spaced.txt BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE > /data0/users/rthomaso/tmp/ hullman/senseval_tagged.txt These three commands run the Brill tagger on the parsed corpora, outputting the results to a file called senseval_tagged.txt. 4. Run “” This program takes the POS-tagged corpora and substitutes strings for problematic characters outputted by the POS-tagger, including $, (, ), #, ., ;, and commas. Outputs a file “senseval_tagged_POS_substituted.txt”.
  9. 9. 5. Run “”. This program and the next are the bulk of the project. This one reads in the POS- tagged, character substituted output file from the previous program, tracking when it gets to a new instance of a word. It then reads in the POS’s, and when it gets to the given word, creates the vector of the POS of that word as well as the three words before and after the word. For each POS in that vector, a 45 digit string is created, with a one inserted in the place in that string corresponding to the part of speech of the word. A separate file is created for every word, along with indication of which particular vector corresponds to the POS of the given word (out of all of the other words in the context). Each of these files ends in “_SVM_input.txt”. 6. Run “”: This splits the data into separate files corresponding to each instance and sense id of the word. The files it outputs for each instance/sense id are formatted for input into SVMlight; each ends in “_SVM_prepared_input.txt” 7. Run SVM from /data0/tools/svmlight using two commands (either “./svm_learn input_file model_file” or “./svm_classify input_file model_file output_file”). The first command runs the learning module on the training data file (corresponding to one sense of a word) and outputs a model file (parameters) which the SVM classifying module takes in in order to make a prediction. The classifying module takes in an input file (corresponding to one sense of a word in the test data), a model file created by the running of the learning module on the training data file for that sense, and outputs a file with a prediction (in the form of a one or negative one, depending on which side of the hyperplane the instance falls) as well as statistics including the accuracy as well as the precision and recall of the classification. *Because the Senseval test data did not supply the correct answers within the xml corpora, and because this information was needed if I was to evaluate the SVM, I needed to call up a separate Senseval file of answers for the test data and insert this information into the test corpora (see “test_sense_id.plx”).