Adding morphological information  to a connectionist Part-Of-Speech tagger
Upcoming SlideShare
Loading in...5
×
 

Adding morphological information to a connectionist Part-Of-Speech tagger

on

  • 376 views

In this paper, we describe our recent advances on a novel approach to Part-Of-Speech tagging based on neural networks. Multilayer perceptrons are used following corpus-based learning from contextual, ...

In this paper, we describe our recent advances on a novel approach to Part-Of-Speech tagging based on neural networks. Multilayer perceptrons are used following corpus-based learning from contextual, lexical and morphological information. The Penn Treebank corpus has been used for the training and evaluation of the tagging system. The results show that the connectionist approach is feasible and comparable with other approaches.

Statistics

Views

Total Views
376
Views on SlideShare
376
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Adding morphological information  to a connectionist Part-Of-Speech tagger Adding morphological information to a connectionist Part-Of-Speech tagger Presentation Transcript

  • Adding morphological information to a connectionist Part-Of-Speech tagger F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera S. Tortajada-Velert Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia, Spain Escuela Superior de Enseñanzas Técnicas Universidad CEU-Cadenal Herrera, Alfara del Patriarca, Valencia, Spain 10-12 November 2009, SevillaF. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 1 / 33
  • Index1 POS tagging2 Probalilistic tagging3 Connectionist tagging4 The Penn Treebank Corpus5 The connectionist POS taggers6 Conclusions F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 2 / 33
  • Index1 POS tagging2 Probalilistic tagging3 Connectionist tagging4 The Penn Treebank Corpus5 The connectionist POS taggers6 Conclusions F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 3 / 33
  • What is Part-Of-Speech (POS) tagging? T = {τ1 , τ2 , . . . , τk }: a set of POS tags Ω = {ω1 , ω2 , . . . , ωm }: the vocabulary of the applicationThe goal of a Part-Of-Speech tagger is to associate each word in a textwith its correct lexical-syntactic category (represented by a tag).ExampleThe grand jury commented on a number of other topicsDT JJ NN VBD IN DT NN IN JJ NNS F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 4 / 33
  • Ambiguity and applicationsWords often have more than one POS tag: lower Europe proposed lower rate increases . . . = JJR To push the pound even lower . . . = RBR . . . should be able to lower long-term . . . = VB Ambiguity!!!Applications: speech synthesis, speech recognition, informationretrieval, word-sense disambiguation, machine translation, ... F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 5 / 33
  • How hard is POS tagging? Measuring ambiguity Peen Treebank (45-tag corpus) Unambiguous (1 tag) 36,678 (84%) Ambiguous (2-7 tags) 7,088 (16%) Details: 2 tags 5,475 3 tags 1,313 (lower) 4 tags 250 5 tags 41 6 tags 7 7 tags 2 (bet, open)A simple approach which assigns only the most common tag to eachword performs with 90% accuracy! F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 6 / 33
  • Unknown WordsHow can one assign a tag to a given word if that word is unknown tothe tagger?Unknown words are the hardest problem for POS tagging! F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 7 / 33
  • Index1 POS tagging2 Probalilistic tagging3 Connectionist tagging4 The Penn Treebank Corpus5 The connectionist POS taggers6 Conclusions F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 8 / 33
  • Probabilistic modelWe are given a sentence: what is the best sequence of tags whichcorresponds to the sequence of words?Probabilistic view: Consider all possible sequences of tags and out ofthis universe of sequences, choose the tag sequence which is mostprobable given the observation sequence of words. ˆn = argmax P(t n |w n ) = argmax P(w n |t n )P(t n ). t1 1 1 1 1 1 n t1 n t1 F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 9 / 33
  • Probabilistic model: SimplificationsTo simplify: 1 Words are independent of each other and a word’s identity only depends on its tag → lexical probabilities: n n n P(w1 |t1 ) ≈ P(wi |ti ) i=1 2 Another one establishes that the probability of one tag to appear only depends on its predecessor tag (bigram, trigram, ...) → contextual probabilities: n n P(t1 ) ≈ P(ti |ti−1 ). i=1 F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 10 / 33
  • Probabilistic model: LimitationsWith these assumptions, a typical probabilistic model is expressed as: n ˆn = argmax P(t n |w n ) ≈ argmax t1 P(wi |ti )P(ti |ti−1 ), 1 1 n t1 n t1 i=1where ˆ1 is the best estimation of POS tags for the given sentence tn n = w w . . . w and considering that P(t |t ) = 1.w1 1 2 n 1 0 1 It does not model long-distance relationships. 2 The contextual information takes into account the context on the left while the context on the right is not considered.Both limitations can be overwhelmed using ANNs models. F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 11 / 33
  • Index1 POS tagging2 Probalilistic tagging3 Connectionist tagging4 The Penn Treebank Corpus5 The connectionist POS taggers6 Conclusions F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 12 / 33
  • Basic connectionist model Europe proposed lower rate increases NNP VBD ????? NN NNSMLPs as POS tags classifiers: MLP Input: lower — wi : the ambiguous input word, loc. cod. → projection layer NNP , VBD, NN, NNS — ci : the tags of the words surrounding the ambiguous word to be tagged (past and future context), loc. cod. MLP Output: the probability of each tag given the input: Pr(JJR|input)=0.6, Pr(RBR|input)=0.2, Pr(VB|input)=0.1, . . .Therefore, the network learnt the following mapping: F (wi , ci , ti , Θ) = PrΘ (ti |wi , ci ) F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 13 / 33
  • Morphological extended connectionist model Europe proposed lower rate increases NNP-Cap VBD-NCap ????? NN-NCap NNS-NCap NCap, -erMLPs as POS tags classifiers: MLP Input: lower — wi : the ambiguous input word, loc. cod. → projection layer NCap, -er — mi : morph. info related to the amb. input word. NNP-Cap., VBD-NCap, NN-NCap, NNS-NCap — ci : the tags of the words surrounding the ambiguous word to be tagged (past and future context) extended with morphological information, loc. cod. MLP Output: the probability of each tag given the input.Therefore, the network learnt the following mapping: F (wi , mi , ci , ti , Θ) = PrΘ (ti |wi , mi , ci ), F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 14 / 33
  • And what about Unknown Words?When evaluating the model, there are words that have never beenseen during training; therefore, they do not belong neither to thevocabulary of known ambiguous words nor to the vocabulary of knownnon-ambiguous words → “Unknown words”: the hardest problem forthe network to tag correctly.Proposed solutionA combination of two especialized models: MLPKnow : the MLP specialized for known ambiguous words MLPUnk : the MLP specialized in unknown words F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 15 / 33
  • MLPKnow for known ambiguous words wi : known ambiguous input word locally codified at the input of the projection layer mi : morphological info related to the input ambiguous word Context: two labels of past context and one label of future context, extended with morphological info. FKnow (wi , mi , ci , ti , ΘK ) = PrΘK (ti |wi , mi , ci ). F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 16 / 33
  • MLPUnk for unknown words mi : morphological info related to the input unknown word (the same that for MLPKnow si : more specific morphological info related to the input unknown word (different from MLPKnow Context: three labels of past context and one label of future context, extended with morphological info. FUnk (mi , si , ci , ti , ΘU ) = PrΘU (ti |mi , si , ci ),where si corresponds to additional morphological information relatedto the unknown input i-th word. F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 17 / 33
  • Twi table with the POS tags minutes NNS, NNPS magnification NN strikes NNS, VBZ size VBP, NN layoff NN cohens NNPS ... ... Tminutes = {NNS, NNPS} Known ambiguous word Tmagnification = {NN} Known non-ambiguous word F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 18 / 33
  • Final connectionist modelFor each posible known word (ambiguous and non-ambiguous) wehave a Twi table with the POS tags observed in training for word wi :  0  if ti ∈ Twi ,  1 if Twi = {ti }, F (wi , mi , si , ci , ti , ΘK , ΘU ) = FKnow (wi , mi , ci , ti , ΘK ) if wi ∈ Ω ∧ ti ∈ Twi ,   F (m , s , c , t , Θ ) Unk i i i i U in other case.Where Ω is the ambiguous words vocabulary. n ˆn = argmax Pr (t n |w n ) ≈ argmax t1 F (wi , mi , si , ci , ti , ΘK , ΘU ) 1 1 n t1 n t1 i=1 F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 19 / 33
  • Index1 POS tagging2 Probalilistic tagging3 Connectionist tagging4 The Penn Treebank Corpus5 The connectionist POS taggers6 Conclusions F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 20 / 33
  • The Penn Treebank Corpus This corpus consists of a set of English texts from the Wall Street Journal distributed in 25 directories containing 100 files with several sentences each one. The total number of words is about one million, being 49 000 different. The whole corpus was labeled with POS and synyactic tags. The POS tag labeling consists of a set of 45 different categories. Two more tag were added to take into account the beginning and ending of a sentence, thus resulting in a total amount of 47 different POS tags. F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 21 / 33
  • The Penn Treebank Corpus: Partitions Dataset Directory Num. of Num. of Vocabulary sentences words size Training 00-18 38 219 912 344 34 064 Tuning 19-21 5 527 131 768 12 389 Test 22-24 5 462 129 654 11 548 Total 00-24 49 208 1 173 766 38 452 F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 22 / 33
  • The Penn Treebank Corpus: PreprocessHuge corpus with a lot of words in ambiguous vocabulary. Preprocessto reduce the vocabulary: Ten random partitions from training set of equal size. Words that appeared just in one partition were considered as unknown words. POS tags appearing in a word less than 1% of its possible tags were eliminated (tagging errors). F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 23 / 33
  • The Penn Treebank Corpus: Morph. informationTwo morphological preprocessing filters: Deleting the prefixes from the composed words (using a set of the 125 more common English prefixes). In this way, some unknown words were converted to known words.Examplepre-, electro-, tele-, . . . All the cardinal and ordinal numbers (except “one” and “second” that are polysemic) were replaced with the special token *CD*.Example twenty-years-old ⇒ *CD*-years-old post-1987 ⇒ post-*CD* F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 24 / 33
  • The Penn Treebank Corpus: Morph. informationMorphological added to MLPs: Three input units ⇒ input word has the first capital letter, all caps or a subset. This is an important morphological characteristic and it was also added to the POS tags of the context (both MLPs). A unit indicating if the word has any dash “-” (both MLPs). A unit indicating if the word has any point “.” (both MLPs). Suffix analysis to deal with unknown words (only MLPUnk ): Compute the probability distribution of tags for suffixes of length less or equal to 10 ⇒ 709 suffixes found. An agglomerative hierarchical clustering process was followed, and a empirical set of clusters was chosen. Finally, a set of the 21 more common grammatical suffixes were added. MLPUnk needs 209 units for take into account the presence of suffixes in words. F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 24 / 33
  • The Penn Treebank Corpus: after preproces Dataset Num. of words Unambiguous Ambiguous Unknown Training 912 344 549 272 361 704 1 368 Tuning 131 768 77 347 51 292 3 129 Test 129 654 75 758 51 315 2 581 Total 1 173 766 702 377 464 311 7 078 Vocabulary in Training 6 239 ambiguous words. 25 798 unambiguous words were obtained. F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 25 / 33
  • Index1 POS tagging2 Probalilistic tagging3 Connectionist tagging4 The Penn Treebank Corpus5 The connectionist POS taggers6 Conclusions F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 26 / 33
  • The connectionist POS taggers Projection layer. Error backpropagation algorithm for training. The topology and parameters of multilayer perceptrons in the trainings were selected in previous experimentation. For the experiments we have used a toolkit for pattern recognition tasks developed by our research group. MLPKnow trained with ambiguous vocabulary words. MLPUnk trained with words that appear less than 4 times. Parameter MLPKnown MLPUnk Input layer size |T + M |(p + f ) + 50 + |M| |T + M |(p + f ) + |M| + |S| Output layer size |T | |T | Projection layer size |Ω | → 50 – Hidden layer(s) size 100-75 175-100 Hidden layer act. func. Hyperbolic Tangent Output layer act. func. Softmax Learning rate 0.005 Momentum 0.001 Weight decay 0.0000001 F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 27 / 33
  • Performance on the tuning setPOS tagging error rate for the tuning set varying the context (p is thepast context, and f is the future context). MLPUnk error MLPKnown error Future Future Past 1 2 3 Past 2 3 4 5 1 12.56 12.46 12.40 2 6.30 6.26 6.25 6.31 2 12.27 12.08 12.37 3 6.28 6.22 6.20 6.31 3 12.59 11.95 12.24 4 6.28 6.27 6.28 6.31 4 12.72 12.34 12.46 F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 28 / 33
  • Test POS tagging performancePOS tagging error rate for the tuning and test sets for the globalsystem. Comparison of our connectionist system with morphologicalinformation versus our previous system without morphologicalinformation. Partition With morp. info. Without morp. info. Tuning 3.2% 4.2% Test 3.3% 4.3% F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 29 / 33
  • Index1 POS tagging2 Probalilistic tagging3 Connectionist tagging4 The Penn Treebank Corpus5 The connectionist POS taggers6 Conclusions F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 30 / 33
  • Conclusions: Comparison with other tagging systemsPOS tagging error rate for the test set. Known refers to thedisambiguation error for known ambiguous words. Unk refers to thePOS tag error for unknown words. Total is the total POS tag error, withambiguous, non-ambiguous, and unknown words. Model KnownAmb Unknown Total SVMs 6.1 11.0 2.8 MT - 23.5 3.5 TnT 7.8 14.1 3.5 NetTagger - - 3.8 HMM Tagger - - 5.8 RANN - - 8.0 Our approach 6.7 10.3 3.3 Results comparable with state of the art systems. F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 31 / 33
  • Conclusions: Future works Increase the amount of morphological information. Test the models in a graph based approach. Introduce a language model of POS tags to improve the results. F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 32 / 33
  • Thank you!F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 33 / 33