• Save
To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian

on

  • 732 views

Mihăilă, C., Ilisei, I. & Inkpen, D. To Be or Not to Be a Zero Pronoun: A Machine Learning Approach for Romanian. In Proceedings of PROMISE joint with CICLing 2010, Iaşi, Romania

Mihăilă, C., Ilisei, I. & Inkpen, D. To Be or Not to Be a Zero Pronoun: A Machine Learning Approach for Romanian. In Proceedings of PROMISE joint with CICLing 2010, Iaşi, Romania

Statistics

Views

Total Views
732
Views on SlideShare
726
Embed Views
6

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 6

http://www.slideshare.net 2
http://www.linkedin.com 2
https://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian Presentation Transcript

  • 1. Introduction Corpus Identification Conclusions To Be or Not To Be a Zero Pronoun? A Machine Learning Approach For Romanian Claudiu Mih˘il˘1 a a Iustina Ilisei2 Diana Inkpen3 1 Faculty of Computer Science, ”Alexandru Ioan Cuza” University of Ia¸i s 2 Research Institute in Information and Language Processing, University of Wolverhampton 3 School of Information Technology and Engineering, University of Ottawa PROMISE, 29 March 2010, Ia¸i, Romania s Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 2. Introduction Corpus Identification Conclusions Outline 1 Introduction Motivation Zero Subjects vs. Zero Pronouns Previous Work 2 Corpus Annotation Statistics 3 Identification Features Algorithms Results 4 Conclusions Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 3. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Motivation The problem Invisible anaphors Lack of morphological information Utility Information extraction/retrieval Automatic summarisation Machine translation Multiple-choice test items generation etc. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 4. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Motivation The problem Invisible anaphors Lack of morphological information Utility Information extraction/retrieval Automatic summarisation Machine translation Multiple-choice test items generation etc. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 5. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Zero Subjects vs. Zero Pronouns Zero subjects The verb does not need a subject Plou˘. a ˆ pare r˘u de voi. Azi Imi a nu-mi arde de glum˘. a Zero pronouns Lexically retrievable from the inflection of the verb Coreferring an overt noun, noun phrase, or clause zp [Eu] Merg la ¸coal˘. s a Cine a auzit s-a ˆıntors ¸i s zp [acela] a plecat. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 6. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Zero Subjects vs. Zero Pronouns Zero subjects The verb does not need a subject Plou˘. a ˆ pare r˘u de voi. Azi Imi a nu-mi arde de glum˘. a Zero pronouns Lexically retrievable from the inflection of the verb Coreferring an overt noun, noun phrase, or clause zp [Eu] Merg la ¸coal˘. s a Cine a auzit s-a ˆıntors ¸i s zp [acela] a plecat. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 7. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Previous Work For other languages Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009) a Chinese: Converse (2006), Zhao & Ng (2007) Japanese, Korean, Portuguese, etc. For Romanian Harabagiu & Maiorano (2000) Pavel et al. (2006) Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 8. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Previous Work For other languages Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009) a Chinese: Converse (2006), Zhao & Ng (2007) Japanese, Korean, Portuguese, etc. For Romanian Harabagiu & Maiorano (2000) Pavel et al. (2006) Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 9. Introduction Corpus Annotation Identification Statistics Conclusions Annotation Empty XML tag with attributes id antecedent – the reference id, ’non-nominal’, or ’elliptic’ dependent verb – the reference id clause type – main, coordinated, juxtaposed, or subordinated annotator confidence – regarding the position, high or low Inter-annotator agreement Agreement on ZP’s dependent verb: ≈ 98% Cohen’s Kappa Coefficient: κ ≈ 90% Agreement on ZP’s position in text: ≈ 90% Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 10. Introduction Corpus Annotation Identification Statistics Conclusions Annotation Empty XML tag with attributes id antecedent – the reference id, ’non-nominal’, or ’elliptic’ dependent verb – the reference id clause type – main, coordinated, juxtaposed, or subordinated annotator confidence – regarding the position, high or low Inter-annotator agreement Agreement on ZP’s dependent verb: ≈ 98% Cohen’s Kappa Coefficient: κ ≈ 90% Agreement on ZP’s position in text: ≈ 90% Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 11. Introduction Corpus Annotation Identification Statistics Conclusions Statistics Corpus size Overview NT ET LT ST Overall No. of tokens 18690 12963 13739 3391 48783 No. of sentences 816 574 790 253 2433 No. of ZPs 245 172 113 251 781 Avg. tokens/sent. 22.90 22.58 17.39 13.40 20.05 Avg. ZP/sent. 0.30 0.30 0.14 0.99 0.32 Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 12. Introduction Features Corpus Algorithms Identification Results Conclusions Features 10 features From RACAI’s parser type – main, auxiliary, copulative, or modal mood – indicative, subjunctive, etc. tense – present, imperfect, past, or pluperfect person – first, second, or third number – singular or plural gender – masculine, feminine, or neuter clitic – whether clitic form or not Dynamically computed impersonality – whether strictly impersonal or not ’se’ – verb preceded by reflexive pronoun ’se’ The verb class from the manual annotation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 13. Introduction Features Corpus Algorithms Identification Results Conclusions Features 10 features From RACAI’s parser type – main, auxiliary, copulative, or modal mood – indicative, subjunctive, etc. tense – present, imperfect, past, or pluperfect person – first, second, or third number – singular or plural gender – masculine, feminine, or neuter clitic – whether clitic form or not Dynamically computed impersonality – whether strictly impersonal or not ’se’ – verb preceded by reflexive pronoun ’se’ The verb class from the manual annotation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 14. Introduction Features Corpus Algorithms Identification Results Conclusions Features 10 features From RACAI’s parser type – main, auxiliary, copulative, or modal mood – indicative, subjunctive, etc. tense – present, imperfect, past, or pluperfect person – first, second, or third number – singular or plural gender – masculine, feminine, or neuter clitic – whether clitic form or not Dynamically computed impersonality – whether strictly impersonal or not ’se’ – verb preceded by reflexive pronoun ’se’ The verb class from the manual annotation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 15. Introduction Features Corpus Algorithms Identification Results Conclusions Algorithms Weka classifiers SMO – implementation of SVM Jrip – implementation of decision rules J48 – implementation of decision trees Vote – majority-voting meta-classifier on previous three Data set 781 verbs with a ZP 781 randomly selected verbs without a ZP 10-fold cross validation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 16. Introduction Features Corpus Algorithms Identification Results Conclusions Algorithms Weka classifiers SMO – implementation of SVM Jrip – implementation of decision rules J48 – implementation of decision trees Vote – majority-voting meta-classifier on previous three Data set 781 verbs with a ZP 781 randomly selected verbs without a ZP 10-fold cross validation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 17. Introduction Features Corpus Algorithms Identification Results Conclusions Results Classifier results has ZP not ZP Class. Acc. P R F1 P R F1 SMO 0.739 0.684 0.889 0.773 0.841 0.590 0.694 Jrip 0.733 0.709 0.793 0.748 0.765 0.675 0.717 J48 0.720 0.698 0.777 0.735 0.749 0.663 0.703 Vote 0.733 0.705 0.802 0.750 0.770 0.665 0.713 Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 18. Introduction Features Corpus Algorithms Identification Results Conclusions Results Attribute evaluation Attribute ChiSquare InfoGain Mood 402.546 0.206 ’Se’ 25.719 0.012 Person 21.217 0.010 Impersonality 12.092 0.007 Tense 9.371 0.004 Type 2.577 0.001 Number 0.354 1E-4 Gender 7E-4 3E-7 Clitic 0 0 Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 19. Introduction Features Corpus Algorithms Identification Results Conclusions Results Error analysis Ambiguity: E greu f˘r˘ bani. aa E greu de scris o carte. Se ˆ ıntunec˘ la ora cinci. a El se ˆ ıntunec˘ la fat˘. a ¸a Parser errors Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 20. Introduction Features Corpus Algorithms Identification Results Conclusions Results Error analysis Ambiguity: E greu f˘r˘ bani. aa E greu de scris o carte. Se ˆ ıntunec˘ la ora cinci. a El se ˆ ıntunec˘ la fat˘. a ¸a Parser errors Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 21. Introduction Corpus Identification Conclusions Conclusions Summary RoZP, a corpus with manually annotated ZPs Identification of over 70% of ZPs using ML methods Outlook Improve the identification accuracy other features – no. of verbs in sentence syntactic information? Resolve the identified ZPs Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 22. Introduction Corpus Identification Conclusions Conclusions Summary RoZP, a corpus with manually annotated ZPs Identification of over 70% of ZPs using ML methods Outlook Improve the identification accuracy other features – no. of verbs in sentence syntactic information? Resolve the identified ZPs Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 23. Introduction Corpus Identification Conclusions Thank you! Questions? Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns