Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Añotador: a Temporal Tagger for Spanish

3 views

Published on

Presentation of Añotador, a temporal tagger for Spanish, at the LKE2019 (Dublin, Ireland)

Published in: Software
  • Be the first to comment

  • Be the first to like this

Añotador: a Temporal Tagger for Spanish

  1. 1. María Navas-Loro Ontology Engineering Group Universidad Politécnica de Madrid, Spain Víctor Rodríguez-Doncel Ontology Engineering Group Universidad Politécnica de Madrid, Spain Añotador: a Temporal Tagger for Spanish mnavas@fi.upm.es https://marianavas.oeg-upm.net 29/10/2019 LKE 2019
  2. 2. Añotador: a Temporal Tagger for Spanish Outline 2 Outline  Brief Introduction  State of the Art o Temporal Taggers o Corpora  Añotador  Corpus  Results  Challenges  Conclusions
  3. 3. 3 BRIEF INTRODUCTION Temporal Annotation
  4. 4. Añotador: a Temporal Tagger for Spanish Temporal information 4 Temporal information is everywhere: news, medical texts, legal documents… … and we can use it for a lot of things! TIMELINE GENERATION SEARCH ENGINESUMMARIZATION PATTERN DETECTION It was lodged on 5 March 2010. It was rejected related to La Vanguardia. No information related to Google Inc. Was it accepted? When was the complaint against Google Inc made? QA Nevertheless, there are not a lot of tools to extract temporal information in Spanish.
  5. 5. Añotador: a Temporal Tagger for Spanish The task 5 Temporal taggers: 1) Detect temporal expressions: - DATE - DURATION - SET - TIME 2) Normalize them. 3) Additionally, event and relation extraction
  6. 6. STATE OF THE ART What has been done 6
  7. 7. Añotador: a Temporal Tagger for Spanish Corpora 7  A lot in English! o News: • Timebank corpus [1] • TempEval challenges datasets [2] • MEANTIME corpus [3] o Medical: THYME [4] o Historical: Wikiwars corpus [5] o Scientific: Abstracts [6] o Colloquial: SMS [6] and tweets [7]  Less for Spanish… o News: The Spanish TimeBank corpus [8], MEANTIME [3] o Historical: The ModeS TimeBank 1.06 (texts from the 17th and 18th centuries) [9] * TempEval 2 and TempEval 3 challenges (fragment of TimeBank)
  8. 8. Añotador: a Temporal Tagger for Spanish Temporal Taggers 8  A lot in English! o Rule-based: • HeidelTime [10] • SUTime* [11] • SynTime [12] o Machine-Learning-based • ClearTK-TimeML [13] • USFD2 [14] • UWTime [15] • TARSQI [16] • CAEVO [17] • TIPSem [18]  For Spanish, just the bold ones… (there are more in literature, but we could not find them available!)  No corpora, no taggers!
  9. 9. 9 AÑOTADOR Our tool
  10. 10. Añotador 10 Añotador (aka Annotador) is a temporal tagger for Spanish and English. It is able to find and normalize dates, times, durations and sets (temporal expressions that repeat periodically, such as “monthly”, “twice a year” or “every Tuesday”). There is a demo available in its webpage, and can be called as webservice or downloaded from GitHub (where all the code and a docker image are available). http://annotador.oeg-upm.net/
  11. 11. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 11 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR
  12. 12. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 12 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 1. Preprocessing:  We get as input the text and the anchor date (if none, we assume the current day)  We use CoreNLP for lemmatizing, sentence splitting…  We added IxaPipes models for Spanish to improve the quality of the output.
  13. 13. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 13 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 2. Rules: More than 100 rules written in CoreNLP TokensRegex format. 1. Token-based rules for expressions such as numerals, granularities… 2. Basic temporal expression rules, working on previously found basic expressions 3. Compound expression rules, for inheritance values or composition. 4. Literal expression rules, for specific expressions such as “previously”, polysemic expressions (Spanish “mañana”) or expressions including tokens that can appear in other temporal expressions, requiring a dedicated approach.
  14. 14. Añotador: a Temporal Tagger for Spanish Mañana, a tricky case 14 Additionally, some expressions are really tricky… … also syntactically! “por la mañana” vs “en la mañana” (“in the morning”).
  15. 15. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 15 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 3. Normalization algorithm:  Works for each sentence sepparately.  Different approaches for each type of expressions.  Can take into account different reference dates.
  16. 16. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 16 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 4. Output format: - TimeML - JSON
  17. 17. 17 HOURGLASS Corpus built
  18. 18. Añotador: a Temporal Tagger for Spanish Hourglass corpus While building Añotador, we noticed both the scarcity and lack of heterogeneity of corpora in Spanish.  We decided to create a synthetic dataset of expressions that a temporal tagger should cover, including common expressions and sentences that tend to produce false positives.  We additionally asked people foreign to the task to provide temporal expressions. These contributors: o Had different backgrounds (linguists, computer scientists, translators…) o Were from different Spanish-speaking countries and regions (such as Venezuelan or Ecuatorians, and people from different regions in Spain) 18
  19. 19. Añotador: a Temporal Tagger for Spanish Hourglass corpus The result was the Hourglass corpus, composed of two parts: 19 SYNTHETIC part: o Short texts annotated with TimeML. o It includes additional tags in order to facilitate the evaluation of different temporal expressions, such as “Dates”, “Fractions” or “False” (referring to false positives like “50/2/1991” or “February 30”). PEOPLE part: o Sentences from contributors with different backgrounds and from different Spanish- speaking countries. o Also tagged and annotated following the standards, as well as with the register of the sentence (colloquial, chat, latin expressions, Latin American expressions, phrases and one literary sentence).
  20. 20. Añotador: a Temporal Tagger for Spanish Hourglass corpus in numbers 20
  21. 21. 21 EVALUATION Performance
  22. 22. Añotador: a Temporal Tagger for Spanish Hourglass corpus We evaluated Añotador against the available state of the art temporal taggers for Spanish*:  HeidelTime  SUTime We used the software GATE and the measures precision (P), recall (R) and F1-measure (F1), on the following corpora:  Hourglass corpus.  TempEval2 dataset** * We also tried to compare it to TIPSem, but we were not able to run it for Spanish despite contacting the creator. We nevertheless thank Héctor Llorens for his help. ** TempEval3 test is not available online, and TimeModeS contains historical texts, out of our target. 22
  23. 23. Añotador: a Temporal Tagger for Spanish Results against the Hourglass corpus 23 In the HourGlass corpus, Añotador always shows the best performance. Value means the normalization is correct Type means that the temporal expression was correctly classified Extent means that the temporal expression was correctly marked in the text
  24. 24. Añotador: a Temporal Tagger for Spanish Results against the TempEval2 corpus 24 Results of the temporal taggers in the TempEval 2 corpus. HeidelTime is slightly better at finding the normalized value (0.0027 on average), but Añotador is better in the rest of metrics. SUTime shows good Precission.
  25. 25. Añotador: a Temporal Tagger for Spanish Some examples The following examples were difficult to handle to the taggers: 25 Example Añotador SUTime HeidelTime “1 año, 6 meses y un día” (“1 year, 6 months and one day”) 1 año, 6 meses y un día 1 año, 6 meses y un día 1 año, 6 meses y un día “Cinco para las 11.” (“Five to eleven.”) Cinco para las 11. Cinco para las 11. Cinco para las 11. “lo vuestro dura 1h, no?” (“your stuff lasts 1h, right?”) lo vuestro dura 1h, no? lo vuestro dura 1h, no? lo vuestro dura 1h, no? “en cero coma” (in a short amount of time) en cero coma en cero coma en cero coma
  26. 26. 26 CHALLENGES To do
  27. 27. Añotador: a Temporal Tagger for Spanish Context-dependent temporal expressions  Most temporal expressions are context-free, so we require no extra information to detect or normalize them, or depend just on an anchor date.  Therefore, temporal taggers tend to cover them easily.  But we additionally find context-dependent ones, that can be challenging.  After some analysis, we detected three types of general dependencies: o TE dependent on the register o TE dependent on temporal information o TE dependent on geographical information 27
  28. 28. Añotador: a Temporal Tagger for Spanish Context dependent temporal expressions TE dependent on the register.  Non-temporal words used in a temporal sense: o “Él tiene 37 castañas/tacos” (“He has 37 chestnuts/tacos”), both meaning “He is 37 years old”.  Meronyms: o “Tiene 30 primaveras/abriles ya” (“He already has 30 springs/Aprils”), meaning “He is 30 years old”.  Idioms involving temporal expressions: o “Hasta el 40 de mayo no te quites el sayo” (“Until 40th May, do not take off the jacket”), meaning that the beginning of June can still be chilly.  Latin America: o “la hora del moro” (“the hour of the moor”) means “lunch time” in the Dominican Republic. 28
  29. 29. Añotador: a Temporal Tagger for Spanish Context dependent temporal expressions TE dependent on temporal information  We usually asume the Gregorian calendar and the reference date as the current one, but this is not always that simple.  There are other calendars, both historical and contemporary, such as: o The Julian calendar o The French Republican calendar o The Japanese era system. o The Chinese Calendar o Persian Solar Hijri calendar o … 29
  30. 30. Añotador: a Temporal Tagger for Spanish Context dependent temporal expressions TE dependent on geographical information  Geographical information infludes in how to normalize: o Temponyms such as: • International Children’s Day (15/04 Spain, 30/04 Mexico) • National Day (12/10 Spain, 14/07 France) o Date formats: • Some countries have their own separators (eg, Germany uses DD.MM.YYYY) • Europe tends to DD/MM/YYYY • EEUU tends to MM/DD/YYYY • Japan tends to YYYY/MM/DD (when using our calendar!) • eg: 11/09 is 11th September in Spain, but 9th November in EEUU. Can be a combination (eg, Emperor’s birthday in Japan) 30
  31. 31. 31 CONCLUSIONS Future work
  32. 32. Añotador: a Temporal Tagger for Spanish Conclusions  Results: o Añotador: • Able to process Spanish (and English) texts and extract temporal expressions. • Improves the state of the art both in our corpus and in TempEval2. o Hourglass corpus: new corpus with two parts. • Synthetic dataset for systematic testing of temporal taggers, including new registers and non-Castilian Spanish texts. • People contributions of real common temporal expressions.  Future work: o Improve the system to make better cover the examples in HourGlass. o Deal with Context Dependent Temporal Expressions. 32
  33. 33. María Navas-Loro Ontology Engineering Group Universidad Politécnica de Madrid, Spain Víctor Rodríguez-Doncel Ontology Engineering Group Universidad Politécnica de Madrid, Spain Añotador: a Temporal Tagger for Spanish mnavas@fi.upm.es https://marianavas.oeg-upm.net 29/10/2019 LKE 2019
  34. 34. Añotador: a Temporal Tagger for Spanish References - Corpora 34 1. J. Pustejovsky et al., “The timebank corpus,” in Corpus linguistics, vol. 2003, p. 40, Lancaster, UK, 2003. 2. http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=5 3. L. Minard et al., “Meantime, the newsreader multilingual event and time corpus,” in Proc. LREC 2016, 2016. 4. W. Styler IV et al., “Temporal annotation in the clinical domain,” Transactions of ACL, vol. 2, pp. 143–154, 2014. 5. P. Mazur et al., “Wikiwars: A new corpus for research on temporal expressions,” in Proc. EMNLP 2010, pp. 913–922, 2010. 6. J. Strötgen et al., “Temporal tagging on different domains: Challenges, strategies, and gold standards,” in LREC, vol. 12, pp. 3746–3753, 2012. 7. J. Tabassum et al., “Tweetime: A minimally supervised method for recognizing and normalizing time expressions in twitter,” arXiv preprint arXiv:1608.02904, 2016. 8. https://catalog.ldc.upenn.edu/LDC2012T12 9. https://catalog.ldc.upenn.edu/LDC2012T01
  35. 35. Añotador: a Temporal Tagger for Spanish References - Taggers 35 10. J. Strötgen and M. Gertz, “Multilingual and cross-domain temporal tagging,” LREv, vol. 47, no. 2, pp. 269–298, 2013. 11. A. X. Chang and C. D. Manning, “Sutime: A library for recognizing and normalizing time expressions,” in LREC, vol. 2012, pp. 3735–3740, 2012. 12. X. Zhong et al., “Time expression analysis and recognition using syntactic token types and general heuristic rules,” in Proc. the 55th Annual Meeting of the ACL, vol. 1, pp. 420–429, 2017. 13. S. Bethard, “ClearTK-TimeML: A minimalist approach to TempEval 2013,” in Proc. the Workshop SemEval 2013, pp. 10–14, ACL, 2013. 14. L. Derczynski et al., “Usfd2: Annotating temporal expresions and tlinks for tempeval-2,” in Proc. the Workshop SemEval, pp. 337–340, ACL, 2010. 15. K. Lee et al., “Context-dependent semantic parsing for time expressions,” in Proc. the 52nd Annual Meeting of the ACL, vol. 1, pp. 1437–1447, 2014. 16. M. Verhagen et al., “Automating Temporal Annotation with TARSQI,” in Proc. the ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84, ACL, 2005. 17. N. Chambers et al., “Dense event ordering with a multi-pass architecture,” Transactions of ACL, vol. 2, pp. 273–284, 2014. 18. H. Llorens et al., “Tipsem (english and spanish): Evaluating crfs and semantic roles in tempeval-2,” in Proc. the Workshop SemEval, pp. 284–291, ACL, 2010.

×