Añotador: a Temporal Tagger for Spanish

María Navas-Loro
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Víctor Rodríguez-Doncel
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Añotador:
a Temporal Tagger for Spanish
mnavas@fi.upm.es
https://marianavas.oeg-upm.net
29/10/2019
LKE 2019

Añotador: a Temporal Tagger for Spanish
Outline
2
Outline
 Brief Introduction
 State of the Art
o Temporal Taggers
o Corpora
 Añotador
 Corpus
 Results
 Challenges
 Conclusions

3
BRIEF INTRODUCTION
Temporal Annotation

Temporal information
4
Temporal information is everywhere:
news, medical texts, legal documents…
… and we can use it for a lot of things!
TIMELINE GENERATION
SEARCH
ENGINESUMMARIZATION
PATTERN
DETECTION
It was lodged on 5 March 2010.
It was rejected related to La
Vanguardia. No information
related to Google Inc.
Was it accepted?
When was the complaint against
Google Inc made?
QA
Nevertheless, there are
not a lot of tools to
extract temporal
information in Spanish.

The task
5
Temporal taggers:
1) Detect temporal expressions:
- DATE - DURATION
- SET - TIME
2) Normalize them.
3) Additionally, event and relation extraction

STATE OF THE ART
What has been done
6

Corpora
7
 A lot in English!
o News:
• Timebank corpus [1]
• TempEval challenges datasets [2]
• MEANTIME corpus [3]
o Medical: THYME [4]
o Historical: Wikiwars corpus [5]
o Scientific: Abstracts [6]
o Colloquial: SMS [6] and tweets [7]
 Less for Spanish…
o News: The Spanish TimeBank corpus [8], MEANTIME [3]
o Historical: The ModeS TimeBank 1.06 (texts from the 17th and
18th centuries) [9]
* TempEval 2 and TempEval 3 challenges (fragment of TimeBank)

Temporal Taggers
8
 A lot in English!
o Rule-based:
• HeidelTime [10]
• SUTime* [11]
• SynTime [12]
o Machine-Learning-based
• ClearTK-TimeML [13]
• USFD2 [14]
• UWTime [15]
• TARSQI [16]
• CAEVO [17]
• TIPSem [18]
 For Spanish, just the bold ones… (there are more in
literature, but we could not find them available!)
 No corpora, no taggers!

Añotador
10
Añotador (aka Annotador) is a temporal tagger for Spanish
and English. It is able to find and normalize dates, times,
durations and sets (temporal expressions that repeat
periodically, such as “monthly”, “twice a year” or “every
Tuesday”). There is a demo available in its webpage, and
can be called as webservice or downloaded from GitHub
(where all the code and a docker image are available).
http://annotador.oeg-upm.net/

Pipeline of Añotador
11
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR

12
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
1. Preprocessing:
 We get as input the text and the
anchor date (if none, we assume
the current day)
 We use CoreNLP for
lemmatizing, sentence splitting…
 We added IxaPipes models for
Spanish to improve the quality of
the output.

13
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
2. Rules:
More than 100 rules written in CoreNLP
TokensRegex format.
1. Token-based rules for expressions
such as numerals, granularities…
2. Basic temporal expression rules,
working on previously found basic
expressions
3. Compound expression rules, for
inheritance values or composition.
4. Literal expression rules, for specific
expressions such as “previously”,
polysemic expressions (Spanish
“mañana”) or expressions including
tokens that can appear in other
temporal expressions, requiring a
dedicated approach.

Mañana, a tricky case
14
Additionally, some expressions are really tricky…
… also syntactically!
“por la mañana” vs “en la mañana” (“in the morning”).

15
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
3. Normalization
algorithm:
 Works for each
sentence
sepparately.
 Different
approaches for
each type of
expressions.
 Can take into
account different
reference dates.

16
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
4. Output format:
- TimeML
- JSON

Hourglass corpus
While building Añotador, we noticed both the scarcity and
lack of heterogeneity of corpora in Spanish.
 We decided to create a synthetic dataset of expressions
that a temporal tagger should cover, including common
expressions and sentences that tend to produce false
positives.
 We additionally asked people foreign to the task to
provide temporal expressions. These contributors:
o Had different backgrounds (linguists, computer scientists,
translators…)
o Were from different Spanish-speaking countries and regions
(such as Venezuelan or Ecuatorians, and people from different
regions in Spain)
18

Hourglass corpus
The result was the Hourglass corpus, composed of two parts:
19
SYNTHETIC part:
o Short texts annotated with
TimeML.
o It includes additional tags in
order to facilitate the
evaluation of different
temporal expressions, such
as “Dates”, “Fractions” or
“False” (referring to false
positives like “50/2/1991” or
“February 30”).
PEOPLE part:
o Sentences from contributors
with different backgrounds
and from different Spanish-
speaking countries.
o Also tagged and annotated
following the standards, as
well as with the register of
the sentence (colloquial,
chat, latin expressions, Latin
American expressions,
phrases and one literary
sentence).

Hourglass corpus in numbers
20

Hourglass corpus
We evaluated Añotador against the available state of the art
temporal taggers for Spanish*:
 HeidelTime
 SUTime
We used the software GATE and the measures precision (P),
recall (R) and F1-measure (F1), on the following corpora:
 Hourglass corpus.
 TempEval2 dataset**
* We also tried to compare it to TIPSem, but we were not able to run it for Spanish
despite contacting the creator. We nevertheless thank Héctor Llorens for his help.
** TempEval3 test is not available online, and TimeModeS contains historical
texts, out of our target.
22

Results against the Hourglass corpus
23
In the HourGlass corpus, Añotador always shows the best
performance.
Value means the normalization is correct
Type means that the temporal expression was correctly classified
Extent means that the temporal expression was correctly marked in the text

Results against the TempEval2 corpus
24
Results of the temporal taggers in the TempEval 2 corpus.
HeidelTime is slightly better at finding the normalized value
(0.0027 on average), but Añotador is better in the rest of
metrics. SUTime shows good Precission.

Some examples
The following examples were difficult to handle to the taggers:
25
Example Añotador SUTime HeidelTime
“1 año, 6 meses y un día”
(“1 year, 6 months and one day”)
1 año, 6
meses y un
día
1 año, 6
meses y
un día
1 año, 6
meses y
un día
“Cinco para las 11.”
(“Five to eleven.”)
Cinco para
las 11.
Cinco para
las 11.
Cinco para
las 11.
“lo vuestro dura 1h, no?”
(“your stuff lasts 1h, right?”)
lo vuestro
dura 1h, no?
lo vuestro
dura 1h,
no?
lo vuestro
dura 1h,
no?
“en cero coma”
(in a short amount of time)
en cero coma en cero
coma
en cero
coma

Context-dependent temporal expressions
 Most temporal expressions are context-free, so we
require no extra information to detect or normalize them,
or depend just on an anchor date.
 Therefore, temporal taggers tend to cover them
easily.
 But we additionally find context-dependent ones, that can
be challenging.
 After some analysis, we detected three types of
general dependencies:
o TE dependent on the register
o TE dependent on temporal information
o TE dependent on geographical information
27

Context dependent temporal expressions
TE dependent on the register.
 Non-temporal words used in a temporal sense:
o “Él tiene 37 castañas/tacos” (“He has 37 chestnuts/tacos”),
both meaning “He is 37 years old”.
 Meronyms:
o “Tiene 30 primaveras/abriles ya” (“He already has 30
springs/Aprils”), meaning “He is 30 years old”.
 Idioms involving temporal expressions:
o “Hasta el 40 de mayo no te quites el sayo” (“Until 40th May,
do not take off the jacket”), meaning that the beginning of
June can still be chilly.
 Latin America:
o “la hora del moro” (“the hour of the moor”) means “lunch
time” in the Dominican Republic.
28

TE dependent on temporal information
 We usually asume the Gregorian calendar and the
reference date as the current one, but this is not always
that simple.
 There are other calendars, both historical and
contemporary, such as:
o The Julian calendar
o The French Republican calendar
o The Japanese era system.
o The Chinese Calendar
o Persian Solar Hijri calendar
o …
29

TE dependent on geographical information
 Geographical information infludes in how to normalize:
o Temponyms such as:
• International Children’s Day (15/04 Spain, 30/04 Mexico)
• National Day (12/10 Spain, 14/07 France)
o Date formats:
• Some countries have their own separators (eg, Germany uses
DD.MM.YYYY)
• Europe tends to DD/MM/YYYY
• EEUU tends to MM/DD/YYYY
• Japan tends to YYYY/MM/DD (when using our calendar!)
• eg: 11/09 is 11th September in Spain, but 9th November in
EEUU.
Can be a combination (eg, Emperor’s birthday in Japan)
30

Conclusions
 Results:
o Añotador:
• Able to process Spanish (and English) texts and extract
temporal expressions.
• Improves the state of the art both in our corpus and in
TempEval2.
o Hourglass corpus: new corpus with two parts.
• Synthetic dataset for systematic testing of temporal taggers,
including new registers and non-Castilian Spanish texts.
• People contributions of real common temporal expressions.
 Future work:
o Improve the system to make better cover the examples in
HourGlass.
o Deal with Context Dependent Temporal Expressions.
32

References - Corpora
34
1. J. Pustejovsky et al., “The timebank corpus,” in Corpus linguistics, vol. 2003, p. 40,
Lancaster, UK, 2003.
2. http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=5
3. L. Minard et al., “Meantime, the newsreader multilingual event and time corpus,” in
Proc. LREC 2016, 2016.
4. W. Styler IV et al., “Temporal annotation in the clinical domain,” Transactions of
ACL, vol. 2, pp. 143–154, 2014.
5. P. Mazur et al., “Wikiwars: A new corpus for research on temporal expressions,” in
Proc. EMNLP 2010, pp. 913–922, 2010.
6. J. Strötgen et al., “Temporal tagging on different domains: Challenges, strategies,
and gold standards,” in LREC, vol. 12, pp. 3746–3753, 2012.
7. J. Tabassum et al., “Tweetime: A minimally supervised method for recognizing and
normalizing time expressions in twitter,” arXiv preprint arXiv:1608.02904, 2016.
8. https://catalog.ldc.upenn.edu/LDC2012T12
9. https://catalog.ldc.upenn.edu/LDC2012T01

References - Taggers
35
10. J. Strötgen and M. Gertz, “Multilingual and cross-domain temporal tagging,” LREv,
vol. 47, no. 2, pp. 269–298, 2013.
11. A. X. Chang and C. D. Manning, “Sutime: A library for recognizing and normalizing
time expressions,” in LREC, vol. 2012, pp. 3735–3740, 2012.
12. X. Zhong et al., “Time expression analysis and recognition using syntactic token
types and general heuristic rules,” in Proc. the 55th Annual Meeting of the ACL, vol.
1, pp. 420–429, 2017.
13. S. Bethard, “ClearTK-TimeML: A minimalist approach to TempEval 2013,” in Proc.
the Workshop SemEval 2013, pp. 10–14, ACL, 2013.
14. L. Derczynski et al., “Usfd2: Annotating temporal expresions and tlinks for
tempeval-2,” in Proc. the Workshop SemEval, pp. 337–340, ACL, 2010.
15. K. Lee et al., “Context-dependent semantic parsing for time expressions,” in Proc.
the 52nd Annual Meeting of the ACL, vol. 1, pp. 1437–1447, 2014.
16. M. Verhagen et al., “Automating Temporal Annotation with TARSQI,” in Proc. the
ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84, ACL,
2005.
17. N. Chambers et al., “Dense event ordering with a multi-pass architecture,”
Transactions of ACL, vol. 2, pp. 273–284, 2014.
18. H. Llorens et al., “Tipsem (english and spanish): Evaluating crfs and semantic roles
in tempeval-2,” in Proc. the Workshop SemEval, pp. 284–291, ACL, 2010.

Añotador: a Temporal Tagger for Spanish

Recommended

Recommended

More Related Content

Similar to Añotador: a Temporal Tagger for Spanish

Similar to Añotador: a Temporal Tagger for Spanish (10)

Recently uploaded

Recently uploaded (20)

Añotador: a Temporal Tagger for Spanish