BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
Añotador: a Temporal Tagger for Spanish
1. María Navas-Loro
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Víctor Rodríguez-Doncel
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Añotador:
a Temporal Tagger for Spanish
mnavas@fi.upm.es
https://marianavas.oeg-upm.net
29/10/2019
LKE 2019
2. Añotador: a Temporal Tagger for Spanish
Outline
2
Outline
Brief Introduction
State of the Art
o Temporal Taggers
o Corpora
Añotador
Corpus
Results
Challenges
Conclusions
4. Añotador: a Temporal Tagger for Spanish
Temporal information
4
Temporal information is everywhere:
news, medical texts, legal documents…
… and we can use it for a lot of things!
TIMELINE GENERATION
SEARCH
ENGINESUMMARIZATION
PATTERN
DETECTION
It was lodged on 5 March 2010.
It was rejected related to La
Vanguardia. No information
related to Google Inc.
Was it accepted?
When was the complaint against
Google Inc made?
QA
Nevertheless, there are
not a lot of tools to
extract temporal
information in Spanish.
5. Añotador: a Temporal Tagger for Spanish
The task
5
Temporal taggers:
1) Detect temporal expressions:
- DATE - DURATION
- SET - TIME
2) Normalize them.
3) Additionally, event and relation extraction
7. Añotador: a Temporal Tagger for Spanish
Corpora
7
A lot in English!
o News:
• Timebank corpus [1]
• TempEval challenges datasets [2]
• MEANTIME corpus [3]
o Medical: THYME [4]
o Historical: Wikiwars corpus [5]
o Scientific: Abstracts [6]
o Colloquial: SMS [6] and tweets [7]
Less for Spanish…
o News: The Spanish TimeBank corpus [8], MEANTIME [3]
o Historical: The ModeS TimeBank 1.06 (texts from the 17th and
18th centuries) [9]
* TempEval 2 and TempEval 3 challenges (fragment of TimeBank)
8. Añotador: a Temporal Tagger for Spanish
Temporal Taggers
8
A lot in English!
o Rule-based:
• HeidelTime [10]
• SUTime* [11]
• SynTime [12]
o Machine-Learning-based
• ClearTK-TimeML [13]
• USFD2 [14]
• UWTime [15]
• TARSQI [16]
• CAEVO [17]
• TIPSem [18]
For Spanish, just the bold ones… (there are more in
literature, but we could not find them available!)
No corpora, no taggers!
10. Añotador
10
Añotador (aka Annotador) is a temporal tagger for Spanish
and English. It is able to find and normalize dates, times,
durations and sets (temporal expressions that repeat
periodically, such as “monthly”, “twice a year” or “every
Tuesday”). There is a demo available in its webpage, and
can be called as webservice or downloaded from GitHub
(where all the code and a docker image are available).
http://annotador.oeg-upm.net/
11. Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
11
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
12. Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
12
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
1. Preprocessing:
We get as input the text and the
anchor date (if none, we assume
the current day)
We use CoreNLP for
lemmatizing, sentence splitting…
We added IxaPipes models for
Spanish to improve the quality of
the output.
13. Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
13
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
2. Rules:
More than 100 rules written in CoreNLP
TokensRegex format.
1. Token-based rules for expressions
such as numerals, granularities…
2. Basic temporal expression rules,
working on previously found basic
expressions
3. Compound expression rules, for
inheritance values or composition.
4. Literal expression rules, for specific
expressions such as “previously”,
polysemic expressions (Spanish
“mañana”) or expressions including
tokens that can appear in other
temporal expressions, requiring a
dedicated approach.
14. Añotador: a Temporal Tagger for Spanish
Mañana, a tricky case
14
Additionally, some expressions are really tricky…
… also syntactically!
“por la mañana” vs “en la mañana” (“in the morning”).
15. Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
15
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
3. Normalization
algorithm:
Works for each
sentence
sepparately.
Different
approaches for
each type of
expressions.
Can take into
account different
reference dates.
16. Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
16
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
4. Output format:
- TimeML
- JSON
18. Añotador: a Temporal Tagger for Spanish
Hourglass corpus
While building Añotador, we noticed both the scarcity and
lack of heterogeneity of corpora in Spanish.
We decided to create a synthetic dataset of expressions
that a temporal tagger should cover, including common
expressions and sentences that tend to produce false
positives.
We additionally asked people foreign to the task to
provide temporal expressions. These contributors:
o Had different backgrounds (linguists, computer scientists,
translators…)
o Were from different Spanish-speaking countries and regions
(such as Venezuelan or Ecuatorians, and people from different
regions in Spain)
18
19. Añotador: a Temporal Tagger for Spanish
Hourglass corpus
The result was the Hourglass corpus, composed of two parts:
19
SYNTHETIC part:
o Short texts annotated with
TimeML.
o It includes additional tags in
order to facilitate the
evaluation of different
temporal expressions, such
as “Dates”, “Fractions” or
“False” (referring to false
positives like “50/2/1991” or
“February 30”).
PEOPLE part:
o Sentences from contributors
with different backgrounds
and from different Spanish-
speaking countries.
o Also tagged and annotated
following the standards, as
well as with the register of
the sentence (colloquial,
chat, latin expressions, Latin
American expressions,
phrases and one literary
sentence).
22. Añotador: a Temporal Tagger for Spanish
Hourglass corpus
We evaluated Añotador against the available state of the art
temporal taggers for Spanish*:
HeidelTime
SUTime
We used the software GATE and the measures precision (P),
recall (R) and F1-measure (F1), on the following corpora:
Hourglass corpus.
TempEval2 dataset**
* We also tried to compare it to TIPSem, but we were not able to run it for Spanish
despite contacting the creator. We nevertheless thank Héctor Llorens for his help.
** TempEval3 test is not available online, and TimeModeS contains historical
texts, out of our target.
22
23. Añotador: a Temporal Tagger for Spanish
Results against the Hourglass corpus
23
In the HourGlass corpus, Añotador always shows the best
performance.
Value means the normalization is correct
Type means that the temporal expression was correctly classified
Extent means that the temporal expression was correctly marked in the text
24. Añotador: a Temporal Tagger for Spanish
Results against the TempEval2 corpus
24
Results of the temporal taggers in the TempEval 2 corpus.
HeidelTime is slightly better at finding the normalized value
(0.0027 on average), but Añotador is better in the rest of
metrics. SUTime shows good Precission.
25. Añotador: a Temporal Tagger for Spanish
Some examples
The following examples were difficult to handle to the taggers:
25
Example Añotador SUTime HeidelTime
“1 año, 6 meses y un día”
(“1 year, 6 months and one day”)
1 año, 6
meses y un
día
1 año, 6
meses y
un día
1 año, 6
meses y
un día
“Cinco para las 11.”
(“Five to eleven.”)
Cinco para
las 11.
Cinco para
las 11.
Cinco para
las 11.
“lo vuestro dura 1h, no?”
(“your stuff lasts 1h, right?”)
lo vuestro
dura 1h, no?
lo vuestro
dura 1h,
no?
lo vuestro
dura 1h,
no?
“en cero coma”
(in a short amount of time)
en cero coma en cero
coma
en cero
coma
27. Añotador: a Temporal Tagger for Spanish
Context-dependent temporal expressions
Most temporal expressions are context-free, so we
require no extra information to detect or normalize them,
or depend just on an anchor date.
Therefore, temporal taggers tend to cover them
easily.
But we additionally find context-dependent ones, that can
be challenging.
After some analysis, we detected three types of
general dependencies:
o TE dependent on the register
o TE dependent on temporal information
o TE dependent on geographical information
27
28. Añotador: a Temporal Tagger for Spanish
Context dependent temporal expressions
TE dependent on the register.
Non-temporal words used in a temporal sense:
o “Él tiene 37 castañas/tacos” (“He has 37 chestnuts/tacos”),
both meaning “He is 37 years old”.
Meronyms:
o “Tiene 30 primaveras/abriles ya” (“He already has 30
springs/Aprils”), meaning “He is 30 years old”.
Idioms involving temporal expressions:
o “Hasta el 40 de mayo no te quites el sayo” (“Until 40th May,
do not take off the jacket”), meaning that the beginning of
June can still be chilly.
Latin America:
o “la hora del moro” (“the hour of the moor”) means “lunch
time” in the Dominican Republic.
28
29. Añotador: a Temporal Tagger for Spanish
Context dependent temporal expressions
TE dependent on temporal information
We usually asume the Gregorian calendar and the
reference date as the current one, but this is not always
that simple.
There are other calendars, both historical and
contemporary, such as:
o The Julian calendar
o The French Republican calendar
o The Japanese era system.
o The Chinese Calendar
o Persian Solar Hijri calendar
o …
29
30. Añotador: a Temporal Tagger for Spanish
Context dependent temporal expressions
TE dependent on geographical information
Geographical information infludes in how to normalize:
o Temponyms such as:
• International Children’s Day (15/04 Spain, 30/04 Mexico)
• National Day (12/10 Spain, 14/07 France)
o Date formats:
• Some countries have their own separators (eg, Germany uses
DD.MM.YYYY)
• Europe tends to DD/MM/YYYY
• EEUU tends to MM/DD/YYYY
• Japan tends to YYYY/MM/DD (when using our calendar!)
• eg: 11/09 is 11th September in Spain, but 9th November in
EEUU.
Can be a combination (eg, Emperor’s birthday in Japan)
30
32. Añotador: a Temporal Tagger for Spanish
Conclusions
Results:
o Añotador:
• Able to process Spanish (and English) texts and extract
temporal expressions.
• Improves the state of the art both in our corpus and in
TempEval2.
o Hourglass corpus: new corpus with two parts.
• Synthetic dataset for systematic testing of temporal taggers,
including new registers and non-Castilian Spanish texts.
• People contributions of real common temporal expressions.
Future work:
o Improve the system to make better cover the examples in
HourGlass.
o Deal with Context Dependent Temporal Expressions.
32
33. María Navas-Loro
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Víctor Rodríguez-Doncel
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Añotador:
a Temporal Tagger for Spanish
mnavas@fi.upm.es
https://marianavas.oeg-upm.net
29/10/2019
LKE 2019
34. Añotador: a Temporal Tagger for Spanish
References - Corpora
34
1. J. Pustejovsky et al., “The timebank corpus,” in Corpus linguistics, vol. 2003, p. 40,
Lancaster, UK, 2003.
2. http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=5
3. L. Minard et al., “Meantime, the newsreader multilingual event and time corpus,” in
Proc. LREC 2016, 2016.
4. W. Styler IV et al., “Temporal annotation in the clinical domain,” Transactions of
ACL, vol. 2, pp. 143–154, 2014.
5. P. Mazur et al., “Wikiwars: A new corpus for research on temporal expressions,” in
Proc. EMNLP 2010, pp. 913–922, 2010.
6. J. Strötgen et al., “Temporal tagging on different domains: Challenges, strategies,
and gold standards,” in LREC, vol. 12, pp. 3746–3753, 2012.
7. J. Tabassum et al., “Tweetime: A minimally supervised method for recognizing and
normalizing time expressions in twitter,” arXiv preprint arXiv:1608.02904, 2016.
8. https://catalog.ldc.upenn.edu/LDC2012T12
9. https://catalog.ldc.upenn.edu/LDC2012T01
35. Añotador: a Temporal Tagger for Spanish
References - Taggers
35
10. J. Strötgen and M. Gertz, “Multilingual and cross-domain temporal tagging,” LREv,
vol. 47, no. 2, pp. 269–298, 2013.
11. A. X. Chang and C. D. Manning, “Sutime: A library for recognizing and normalizing
time expressions,” in LREC, vol. 2012, pp. 3735–3740, 2012.
12. X. Zhong et al., “Time expression analysis and recognition using syntactic token
types and general heuristic rules,” in Proc. the 55th Annual Meeting of the ACL, vol.
1, pp. 420–429, 2017.
13. S. Bethard, “ClearTK-TimeML: A minimalist approach to TempEval 2013,” in Proc.
the Workshop SemEval 2013, pp. 10–14, ACL, 2013.
14. L. Derczynski et al., “Usfd2: Annotating temporal expresions and tlinks for
tempeval-2,” in Proc. the Workshop SemEval, pp. 337–340, ACL, 2010.
15. K. Lee et al., “Context-dependent semantic parsing for time expressions,” in Proc.
the 52nd Annual Meeting of the ACL, vol. 1, pp. 1437–1447, 2014.
16. M. Verhagen et al., “Automating Temporal Annotation with TARSQI,” in Proc. the
ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84, ACL,
2005.
17. N. Chambers et al., “Dense event ordering with a multi-pass architecture,”
Transactions of ACL, vol. 2, pp. 273–284, 2014.
18. H. Llorens et al., “Tipsem (english and spanish): Evaluating crfs and semantic roles
in tempeval-2,” in Proc. the Workshop SemEval, pp. 284–291, ACL, 2010.