SlideShare a Scribd company logo
1 of 35
María Navas-Loro
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Víctor Rodríguez-Doncel
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Añotador:
a Temporal Tagger for Spanish
mnavas@fi.upm.es
https://marianavas.oeg-upm.net
29/10/2019
LKE 2019
Añotador: a Temporal Tagger for Spanish
Outline
2
Outline
 Brief Introduction
 State of the Art
o Temporal Taggers
o Corpora
 Añotador
 Corpus
 Results
 Challenges
 Conclusions
3
BRIEF INTRODUCTION
Temporal Annotation
Añotador: a Temporal Tagger for Spanish
Temporal information
4
Temporal information is everywhere:
news, medical texts, legal documents…
… and we can use it for a lot of things!
TIMELINE GENERATION
SEARCH
ENGINESUMMARIZATION
PATTERN
DETECTION
It was lodged on 5 March 2010.
It was rejected related to La
Vanguardia. No information
related to Google Inc.
Was it accepted?
When was the complaint against
Google Inc made?
QA
Nevertheless, there are
not a lot of tools to
extract temporal
information in Spanish.
Añotador: a Temporal Tagger for Spanish
The task
5
Temporal taggers:
1) Detect temporal expressions:
- DATE - DURATION
- SET - TIME
2) Normalize them.
3) Additionally, event and relation extraction
STATE OF THE ART
What has been done
6
Añotador: a Temporal Tagger for Spanish
Corpora
7
 A lot in English!
o News:
• Timebank corpus [1]
• TempEval challenges datasets [2]
• MEANTIME corpus [3]
o Medical: THYME [4]
o Historical: Wikiwars corpus [5]
o Scientific: Abstracts [6]
o Colloquial: SMS [6] and tweets [7]
 Less for Spanish…
o News: The Spanish TimeBank corpus [8], MEANTIME [3]
o Historical: The ModeS TimeBank 1.06 (texts from the 17th and
18th centuries) [9]
* TempEval 2 and TempEval 3 challenges (fragment of TimeBank)
Añotador: a Temporal Tagger for Spanish
Temporal Taggers
8
 A lot in English!
o Rule-based:
• HeidelTime [10]
• SUTime* [11]
• SynTime [12]
o Machine-Learning-based
• ClearTK-TimeML [13]
• USFD2 [14]
• UWTime [15]
• TARSQI [16]
• CAEVO [17]
• TIPSem [18]
 For Spanish, just the bold ones… (there are more in
literature, but we could not find them available!)
 No corpora, no taggers!
9
AÑOTADOR
Our tool
Añotador
10
Añotador (aka Annotador) is a temporal tagger for Spanish
and English. It is able to find and normalize dates, times,
durations and sets (temporal expressions that repeat
periodically, such as “monthly”, “twice a year” or “every
Tuesday”). There is a demo available in its webpage, and
can be called as webservice or downloaded from GitHub
(where all the code and a docker image are available).
http://annotador.oeg-upm.net/
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
11
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
12
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
1. Preprocessing:
 We get as input the text and the
anchor date (if none, we assume
the current day)
 We use CoreNLP for
lemmatizing, sentence splitting…
 We added IxaPipes models for
Spanish to improve the quality of
the output.
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
13
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
2. Rules:
More than 100 rules written in CoreNLP
TokensRegex format.
1. Token-based rules for expressions
such as numerals, granularities…
2. Basic temporal expression rules,
working on previously found basic
expressions
3. Compound expression rules, for
inheritance values or composition.
4. Literal expression rules, for specific
expressions such as “previously”,
polysemic expressions (Spanish
“mañana”) or expressions including
tokens that can appear in other
temporal expressions, requiring a
dedicated approach.
Añotador: a Temporal Tagger for Spanish
Mañana, a tricky case
14
Additionally, some expressions are really tricky…
… also syntactically!
“por la mañana” vs “en la mañana” (“in the morning”).
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
15
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
3. Normalization
algorithm:
 Works for each
sentence
sepparately.
 Different
approaches for
each type of
expressions.
 Can take into
account different
reference dates.
Añotador: a Temporal Tagger for Spanish
Pipeline of Añotador
16
CoreNLP pipeline
+ IxaPipes
RULES
for each
sentence
for each
expression
reference
current
Date?
do
Calculus
type?
DATE
YES
setFormat
INPUT
text + anchor date
OUTPUT
NORMALIZATION ALGORITHM
token rules identify numbers, months,
granularities ( day )...
basic TEs rules # + granularity: two days
compound rules merge TEs: 1 month (1M)
and 2 days (2D) = P1M2D
YYYY-MM-DD
TIME DURATION SET
normalize
XX values
NO
add
lastDate
process
Durations
NO
reference
next/last
YES
NO
is
anchored
?
YES
update
lastDate
ANNOTADOR
4. Output format:
- TimeML
- JSON
17
HOURGLASS
Corpus built
Añotador: a Temporal Tagger for Spanish
Hourglass corpus
While building Añotador, we noticed both the scarcity and
lack of heterogeneity of corpora in Spanish.
 We decided to create a synthetic dataset of expressions
that a temporal tagger should cover, including common
expressions and sentences that tend to produce false
positives.
 We additionally asked people foreign to the task to
provide temporal expressions. These contributors:
o Had different backgrounds (linguists, computer scientists,
translators…)
o Were from different Spanish-speaking countries and regions
(such as Venezuelan or Ecuatorians, and people from different
regions in Spain)
18
Añotador: a Temporal Tagger for Spanish
Hourglass corpus
The result was the Hourglass corpus, composed of two parts:
19
SYNTHETIC part:
o Short texts annotated with
TimeML.
o It includes additional tags in
order to facilitate the
evaluation of different
temporal expressions, such
as “Dates”, “Fractions” or
“False” (referring to false
positives like “50/2/1991” or
“February 30”).
PEOPLE part:
o Sentences from contributors
with different backgrounds
and from different Spanish-
speaking countries.
o Also tagged and annotated
following the standards, as
well as with the register of
the sentence (colloquial,
chat, latin expressions, Latin
American expressions,
phrases and one literary
sentence).
Añotador: a Temporal Tagger for Spanish
Hourglass corpus in numbers
20
21
EVALUATION
Performance
Añotador: a Temporal Tagger for Spanish
Hourglass corpus
We evaluated Añotador against the available state of the art
temporal taggers for Spanish*:
 HeidelTime
 SUTime
We used the software GATE and the measures precision (P),
recall (R) and F1-measure (F1), on the following corpora:
 Hourglass corpus.
 TempEval2 dataset**
* We also tried to compare it to TIPSem, but we were not able to run it for Spanish
despite contacting the creator. We nevertheless thank Héctor Llorens for his help.
** TempEval3 test is not available online, and TimeModeS contains historical
texts, out of our target.
22
Añotador: a Temporal Tagger for Spanish
Results against the Hourglass corpus
23
In the HourGlass corpus, Añotador always shows the best
performance.
Value means the normalization is correct
Type means that the temporal expression was correctly classified
Extent means that the temporal expression was correctly marked in the text
Añotador: a Temporal Tagger for Spanish
Results against the TempEval2 corpus
24
Results of the temporal taggers in the TempEval 2 corpus.
HeidelTime is slightly better at finding the normalized value
(0.0027 on average), but Añotador is better in the rest of
metrics. SUTime shows good Precission.
Añotador: a Temporal Tagger for Spanish
Some examples
The following examples were difficult to handle to the taggers:
25
Example Añotador SUTime HeidelTime
“1 año, 6 meses y un día”
(“1 year, 6 months and one day”)
1 año, 6
meses y un
día
1 año, 6
meses y
un día
1 año, 6
meses y
un día
“Cinco para las 11.”
(“Five to eleven.”)
Cinco para
las 11.
Cinco para
las 11.
Cinco para
las 11.
“lo vuestro dura 1h, no?”
(“your stuff lasts 1h, right?”)
lo vuestro
dura 1h, no?
lo vuestro
dura 1h,
no?
lo vuestro
dura 1h,
no?
“en cero coma”
(in a short amount of time)
en cero coma en cero
coma
en cero
coma
26
CHALLENGES
To do
Añotador: a Temporal Tagger for Spanish
Context-dependent temporal expressions
 Most temporal expressions are context-free, so we
require no extra information to detect or normalize them,
or depend just on an anchor date.
 Therefore, temporal taggers tend to cover them
easily.
 But we additionally find context-dependent ones, that can
be challenging.
 After some analysis, we detected three types of
general dependencies:
o TE dependent on the register
o TE dependent on temporal information
o TE dependent on geographical information
27
Añotador: a Temporal Tagger for Spanish
Context dependent temporal expressions
TE dependent on the register.
 Non-temporal words used in a temporal sense:
o “Él tiene 37 castañas/tacos” (“He has 37 chestnuts/tacos”),
both meaning “He is 37 years old”.
 Meronyms:
o “Tiene 30 primaveras/abriles ya” (“He already has 30
springs/Aprils”), meaning “He is 30 years old”.
 Idioms involving temporal expressions:
o “Hasta el 40 de mayo no te quites el sayo” (“Until 40th May,
do not take off the jacket”), meaning that the beginning of
June can still be chilly.
 Latin America:
o “la hora del moro” (“the hour of the moor”) means “lunch
time” in the Dominican Republic.
28
Añotador: a Temporal Tagger for Spanish
Context dependent temporal expressions
TE dependent on temporal information
 We usually asume the Gregorian calendar and the
reference date as the current one, but this is not always
that simple.
 There are other calendars, both historical and
contemporary, such as:
o The Julian calendar
o The French Republican calendar
o The Japanese era system.
o The Chinese Calendar
o Persian Solar Hijri calendar
o …
29
Añotador: a Temporal Tagger for Spanish
Context dependent temporal expressions
TE dependent on geographical information
 Geographical information infludes in how to normalize:
o Temponyms such as:
• International Children’s Day (15/04 Spain, 30/04 Mexico)
• National Day (12/10 Spain, 14/07 France)
o Date formats:
• Some countries have their own separators (eg, Germany uses
DD.MM.YYYY)
• Europe tends to DD/MM/YYYY
• EEUU tends to MM/DD/YYYY
• Japan tends to YYYY/MM/DD (when using our calendar!)
• eg: 11/09 is 11th September in Spain, but 9th November in
EEUU.
Can be a combination (eg, Emperor’s birthday in Japan)
30
31
CONCLUSIONS
Future work
Añotador: a Temporal Tagger for Spanish
Conclusions
 Results:
o Añotador:
• Able to process Spanish (and English) texts and extract
temporal expressions.
• Improves the state of the art both in our corpus and in
TempEval2.
o Hourglass corpus: new corpus with two parts.
• Synthetic dataset for systematic testing of temporal taggers,
including new registers and non-Castilian Spanish texts.
• People contributions of real common temporal expressions.
 Future work:
o Improve the system to make better cover the examples in
HourGlass.
o Deal with Context Dependent Temporal Expressions.
32
María Navas-Loro
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Víctor Rodríguez-Doncel
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Añotador:
a Temporal Tagger for Spanish
mnavas@fi.upm.es
https://marianavas.oeg-upm.net
29/10/2019
LKE 2019
Añotador: a Temporal Tagger for Spanish
References - Corpora
34
1. J. Pustejovsky et al., “The timebank corpus,” in Corpus linguistics, vol. 2003, p. 40,
Lancaster, UK, 2003.
2. http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=5
3. L. Minard et al., “Meantime, the newsreader multilingual event and time corpus,” in
Proc. LREC 2016, 2016.
4. W. Styler IV et al., “Temporal annotation in the clinical domain,” Transactions of
ACL, vol. 2, pp. 143–154, 2014.
5. P. Mazur et al., “Wikiwars: A new corpus for research on temporal expressions,” in
Proc. EMNLP 2010, pp. 913–922, 2010.
6. J. Strötgen et al., “Temporal tagging on different domains: Challenges, strategies,
and gold standards,” in LREC, vol. 12, pp. 3746–3753, 2012.
7. J. Tabassum et al., “Tweetime: A minimally supervised method for recognizing and
normalizing time expressions in twitter,” arXiv preprint arXiv:1608.02904, 2016.
8. https://catalog.ldc.upenn.edu/LDC2012T12
9. https://catalog.ldc.upenn.edu/LDC2012T01
Añotador: a Temporal Tagger for Spanish
References - Taggers
35
10. J. Strötgen and M. Gertz, “Multilingual and cross-domain temporal tagging,” LREv,
vol. 47, no. 2, pp. 269–298, 2013.
11. A. X. Chang and C. D. Manning, “Sutime: A library for recognizing and normalizing
time expressions,” in LREC, vol. 2012, pp. 3735–3740, 2012.
12. X. Zhong et al., “Time expression analysis and recognition using syntactic token
types and general heuristic rules,” in Proc. the 55th Annual Meeting of the ACL, vol.
1, pp. 420–429, 2017.
13. S. Bethard, “ClearTK-TimeML: A minimalist approach to TempEval 2013,” in Proc.
the Workshop SemEval 2013, pp. 10–14, ACL, 2013.
14. L. Derczynski et al., “Usfd2: Annotating temporal expresions and tlinks for
tempeval-2,” in Proc. the Workshop SemEval, pp. 337–340, ACL, 2010.
15. K. Lee et al., “Context-dependent semantic parsing for time expressions,” in Proc.
the 52nd Annual Meeting of the ACL, vol. 1, pp. 1437–1447, 2014.
16. M. Verhagen et al., “Automating Temporal Annotation with TARSQI,” in Proc. the
ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84, ACL,
2005.
17. N. Chambers et al., “Dense event ordering with a multi-pass architecture,”
Transactions of ACL, vol. 2, pp. 273–284, 2014.
18. H. Llorens et al., “Tipsem (english and spanish): Evaluating crfs and semantic roles
in tempeval-2,” in Proc. the Workshop SemEval, pp. 284–291, ACL, 2010.

More Related Content

Similar to Añotador: a Temporal Tagger for Spanish

DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021DSDT_MTL
 
EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03hpcosta
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015RIILP
 
Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...IMPACT Centre of Competence
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramAnnalina Caputo
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportAlexandre Rademaker
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...RIILP
 
Mapping Early Modern News Networks
Mapping Early Modern News NetworksMapping Early Modern News Networks
Mapping Early Modern News NetworksGiovanni Colavizza
 
Anil timeline construction
Anil timeline constructionAnil timeline construction
Anil timeline constructionanilcs0405
 

Similar to Añotador: a Temporal Tagger for Spanish (10)

DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021
 
EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
 
Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google Ngram
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
Time and computers
Time and computersTime and computers
Time and computers
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
 
Mapping Early Modern News Networks
Mapping Early Modern News NetworksMapping Early Modern News Networks
Mapping Early Modern News Networks
 
Anil timeline construction
Anil timeline constructionAnil timeline construction
Anil timeline construction
 

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 

Añotador: a Temporal Tagger for Spanish

  • 1. María Navas-Loro Ontology Engineering Group Universidad Politécnica de Madrid, Spain Víctor Rodríguez-Doncel Ontology Engineering Group Universidad Politécnica de Madrid, Spain Añotador: a Temporal Tagger for Spanish mnavas@fi.upm.es https://marianavas.oeg-upm.net 29/10/2019 LKE 2019
  • 2. Añotador: a Temporal Tagger for Spanish Outline 2 Outline  Brief Introduction  State of the Art o Temporal Taggers o Corpora  Añotador  Corpus  Results  Challenges  Conclusions
  • 4. Añotador: a Temporal Tagger for Spanish Temporal information 4 Temporal information is everywhere: news, medical texts, legal documents… … and we can use it for a lot of things! TIMELINE GENERATION SEARCH ENGINESUMMARIZATION PATTERN DETECTION It was lodged on 5 March 2010. It was rejected related to La Vanguardia. No information related to Google Inc. Was it accepted? When was the complaint against Google Inc made? QA Nevertheless, there are not a lot of tools to extract temporal information in Spanish.
  • 5. Añotador: a Temporal Tagger for Spanish The task 5 Temporal taggers: 1) Detect temporal expressions: - DATE - DURATION - SET - TIME 2) Normalize them. 3) Additionally, event and relation extraction
  • 6. STATE OF THE ART What has been done 6
  • 7. Añotador: a Temporal Tagger for Spanish Corpora 7  A lot in English! o News: • Timebank corpus [1] • TempEval challenges datasets [2] • MEANTIME corpus [3] o Medical: THYME [4] o Historical: Wikiwars corpus [5] o Scientific: Abstracts [6] o Colloquial: SMS [6] and tweets [7]  Less for Spanish… o News: The Spanish TimeBank corpus [8], MEANTIME [3] o Historical: The ModeS TimeBank 1.06 (texts from the 17th and 18th centuries) [9] * TempEval 2 and TempEval 3 challenges (fragment of TimeBank)
  • 8. Añotador: a Temporal Tagger for Spanish Temporal Taggers 8  A lot in English! o Rule-based: • HeidelTime [10] • SUTime* [11] • SynTime [12] o Machine-Learning-based • ClearTK-TimeML [13] • USFD2 [14] • UWTime [15] • TARSQI [16] • CAEVO [17] • TIPSem [18]  For Spanish, just the bold ones… (there are more in literature, but we could not find them available!)  No corpora, no taggers!
  • 10. Añotador 10 Añotador (aka Annotador) is a temporal tagger for Spanish and English. It is able to find and normalize dates, times, durations and sets (temporal expressions that repeat periodically, such as “monthly”, “twice a year” or “every Tuesday”). There is a demo available in its webpage, and can be called as webservice or downloaded from GitHub (where all the code and a docker image are available). http://annotador.oeg-upm.net/
  • 11. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 11 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR
  • 12. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 12 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 1. Preprocessing:  We get as input the text and the anchor date (if none, we assume the current day)  We use CoreNLP for lemmatizing, sentence splitting…  We added IxaPipes models for Spanish to improve the quality of the output.
  • 13. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 13 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 2. Rules: More than 100 rules written in CoreNLP TokensRegex format. 1. Token-based rules for expressions such as numerals, granularities… 2. Basic temporal expression rules, working on previously found basic expressions 3. Compound expression rules, for inheritance values or composition. 4. Literal expression rules, for specific expressions such as “previously”, polysemic expressions (Spanish “mañana”) or expressions including tokens that can appear in other temporal expressions, requiring a dedicated approach.
  • 14. Añotador: a Temporal Tagger for Spanish Mañana, a tricky case 14 Additionally, some expressions are really tricky… … also syntactically! “por la mañana” vs “en la mañana” (“in the morning”).
  • 15. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 15 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 3. Normalization algorithm:  Works for each sentence sepparately.  Different approaches for each type of expressions.  Can take into account different reference dates.
  • 16. Añotador: a Temporal Tagger for Spanish Pipeline of Añotador 16 CoreNLP pipeline + IxaPipes RULES for each sentence for each expression reference current Date? do Calculus type? DATE YES setFormat INPUT text + anchor date OUTPUT NORMALIZATION ALGORITHM token rules identify numbers, months, granularities ( day )... basic TEs rules # + granularity: two days compound rules merge TEs: 1 month (1M) and 2 days (2D) = P1M2D YYYY-MM-DD TIME DURATION SET normalize XX values NO add lastDate process Durations NO reference next/last YES NO is anchored ? YES update lastDate ANNOTADOR 4. Output format: - TimeML - JSON
  • 18. Añotador: a Temporal Tagger for Spanish Hourglass corpus While building Añotador, we noticed both the scarcity and lack of heterogeneity of corpora in Spanish.  We decided to create a synthetic dataset of expressions that a temporal tagger should cover, including common expressions and sentences that tend to produce false positives.  We additionally asked people foreign to the task to provide temporal expressions. These contributors: o Had different backgrounds (linguists, computer scientists, translators…) o Were from different Spanish-speaking countries and regions (such as Venezuelan or Ecuatorians, and people from different regions in Spain) 18
  • 19. Añotador: a Temporal Tagger for Spanish Hourglass corpus The result was the Hourglass corpus, composed of two parts: 19 SYNTHETIC part: o Short texts annotated with TimeML. o It includes additional tags in order to facilitate the evaluation of different temporal expressions, such as “Dates”, “Fractions” or “False” (referring to false positives like “50/2/1991” or “February 30”). PEOPLE part: o Sentences from contributors with different backgrounds and from different Spanish- speaking countries. o Also tagged and annotated following the standards, as well as with the register of the sentence (colloquial, chat, latin expressions, Latin American expressions, phrases and one literary sentence).
  • 20. Añotador: a Temporal Tagger for Spanish Hourglass corpus in numbers 20
  • 22. Añotador: a Temporal Tagger for Spanish Hourglass corpus We evaluated Añotador against the available state of the art temporal taggers for Spanish*:  HeidelTime  SUTime We used the software GATE and the measures precision (P), recall (R) and F1-measure (F1), on the following corpora:  Hourglass corpus.  TempEval2 dataset** * We also tried to compare it to TIPSem, but we were not able to run it for Spanish despite contacting the creator. We nevertheless thank Héctor Llorens for his help. ** TempEval3 test is not available online, and TimeModeS contains historical texts, out of our target. 22
  • 23. Añotador: a Temporal Tagger for Spanish Results against the Hourglass corpus 23 In the HourGlass corpus, Añotador always shows the best performance. Value means the normalization is correct Type means that the temporal expression was correctly classified Extent means that the temporal expression was correctly marked in the text
  • 24. Añotador: a Temporal Tagger for Spanish Results against the TempEval2 corpus 24 Results of the temporal taggers in the TempEval 2 corpus. HeidelTime is slightly better at finding the normalized value (0.0027 on average), but Añotador is better in the rest of metrics. SUTime shows good Precission.
  • 25. Añotador: a Temporal Tagger for Spanish Some examples The following examples were difficult to handle to the taggers: 25 Example Añotador SUTime HeidelTime “1 año, 6 meses y un día” (“1 year, 6 months and one day”) 1 año, 6 meses y un día 1 año, 6 meses y un día 1 año, 6 meses y un día “Cinco para las 11.” (“Five to eleven.”) Cinco para las 11. Cinco para las 11. Cinco para las 11. “lo vuestro dura 1h, no?” (“your stuff lasts 1h, right?”) lo vuestro dura 1h, no? lo vuestro dura 1h, no? lo vuestro dura 1h, no? “en cero coma” (in a short amount of time) en cero coma en cero coma en cero coma
  • 27. Añotador: a Temporal Tagger for Spanish Context-dependent temporal expressions  Most temporal expressions are context-free, so we require no extra information to detect or normalize them, or depend just on an anchor date.  Therefore, temporal taggers tend to cover them easily.  But we additionally find context-dependent ones, that can be challenging.  After some analysis, we detected three types of general dependencies: o TE dependent on the register o TE dependent on temporal information o TE dependent on geographical information 27
  • 28. Añotador: a Temporal Tagger for Spanish Context dependent temporal expressions TE dependent on the register.  Non-temporal words used in a temporal sense: o “Él tiene 37 castañas/tacos” (“He has 37 chestnuts/tacos”), both meaning “He is 37 years old”.  Meronyms: o “Tiene 30 primaveras/abriles ya” (“He already has 30 springs/Aprils”), meaning “He is 30 years old”.  Idioms involving temporal expressions: o “Hasta el 40 de mayo no te quites el sayo” (“Until 40th May, do not take off the jacket”), meaning that the beginning of June can still be chilly.  Latin America: o “la hora del moro” (“the hour of the moor”) means “lunch time” in the Dominican Republic. 28
  • 29. Añotador: a Temporal Tagger for Spanish Context dependent temporal expressions TE dependent on temporal information  We usually asume the Gregorian calendar and the reference date as the current one, but this is not always that simple.  There are other calendars, both historical and contemporary, such as: o The Julian calendar o The French Republican calendar o The Japanese era system. o The Chinese Calendar o Persian Solar Hijri calendar o … 29
  • 30. Añotador: a Temporal Tagger for Spanish Context dependent temporal expressions TE dependent on geographical information  Geographical information infludes in how to normalize: o Temponyms such as: • International Children’s Day (15/04 Spain, 30/04 Mexico) • National Day (12/10 Spain, 14/07 France) o Date formats: • Some countries have their own separators (eg, Germany uses DD.MM.YYYY) • Europe tends to DD/MM/YYYY • EEUU tends to MM/DD/YYYY • Japan tends to YYYY/MM/DD (when using our calendar!) • eg: 11/09 is 11th September in Spain, but 9th November in EEUU. Can be a combination (eg, Emperor’s birthday in Japan) 30
  • 32. Añotador: a Temporal Tagger for Spanish Conclusions  Results: o Añotador: • Able to process Spanish (and English) texts and extract temporal expressions. • Improves the state of the art both in our corpus and in TempEval2. o Hourglass corpus: new corpus with two parts. • Synthetic dataset for systematic testing of temporal taggers, including new registers and non-Castilian Spanish texts. • People contributions of real common temporal expressions.  Future work: o Improve the system to make better cover the examples in HourGlass. o Deal with Context Dependent Temporal Expressions. 32
  • 33. María Navas-Loro Ontology Engineering Group Universidad Politécnica de Madrid, Spain Víctor Rodríguez-Doncel Ontology Engineering Group Universidad Politécnica de Madrid, Spain Añotador: a Temporal Tagger for Spanish mnavas@fi.upm.es https://marianavas.oeg-upm.net 29/10/2019 LKE 2019
  • 34. Añotador: a Temporal Tagger for Spanish References - Corpora 34 1. J. Pustejovsky et al., “The timebank corpus,” in Corpus linguistics, vol. 2003, p. 40, Lancaster, UK, 2003. 2. http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=5 3. L. Minard et al., “Meantime, the newsreader multilingual event and time corpus,” in Proc. LREC 2016, 2016. 4. W. Styler IV et al., “Temporal annotation in the clinical domain,” Transactions of ACL, vol. 2, pp. 143–154, 2014. 5. P. Mazur et al., “Wikiwars: A new corpus for research on temporal expressions,” in Proc. EMNLP 2010, pp. 913–922, 2010. 6. J. Strötgen et al., “Temporal tagging on different domains: Challenges, strategies, and gold standards,” in LREC, vol. 12, pp. 3746–3753, 2012. 7. J. Tabassum et al., “Tweetime: A minimally supervised method for recognizing and normalizing time expressions in twitter,” arXiv preprint arXiv:1608.02904, 2016. 8. https://catalog.ldc.upenn.edu/LDC2012T12 9. https://catalog.ldc.upenn.edu/LDC2012T01
  • 35. Añotador: a Temporal Tagger for Spanish References - Taggers 35 10. J. Strötgen and M. Gertz, “Multilingual and cross-domain temporal tagging,” LREv, vol. 47, no. 2, pp. 269–298, 2013. 11. A. X. Chang and C. D. Manning, “Sutime: A library for recognizing and normalizing time expressions,” in LREC, vol. 2012, pp. 3735–3740, 2012. 12. X. Zhong et al., “Time expression analysis and recognition using syntactic token types and general heuristic rules,” in Proc. the 55th Annual Meeting of the ACL, vol. 1, pp. 420–429, 2017. 13. S. Bethard, “ClearTK-TimeML: A minimalist approach to TempEval 2013,” in Proc. the Workshop SemEval 2013, pp. 10–14, ACL, 2013. 14. L. Derczynski et al., “Usfd2: Annotating temporal expresions and tlinks for tempeval-2,” in Proc. the Workshop SemEval, pp. 337–340, ACL, 2010. 15. K. Lee et al., “Context-dependent semantic parsing for time expressions,” in Proc. the 52nd Annual Meeting of the ACL, vol. 1, pp. 1437–1447, 2014. 16. M. Verhagen et al., “Automating Temporal Annotation with TARSQI,” in Proc. the ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84, ACL, 2005. 17. N. Chambers et al., “Dense event ordering with a multi-pass architecture,” Transactions of ACL, vol. 2, pp. 273–284, 2014. 18. H. Llorens et al., “Tipsem (english and spanish): Evaluating crfs and semantic roles in tempeval-2,” in Proc. the Workshop SemEval, pp. 284–291, ACL, 2010.