SlideShare a Scribd company logo
1 of 43
How to add a language to
the linguistic resources map
Corina Forăscu
Alexandru Ioan Cuza University of Iasi - Faculty of Computer Science &
Romanian Academy Research Institute for Artificial Intelligence “Mihai Drăgănescu”
corinfor@info.uaic.ro
Distinguished Speakers Departmental Seminars
10th of February, 2015
 How to efficiently use time in research and… personal life?
 Why are languages and language technologies (LT) important in our societies?
 How to deal with a less-studied language?
 How to build and exploit new language resources?
 How much time is needed?
 How to represent and use time (temporal information in NLP applications)?
Agenda
 Languages
 Language technologies
 for Romanian
 Language resources
 for Romanian
 Research projects / competitions &
scientific / personal events
Languages – native speakers
Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (eds.). 2014. Ethnologue: Languages of the World, Seventeenth edition.
Dallas, Texas: SIL International http://www.ethnologue.com/statistics/size
Languages – Internet speakers
http://www.vistawide.com/languages/top_30_languages.htm
Languages and LRTs
Excellent
support
Good
support
Moderate
support
Fragmentary
support
Weak/no
support
Machine
Translation
EN FR, ESP.
RO, GER, IT,
POL, HU, …
majority
Speech
Processing
EN
CZE, POR, NL, FR,
GER, IT, ESP, ..
RO, BG, GR, ..
MT, LIT, SRB,
..
Text Analysis EN
NL, FR, GER, IT,
ESP, POR..
RO, CZE, POR,
BG, GR, ..
MT, LIT, SRB,
..
Speech and
Text Resources
EN
CZE, NL, FR, GER,
IT, ESP, ..
RO, POR, BG,
GR, ..
MT, LIT, ..
http://www.meta-net.eu/whitepapers/ 2012
Romania
http://www.internetworldstats.com/eu/ro.htm
year Users Population % Pop.
2000 800,000 22,217,700 3.6 %
2004 4,000,000 21,377,426 18.7 %
2006 4,940,000 21,154,226 23.4 %
2007 5,062,500 21,154,226 23.9 %
2010 7,786,700 21,959,278 35.5 %
2012 9,642,383 21,848,504 44.1 %
2014 10,812,784 21,729,871 49.8%
 5,374,980 Facebook subscribers on Dec 31, 2012, 24.7%
penetration rate.
 55.54 Mbps Broadband download speed on August, 2014
 Netindex: 60.53 Mbps download speed
Romanian
 Romance language, with influences from old Slavic, Turkish,
Greek, German, Hungarian, Bulgarian, Russian
 spoken by about 29 mil. people, with 4 official dialects
 highly inflected language
 pro-drop language ([en] It rains. / [ro] Plouă)
 with clitic doubling ([en] I see her. [ro] O văd pe ea.)
 with negative concord
 with double negation
 Mihai Eminescu
 Emil Cioran
 Mircea Eliade
 Mircea Cărtărescu
BLARK - Basic LAnguage Resource Kit
 (a) the minimal general text corpus to be able to do any
precompetitive research for the language at all,
annotated according to some generally accepted
standards
 (a’) something similar for a spoken text corpus
 (b) a collection of basic tools to manipulate and analyze
the corpora  LT systems
 (c) a collection of skills that constitute the minimal
starting point for the development of a competitive
NL/Speech technology industry
http://www.elsnet.org/dox/blark.html
LT systems
preprocessing
•Cleaning data
•Format analysis / removal
•Language identification
Morpho-
syntactic
analysis
•Sentence segmentation
•Tokenization
•POS-tagging, chunking
Semantic
analysis
•Word sense disambiguation
•NER, event extraction
•Anaphora resolution
•Discourse processing
Specific modules
•QA
•TE
•Summarization
•MT
Language Identification
 web service derived from a stand alone application that was
initially aimed at autonomously collecting web data for English
and Romanian
 distinguishes among the 22 languages of the European Union.,
present in the JRC-Acquis parallel corpus
Romanian LTs: morpho-syntactic analysis
 UAIC Romanian POS tagger
 http://nlptools.infoiasi.ro/WebPosRo/ (webservice)
 Sentence-splitting, tokenizing, POS-tagging (406 MSD tags, based on a
1.25 mil. words morphologic dictionary and a statistical model) and
lemmatizing,
 TTL (Tokenizing, Tagging and Lemmatizing free running texts )
 http://www.racai.ro/tools/text/ (webservice & standalone application)
 sentence splitting, tokenization, POS tagging (cca 600 CTAGs),
lemmatization and chunking on Romanian, English and French texts.
Precision Without rules With rules
For unknown words 88.88% 93.31%
For all words 95.12% 97.03%
Romanian diacritics recovery – DIAC +
 fata / fată / fată / făta / fâță
the girl / girl / (she) calves / (to) calve / a fussy girl
 Diacritics have a high frequency (every third word might
contain at least one diacritical character)
 Diacritics have a significant contribution to the morpho-lexical
and semantic disambiguation of the words
 Plugin for Office 2003/2007/2010/2013
 http://www.racai.ro/downloads/diac/diac+.zip
 Based on tokenization, sentence splitting, lemmatization, and
especially POS tagging (MSD tags) DIAC disambiguates
between several possible word forms that may or may not
contain diacritics
Romanian LTs: NP-chunker
 The Romanian NP Chunker uses the UAIC POS tagger and GGS
(Graphical Grammar Studio http://sourceforge.net/projects/ggs/),
a visual tool for describing grammars.
 A Romanian grammar has been developed allowing fully recursive
NP chunks.
 http://nlptools.infoiasi.ro/WebNpChunkerRo/ (webservice)
Romanian FDG parser
 http://nlptools.infoiasi.ro/WebFdgRo/ (webservice)
 The parser was trained on a dependency treebank linguistic resource.
Romanian Word Linker - LexPar
 A link between two syntactico-semantic related words in a
sentence is an approximation of a dependency relation, with no
orientation and no labeling.
 A link structure of a sentence is constructed with a Lexical
Attraction Model
 Dan Tufiș, Radu Ion, Alexandru Ceaușu, and Dan Ștefănescu.
RACAI's Linguistic Web Services. In Proceedings of the 6th
Language Resources and Evaluation Conference - LREC 2008,
Marrakech, Morocco, May 2008. ELRA - European Language
Resources Association.‘
RO / EN Named Entity Recognizer & Editor
 http://nlptools.infoiasi.ro/UAIC.NamedEntityRecognizer/ (web
service)
 NEs are organized – based on a voting system - under four top
level classes (PERSON, LOCATION, ORGANIZATION and
MISC) and a total of nine subclasses
RO / EN Anaphora Recognizer & Editor
 http://nlptools.infoiasi.ro/UAIC.AnaphoraResolution/
 http://nlptools.infoiasi.ro/UAIC.AnaphoraEditor/
 Features used to decide if there is a co-referential chain
between two NPs:
 number agreement, gender agreement, and morphological
description, implementing on the head noun;
 similarity between the two noun phrases, both at lemma level and
text level implemented on the head noun and also on the entire
noun phrase;
 condition if the two noun phrases belong to the same phrase or
not.
RO / EN Clause Splitter & Editor
 http://nlptools.infoiasi.ro/UAIC.ClauseSplitter/
http://nlptools.infoiasi.ro/UAIC.ClauseEditor/
 Features used to features used to build the model of compound
verbs:
 Distance between the verbs
 the existence of punctuation or markers between them
 the lemma and the morphological description of the verbs
RO / EN Discourse Parser
 http://nlptools.infoiasi.ro/UAIC.DiscourseParser/
 The generated discourse trees put in evidence only the
nuclearity of the nodes, while the name of relations is ignored.
 The discourse parser adopts an incremental policy in developing
the trees and it is constrained by two general principles in
discourse parsing: sequentiality of the terminal nodes (Marcu,
2000) and attachment restricted to the right frontier.
EBMT system
 http://www.racai.ro/tools/translation/racai-translation-system/
 Available for EN  RO, EN  GER, EN  ESP, RO  SLO
Language resources
 Dictionaries monolingual / bilingual…  eDTLR
 (60) Wordnets  RoWordnet
 Lexical / morphological resources
 Corpora
Mono / multi-lingual  translation / language models
 CoRoLa
comparable
parallel  RoTimeBank, JRC-Acquis, Ro-Semcor
treebanks
 Text collections
Romanian Wordnet
 Balkanet, 2004: lexical semantic network of Romanian
 Hierarchy Preservation Principle and Conceptual Density Principle
 aligned at the conceptual level with the English WordNet with
Princeton WordNet 3.0, SUMO&MILO ontologies, the IRST
DOMAINS taxonomy
 PWN 2.0-3.0 mappings http://dev.racai.ro/dw/PWNMappings20-
30/PWN_3.0-2.0_Concept_Mapping.zip
 It includes the SentiWordNet subjectivity mark-up.
 words belonging both to the general vocabulary and to various
domains of activity
 Cca 60.000 synsets
 Used in word sense disambiguation, machine translation and question
answering systems
Romanian Wordnet (2)
 http://www.racai.ro/en/tools/text/rowordnet-visualizer/
Romanian Wordnet (3)
PoS Synsets Literals Unique Lit NonLexicalised
Nouns 41063 56532 52009 1839
Verbs 10397 16484 14210 759
Adjective 4822 8203 7407 79
Adverbs 3066 4019 3248 110
TOTAL 59348 85238 75656 2787
Barbu Mititelu, Verginica and Dumitrescu, Ștefan Daniel and Tufiș, Dan. News
about the Romanian Wordnet. In Proceedings of the 7th International Global
WordNet Conference. Tartu, Estonia, 2014
DTLR
 Romanian Academy, since 1913
 33 volumes, more than 15,000 pages and about 175,000 entries,
with citations collected from more than 2,500 volumes of the
written Romanian literature
eDTLR
 The digital form of DTLR, including its sources in digital form
and the software to access them
 National project, 2007 - 2010
 Steps in Building eDTLR:
 Preliminary processing of the paper version
 Scanning
 Image Processing
 Automatic recognition of symbols - OCR
 Correction phases – volunteers + specialists
 Parsing the entries
 Correcting the structure - specialists
 Linking the dictionary entries to sources
CoRoLa – the reference electronic corpus of
contemporary Romanian language
 http://www.racai.ro/en/research-activities/corola-program-
prioritar-al-academiei-romane/
 a big corpus (more than 500 million word forms)
 all functional styles will be represented
 written texts: from books, newspaper articles, booklets,
theses and technical reports
 oral texts: 300 hours of recordings accompanied by their
transcripts
 pre-processed and annotated texts (at least at the
morphological level, but maybe also at a syntactic and even
semantic and discourse level).
CoRoLa - partners
CoRoLa – current stats
Sentences Tokens Words Content words
News 651,872 10,294,016 8,558,619 4,662,528
Medical 603,161 10,950,271 9,163,029 5,226,837
Legal 659,646 9,067,516 7,482,484 4,247,737
Biogr. 314,368 5,802,961 4,298,493 2,567,427
Fiction 517,803 8,002,596 6,773,648 3,531,156
Total 2,746,850 44,117,360 36,276,273 20,235,685
Barbu Mititelu, Verginica and Irimia, Elena and Tufiș, Dan. CoRoLa – The Reference
Corpus of Contemporary Romanian Language. In Proceedings of LREC'14. Reykjavik,
Iceland, pp. 1235–1239, 2014
RoTimeBank - motivations
1. QA:
• when?, how often? or how long?
• Temporally-anchored questions
2. IE & IR
• Tracks in evaluation campaigns (SemEval, ACE, TAC)
3. MT:
• translated and normalized temporal references
• mappings between different behavior of tenses from
language to language
4. DP:
• temporal structure of discourse
• Summarization (biographic summaries)
RoTimeBank – motivations (2)
• Time-consuming, error-prone annotation for
Romanian
• “fuzzy” situations
• all sentences express an EVENT
• acum câteva zile, (în) următoarele luni
• long-distance relations (dependencies)
• Extensions to other domains (literature,
legislation)
• ISO standard
TimeML standard
 A metadata standard developed especially
for (English) news articles, for marking
 events: EVENT, MAKEINSTANCE
 temporal anchoring of events: TIMEX3,
SIGNAL
 links between events and/or timexes:
TLINK, ALINK, SLINK
 ISO proposal including Italian, Chinese,
Korean
TimeBank corpus
 183 English news report documents TimeML
annotated, freely distributed through LDC
 4715 sentences with
 10586 unique lexical units, from
 a total of 61042 lexical units
•
 Non-TimeML Markup in Time Bank 1.1:
 structure information: header
 named entity recognition: <ENAMEX>, <NUMEX>,
<CARDINAL>
 sentence boundary information: <s>
TimeBank - Parallel corpus creation & processing
1. Translation (guidelines)
2. Pre-processing (tokenizing, POS-tagging)
3. Alignment (word-level, manual
correction)
4. Annotation import (automatic, with
manual evaluation)
5. ISO-TimeML adapted to Romanian
(annotation guideline)
Analysis of the annotation import
1. Types of temporal annotation import
1. Perfect transfer
2. Transfer with some amendments due to TimeML
specifications
3. Transfer with amendments imposed by with
language specific phenomena
4. Impossible transfer
2. Temporal elements not (yet) marked in the
Romanian & English corpus
RO-EN TimeBank stats
TimeML
tags
 RO  EN % final transfer
EVENTs 7926 7935 99.89
instances - 7940 -
TIMEXes 1414 1414 100.00
SIGNALs 669 688 97.24
TLINKs 6311 6418 98.33
SLINKs 2908 2932 99.18
ALINKs 262 265 98.87
TOTAL 19490 19652 99.18
Final thoughts
 Time is the only critic without ambition.
(John Steinbeck)
 Time is a great teacher. Unfortunately, it kills
all its pupils.
(Hector Berlioz)
Evaluation competitions for LRT development
 CLEF: Cross-Language Evaluation Forum 
Conference and Labs of the Evaluation Forum
 QA@CLEF 2007-2008
 ResPublQA 2009 – 2010
 QA4MRE 2011-2013
 QALD 2015-2015
 GikiCLEF 2009
 MultiLing @ ACL 2013
Scientific & raising awareness events
 EUROLAN summer schools
 2015, 12th edition, Sibiu, Romania:
 Linguistic Linked Open Data
http://eurolan.info.uaic.ro/2015/
 ConsILR workshops (Conference on Linguistic Resources
and Tools for Processing the Romanian Language)
 http://consilr.info.uaic.ro/2014/index.php?list=eng
 CICLing 2010, GWC 2016
 LT4RD 2012 – Language Technologies in Romanian
Diaspora
 Following Anita Borg @ Iasi, through WITchIS
Thank you for your attention!
Further information:
corinfor@info.uaic.ro
References
 METANET whitepapers - http://www.meta-net.eu/whitepapers/overview
 Steven Krauwer (2003), “The Basic Language Resource Kit (BLARK) as the First
Milestone for the Language Resources Roadmap”, in Proceedings of the
InternationalWorkshop “Speech and Computer”, Moscow, Russia.

More Related Content

Similar to Disntinguished Speaker - Corina Forascu

Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsAdnanBaloch15
 
Recent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesRecent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesIJECEIAES
 
2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentationDouglas Randall
 
An Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerCOMRADES project
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
 
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
 
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
Educational Multimedia Dictionary
Educational Multimedia DictionaryEducational Multimedia Dictionary
Educational Multimedia DictionaryYannis Kotsanis
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...ijnlc
 
Fao Semantics Related Projects
Fao Semantics Related ProjectsFao Semantics Related Projects
Fao Semantics Related ProjectsMargherita Sini
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxSyedNadeemAbbas6
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
Using Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsUsing Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsMauro Dragoni
 
Realization of natural language interfaces using
Realization of natural language interfaces usingRealization of natural language interfaces using
Realization of natural language interfaces usingunyil96
 

Similar to Disntinguished Speaker - Corina Forascu (20)

OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual DictionariesOpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Recent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesRecent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performances
 
2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation
 
An Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source Lemmatizer
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
 
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Educational Multimedia Dictionary
Educational Multimedia DictionaryEducational Multimedia Dictionary
Educational Multimedia Dictionary
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
 
Fao Semantics Related Projects
Fao Semantics Related ProjectsFao Semantics Related Projects
Fao Semantics Related Projects
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Using Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsUsing Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR Systems
 
Realization of natural language interfaces using
Realization of natural language interfaces usingRealization of natural language interfaces using
Realization of natural language interfaces using
 
RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)
 

Recently uploaded

Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 

Disntinguished Speaker - Corina Forascu

  • 1. How to add a language to the linguistic resources map Corina Forăscu Alexandru Ioan Cuza University of Iasi - Faculty of Computer Science & Romanian Academy Research Institute for Artificial Intelligence “Mihai Drăgănescu” corinfor@info.uaic.ro Distinguished Speakers Departmental Seminars 10th of February, 2015
  • 2.  How to efficiently use time in research and… personal life?  Why are languages and language technologies (LT) important in our societies?  How to deal with a less-studied language?  How to build and exploit new language resources?  How much time is needed?  How to represent and use time (temporal information in NLP applications)?
  • 3. Agenda  Languages  Language technologies  for Romanian  Language resources  for Romanian  Research projects / competitions & scientific / personal events
  • 4. Languages – native speakers Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (eds.). 2014. Ethnologue: Languages of the World, Seventeenth edition. Dallas, Texas: SIL International http://www.ethnologue.com/statistics/size
  • 5. Languages – Internet speakers http://www.vistawide.com/languages/top_30_languages.htm
  • 6. Languages and LRTs Excellent support Good support Moderate support Fragmentary support Weak/no support Machine Translation EN FR, ESP. RO, GER, IT, POL, HU, … majority Speech Processing EN CZE, POR, NL, FR, GER, IT, ESP, .. RO, BG, GR, .. MT, LIT, SRB, .. Text Analysis EN NL, FR, GER, IT, ESP, POR.. RO, CZE, POR, BG, GR, .. MT, LIT, SRB, .. Speech and Text Resources EN CZE, NL, FR, GER, IT, ESP, .. RO, POR, BG, GR, .. MT, LIT, .. http://www.meta-net.eu/whitepapers/ 2012
  • 7. Romania http://www.internetworldstats.com/eu/ro.htm year Users Population % Pop. 2000 800,000 22,217,700 3.6 % 2004 4,000,000 21,377,426 18.7 % 2006 4,940,000 21,154,226 23.4 % 2007 5,062,500 21,154,226 23.9 % 2010 7,786,700 21,959,278 35.5 % 2012 9,642,383 21,848,504 44.1 % 2014 10,812,784 21,729,871 49.8%  5,374,980 Facebook subscribers on Dec 31, 2012, 24.7% penetration rate.  55.54 Mbps Broadband download speed on August, 2014  Netindex: 60.53 Mbps download speed
  • 8. Romanian  Romance language, with influences from old Slavic, Turkish, Greek, German, Hungarian, Bulgarian, Russian  spoken by about 29 mil. people, with 4 official dialects  highly inflected language  pro-drop language ([en] It rains. / [ro] Plouă)  with clitic doubling ([en] I see her. [ro] O văd pe ea.)  with negative concord  with double negation  Mihai Eminescu  Emil Cioran  Mircea Eliade  Mircea Cărtărescu
  • 9. BLARK - Basic LAnguage Resource Kit  (a) the minimal general text corpus to be able to do any precompetitive research for the language at all, annotated according to some generally accepted standards  (a’) something similar for a spoken text corpus  (b) a collection of basic tools to manipulate and analyze the corpora  LT systems  (c) a collection of skills that constitute the minimal starting point for the development of a competitive NL/Speech technology industry http://www.elsnet.org/dox/blark.html
  • 10. LT systems preprocessing •Cleaning data •Format analysis / removal •Language identification Morpho- syntactic analysis •Sentence segmentation •Tokenization •POS-tagging, chunking Semantic analysis •Word sense disambiguation •NER, event extraction •Anaphora resolution •Discourse processing Specific modules •QA •TE •Summarization •MT
  • 11. Language Identification  web service derived from a stand alone application that was initially aimed at autonomously collecting web data for English and Romanian  distinguishes among the 22 languages of the European Union., present in the JRC-Acquis parallel corpus
  • 12. Romanian LTs: morpho-syntactic analysis  UAIC Romanian POS tagger  http://nlptools.infoiasi.ro/WebPosRo/ (webservice)  Sentence-splitting, tokenizing, POS-tagging (406 MSD tags, based on a 1.25 mil. words morphologic dictionary and a statistical model) and lemmatizing,  TTL (Tokenizing, Tagging and Lemmatizing free running texts )  http://www.racai.ro/tools/text/ (webservice & standalone application)  sentence splitting, tokenization, POS tagging (cca 600 CTAGs), lemmatization and chunking on Romanian, English and French texts. Precision Without rules With rules For unknown words 88.88% 93.31% For all words 95.12% 97.03%
  • 13. Romanian diacritics recovery – DIAC +  fata / fată / fată / făta / fâță the girl / girl / (she) calves / (to) calve / a fussy girl  Diacritics have a high frequency (every third word might contain at least one diacritical character)  Diacritics have a significant contribution to the morpho-lexical and semantic disambiguation of the words  Plugin for Office 2003/2007/2010/2013  http://www.racai.ro/downloads/diac/diac+.zip  Based on tokenization, sentence splitting, lemmatization, and especially POS tagging (MSD tags) DIAC disambiguates between several possible word forms that may or may not contain diacritics
  • 14. Romanian LTs: NP-chunker  The Romanian NP Chunker uses the UAIC POS tagger and GGS (Graphical Grammar Studio http://sourceforge.net/projects/ggs/), a visual tool for describing grammars.  A Romanian grammar has been developed allowing fully recursive NP chunks.  http://nlptools.infoiasi.ro/WebNpChunkerRo/ (webservice)
  • 15. Romanian FDG parser  http://nlptools.infoiasi.ro/WebFdgRo/ (webservice)  The parser was trained on a dependency treebank linguistic resource.
  • 16. Romanian Word Linker - LexPar  A link between two syntactico-semantic related words in a sentence is an approximation of a dependency relation, with no orientation and no labeling.  A link structure of a sentence is constructed with a Lexical Attraction Model  Dan Tufiș, Radu Ion, Alexandru Ceaușu, and Dan Ștefănescu. RACAI's Linguistic Web Services. In Proceedings of the 6th Language Resources and Evaluation Conference - LREC 2008, Marrakech, Morocco, May 2008. ELRA - European Language Resources Association.‘
  • 17. RO / EN Named Entity Recognizer & Editor  http://nlptools.infoiasi.ro/UAIC.NamedEntityRecognizer/ (web service)  NEs are organized – based on a voting system - under four top level classes (PERSON, LOCATION, ORGANIZATION and MISC) and a total of nine subclasses
  • 18. RO / EN Anaphora Recognizer & Editor  http://nlptools.infoiasi.ro/UAIC.AnaphoraResolution/  http://nlptools.infoiasi.ro/UAIC.AnaphoraEditor/  Features used to decide if there is a co-referential chain between two NPs:  number agreement, gender agreement, and morphological description, implementing on the head noun;  similarity between the two noun phrases, both at lemma level and text level implemented on the head noun and also on the entire noun phrase;  condition if the two noun phrases belong to the same phrase or not.
  • 19. RO / EN Clause Splitter & Editor  http://nlptools.infoiasi.ro/UAIC.ClauseSplitter/ http://nlptools.infoiasi.ro/UAIC.ClauseEditor/  Features used to features used to build the model of compound verbs:  Distance between the verbs  the existence of punctuation or markers between them  the lemma and the morphological description of the verbs
  • 20. RO / EN Discourse Parser  http://nlptools.infoiasi.ro/UAIC.DiscourseParser/  The generated discourse trees put in evidence only the nuclearity of the nodes, while the name of relations is ignored.  The discourse parser adopts an incremental policy in developing the trees and it is constrained by two general principles in discourse parsing: sequentiality of the terminal nodes (Marcu, 2000) and attachment restricted to the right frontier.
  • 21. EBMT system  http://www.racai.ro/tools/translation/racai-translation-system/  Available for EN  RO, EN  GER, EN  ESP, RO  SLO
  • 22. Language resources  Dictionaries monolingual / bilingual…  eDTLR  (60) Wordnets  RoWordnet  Lexical / morphological resources  Corpora Mono / multi-lingual  translation / language models  CoRoLa comparable parallel  RoTimeBank, JRC-Acquis, Ro-Semcor treebanks  Text collections
  • 23. Romanian Wordnet  Balkanet, 2004: lexical semantic network of Romanian  Hierarchy Preservation Principle and Conceptual Density Principle  aligned at the conceptual level with the English WordNet with Princeton WordNet 3.0, SUMO&MILO ontologies, the IRST DOMAINS taxonomy  PWN 2.0-3.0 mappings http://dev.racai.ro/dw/PWNMappings20- 30/PWN_3.0-2.0_Concept_Mapping.zip  It includes the SentiWordNet subjectivity mark-up.  words belonging both to the general vocabulary and to various domains of activity  Cca 60.000 synsets  Used in word sense disambiguation, machine translation and question answering systems
  • 24. Romanian Wordnet (2)  http://www.racai.ro/en/tools/text/rowordnet-visualizer/
  • 25. Romanian Wordnet (3) PoS Synsets Literals Unique Lit NonLexicalised Nouns 41063 56532 52009 1839 Verbs 10397 16484 14210 759 Adjective 4822 8203 7407 79 Adverbs 3066 4019 3248 110 TOTAL 59348 85238 75656 2787 Barbu Mititelu, Verginica and Dumitrescu, Ștefan Daniel and Tufiș, Dan. News about the Romanian Wordnet. In Proceedings of the 7th International Global WordNet Conference. Tartu, Estonia, 2014
  • 26. DTLR  Romanian Academy, since 1913  33 volumes, more than 15,000 pages and about 175,000 entries, with citations collected from more than 2,500 volumes of the written Romanian literature
  • 27. eDTLR  The digital form of DTLR, including its sources in digital form and the software to access them  National project, 2007 - 2010  Steps in Building eDTLR:  Preliminary processing of the paper version  Scanning  Image Processing  Automatic recognition of symbols - OCR  Correction phases – volunteers + specialists  Parsing the entries  Correcting the structure - specialists  Linking the dictionary entries to sources
  • 28. CoRoLa – the reference electronic corpus of contemporary Romanian language  http://www.racai.ro/en/research-activities/corola-program- prioritar-al-academiei-romane/  a big corpus (more than 500 million word forms)  all functional styles will be represented  written texts: from books, newspaper articles, booklets, theses and technical reports  oral texts: 300 hours of recordings accompanied by their transcripts  pre-processed and annotated texts (at least at the morphological level, but maybe also at a syntactic and even semantic and discourse level).
  • 30. CoRoLa – current stats Sentences Tokens Words Content words News 651,872 10,294,016 8,558,619 4,662,528 Medical 603,161 10,950,271 9,163,029 5,226,837 Legal 659,646 9,067,516 7,482,484 4,247,737 Biogr. 314,368 5,802,961 4,298,493 2,567,427 Fiction 517,803 8,002,596 6,773,648 3,531,156 Total 2,746,850 44,117,360 36,276,273 20,235,685 Barbu Mititelu, Verginica and Irimia, Elena and Tufiș, Dan. CoRoLa – The Reference Corpus of Contemporary Romanian Language. In Proceedings of LREC'14. Reykjavik, Iceland, pp. 1235–1239, 2014
  • 31. RoTimeBank - motivations 1. QA: • when?, how often? or how long? • Temporally-anchored questions 2. IE & IR • Tracks in evaluation campaigns (SemEval, ACE, TAC) 3. MT: • translated and normalized temporal references • mappings between different behavior of tenses from language to language 4. DP: • temporal structure of discourse • Summarization (biographic summaries)
  • 32. RoTimeBank – motivations (2) • Time-consuming, error-prone annotation for Romanian • “fuzzy” situations • all sentences express an EVENT • acum câteva zile, (în) următoarele luni • long-distance relations (dependencies) • Extensions to other domains (literature, legislation) • ISO standard
  • 33. TimeML standard  A metadata standard developed especially for (English) news articles, for marking  events: EVENT, MAKEINSTANCE  temporal anchoring of events: TIMEX3, SIGNAL  links between events and/or timexes: TLINK, ALINK, SLINK  ISO proposal including Italian, Chinese, Korean
  • 34. TimeBank corpus  183 English news report documents TimeML annotated, freely distributed through LDC  4715 sentences with  10586 unique lexical units, from  a total of 61042 lexical units •  Non-TimeML Markup in Time Bank 1.1:  structure information: header  named entity recognition: <ENAMEX>, <NUMEX>, <CARDINAL>  sentence boundary information: <s>
  • 35. TimeBank - Parallel corpus creation & processing 1. Translation (guidelines) 2. Pre-processing (tokenizing, POS-tagging) 3. Alignment (word-level, manual correction) 4. Annotation import (automatic, with manual evaluation) 5. ISO-TimeML adapted to Romanian (annotation guideline)
  • 36. Analysis of the annotation import 1. Types of temporal annotation import 1. Perfect transfer 2. Transfer with some amendments due to TimeML specifications 3. Transfer with amendments imposed by with language specific phenomena 4. Impossible transfer 2. Temporal elements not (yet) marked in the Romanian & English corpus
  • 37. RO-EN TimeBank stats TimeML tags  RO  EN % final transfer EVENTs 7926 7935 99.89 instances - 7940 - TIMEXes 1414 1414 100.00 SIGNALs 669 688 97.24 TLINKs 6311 6418 98.33 SLINKs 2908 2932 99.18 ALINKs 262 265 98.87 TOTAL 19490 19652 99.18
  • 38. Final thoughts  Time is the only critic without ambition. (John Steinbeck)  Time is a great teacher. Unfortunately, it kills all its pupils. (Hector Berlioz)
  • 39. Evaluation competitions for LRT development  CLEF: Cross-Language Evaluation Forum  Conference and Labs of the Evaluation Forum  QA@CLEF 2007-2008  ResPublQA 2009 – 2010  QA4MRE 2011-2013  QALD 2015-2015  GikiCLEF 2009  MultiLing @ ACL 2013
  • 40. Scientific & raising awareness events  EUROLAN summer schools  2015, 12th edition, Sibiu, Romania:  Linguistic Linked Open Data http://eurolan.info.uaic.ro/2015/  ConsILR workshops (Conference on Linguistic Resources and Tools for Processing the Romanian Language)  http://consilr.info.uaic.ro/2014/index.php?list=eng  CICLing 2010, GWC 2016  LT4RD 2012 – Language Technologies in Romanian Diaspora  Following Anita Borg @ Iasi, through WITchIS
  • 41.
  • 42. Thank you for your attention! Further information: corinfor@info.uaic.ro
  • 43. References  METANET whitepapers - http://www.meta-net.eu/whitepapers/overview  Steven Krauwer (2003), “The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap”, in Proceedings of the InternationalWorkshop “Speech and Computer”, Moscow, Russia.