SlideShare a Scribd company logo
Text Mining for Global Event Monitoring
EMM Team
(speaker: Vanni Zavarella)
European Commission – Joint Research Centre (JRC)
Structure and Dynamics of Media
Flows INA, 7&8 July 2016
• Hristo Tanev
• Vanni Zavarella
• Jakub Piskorski
• Martin Atkinson
EMM-NEXUS Team
And more colleagues from the OPTIMA team
• What does the EE application do?
• Custom domain event types
• Multilinguality
• Core extraction engine
• Resource Learning
• Event Time
• Event Location
• Performance
• References
Agenda
Cluster
RSS
EventExtractor
RealTimenewsclustering
Typically last 8
hours of RSS
per language
RSS Cache
Stories
Summaries
EntityRecogniser
CrosslingualClustering
Breaking
news, based
on cluster
growth
Mailer/SMS
RSS+
<text>
<entity>
<geo>
<quote>
<tonality>
<category>
duplicate=
Continuously updated RSS
EMM Pipeline
• For each language, continuously collect media
reports from upstream EMM pipeline
• Every 10 minutes, cluster the latest (4 hours
window) articles about the same event or subject
• Hierarchical, agglomerative clustering, using
average group linkage and cosine similarity over
simple word count vectors
• apply core language processing modules and
extraction grammars to article title and
description
• Identify and extract information on the main
events of the cluster
• Display the latest events on a map
• Give access to extracted information and to full
articles.
Nexus Pipeline
Car bomber strikes north Pakistan
ech-chorouk-en Tuesday, November 10, 2009 2:23:00 PM CET
A car bomb has exploded in Pakistani's northwestern town of Charsadda killing at least 10 people....
Bomb explodes in northwestern Pakistani town
yediotaharonot Tuesday, November 10, 2009 1:58:00 PM CET
A bomb exploded in the northwestern Pakistani town of Charsadda on Tuesday causing
an unknown number of casualties, police said. "It was a bomb blast....
10 killed in Pakistan bomb
RTERadio Tuesday, November 10, 2009 1:57:00 PM CET
A bomb has exploded in the north-western Pakistani town of Charsadda, killing 10 people....
TYPE Bombing
PLACE Charsadda, Pakistan
EVENT LOCATION Charsadda, Pakistan
TIME Tuesday, November 10, 2009
DEAD COUNT 10
DEAD DESCRIPTION people
WOUNDED COUNT/DESC
DISPLACED COUNT/DESC
HOMELESS COUNT/DESC
ARRESTED COUNT/DESC
PERPETRATOR
WEAPONS Bomb
Information Aggregation
http://emm.newsbrief.eu/NewsBrief/eventedition/all/latest.ht
ml
KML Network Links:
http://labs.emm4u.eu/EventModerator/service?target=event
&language=en&format=kml&source=rtn
Public Interfaces
Event Types
Crisis event
Disaster Security-related Humanitarian crisis
Natural Disaster
Manmade disaster/
accident
Arrest
Trial
Kidnapping/Hostage
taking
Hostage release
Hostage video
Release
Violent event
Terrorist attack
Shooting
Armed conflict
Execution
Crimes
Assassination
Medical events
Event Type Hierarchy
IT
EN FR ES
AR RU
Multilinguality
Covers English, Italian, French, Spanish, Russian, Portuguese, Turkish,
Romanian, Bulgarian, Czech and Arabic after Machine Translation.
Light-weight and shallow process to allow coverage of many languages
Morphological dictionaries
• Static resources, mainly for grammatical structure of rules.
Domain-specific lexica
• (Possibly multiword) expressions subcategorised into semantic classes
relevant for the domain.
Surface-level extraction patterns
• Often learned (semi-) automatically.e.g. [VICTIM] was heavily wounded
Finite-state grammar rules
• To recognise person groups or other partial patterns e.g. the actor, three Iraqi soldiers
• Generalise the surface-level extraction patterns e.g. has been (strongly) wounded
• Ideally language-agnostic
System Resources
• ExPRESS is a blend of JAPE (GATE) and XTDL
(SProUT).
• LHS of the rule is a regular expression over flat
feature structures.
• RHS specifies the output structure.
• Allows variables, labels, functional operators,
grammar cascading, …
• Multiple and nested labels (multiple actions)
• Rule sample:
Finite-State Engine
• Learning new patterns for target slots
• Learning semantic classes
• Learning domain-specific words
Resource learning
Pattern Learning
• <DEAD> was shot by <PERPETRATOR>
• police nabbes <ARRESTED>
• <KIDNAPPED> has been taken hostage
• <WOUNDED> was found injured
• raptou <KIDNAPPED>
• <DEAD> foram mortas
Sometimes, it is necessary to learn specific semantic classes, e.g.
disasters, types of chemicals, facilities, professions, etc.
Language-independent system, only needs language-specific stop word lists
Two-step process: feature extraction and weighting (uni/bi-grams), term extraction and ranking
E.g.
• Seeds: toxic, hazardous
• Output (Top):
• hazardous 77.20
• toxic 73.10
• radioactive 18.67
• harmful 13.78
• nuclear 12.18
• dangerous 9.68
• organic 8.63
• chemical 8.56
• poisonous 8.00
• toxic substances 7.94
• highly toxic 7.37
• solid 7.26
• carcinogenic 7.21
• noxious 6.47
• industrial 5.73
• corrosive 5.45
Semantic class learning
Input: a handful of keywords - seeds
Output: a set of keywords which tend to co-occur with seeds, ordered by weight
• TF.IDF like formula for term weighting:
Weight(term)=TF.IDF2
TF=Frequency (seeds, term) – the number of documents which contain both the term and at least one of the seeds
IDF= log(NumberDocuments / Frequency(Term)), e.g.
• Seeds: sustainable development, sustainable energy, clean energy,
environmental, greenhouse gases
• Output: • environment
• emissions
• climate
• carbon
• differ materially
• impact
• global
• development
• risks and
uncertainties
• resources
• efficiency
• water
• future
• projects
• developing
• based
• cost
• economic
• potential
• reducing
• renewable
• quality
• management
• technologies
• efficient
• developed
• sustainability
• industry
• technology
• …
Terminology Learning
Tokenization, Morphological Analysis,
Temporal Lexicon Lookup
RECOGNITION
(language level grammar rules)
INFORMATION GATHERING
(compositional rules)
TEXT
FEATURE
STRUCTURES
(Intermediate
Annotation)
ATTRIBUTE NORMALIZATION
ANCHORS SELECTION
TIMEX3 OBJECTS
EXPRESS GRAMMAR JAVA CODE
Document CDATE
CALENDAR ARITHMETIC
• A rule-based system featuring finite-state pattern rules
• Very shallow text analysis modules, language specific recognition rules,language independent
normalization process
• Good Precision scores for EN (~90%) and successful porting to ES without significant
Precision drop, although Recall still falling behind (~52%) at TempEval-2013
Time Extractor Module
• A language-independent algorithm for article geocoding
• uses person/organization entities and language variant tag to resolve geo/non-geo
ambiguity (Clinton as a city, Seul as a city or adj in French)
• uses admin containment and place size info to resolve geo/geo ambiguity
• Newswire location filtering
A linguistic algorithm for fine-grained event geocoding
• uses grammars for parsing locative prepositional phrases
Geocoding
Dead Wounded Kidnapped Perpetrators
Precision English 91% 91% 100% 69%
Dead Wounded Kidnapped Arrested
F1 Portuguese 0.69 0.51 0.67 0.47
F1 Spanish 0.46 - - 0.13
F1 Italian 0.87 0.62 - 0.67
Conclusion:
• There are errors in the output, therefore manual verification is necessary.
• Some less-reported events can remain undetected.
• Two or more events are sometimes merged into one event description.
• The same event can be presented via several event descriptions (event duplication).
Some evaluation results
Event moderation interface
Twitter multimedia link extraction
Exploiting Social Media
Exploiting Social Media
Tanev, H.; Zavarella, V. (2014) Multilingual Lexicalisation and Population of Event Ontologies: A Case Study for Social Media, in:
Buitelaar, Paul, Cimiano, Philipp (Eds.):Towards the Multilingual Semantic Web, 2014, Springer Berlin Heidelberg.
Zavarella, V., Kucuk, Dilek, Tanev, H. and Hurriyetouglu, Ali (2014) Event Extraction for
Balkan Languages, in: Proceedings of the Demonstrations at the 14th Conference of the
European Chapter of the Association for Computational Linguistics, 65-68, Gothenburg,
Sweden.
Zavarella, V. and Tanev, H. (2013), FSS-TimEx for TempEval-3: Extracting Temporal
Information from Text, in: Proceedings of the Seventh International Workshop on Semantic
Evaluation (SemEval 2013), Volume 2, Association for Computational Linguistics,
pages:58--63, Atlanta, Georgia, USA
H. Tanev, M. Ehrman, J. Piskorski, V. Zavarella (2012). Enhancing Event Descriptions
through Twitter Mining. In proceeding of: AAAI International Conference on Weblogs and
Social Media 2012, At Dublin, Ireland.
Atkinson M., J. Piskorski, Bruno Pouliquen, R. Steinberger, H. Tanev & V. Zavarella (2008).
Online-monitoring of security-related events. In Proceedings of the 22nd International
Conference on Computational Linguistics (CoLing'2008). Manchester, UK, 18-22 August
2008.
References
http://emm.newsbrief.eu/NewsBrief/eventedition/en/latest.html
(text format)
http://medusa.jrc.it/medisys/eventedition/en/rss.html
(text format for the Medical Information System MedISys)
Live Access

More Related Content

Viewers also liked

Yemenia focus 3
Yemenia focus 3Yemenia focus 3
Yemenia focus 3
Mohammed Awad
 
Semantic technologies at work
Semantic technologies at workSemantic technologies at work
Semantic technologies at work
Yannis Kalfoglou
 
Forecasting by Objective
Forecasting by ObjectiveForecasting by Objective
Forecasting by Objective
Airport_Forecasting
 
Airport forecasting article 1
Airport forecasting article 1Airport forecasting article 1
Airport forecasting article 1
Airport_Forecasting
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
Paul Groth
 
The prison life tcrim372 online version
The prison life tcrim372 online versionThe prison life tcrim372 online version
The prison life tcrim372 online versionAcklin1921
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Innovation Quotient Pvt Ltd
 
Text mining
Text miningText mining
Text mining
ike kurniati
 
El pronombre
El pronombreEl pronombre
El pronombre
Paqui Ruiz
 
Para além do fracasso escolar
Para além do fracasso escolarPara além do fracasso escolar
Para além do fracasso escolar
thamiresaneves
 
Bases de dades en enginyeria: Compendex, Inspec i IEEEXplore
Bases de dades en enginyeria: Compendex, Inspec i IEEEXploreBases de dades en enginyeria: Compendex, Inspec i IEEEXplore
Bases de dades en enginyeria: Compendex, Inspec i IEEEXplore
Biblioteca del Campus Terrassa
 
Concurso figuras literarias (1º BACH)
Concurso figuras literarias (1º BACH)Concurso figuras literarias (1º BACH)
Concurso figuras literarias (1º BACH)
Luis Gil Gil
 
Working on Scholarly Contents: A Semantic Vision
Working on Scholarly Contents: A Semantic VisionWorking on Scholarly Contents: A Semantic Vision
Working on Scholarly Contents: A Semantic Vision
Francesca Di Donato
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
Marissa Kobylenski
 

Viewers also liked (14)

Yemenia focus 3
Yemenia focus 3Yemenia focus 3
Yemenia focus 3
 
Semantic technologies at work
Semantic technologies at workSemantic technologies at work
Semantic technologies at work
 
Forecasting by Objective
Forecasting by ObjectiveForecasting by Objective
Forecasting by Objective
 
Airport forecasting article 1
Airport forecasting article 1Airport forecasting article 1
Airport forecasting article 1
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
The prison life tcrim372 online version
The prison life tcrim372 online versionThe prison life tcrim372 online version
The prison life tcrim372 online version
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
Text mining
Text miningText mining
Text mining
 
El pronombre
El pronombreEl pronombre
El pronombre
 
Para além do fracasso escolar
Para além do fracasso escolarPara além do fracasso escolar
Para além do fracasso escolar
 
Bases de dades en enginyeria: Compendex, Inspec i IEEEXplore
Bases de dades en enginyeria: Compendex, Inspec i IEEEXploreBases de dades en enginyeria: Compendex, Inspec i IEEEXplore
Bases de dades en enginyeria: Compendex, Inspec i IEEEXplore
 
Concurso figuras literarias (1º BACH)
Concurso figuras literarias (1º BACH)Concurso figuras literarias (1º BACH)
Concurso figuras literarias (1º BACH)
 
Working on Scholarly Contents: A Semantic Vision
Working on Scholarly Contents: A Semantic VisionWorking on Scholarly Contents: A Semantic Vision
Working on Scholarly Contents: A Semantic Vision
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 

Similar to EventExtractionGeomedia

High throughput analysis and alerting of disease outbreaks from the grey lite...
High throughput analysis and alerting of disease outbreaks from the grey lite...High throughput analysis and alerting of disease outbreaks from the grey lite...
High throughput analysis and alerting of disease outbreaks from the grey lite...
Nigel Collier
 
What is Reproducibility? The R* brouhaha and how Research Objects can help
What is Reproducibility? The R* brouhaha and how Research Objects can helpWhat is Reproducibility? The R* brouhaha and how Research Objects can help
What is Reproducibility? The R* brouhaha and how Research Objects can help
Carole Goble
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
Carole Goble
 
THE New CEFR Companion Volume with New Descriptors
THE New CEFR Companion Volume with New DescriptorsTHE New CEFR Companion Volume with New Descriptors
THE New CEFR Companion Volume with New Descriptors
THOMASJEROMEBAKER
 
CEFR Companion Volume with New Descriptors - 2018
CEFR Companion Volume with New Descriptors - 2018CEFR Companion Volume with New Descriptors - 2018
CEFR Companion Volume with New Descriptors - 2018
THOMASJEROMEBAKER
 
1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner IntroductionsRIILP
 
TAUS MT Showcace, MT Applications in the EU Public Sector, Adrejs Vasiljevs, ...
TAUS MT Showcace, MT Applications in the EU Public Sector, Adrejs Vasiljevs, ...TAUS MT Showcace, MT Applications in the EU Public Sector, Adrejs Vasiljevs, ...
TAUS MT Showcace, MT Applications in the EU Public Sector, Adrejs Vasiljevs, ...
TAUS - The Language Data Network
 
Risk assessment for preservation in the active life of complex digital object...
Risk assessment for preservation in the active life of complex digital object...Risk assessment for preservation in the active life of complex digital object...
Risk assessment for preservation in the active life of complex digital object...
PERICLES_FP7
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
Stephen Marquard
 
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
 MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
ijitcs
 
Multilingualism for Digital Europe
Multilingualism for Digital EuropeMultilingualism for Digital Europe
Multilingualism for Digital Europe
Georg Rehm
 
Bird05 nltk-intro
Bird05 nltk-introBird05 nltk-intro
Bird05 nltk-intro
Stefano Lariccia
 
Language Grid
Language GridLanguage Grid
Language Gridlindh
 
Ucl guest lecture tvh
Ucl guest lecture tvhUcl guest lecture tvh
Ucl guest lecture tvhTom Van Hout
 
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Iconic Translation Machines
 
Summary of GSCL 2013 international NLP conference in Germany
Summary of GSCL 2013 international NLP conference in GermanySummary of GSCL 2013 international NLP conference in Germany
Summary of GSCL 2013 international NLP conference in Germany
Lifeng (Aaron) Han
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
cneudecker
 
FruitBreedomics KOM Stakeholders meeting 31-03-2011 10 WP8 presentation and f...
FruitBreedomics KOM Stakeholders meeting 31-03-2011 10 WP8 presentation and f...FruitBreedomics KOM Stakeholders meeting 31-03-2011 10 WP8 presentation and f...
FruitBreedomics KOM Stakeholders meeting 31-03-2011 10 WP8 presentation and f...fruitbreedomics
 
FruitBreedomics KOM 30-03-2011 3 WP8 presentation
FruitBreedomics KOM 30-03-2011 3 WP8 presentationFruitBreedomics KOM 30-03-2011 3 WP8 presentation
FruitBreedomics KOM 30-03-2011 3 WP8 presentationfruitbreedomics
 

Similar to EventExtractionGeomedia (20)

High throughput analysis and alerting of disease outbreaks from the grey lite...
High throughput analysis and alerting of disease outbreaks from the grey lite...High throughput analysis and alerting of disease outbreaks from the grey lite...
High throughput analysis and alerting of disease outbreaks from the grey lite...
 
What is Reproducibility? The R* brouhaha and how Research Objects can help
What is Reproducibility? The R* brouhaha and how Research Objects can helpWhat is Reproducibility? The R* brouhaha and how Research Objects can help
What is Reproducibility? The R* brouhaha and how Research Objects can help
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
 
THE New CEFR Companion Volume with New Descriptors
THE New CEFR Companion Volume with New DescriptorsTHE New CEFR Companion Volume with New Descriptors
THE New CEFR Companion Volume with New Descriptors
 
CEFR Companion Volume with New Descriptors - 2018
CEFR Companion Volume with New Descriptors - 2018CEFR Companion Volume with New Descriptors - 2018
CEFR Companion Volume with New Descriptors - 2018
 
1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions
 
TAUS MT Showcace, MT Applications in the EU Public Sector, Adrejs Vasiljevs, ...
TAUS MT Showcace, MT Applications in the EU Public Sector, Adrejs Vasiljevs, ...TAUS MT Showcace, MT Applications in the EU Public Sector, Adrejs Vasiljevs, ...
TAUS MT Showcace, MT Applications in the EU Public Sector, Adrejs Vasiljevs, ...
 
Risk assessment for preservation in the active life of complex digital object...
Risk assessment for preservation in the active life of complex digital object...Risk assessment for preservation in the active life of complex digital object...
Risk assessment for preservation in the active life of complex digital object...
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
 MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
 
Multilingualism for Digital Europe
Multilingualism for Digital EuropeMultilingualism for Digital Europe
Multilingualism for Digital Europe
 
Bird05 nltk-intro
Bird05 nltk-introBird05 nltk-intro
Bird05 nltk-intro
 
Language Grid
Language GridLanguage Grid
Language Grid
 
Ucl guest lecture tvh
Ucl guest lecture tvhUcl guest lecture tvh
Ucl guest lecture tvh
 
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
 
Summary of GSCL 2013 international NLP conference in Germany
Summary of GSCL 2013 international NLP conference in GermanySummary of GSCL 2013 international NLP conference in Germany
Summary of GSCL 2013 international NLP conference in Germany
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
MYRIAM HEILI-DUMOULIN cV
MYRIAM HEILI-DUMOULIN cVMYRIAM HEILI-DUMOULIN cV
MYRIAM HEILI-DUMOULIN cV
 
FruitBreedomics KOM Stakeholders meeting 31-03-2011 10 WP8 presentation and f...
FruitBreedomics KOM Stakeholders meeting 31-03-2011 10 WP8 presentation and f...FruitBreedomics KOM Stakeholders meeting 31-03-2011 10 WP8 presentation and f...
FruitBreedomics KOM Stakeholders meeting 31-03-2011 10 WP8 presentation and f...
 
FruitBreedomics KOM 30-03-2011 3 WP8 presentation
FruitBreedomics KOM 30-03-2011 3 WP8 presentationFruitBreedomics KOM 30-03-2011 3 WP8 presentation
FruitBreedomics KOM 30-03-2011 3 WP8 presentation
 

EventExtractionGeomedia

  • 1. Text Mining for Global Event Monitoring EMM Team (speaker: Vanni Zavarella) European Commission – Joint Research Centre (JRC) Structure and Dynamics of Media Flows INA, 7&8 July 2016
  • 2. • Hristo Tanev • Vanni Zavarella • Jakub Piskorski • Martin Atkinson EMM-NEXUS Team And more colleagues from the OPTIMA team
  • 3. • What does the EE application do? • Custom domain event types • Multilinguality • Core extraction engine • Resource Learning • Event Time • Event Location • Performance • References Agenda
  • 4. Cluster RSS EventExtractor RealTimenewsclustering Typically last 8 hours of RSS per language RSS Cache Stories Summaries EntityRecogniser CrosslingualClustering Breaking news, based on cluster growth Mailer/SMS RSS+ <text> <entity> <geo> <quote> <tonality> <category> duplicate= Continuously updated RSS EMM Pipeline
  • 5. • For each language, continuously collect media reports from upstream EMM pipeline • Every 10 minutes, cluster the latest (4 hours window) articles about the same event or subject • Hierarchical, agglomerative clustering, using average group linkage and cosine similarity over simple word count vectors • apply core language processing modules and extraction grammars to article title and description • Identify and extract information on the main events of the cluster • Display the latest events on a map • Give access to extracted information and to full articles. Nexus Pipeline
  • 6. Car bomber strikes north Pakistan ech-chorouk-en Tuesday, November 10, 2009 2:23:00 PM CET A car bomb has exploded in Pakistani's northwestern town of Charsadda killing at least 10 people.... Bomb explodes in northwestern Pakistani town yediotaharonot Tuesday, November 10, 2009 1:58:00 PM CET A bomb exploded in the northwestern Pakistani town of Charsadda on Tuesday causing an unknown number of casualties, police said. "It was a bomb blast.... 10 killed in Pakistan bomb RTERadio Tuesday, November 10, 2009 1:57:00 PM CET A bomb has exploded in the north-western Pakistani town of Charsadda, killing 10 people.... TYPE Bombing PLACE Charsadda, Pakistan EVENT LOCATION Charsadda, Pakistan TIME Tuesday, November 10, 2009 DEAD COUNT 10 DEAD DESCRIPTION people WOUNDED COUNT/DESC DISPLACED COUNT/DESC HOMELESS COUNT/DESC ARRESTED COUNT/DESC PERPETRATOR WEAPONS Bomb Information Aggregation
  • 9. Crisis event Disaster Security-related Humanitarian crisis Natural Disaster Manmade disaster/ accident Arrest Trial Kidnapping/Hostage taking Hostage release Hostage video Release Violent event Terrorist attack Shooting Armed conflict Execution Crimes Assassination Medical events Event Type Hierarchy
  • 10. IT EN FR ES AR RU Multilinguality Covers English, Italian, French, Spanish, Russian, Portuguese, Turkish, Romanian, Bulgarian, Czech and Arabic after Machine Translation.
  • 11. Light-weight and shallow process to allow coverage of many languages Morphological dictionaries • Static resources, mainly for grammatical structure of rules. Domain-specific lexica • (Possibly multiword) expressions subcategorised into semantic classes relevant for the domain. Surface-level extraction patterns • Often learned (semi-) automatically.e.g. [VICTIM] was heavily wounded Finite-state grammar rules • To recognise person groups or other partial patterns e.g. the actor, three Iraqi soldiers • Generalise the surface-level extraction patterns e.g. has been (strongly) wounded • Ideally language-agnostic System Resources
  • 12. • ExPRESS is a blend of JAPE (GATE) and XTDL (SProUT). • LHS of the rule is a regular expression over flat feature structures. • RHS specifies the output structure. • Allows variables, labels, functional operators, grammar cascading, … • Multiple and nested labels (multiple actions) • Rule sample: Finite-State Engine
  • 13. • Learning new patterns for target slots • Learning semantic classes • Learning domain-specific words Resource learning
  • 14. Pattern Learning • <DEAD> was shot by <PERPETRATOR> • police nabbes <ARRESTED> • <KIDNAPPED> has been taken hostage • <WOUNDED> was found injured • raptou <KIDNAPPED> • <DEAD> foram mortas
  • 15. Sometimes, it is necessary to learn specific semantic classes, e.g. disasters, types of chemicals, facilities, professions, etc. Language-independent system, only needs language-specific stop word lists Two-step process: feature extraction and weighting (uni/bi-grams), term extraction and ranking E.g. • Seeds: toxic, hazardous • Output (Top): • hazardous 77.20 • toxic 73.10 • radioactive 18.67 • harmful 13.78 • nuclear 12.18 • dangerous 9.68 • organic 8.63 • chemical 8.56 • poisonous 8.00 • toxic substances 7.94 • highly toxic 7.37 • solid 7.26 • carcinogenic 7.21 • noxious 6.47 • industrial 5.73 • corrosive 5.45 Semantic class learning
  • 16. Input: a handful of keywords - seeds Output: a set of keywords which tend to co-occur with seeds, ordered by weight • TF.IDF like formula for term weighting: Weight(term)=TF.IDF2 TF=Frequency (seeds, term) – the number of documents which contain both the term and at least one of the seeds IDF= log(NumberDocuments / Frequency(Term)), e.g. • Seeds: sustainable development, sustainable energy, clean energy, environmental, greenhouse gases • Output: • environment • emissions • climate • carbon • differ materially • impact • global • development • risks and uncertainties • resources • efficiency • water • future • projects • developing • based • cost • economic • potential • reducing • renewable • quality • management • technologies • efficient • developed • sustainability • industry • technology • … Terminology Learning
  • 17. Tokenization, Morphological Analysis, Temporal Lexicon Lookup RECOGNITION (language level grammar rules) INFORMATION GATHERING (compositional rules) TEXT FEATURE STRUCTURES (Intermediate Annotation) ATTRIBUTE NORMALIZATION ANCHORS SELECTION TIMEX3 OBJECTS EXPRESS GRAMMAR JAVA CODE Document CDATE CALENDAR ARITHMETIC • A rule-based system featuring finite-state pattern rules • Very shallow text analysis modules, language specific recognition rules,language independent normalization process • Good Precision scores for EN (~90%) and successful porting to ES without significant Precision drop, although Recall still falling behind (~52%) at TempEval-2013 Time Extractor Module
  • 18. • A language-independent algorithm for article geocoding • uses person/organization entities and language variant tag to resolve geo/non-geo ambiguity (Clinton as a city, Seul as a city or adj in French) • uses admin containment and place size info to resolve geo/geo ambiguity • Newswire location filtering A linguistic algorithm for fine-grained event geocoding • uses grammars for parsing locative prepositional phrases Geocoding
  • 19. Dead Wounded Kidnapped Perpetrators Precision English 91% 91% 100% 69% Dead Wounded Kidnapped Arrested F1 Portuguese 0.69 0.51 0.67 0.47 F1 Spanish 0.46 - - 0.13 F1 Italian 0.87 0.62 - 0.67 Conclusion: • There are errors in the output, therefore manual verification is necessary. • Some less-reported events can remain undetected. • Two or more events are sometimes merged into one event description. • The same event can be presented via several event descriptions (event duplication). Some evaluation results
  • 24. Tanev, H.; Zavarella, V. (2014) Multilingual Lexicalisation and Population of Event Ontologies: A Case Study for Social Media, in: Buitelaar, Paul, Cimiano, Philipp (Eds.):Towards the Multilingual Semantic Web, 2014, Springer Berlin Heidelberg. Zavarella, V., Kucuk, Dilek, Tanev, H. and Hurriyetouglu, Ali (2014) Event Extraction for Balkan Languages, in: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, 65-68, Gothenburg, Sweden. Zavarella, V. and Tanev, H. (2013), FSS-TimEx for TempEval-3: Extracting Temporal Information from Text, in: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Volume 2, Association for Computational Linguistics, pages:58--63, Atlanta, Georgia, USA H. Tanev, M. Ehrman, J. Piskorski, V. Zavarella (2012). Enhancing Event Descriptions through Twitter Mining. In proceeding of: AAAI International Conference on Weblogs and Social Media 2012, At Dublin, Ireland. Atkinson M., J. Piskorski, Bruno Pouliquen, R. Steinberger, H. Tanev & V. Zavarella (2008). Online-monitoring of security-related events. In Proceedings of the 22nd International Conference on Computational Linguistics (CoLing'2008). Manchester, UK, 18-22 August 2008. References