EventExtractionGeomedia

Text Mining for Global Event Monitoring
EMM Team
(speaker: Vanni Zavarella)
European Commission – Joint Research Centre (JRC)
Structure and Dynamics of Media
Flows INA, 7&8 July 2016

• Hristo Tanev
• Vanni Zavarella
• Jakub Piskorski
• Martin Atkinson
EMM-NEXUS Team
And more colleagues from the OPTIMA team

• What does the EE application do?
• Custom domain event types
• Multilinguality
• Core extraction engine
• Resource Learning
• Event Time
• Event Location
• Performance
• References
Agenda

Cluster
RSS
EventExtractor
RealTimenewsclustering
Typically last 8
hours of RSS
per language
RSS Cache
Stories
Summaries
EntityRecogniser
CrosslingualClustering
Breaking
news, based
on cluster
growth
Mailer/SMS
RSS+
<text>
<entity>
<geo>
<quote>
<tonality>
<category>
duplicate=
Continuously updated RSS
EMM Pipeline

• For each language, continuously collect media
reports from upstream EMM pipeline
• Every 10 minutes, cluster the latest (4 hours
window) articles about the same event or subject
• Hierarchical, agglomerative clustering, using
average group linkage and cosine similarity over
simple word count vectors
• apply core language processing modules and
extraction grammars to article title and
description
• Identify and extract information on the main
events of the cluster
• Display the latest events on a map
• Give access to extracted information and to full
articles.
Nexus Pipeline

Car bomber strikes north Pakistan
ech-chorouk-en Tuesday, November 10, 2009 2:23:00 PM CET
A car bomb has exploded in Pakistani's northwestern town of Charsadda killing at least 10 people....
Bomb explodes in northwestern Pakistani town
yediotaharonot Tuesday, November 10, 2009 1:58:00 PM CET
A bomb exploded in the northwestern Pakistani town of Charsadda on Tuesday causing
an unknown number of casualties, police said. "It was a bomb blast....
10 killed in Pakistan bomb
RTERadio Tuesday, November 10, 2009 1:57:00 PM CET
A bomb has exploded in the north-western Pakistani town of Charsadda, killing 10 people....
TYPE Bombing
PLACE Charsadda, Pakistan
EVENT LOCATION Charsadda, Pakistan
TIME Tuesday, November 10, 2009
DEAD COUNT 10
DEAD DESCRIPTION people
WOUNDED COUNT/DESC
DISPLACED COUNT/DESC
HOMELESS COUNT/DESC
ARRESTED COUNT/DESC
PERPETRATOR
WEAPONS Bomb
Information Aggregation

http://emm.newsbrief.eu/NewsBrief/eventedition/all/latest.ht
ml
KML Network Links:
http://labs.emm4u.eu/EventModerator/service?target=event
&language=en&format=kml&source=rtn
Public Interfaces

Crisis event
Disaster Security-related Humanitarian crisis
Natural Disaster
Manmade disaster/
accident
Arrest
Trial
Kidnapping/Hostage
taking
Hostage release
Hostage video
Release
Violent event
Terrorist attack
Shooting
Armed conflict
Execution
Crimes
Assassination
Medical events
Event Type Hierarchy

IT
EN FR ES
AR RU
Multilinguality
Covers English, Italian, French, Spanish, Russian, Portuguese, Turkish,
Romanian, Bulgarian, Czech and Arabic after Machine Translation.

Light-weight and shallow process to allow coverage of many languages
Morphological dictionaries
• Static resources, mainly for grammatical structure of rules.
Domain-specific lexica
• (Possibly multiword) expressions subcategorised into semantic classes
relevant for the domain.
Surface-level extraction patterns
• Often learned (semi-) automatically.e.g. [VICTIM] was heavily wounded
Finite-state grammar rules
• To recognise person groups or other partial patterns e.g. the actor, three Iraqi soldiers
• Generalise the surface-level extraction patterns e.g. has been (strongly) wounded
• Ideally language-agnostic
System Resources

• ExPRESS is a blend of JAPE (GATE) and XTDL
(SProUT).
• LHS of the rule is a regular expression over flat
feature structures.
• RHS specifies the output structure.
• Allows variables, labels, functional operators,
grammar cascading, …
• Multiple and nested labels (multiple actions)
• Rule sample:
Finite-State Engine

• Learning new patterns for target slots
• Learning semantic classes
• Learning domain-specific words
Resource learning

Pattern Learning
• <DEAD> was shot by <PERPETRATOR>
• police nabbes <ARRESTED>
• <KIDNAPPED> has been taken hostage
• <WOUNDED> was found injured
• raptou <KIDNAPPED>
• <DEAD> foram mortas

Sometimes, it is necessary to learn specific semantic classes, e.g.
disasters, types of chemicals, facilities, professions, etc.
Language-independent system, only needs language-specific stop word lists
Two-step process: feature extraction and weighting (uni/bi-grams), term extraction and ranking
E.g.
• Seeds: toxic, hazardous
• Output (Top):
• hazardous 77.20
• toxic 73.10
• radioactive 18.67
• harmful 13.78
• nuclear 12.18
• dangerous 9.68
• organic 8.63
• chemical 8.56
• poisonous 8.00
• toxic substances 7.94
• highly toxic 7.37
• solid 7.26
• carcinogenic 7.21
• noxious 6.47
• industrial 5.73
• corrosive 5.45
Semantic class learning

Input: a handful of keywords - seeds
Output: a set of keywords which tend to co-occur with seeds, ordered by weight
• TF.IDF like formula for term weighting:
Weight(term)=TF.IDF2
TF=Frequency (seeds, term) – the number of documents which contain both the term and at least one of the seeds
IDF= log(NumberDocuments / Frequency(Term)), e.g.
• Seeds: sustainable development, sustainable energy, clean energy,
environmental, greenhouse gases
• Output: • environment
• emissions
• climate
• carbon
• differ materially
• impact
• global
• development
• risks and
uncertainties
• resources
• efficiency
• water
• future
• projects
• developing
• based
• cost
• economic
• potential
• reducing
• renewable
• quality
• management
• technologies
• efficient
• developed
• sustainability
• industry
• technology
• …
Terminology Learning

Tokenization, Morphological Analysis,
Temporal Lexicon Lookup
RECOGNITION
(language level grammar rules)
INFORMATION GATHERING
(compositional rules)
TEXT
FEATURE
STRUCTURES
(Intermediate
Annotation)
ATTRIBUTE NORMALIZATION
ANCHORS SELECTION
TIMEX3 OBJECTS
EXPRESS GRAMMAR JAVA CODE
Document CDATE
CALENDAR ARITHMETIC
• A rule-based system featuring finite-state pattern rules
• Very shallow text analysis modules, language specific recognition rules,language independent
normalization process
• Good Precision scores for EN (~90%) and successful porting to ES without significant
Precision drop, although Recall still falling behind (~52%) at TempEval-2013
Time Extractor Module

• A language-independent algorithm for article geocoding
• uses person/organization entities and language variant tag to resolve geo/non-geo
ambiguity (Clinton as a city, Seul as a city or adj in French)
• uses admin containment and place size info to resolve geo/geo ambiguity
• Newswire location filtering
A linguistic algorithm for fine-grained event geocoding
• uses grammars for parsing locative prepositional phrases
Geocoding

Dead Wounded Kidnapped Perpetrators
Precision English 91% 91% 100% 69%
Dead Wounded Kidnapped Arrested
F1 Portuguese 0.69 0.51 0.67 0.47
F1 Spanish 0.46 - - 0.13
F1 Italian 0.87 0.62 - 0.67
Conclusion:
• There are errors in the output, therefore manual verification is necessary.
• Some less-reported events can remain undetected.
• Two or more events are sometimes merged into one event description.
• The same event can be presented via several event descriptions (event duplication).
Some evaluation results

Twitter multimedia link extraction

Tanev, H.; Zavarella, V. (2014) Multilingual Lexicalisation and Population of Event Ontologies: A Case Study for Social Media, in:
Buitelaar, Paul, Cimiano, Philipp (Eds.):Towards the Multilingual Semantic Web, 2014, Springer Berlin Heidelberg.
Zavarella, V., Kucuk, Dilek, Tanev, H. and Hurriyetouglu, Ali (2014) Event Extraction for
Balkan Languages, in: Proceedings of the Demonstrations at the 14th Conference of the
European Chapter of the Association for Computational Linguistics, 65-68, Gothenburg,
Sweden.
Zavarella, V. and Tanev, H. (2013), FSS-TimEx for TempEval-3: Extracting Temporal
Information from Text, in: Proceedings of the Seventh International Workshop on Semantic
Evaluation (SemEval 2013), Volume 2, Association for Computational Linguistics,
pages:58--63, Atlanta, Georgia, USA
H. Tanev, M. Ehrman, J. Piskorski, V. Zavarella (2012). Enhancing Event Descriptions
through Twitter Mining. In proceeding of: AAAI International Conference on Weblogs and
Social Media 2012, At Dublin, Ireland.
Atkinson M., J. Piskorski, Bruno Pouliquen, R. Steinberger, H. Tanev & V. Zavarella (2008).
Online-monitoring of security-related events. In Proceedings of the 22nd International
Conference on Computational Linguistics (CoLing'2008). Manchester, UK, 18-22 August
2008.
References

http://emm.newsbrief.eu/NewsBrief/eventedition/en/latest.html
(text format)
http://medusa.jrc.it/medisys/eventedition/en/rss.html
(text format for the Medical Information System MedISys)
Live Access

EventExtractionGeomedia

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to EventExtractionGeomedia

Similar to EventExtractionGeomedia (20)

EventExtractionGeomedia