This document provides an overview of Text Mining for Global Event Monitoring. It discusses an application that collects media reports from multiple languages, clusters articles about the same event, applies natural language processing to extract key information, and displays the latest events on a map with access to full articles and extracted data. The system is continuously updated, clusters articles in 10 minute intervals, and covers over 10 languages using machine translation. It also discusses techniques for multilingual event extraction, named entity recognition, event geocoding, temporal information extraction, learning new extraction patterns, and exploiting social media.
Slides de la présentation de la société effi10 concernant le référencement et les optimisations techniques possibles sur les boutiques en ligne basées sur le CMS Prestashop.
Somos un Bufete de Abogados especializado en asesoría de empresas y derecho penal, prevención de Riesgos Penales y Compliance.
Ofrecemos servicios y soluciones eficaces y prácticas adaptadas a nuestros clientes, a través de un equipo multidisciplinar de profesionales altamente capacitados.
The Nuts and Bolts of Metadata Tagging and Taxonomies Made Easy WebinarConcept Searching, Inc
Taxonomies are often thought of as hard to use and needing specialized applications or IT skills. Not so with Concept Searching’s unique technologies.
Join Michael Paye, our CTO, to see how taxonomies, auto-classification, and multi-term metadata generation unburden the IT team, eliminate end user tagging, and empower business users.
Understand the return on investment from an effective infrastructure solution for search, security, compliance, eDiscovery, records management, knowledge management, collaboration, and migration activities.
• Learn how our solution can meet either one challenge or several, and see how it works with different applications
• Watch multi-term metadata being automatically generated
• See how easy it is to use unique taxonomy tools and interactive features, such as clue suggestion, instant feedback, and assigning weights to terms
• Discover the value of dynamic screen updating to immediately see the impact of taxonomy changes
• View how document movement feedback enables you to see the cause and effect of changes without re-indexing
Slides de la présentation de la société effi10 concernant le référencement et les optimisations techniques possibles sur les boutiques en ligne basées sur le CMS Prestashop.
Somos un Bufete de Abogados especializado en asesoría de empresas y derecho penal, prevención de Riesgos Penales y Compliance.
Ofrecemos servicios y soluciones eficaces y prácticas adaptadas a nuestros clientes, a través de un equipo multidisciplinar de profesionales altamente capacitados.
The Nuts and Bolts of Metadata Tagging and Taxonomies Made Easy WebinarConcept Searching, Inc
Taxonomies are often thought of as hard to use and needing specialized applications or IT skills. Not so with Concept Searching’s unique technologies.
Join Michael Paye, our CTO, to see how taxonomies, auto-classification, and multi-term metadata generation unburden the IT team, eliminate end user tagging, and empower business users.
Understand the return on investment from an effective infrastructure solution for search, security, compliance, eDiscovery, records management, knowledge management, collaboration, and migration activities.
• Learn how our solution can meet either one challenge or several, and see how it works with different applications
• Watch multi-term metadata being automatically generated
• See how easy it is to use unique taxonomy tools and interactive features, such as clue suggestion, instant feedback, and assigning weights to terms
• Discover the value of dynamic screen updating to immediately see the impact of taxonomy changes
• View how document movement feedback enables you to see the cause and effect of changes without re-indexing
This is the third issue of yemenia focus, it address the evaluation of two yemenia station i.e Jeddah and Doha, which the best one that earned a max passengers per flight.
what is your forecasting objective, is it to get high accuracy by max R or by min signal tracking or matching the trend results by the out of seasonality model or to reflect latest data period in analysis.
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
PRIORITIES OF THE CEFR
The provision of common reference points is
subsidiary to the CEFR’s main aim of facilitating quality
in language education and promoting a Europe of
open-minded plurilingual citizens. This was clearly
confirmed at the intergovernmental Language Policy
Forum that reviewed progress with the CEFR in 2007,
as well as in several recommendations from the
Committee of Ministers. This main focus is
emphasized yet again in the Guide for the
Development and Implementation of Curricula for
Plurilingual and Intercultural Education. However, at
the same time, the Language Policy Forum underlined
the need for responsible use of the CEFR levels,
exploitation of the methodologies and resources
provided for developing examinations and relating
them to the CEFR.
However, as the subtitle learning, teaching,
assessment makes clear; the CEFR is not just an
assessment project. CEFR Chapter 9 outlines many
different approaches to assessment, most of which are
alternatives to standardized tests. It explains ways in
which the CEFR in general, and its illustrative
descriptors in particular, can be helpful to the teacher
in the assessment process, but there is no focus on
language testing and no mention at all of test items.
In general, the Language Policy Forum emphasised
the need for international networking and exchange of
expertise in relation to the CEFR through bodies like
ALTE, EALTA and Eaquals.
COMMON EUROPEAN FRAMEWORK
OF REFERENCE FOR LANGUAGES:
LEARNING, TEACHING, ASSESSMENT
COMPANION VOLUME
WITH NEW DESCRIPTORS
Preface with acknowledgements ►Page 11
Preface with acknowledgements
This companion volume to the Common European Framework of Reference for Languages: Learning,
teaching, assessment (CEFR) represents another step in a process that has been pursued by the
Council of Europe since 1971 and owes much to the contributions of members of the language
teaching profession across Europe and beyond.
This Companion Volume was authored by Brian North and Tim Goodier (Eurocentres Foundation) and
Enrica Piccardo (University of Toronto / Université Grenoble-Alpes).
PRIORITIES OF THE CEFR
The provision of common reference points is
subsidiary to the CEFR’s main aim of facilitating quality
in language education and promoting a Europe of
open-minded plurilingual citizens. This was clearly
confirmed at the intergovernmental Language Policy
Forum that reviewed progress with the CEFR in 2007,
as well as in several recommendations from the
Committee of Ministers. This main focus is
emphasized yet again in the Guide for the
Development and Implementation of Curricula for
Plurilingual and Intercultural Education. However, at
the same time, the Language Policy Forum underlined
the need for responsible use of the CEFR levels,
exploitation of the methodologies and resources
provided for developing examinations and relating
them to the CEFR.
However, as the subtitle learning, teaching,
assessment makes clear; the CEFR is not just an
assessment project. CEFR Chapter 9 outlines many
different approaches to assessment, most of which are
alternatives to standardized tests. It explains ways in
which the CEFR in general, and its illustrative
descriptors in particular, can be helpful to the teacher
in the assessment process, but there is no focus on
language testing and no mention at all of test items.
In general, the Language Policy Forum emphasised
the need for international networking and exchange of
expertise in relation to the CEFR through bodies like
ALTE, EALTA and Eaquals.
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit.
MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
For the latest updates go to http://www.statmt.org/mosescore/
or follow us on Twitter - #MosesCore
Risk assessment for preservation in the active life of complex digital object...PERICLES_FP7
This presentation was delivered by Pip Laurenson and Patricia Falcao (Tate/PERICLES) at the PERICLES final project conference 'Acting on Change: New Approaches and Future Practices in LTDP' (Wellcome Collection Conference Centre, London, 30 Nov -1 Dec 2016).
Tate partners joined Simon Waddington (King's College London), Barbara Reed (Recordkeeping Innovation) and Tomasz Miksa (SBA Research) in a thematic session on 'Risk assessment for preservation in the active life of complex digital objects'.
This session looked at how to characterise different types of risk which are relevant to the preservation of a range of digital objects in different contexts including those described by continuum theory. It also considered what type of information is available and required for accurate assessment within different preservation contexts, namely digital artworks, scientific data, records and archives. The focus of this session was largely on complex digital objects.
http://pericles-project.eu/
This is the third issue of yemenia focus, it address the evaluation of two yemenia station i.e Jeddah and Doha, which the best one that earned a max passengers per flight.
what is your forecasting objective, is it to get high accuracy by max R or by min signal tracking or matching the trend results by the out of seasonality model or to reflect latest data period in analysis.
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
PRIORITIES OF THE CEFR
The provision of common reference points is
subsidiary to the CEFR’s main aim of facilitating quality
in language education and promoting a Europe of
open-minded plurilingual citizens. This was clearly
confirmed at the intergovernmental Language Policy
Forum that reviewed progress with the CEFR in 2007,
as well as in several recommendations from the
Committee of Ministers. This main focus is
emphasized yet again in the Guide for the
Development and Implementation of Curricula for
Plurilingual and Intercultural Education. However, at
the same time, the Language Policy Forum underlined
the need for responsible use of the CEFR levels,
exploitation of the methodologies and resources
provided for developing examinations and relating
them to the CEFR.
However, as the subtitle learning, teaching,
assessment makes clear; the CEFR is not just an
assessment project. CEFR Chapter 9 outlines many
different approaches to assessment, most of which are
alternatives to standardized tests. It explains ways in
which the CEFR in general, and its illustrative
descriptors in particular, can be helpful to the teacher
in the assessment process, but there is no focus on
language testing and no mention at all of test items.
In general, the Language Policy Forum emphasised
the need for international networking and exchange of
expertise in relation to the CEFR through bodies like
ALTE, EALTA and Eaquals.
COMMON EUROPEAN FRAMEWORK
OF REFERENCE FOR LANGUAGES:
LEARNING, TEACHING, ASSESSMENT
COMPANION VOLUME
WITH NEW DESCRIPTORS
Preface with acknowledgements ►Page 11
Preface with acknowledgements
This companion volume to the Common European Framework of Reference for Languages: Learning,
teaching, assessment (CEFR) represents another step in a process that has been pursued by the
Council of Europe since 1971 and owes much to the contributions of members of the language
teaching profession across Europe and beyond.
This Companion Volume was authored by Brian North and Tim Goodier (Eurocentres Foundation) and
Enrica Piccardo (University of Toronto / Université Grenoble-Alpes).
PRIORITIES OF THE CEFR
The provision of common reference points is
subsidiary to the CEFR’s main aim of facilitating quality
in language education and promoting a Europe of
open-minded plurilingual citizens. This was clearly
confirmed at the intergovernmental Language Policy
Forum that reviewed progress with the CEFR in 2007,
as well as in several recommendations from the
Committee of Ministers. This main focus is
emphasized yet again in the Guide for the
Development and Implementation of Curricula for
Plurilingual and Intercultural Education. However, at
the same time, the Language Policy Forum underlined
the need for responsible use of the CEFR levels,
exploitation of the methodologies and resources
provided for developing examinations and relating
them to the CEFR.
However, as the subtitle learning, teaching,
assessment makes clear; the CEFR is not just an
assessment project. CEFR Chapter 9 outlines many
different approaches to assessment, most of which are
alternatives to standardized tests. It explains ways in
which the CEFR in general, and its illustrative
descriptors in particular, can be helpful to the teacher
in the assessment process, but there is no focus on
language testing and no mention at all of test items.
In general, the Language Policy Forum emphasised
the need for international networking and exchange of
expertise in relation to the CEFR through bodies like
ALTE, EALTA and Eaquals.
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit.
MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
For the latest updates go to http://www.statmt.org/mosescore/
or follow us on Twitter - #MosesCore
Risk assessment for preservation in the active life of complex digital object...PERICLES_FP7
This presentation was delivered by Pip Laurenson and Patricia Falcao (Tate/PERICLES) at the PERICLES final project conference 'Acting on Change: New Approaches and Future Practices in LTDP' (Wellcome Collection Conference Centre, London, 30 Nov -1 Dec 2016).
Tate partners joined Simon Waddington (King's College London), Barbara Reed (Recordkeeping Innovation) and Tomasz Miksa (SBA Research) in a thematic session on 'Risk assessment for preservation in the active life of complex digital objects'.
This session looked at how to characterise different types of risk which are relevant to the preservation of a range of digital objects in different contexts including those described by continuum theory. It also considered what type of information is available and required for accurate assessment within different preservation contexts, namely digital artworks, scientific data, records and archives. The focus of this session was largely on complex digital objects.
http://pericles-project.eu/
Wreck a nice beach: adventures in speech recognitionStephen Marquard
Introduction to speech recognition and a description of a project to integrate CMU Sphinx into the Opencast Matterhorn lecture capture system, focusing on language model adaptation using Wikipedia as a corpus.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
Georg Rehm. Mehrsprachigkeit für das Digitale Europa. Ringvorlesung Digitale Lebenswelten, University of Hildesheim, Germany, November 2016. November 15, 2016.
Delivered at the 26th LocWorld Conference in North America.
October 31st 2014
Vancouver, Canada.
In this talk, we describe the various strands of knowledge - machine translation, language, and industry - require to develop effective MT software.
Summary of GSCL 2013 international NLP conference in GermanyLifeng (Aaron) Han
GSCL 2013: Language Processing and Knowledge in the Web - Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, Darmstadt, Germany, on September 25–27, 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch.
1. Text Mining for Global Event Monitoring
EMM Team
(speaker: Vanni Zavarella)
European Commission – Joint Research Centre (JRC)
Structure and Dynamics of Media
Flows INA, 7&8 July 2016
2. • Hristo Tanev
• Vanni Zavarella
• Jakub Piskorski
• Martin Atkinson
EMM-NEXUS Team
And more colleagues from the OPTIMA team
3. • What does the EE application do?
• Custom domain event types
• Multilinguality
• Core extraction engine
• Resource Learning
• Event Time
• Event Location
• Performance
• References
Agenda
4. Cluster
RSS
EventExtractor
RealTimenewsclustering
Typically last 8
hours of RSS
per language
RSS Cache
Stories
Summaries
EntityRecogniser
CrosslingualClustering
Breaking
news, based
on cluster
growth
Mailer/SMS
RSS+
<text>
<entity>
<geo>
<quote>
<tonality>
<category>
duplicate=
Continuously updated RSS
EMM Pipeline
5. • For each language, continuously collect media
reports from upstream EMM pipeline
• Every 10 minutes, cluster the latest (4 hours
window) articles about the same event or subject
• Hierarchical, agglomerative clustering, using
average group linkage and cosine similarity over
simple word count vectors
• apply core language processing modules and
extraction grammars to article title and
description
• Identify and extract information on the main
events of the cluster
• Display the latest events on a map
• Give access to extracted information and to full
articles.
Nexus Pipeline
6. Car bomber strikes north Pakistan
ech-chorouk-en Tuesday, November 10, 2009 2:23:00 PM CET
A car bomb has exploded in Pakistani's northwestern town of Charsadda killing at least 10 people....
Bomb explodes in northwestern Pakistani town
yediotaharonot Tuesday, November 10, 2009 1:58:00 PM CET
A bomb exploded in the northwestern Pakistani town of Charsadda on Tuesday causing
an unknown number of casualties, police said. "It was a bomb blast....
10 killed in Pakistan bomb
RTERadio Tuesday, November 10, 2009 1:57:00 PM CET
A bomb has exploded in the north-western Pakistani town of Charsadda, killing 10 people....
TYPE Bombing
PLACE Charsadda, Pakistan
EVENT LOCATION Charsadda, Pakistan
TIME Tuesday, November 10, 2009
DEAD COUNT 10
DEAD DESCRIPTION people
WOUNDED COUNT/DESC
DISPLACED COUNT/DESC
HOMELESS COUNT/DESC
ARRESTED COUNT/DESC
PERPETRATOR
WEAPONS Bomb
Information Aggregation
10. IT
EN FR ES
AR RU
Multilinguality
Covers English, Italian, French, Spanish, Russian, Portuguese, Turkish,
Romanian, Bulgarian, Czech and Arabic after Machine Translation.
11. Light-weight and shallow process to allow coverage of many languages
Morphological dictionaries
• Static resources, mainly for grammatical structure of rules.
Domain-specific lexica
• (Possibly multiword) expressions subcategorised into semantic classes
relevant for the domain.
Surface-level extraction patterns
• Often learned (semi-) automatically.e.g. [VICTIM] was heavily wounded
Finite-state grammar rules
• To recognise person groups or other partial patterns e.g. the actor, three Iraqi soldiers
• Generalise the surface-level extraction patterns e.g. has been (strongly) wounded
• Ideally language-agnostic
System Resources
12. • ExPRESS is a blend of JAPE (GATE) and XTDL
(SProUT).
• LHS of the rule is a regular expression over flat
feature structures.
• RHS specifies the output structure.
• Allows variables, labels, functional operators,
grammar cascading, …
• Multiple and nested labels (multiple actions)
• Rule sample:
Finite-State Engine
13. • Learning new patterns for target slots
• Learning semantic classes
• Learning domain-specific words
Resource learning
14. Pattern Learning
• <DEAD> was shot by <PERPETRATOR>
• police nabbes <ARRESTED>
• <KIDNAPPED> has been taken hostage
• <WOUNDED> was found injured
• raptou <KIDNAPPED>
• <DEAD> foram mortas
15. Sometimes, it is necessary to learn specific semantic classes, e.g.
disasters, types of chemicals, facilities, professions, etc.
Language-independent system, only needs language-specific stop word lists
Two-step process: feature extraction and weighting (uni/bi-grams), term extraction and ranking
E.g.
• Seeds: toxic, hazardous
• Output (Top):
• hazardous 77.20
• toxic 73.10
• radioactive 18.67
• harmful 13.78
• nuclear 12.18
• dangerous 9.68
• organic 8.63
• chemical 8.56
• poisonous 8.00
• toxic substances 7.94
• highly toxic 7.37
• solid 7.26
• carcinogenic 7.21
• noxious 6.47
• industrial 5.73
• corrosive 5.45
Semantic class learning
16. Input: a handful of keywords - seeds
Output: a set of keywords which tend to co-occur with seeds, ordered by weight
• TF.IDF like formula for term weighting:
Weight(term)=TF.IDF2
TF=Frequency (seeds, term) – the number of documents which contain both the term and at least one of the seeds
IDF= log(NumberDocuments / Frequency(Term)), e.g.
• Seeds: sustainable development, sustainable energy, clean energy,
environmental, greenhouse gases
• Output: • environment
• emissions
• climate
• carbon
• differ materially
• impact
• global
• development
• risks and
uncertainties
• resources
• efficiency
• water
• future
• projects
• developing
• based
• cost
• economic
• potential
• reducing
• renewable
• quality
• management
• technologies
• efficient
• developed
• sustainability
• industry
• technology
• …
Terminology Learning
17. Tokenization, Morphological Analysis,
Temporal Lexicon Lookup
RECOGNITION
(language level grammar rules)
INFORMATION GATHERING
(compositional rules)
TEXT
FEATURE
STRUCTURES
(Intermediate
Annotation)
ATTRIBUTE NORMALIZATION
ANCHORS SELECTION
TIMEX3 OBJECTS
EXPRESS GRAMMAR JAVA CODE
Document CDATE
CALENDAR ARITHMETIC
• A rule-based system featuring finite-state pattern rules
• Very shallow text analysis modules, language specific recognition rules,language independent
normalization process
• Good Precision scores for EN (~90%) and successful porting to ES without significant
Precision drop, although Recall still falling behind (~52%) at TempEval-2013
Time Extractor Module
18. • A language-independent algorithm for article geocoding
• uses person/organization entities and language variant tag to resolve geo/non-geo
ambiguity (Clinton as a city, Seul as a city or adj in French)
• uses admin containment and place size info to resolve geo/geo ambiguity
• Newswire location filtering
A linguistic algorithm for fine-grained event geocoding
• uses grammars for parsing locative prepositional phrases
Geocoding
19. Dead Wounded Kidnapped Perpetrators
Precision English 91% 91% 100% 69%
Dead Wounded Kidnapped Arrested
F1 Portuguese 0.69 0.51 0.67 0.47
F1 Spanish 0.46 - - 0.13
F1 Italian 0.87 0.62 - 0.67
Conclusion:
• There are errors in the output, therefore manual verification is necessary.
• Some less-reported events can remain undetected.
• Two or more events are sometimes merged into one event description.
• The same event can be presented via several event descriptions (event duplication).
Some evaluation results
24. Tanev, H.; Zavarella, V. (2014) Multilingual Lexicalisation and Population of Event Ontologies: A Case Study for Social Media, in:
Buitelaar, Paul, Cimiano, Philipp (Eds.):Towards the Multilingual Semantic Web, 2014, Springer Berlin Heidelberg.
Zavarella, V., Kucuk, Dilek, Tanev, H. and Hurriyetouglu, Ali (2014) Event Extraction for
Balkan Languages, in: Proceedings of the Demonstrations at the 14th Conference of the
European Chapter of the Association for Computational Linguistics, 65-68, Gothenburg,
Sweden.
Zavarella, V. and Tanev, H. (2013), FSS-TimEx for TempEval-3: Extracting Temporal
Information from Text, in: Proceedings of the Seventh International Workshop on Semantic
Evaluation (SemEval 2013), Volume 2, Association for Computational Linguistics,
pages:58--63, Atlanta, Georgia, USA
H. Tanev, M. Ehrman, J. Piskorski, V. Zavarella (2012). Enhancing Event Descriptions
through Twitter Mining. In proceeding of: AAAI International Conference on Weblogs and
Social Media 2012, At Dublin, Ireland.
Atkinson M., J. Piskorski, Bruno Pouliquen, R. Steinberger, H. Tanev & V. Zavarella (2008).
Online-monitoring of security-related events. In Proceedings of the 22nd International
Conference on Computational Linguistics (CoLing'2008). Manchester, UK, 18-22 August
2008.
References