This document describes a project that uses crowdsourcing to extract structured information about events from news articles. The goal is to identify event types like earthquakes or product launches and extract related information like magnitude, location, and time. Annotations are collected through an online interface that clusters related articles and recommends possible event types. An evaluation with 11 annotators showed moderate agreement on event types and roles. Future work will focus on improving the event type recommender and testing the approach in a professional environment.
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Crowdsourcing event extraction
1. Crowdsourcing Event Extraction
Aljaž Košmerlj, JenyaBelyaeva, Gregor Leban,
Blaž Fortuna, Marko Grobelnik
JozefStefan Institute
2. Goal
Identifyandextractfeatures(info-box) aboutevents(e.g. earthquake, product launch…)reportedin thenews.
Automaticallyextracting structured information about events from news articles is challenging.
Even when limited to news articles there is little structure in the text
Human annotators can alleviate shortcomings of automatic approaches
Problem:expert annotators are expensive
Solution:use crowdsourcing to lower costs
3. Eventtypeexample
„San Bernardino, California was struck by a moderate earthquake on Thursday night, with shaking felt from Los Angeles to Orange County.
A preliminary reading by the U.S. Geological Survey showed a 4.5-magnitude quake struck at 7:49pm.…“
Event type: earthquake
Roles:
•magnitude–What was the magnitude of the earthquake?
•location –Where did the earthquake occur?
•time–At what time did the earthquake occur?
•…
4. Constraints and considerations
Price of 1 $ –10 $ per article is acceptable
The annotation process needs to be guided (semi-automatic) in order to be efficient, reliable and cheep.
We can assume some highly skilled workers (e.g. editors)
Schema of the extracted data has to be open end extensible
5. Eventextractionsubtasks
1.Identify articles that can be meaningfully structured
2.Identify a set of event types
3.For each event type identify a set of roles (a template)
4.For each new article identify its event type and fill the roles with the entities from the article
6. Annotation interface
We annotate stories, not individual articles. A story is a cluster of articles about the same event.
Sources of clusters: Event Registry, Google clusters…
The articles are sent through the Enrycher* service (POS tagging, named entity extraction…)
Entities proposed for annotation currently identified using only POS tags (sequences of numerals and nouns)
Online annotation interface
Front end: JavaScript
Back end: Python
* http://enrycher.ijs.si/
11. Future work
Improverecommender
usepredicatesin features
Testingin a „professional“ environment
improvementin speed?
whatis a „correct“ annotation?
Buildinga taxonomyofeventtypes
activelearning