1. SEED: A Framework for Extracting
Social Events from Press News
University Ca’ Foscari – Venice
WWW2013 Rio de Janeiro - May 13th, 2013
Salvatore Orlando
orlando@unive.it
Francesco Pizzolon
pizzolon.francesco@gmail.com
Gabriele Tolomei
gabriele.tolomei@unive.it
2. Overview
• Introduction to the problem
• Background
• SEED
• Experiments
• Results
• Conclusions and future works
4. Events creation
events DB yourportal.com
1
2
3
4 5
News agencies
Portal’s editorial
division
1. A news agency composes the press
news
2. The press news is sent to portal’s
editorial division by mail
3. A journalist reads and analyzes the
verbose and long press news
4. New entertainment events are
added to the events DB
5. The journalist publishes the event
on portal’s site
GOAL: automate step 3 helping journalists to understand right events
Events creation process
2 / 21
Intro Background SEED Experiments Results Conclusions
5. Starting from unstructured text we have to extract structured information
Information Extraction
Named Entity Recognition (NER)
Relation Extraction (RE)
Find entities of the classes:
• Date
• Location
• Place
• Artist
Find 3-ary tuples in the form:
• (Date, Location, Artist)
• (Date, Place, Artist)
3 / 21
Intro Background SEED Experiments Results Conclusions
6. Il 2011 e' stato il suo anno. L'omonimo album di debutto l’ha resa celebre in ogni dove coronandola
"la nuova musa made in UK".
Un grande successo di pubblico e critica ottenuto grazie alla vincente combinazione di bravura, classe
e passione che Anna Calvi riesce ad esprimere con la sua musica e attraverso i live show. Anna Calvi e'
una grande artista, una fuoriclasse.
Gia' indaffarata per i prossimi show estivi che la vedranno ospite di numerosi ed importanti festival,
Anna Calvi fara' tappa in Italia con prevendite attive da Lunedì14 Maggio sui circuiti vivaticket.it,
ticketone.it.
Martedì 24 Luglio
Roma – Parco di San Sebastiano
Roma Vintage
Via di Porta San Sebastiano 2 (P.le Numa Pompilio), 00187 Roma
Biglietto: 15,00 euro + d.p.
L'album di debutto si sviluppa sulla straordinaria chitarra di Anna e sulla sua potente e ammaliante
voce; e' un album indimenticabile e appassionante. Influenzata dalle vocalita' di artisti diversi come
Nina Simone, Maria Callas e Scott Walker, dalle chitarre di Django Rheinhard e Robert Johnson, dal
classico romanticismo di Ravel e Debussy, Anna Calvi anche se ispirata da musicisti di un lontano
passato, ha un sound totalmente attuale ma soprattutto originale. Complici lo sguardo ipnotico e una
bellezza sensuale, Anna Calvi ha conquistato le copertine ed intere pagine delle migliori riviste e
magazine francesi, tedeschi ed Italiani.
Benvenuti nel magico mondo di Anna Calvi – un luogo dove bellezza e oscurita' complottano e si
scontrano tra loro, dove indomite emozioni conquistano e consumano.
A sample press news
4 / 21
Intro Background SEED Experiments Results Conclusions
7. Named Entity Recognition (NER)
Requires PROs CONs
Knowledge-basedRule-basedStatistical
• a dictionary for every
entity class
• set of rules
• policies to apply rules
• large corpus with
labeled examples
• model for text
decomposition
• algorithms to train
and deploy the model
• fast performances
• high precision score
• no labeled corpus
needed
• no labeled corpus
needed
• domain insensitive
• dicts needs updates
• creating new dicts
requires efforts
• hand-creating rules
is annoying
• large corpus for new
domains are unavailable
5 / 21
Intro Background SEED Experiments Results Conclusions
8. Requires PROs CONsSupervised
Semi-supervised
DipreSnowballTextRunner
• set of features to train a
classifier
• labeled corpus
• can be used with any
relation
• difficult to extend
• require to preprocess
the input
• extension to high order
relations is difficult
• given relation
• seed set
• rely on NER tagger
• hard pattern matching
• soft pattern matching
• high precision
• no need of labeled
data
• self-supervised learner
• single-pass extractor
• redundancy-based
assessor
• rely on dependency
parser to self annotate
training data
• no relationship given
Relation Extraction (RE)
6 / 21
Intro Background SEED Experiments Results Conclusions
10. Named Entity Recognition Approach
GOALS
find entities of classes
Date, Location, Place and
Artist in unstructured text
ISSUES
closed domain,
no labeled corpus,
press news are in Italian
VS
SOLUTIONS
• Date: predefined forms rule-based methods
• Location: present in Wikipedia knowledge-based approach
• Place: present in company’s database knowledge-based approach
• Artist: present in Wikipedia knowledge-based approach
8 / 21
Intro Background SEED Experiments Results Conclusions
12. Il 2011 e' stato il suo anno. L'omonimo album di debutto l’ha resa celebre in ogni dove coronandola "la
nuova musa made in UK".
Un grande successo di pubblico e critica ottenuto grazie alla vincente combinazione di bravura, classe e
passione che [art Anna Calvi] riesce ad esprimere con la sua musica e attraverso i live show. [art Anna
Calvi] e' una grande artista, una fuoriclasse.
Gia' indaffarata per i prossimi show estivi che la vedranno ospite di numerosi ed importanti festival, [art
Anna Calvi] fara' tappa in Italia con prevendite attive da [date Lunedì 14 Maggio] sui circuiti vivaticket.it,
ticketone.it.
[date Martedì 24 Luglio]
[loc Roma] – Parco di San Sebastiano
[place Roma Vintage]
Via di Porta San Sebastiano 2 (P.le Numa Pompilio), 00187 [loc Roma]
Biglietto: 15,00 euro + d.p.
L'album di debutto si sviluppa sulla straordinaria chitarra di Anna e sulla sua potente e ammaliante voce;
e' un album indimenticabile e appassionante. Influenzata dalle vocalita' di artisti diversi come [art Nina
Simone], [art Maria Callas] e [art Scott Walker], dalle chitarre di [art Django Rheinhard] e [art Robert
Johnson], dal classico romanticismo di [art Ravel] e [art Debussy], [art Anna Calvi ] anche se ispirata da
musicisti di un lontano passato, ha un sound totalmente attuale ma soprattutto originale. Complici lo
sguardo ipnotico e una bellezza sensuale, [art Anna Calvi] ha conquistato le copertine ed intere pagine
delle migliori riviste e magazine francesi, tedeschi ed Italiani.
Benvenuti nel magico mondo di [art Anna Calvi] – un luogo dove bellezza e oscurità complottano e si
scontrano tra loro, dove indomite emozioni conquistano e consumano.
The sample press news after NER phase
10 / 21
Intro Background SEED Experiments Results Conclusions
13. Relation Extraction Approach
GOALS
find two predefined relations
between entities extracted:
•(Date, Location, Artist)
• (Date, Place, Artist)
ISSUES
events within press news span over
a single sentence, but state-of-the-
art methods work by sentence level
HINT
Documents about Entertainment Events are often abundant on the Social Web
11 / 21
VS
Intro Background SEED Experiments Results Conclusions
14. Blogs Social networks
SOLUTION
Use an external Fresh Social Knowledge to infer right entertainment events,
in particular to disambiguate in the Relation Extraction task
12 / 21
Intro Background SEED Experiments Results Conclusions
16. Which fresh social knowledge?
Too static.. Events inserted after their happening!
Data is not structured for our purpose
Well, they return document related and relevant
given a query… Let’s try!
14 / 21
Encyclopedic one?
Social networks?
… and what about SEs?
Intro Background SEED Experiments Results Conclusions
17. Scoring tuples regarding SE Result List
(Martedì 24 luglio, Roma, Anna Calvi)
Scoring principlesScoring principles
• product of frequency count
• importance to title matches
respect snippet matches
• importance to top results
15 / 21
Intro Background SEED Experiments Results Conclusions
18. RE step
NER step
Date
Lunedì 14 Maggio
Martedì 24 Luglio
Location
Roma
Artist
Nina Simone
Maria Callas
Anna Calvi
Scott Walker
Django Rheinhard
Debussy
Ravel
Candidate Extraction
(Lunedì 14 maggio, Roma, Anna Calvi),
(Lunedì 14 maggio, Roma Vintage, Anna Calvi),
…
(Lunedì 14 maggio, Roma Vintage, Ravel),
(Martedì 24 luglio, Roma, Anna Calvi),
(Martedì 24 luglio, Roma Vintage, Anna Calvi),
…
(Martedì 24 luglio, Roma Vintage, Ravel)
Place
Roma Vintage
Candidate Ranking
(Martedì 24 luglio, Roma, Anna Calvi),
(Martedì 24 luglio, Roma Vintage, Anna Calvi)
16 / 21
Intro Background SEED Experiments Results Conclusions
19. 17 / 21
Intro Background SEED Experiments Results Conclusions
DATASET
One hundred press news, provided by the company, manually labeled by a
member of the editorial office
Evaluation of a Class in NER phase
Precision: # correctly labeled entities / # labeled entities
Recall: # correctly labeled entities / # true (manually) labeled entities
F-measure: harmonic mean between Precision and Recall
20. 18 / 21
Intro Background SEED Experiments Results Conclusions
Evaluation of the RE phase
Precision: # correctly labeled relations / # labeled relations
Recall: # correctly labeled relations/ # true (manually) labeled relations
F-measure: harmonic mean between Precision and Recall
Baselines
Baseline1: if an artist, a place
and a date are named in the
same sentence, then a tuple
containing them is returned.
Baseline2: if an artist, a place
and a date are named more
than the others the
correspondent tuple is
returned.
SEED
Linear SEED: same importance
given to SERP elements
Non-Linear SEED: more
importance given to top-K
SERP elements
21. Total F-measure around 81%
Named Entity Recognition Evaluation
19 / 21
Intro Background SEED Experiments Results Conclusions
22. F-measure around 70.2%
LINEAR: giving same importance to results
F-measure around 70.5%
NON-LINEAR: giving importance to top results
20 / 21
Intro Background SEED Experiments Results Conclusions
Relation Extraction Evaluation
23. What we did so far
• Introduced a novel RE techique to understand our predefined relations exploiting
the Social Web for a real world application
• Developed a framework called SEED implementing our strategy
• Evaluated SEED together with two baselines
Future works
• Improving NER phase
• evaluate RE when an optimal NER is used and viceversa
• Exploiting other social knowledges
21 / 21
Intro Background SEED Experiments Results Conclusions