Cot curve, melting temperature, unique and repetitive DNA
Grouping business news stories based on salience of named entities
1. 1
Grouping business news stories
based on
salience of named entities
Llorenç Escoter,̧ Lidia Pivovarova, Mian Du, Anisia Katiskaya
and Roman Yangarber
EACL 2017, Valencia, Spain
2. 2
Task definition
● PULS – an online news monitoring system for the
business domain (puls.cs.helsinki.fi)
● 4000-6000 news articles daily
● Some stories appear multiple times
● Our task: cluster articles into a set of stories:
– to minimize redundancy
– to identify trending stories
4. 4
Event grouping task is different from
topical text clustering:
- fine-grained
- named entities are crucial
- group size distribution is skewed
5. 5
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
6. 6
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
1 2 3 4 5 6 7 8 9
1 - +
2
3
4
5
6
7
8
9
7. 7
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
1 2 3 4 5 6 7 8 9
1 - + - -
2 -
3 - -
4 -
5 - -
6 -
7
8
9
8. 8
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
1 2 3 4 5 6 7 8 9
1 - + - -
2 - -
3 - - - -
4 -
5 - -
6 -
7
8
9
9. 9
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group
1 2 3 4 5 6 7 8 9
1 - + - + -
2 -
3 - - - -
4 -
5 - -
6 -
7
8
9
11. 11
Named Entity Recognition
● Part of the PULS news monitoring system:
– extracts named entities
– assigns type: company, person, location, etc.
– computes salience
12. 12
Salience
● Our definition of salience relies on the general
nature of news articles:
– Authors typically mention the main event in the title;
then, the main information is elaborated in the first few
sentences, followed by further detail and background
13. 13
Salience
● The most important NEs are mentioned early in
the text and then repeated.
● Less important NEs are mentioned in the later
paragraphs and are less frequent.
16. 16
Clustering method
● Features:
– Word-based
I. TF-IDF
II.”standard” CBOW embeddings, built on Google News
(Mikolov et al. 2013)
III. CBOW embeddings, built on our (4.5 M documents)
corpus of business news
● Hierarchical clustering of document vectors
17. 17
Clustering method
● Hierarchical clustering of document vectors
● Features:
– Word-based
I. TF-IDF
II.”standard” CBOW embeddings, built on Google News
III. CBOW embeddings, built on our (4.5 M documents)
corpus of business news
20. 20
Clustering method
● Features:
– Word-based
– NE-based
● Combining features:
I. Concatenation
- a document vector
consists of both word-
based and document-
based features
● Hierarchical clustering of document vectors
21. 21
Clustering method
● Features:
– Word-based
– NE-based
● Combining features:
I. Concatenation
II.Combination using
AND function
- both word distance
and NE distance
should be sufficiently
close
● Hierarchical clustering of document vectors
22. 22
Evaluation
● Measures
– V-measure (combination of completeness and
homogeneity)
– Rand Index (ratio of correctly classified pairs)
● Adjustment against naïve strategy (doing nothing):
– naïve strategy scores: V-measure 0.96, RI 0.99
23. 23
Individual features
Rand Index and V-measure adjusted for naïve strategy.
θ – cosine distance threshold for hierarchical clustering
25. 25
Combination using AND function
Rand Index adjusted for naïve strategy
(improvement from 0.25 to 0.4)
V-measure adjusted for naïve strategy
(improvement from 0.37 to 0.49)
26. 26
Conclusions
● automatically extracted named entities are better
features than keywords for event-based clustering
● salience of NEs—a weighting sheme that
combines frequency and prominence of the NE—is
the best document representation
● corpus-specific word embeddings alone give lower
performance than embeddings built using bigger
corpus
● combining NE salience with domain-specific
embeddings yields the best performance
● AND-function combination strategy works better
than concatenation
27. 27
Thanks for your attention!
data and code: puls.cs.helsinki.fi/grouping
contacts: first_name.last_name@cs.helsinki.fi