Grouping business news stories based on salience of named entities

1
Grouping business news stories
based on
salience of named entities
Llorenç Escoter,̧ Lidia Pivovarova, Mian Du, Anisia Katiskaya
and Roman Yangarber
EACL 2017, Valencia, Spain

2
Task definition
● PULS – an online news monitoring system for the
business domain (puls.cs.helsinki.fi)
● 4000-6000 news articles daily
● Some stories appear multiple times
● Our task: cluster articles into a set of stories:
– to minimize redundancy
– to identify trending stories

4
Event grouping task is different from
topical text clustering:
- fine-grained
- named entities are crucial
- group size distribution is skewed

5
Dataset
● Popular clustering data sets target much coarser
categorization tasks
● The dataset and the interface are publicly available
● Dataset based on PULS business
corpus
– “Typical” day, ~4000 documents
– Manually annotated using a
command-line interface, which
displays documents pairwise
– Initialization: if a pair of documents
do not mention the same name they
cannot be grouped together
– Decision propagation: only one
member of a group should be
shown against other group

6
Dataset
corpus
1 2 3 4 5 6 7 8 9
1 - +
2
3
4
5
6
7
8
9

7
Dataset
corpus
1 2 3 4 5 6 7 8 9
1 - + - -
2 -
3 - -
4 -
5 - -
6 -
7
8
9

8
Dataset
corpus
1 2 3 4 5 6 7 8 9
1 - + - -
2 - -
3 - - - -
4 -
5 - -
6 -
7
8
9

9
Dataset
corpus
1 2 3 4 5 6 7 8 9
1 - + - + -
2 -
3 - - - -
4 -
5 - -
6 -
7
8
9

11
Named Entity Recognition
● Part of the PULS news monitoring system:
– extracts named entities
– assigns type: company, person, location, etc.
– computes salience

12
Salience
● Our definition of salience relies on the general
nature of news articles:
– Authors typically mention the main event in the title;
then, the main information is elaborated in the first few
sentences, followed by further detail and background

13
Salience
● The most important NEs are mentioned early in
the text and then repeated.
● Less important NEs are mentioned in the later
paragraphs and are less frequent.

15
Clustering method
● Features:
– Word-based
– NE-based
● Hierarchical clustering of document vectors using
cosine similarity

16
Clustering method
● Features:
– Word-based
I. TF-IDF
II.”standard” CBOW embeddings, built on Google News
(Mikolov et al. 2013)
III. CBOW embeddings, built on our (4.5 M documents)
corpus of business news
● Hierarchical clustering of document vectors

17
Clustering method
● Features:
– Word-based
I. TF-IDF
II.”standard” CBOW embeddings, built on Google News
III. CBOW embeddings, built on our (4.5 M documents)
corpus of business news

18
Clustering method
● Features:
– Word-based
– NE-based
I. Counts
II.TF-IDF
III. Salience

19
Clustering method
● Features:
– Word-based
– NE-based
● Combining features:

20
Clustering method
● Features:
– Word-based
– NE-based
I. Concatenation
- a document vector
consists of both word-
based and document-
based features

21
Clustering method
● Features:
– Word-based
– NE-based
I. Concatenation
II.Combination using
AND function
- both word distance
and NE distance
should be sufficiently
close

22
Evaluation
● Measures
– V-measure (combination of completeness and
homogeneity)
– Rand Index (ratio of correctly classified pairs)
● Adjustment against naïve strategy (doing nothing):
– naïve strategy scores: V-measure 0.96, RI 0.99

23
Individual features
Rand Index and V-measure adjusted for naïve strategy.
θ – cosine distance threshold for hierarchical clustering

25
Combination using AND function
Rand Index adjusted for naïve strategy
(improvement from 0.25 to 0.4)
V-measure adjusted for naïve strategy
(improvement from 0.37 to 0.49)

26
Conclusions
● automatically extracted named entities are better
features than keywords for event-based clustering
● salience of NEs—a weighting sheme that
combines frequency and prominence of the NE—is
the best document representation
● corpus-specific word embeddings alone give lower
performance than embeddings built using bigger
corpus
● combining NE salience with domain-specific
embeddings yields the best performance
● AND-function combination strategy works better
than concatenation

27
Thanks for your attention!
data and code: puls.cs.helsinki.fi/grouping
contacts: first_name.last_name@cs.helsinki.fi

Grouping business news stories based on salience of named entities

Recommended

Recommended

More Related Content

Similar to Grouping business news stories based on salience of named entities

Similar to Grouping business news stories based on salience of named entities (20)

More from Lidia Pivovarova

More from Lidia Pivovarova (20)

Recently uploaded

Recently uploaded (20)

Grouping business news stories based on salience of named entities