1. The document analyzes news reporting patterns from over 800,000 news articles from 62 sources over approximately 1 year.
2. It finds that a small number of major news agencies, such as Reuters and AFP, are often the first to report on new events and that their reporting serves as a scaffold for other news sources. Other sources that frequently break news or report in a bursty fashion include regional news outlets.
3. There are also differences found in how quickly various news sources report on events after the initial break, with some national and regional outlets tending to have shorter lags than global wire services. The document examines possible reasons for these reporting differences.
2. Two Opinions
There exists a group of news sources X such that at least one of the
following is true:
“All news come from X and the rest just comments on them”
“A story receives wide public attention only if X reports on them”
4. Remainder of this Talk
1 Method: how we did it
2 Results: what we found
3
5. The Setting
Given a list of sources, crawl periodically all new articles and find events
4
6. The Setting
Given a list of sources, crawl periodically all new articles and find events
Event
“An event is a particular thing that happens at a specific time and place”
7. The Setting
Given a list of sources, crawl periodically all new articles and find events
Event
“An event is a particular thing that happens at a specific time and place”
We know how to do this since 15+ years (TDT)
A Study on Retrospective and On-Line Event Detection
Yiming Yang, Tom Pierce, Jaime Carbonell
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213-3702, USA
www.cs.cmu.edu ~yiming
Abstract This paper investigates the use and exten-
sion of text retrieval and clustering techniques for event
detection. The task is to automatically detect novel
events from a temporally-ordered stream of news stories,
either retrospectively or as the stories arrive. We applied
hierarchical and non-hierarchical document clustering al-
gorithms to a corpus of 15,836 stories, focusing on the
exploitation of both content and temporal information.
We found the resulting cluster hierarchies highly infor-
mative for retrospective detection of previously uniden-
ti ed events, e ectively supporting both query-free and
query-driven retrieval. We also found that temporal dis-
tribution patterns of document clusters provide useful
information for improvement in both retrospective de-
tection and on-line detection of novel events. In an
evaluation using manually labelled events to judge the
system-detected events, we obtained a result of 82 in
the F1 measure for retrospective detection, and a F1
value of 42 for on-line detection.
1 Introduction
The rapidly-growing amount of electronically available
information threatens to overwhelm human attention,
raising new challenges for information retrieval technol-
ogy. Although traditional query-driven retrieval is use-
ful for content-focused queries, it is de cient for generic
queries such as What happened? or What's new?.
Browsing without guidance or a conceptual structure of
the search space is useful only in miniscule information
spaces.
Consider a person who returns from an extended va-
cation and needs to nd out quickly what happened in the
world during her absence. Reading the entire news col-
lection is a daunting task, and generating speci c queries
about unknown facts is rather unrealistic. Thus, intel-
ligent assistance from the computer is clearly desirable.
Such assistance could take the form of a content summary
of a corpus for a quick review, the temporal evolution of
past events of interest, or a listing of automatically de-
tected new events which demonstrate a signi cant con-
tent shift from any previously known events. It would
also be useful to have structured guidelines for naviga-
tion through document clusters. Table 1 shows a sample
Permission to make digital hard copy of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for pro t or commercial ad-
vantage, the copyright notice, the title of the publication and its
date appear, and notice is given that copying is by permission of
ACM, Inc. To copy otherwise, to republish, to post on servers or
to redistribute to lists, requires prior speci c permission and or
fee. SIGIR'98, Melbourne, Australia c 1998 ACM 1-58113-015-5
8 98 $5.00.
Table 1. Corpus summary using keywords of
automatically generated clusters of news stories
Size* Top-ranking Words stemmed
330 republ clinton congress hous amend
217 simpson o prosecut trial jury
98 israel palestin gaza peac arafat
97 japan kobe earthquak quak toky
93 russian chech chechny grozn yeltsin
56 somal u mogadishu iraq marin
55 ood rain californ malibu rive
48 serb bosnian bosnia croat u
35 game leagu play basebal season
33 crash airlin ight airport passeng
28 clinic sav abort massachuset norfolk
27 shuttl spac astronaut mir discov
26 patient drug virus holtz infect
24 chin beij deng trad copyright
...
* Size means the number of documents included.
summary of a corpus obtained by applying our hierarchi-
cal content-based clustering algorithm to a few thousand
news stories CNN news and Reuters articles from Jan-
uary to February in 1995 and presenting each cluster
using a few statistically signi cant key terms. As the
table shows, domestic politics reigns supreme as usual,
the OJ trial still receives media attention, etc. How-
ever, the table also reveals that disasters struck Kobe
Japan and Malibu California, and Chechnia has ared up
again, events which were not present the month before.
The key terms provide content information, and the story
counts imply signi cance, as measured by media atten-
tion. If further detail is desired, the sub-clusters can be
examined via query-driven retrieval, browsing individual
documents or synthetic summaries across documents 2 .
The utility of such computer assistance is evident even
though some clusters may be imperfect and the current
user interface is rudimentary.
This paper reports our work in event detection, a
new research topic initiated by the Topic Detection and
Tracking TDT project1
. The objective is to identify
stories in several continuous news streams that pertain
to new or previously unidenti ed events. To be more
precise, detection consists of two tasks: retrospective de-
tection and on-line detection. The former entails the dis-
covery of previously unidenti ed events in an accumu-
lated collection, and the latter strives to identify the on-
set of new events from live news feeds in real-time. Both
1The TDT project is supported by the U.S. Government, con-
sisting of segmentation of stories in a continuous news-stream,
temporal event tracking and event detection. Our event tracking
work will be reported in a separate paper.
4
8. The Particularities
We wanted to be global:
ABC News US EHealthNews Europe North Africa Journal North Africa
Al Jazeera Arabic World EUbusiness Europe Novaya Gazeta Russia
All Africa Africa EUobserver Europe Novinite Bulgaria
ANSA Italy EurActiv Europe NPR US
Antara News Indonesia Euronews Europe NY Post US
AOL news Global EuropeanAgenda Europe NY Times US
AP Global EuroTopics Europe Reuters Global
BBC UK Fox News US RFERL Asia, M East
Boston Globe US France24 France RIAN Russia
Budapeast Business J Hungary FT Global The Australian Australia
Businessweek Global Helsinki Times Finland The Globe and Mail Canada
CBS News US Kyodo News Japan The Guardian UK
China News Service China The WSJ US The Herald (Glasgow) Scotland
Chosun South Korea Irish Examiner Ireland The Star Malaysia
CNN US Le Monde dipl France The Sun UK
Cyprus Mail Cyprus Mercopress Latin America The Telegraph UK
Daily Mail UK Moscov News Russia Times of India India
Daily Mirror UK MSNBC Global Times of Malta Malta
Der Spiegel Germany New Europe Europe Voice of America US
DW-World Germany New Scientist Global
5
9. The Particularities
Danger of falling into one of two extremes:
-
get drown consider only
by local events most salient events
6
11. Our Solution
Two-stage approach:
1 Scaffold given by main-segments of primary sources
2 Fill in all the remaining articles
Main-segment
Sentences containing the first 100 words
12. The Particularities
Primary Sources:
ABC News US EHealthNews Europe North Africa Journal North Africa
Al Jazeera Arabic World EUbusiness Europe Novaya Gazeta Russia
All Africa Africa EUobserver Europe Novinite Bulgaria
ANSA Italy EurActiv Europe NPR US
Antara News Indonesia Euronews Europe NY Post US
AOL news Global EuropeanAgenda Europe NY Times US
AP Global EuroTopics Europe Reuters Global
BBC UK Fox News US RFERL Asia, M East
Boston Globe US France24 France RIAN Russia
Budapeast Business J Hungary FT Global The Australian Australia
Businessweek Global Helsinki Times Finland The Globe and Mail Canada
CBS News US Kyodo News Japan The Guardian UK
China News Service China The WSJ US The Herald (Glasgow) Scotland
Chosun South Korea Irish Examiner Ireland The Star Malaysia
CNN US Le Monde dipl France The Sun UK
Cyprus Mail Cyprus Mercopress Latin America The Telegraph UK
Daily Mail UK Moscov News Russia Times of India India
Daily Mirror UK MSNBC Global Times of Malta Malta
Der Spiegel Germany New Europe Europe Voice of America US
DW-World Germany New Scientist Global
8
13. Our Solution
1 Get new articles
2 Scaffold:
1 S = main-segments from primary sources
2 Update existing clusters with S
3 Consider as event those that have ≥ 3 articles from ≥ 2 sources
3 Fill-in:
1 Consider all articles of the last 48 hours and try to match them to one
of the existing events
4 Archive old events (average time ≥ 3 days)
9
14. Algorithms used
Scaffolding: Star-EM1
Results are a bit better than traditional TDT results ( 0.8 micro F1
measure).
1
M Gall´e JM Renders. “Full and mini-batch clustering of news articles with
Star-EM” ECIR 2012
15. Algorithms used
Scaffolding: Star-EM1
Results are a bit better than traditional TDT results ( 0.8 micro F1
measure).
Fill-in: tried several alternatives, used one that dynamically classifies
segments:
Scoring function
Given articles d = s1 . . . sn, find (meta-)segments T1, . . . , Tk, such that
d = T1 . . . Tk, Ti = sj . . . sj+ for some j and 0 and
1
k
k
i=1
max
e∈E∪{u}
sim(Ti , e) − β × k
is maximised.
1
M Gall´e JM Renders. “Full and mini-batch clustering of news articles with
Star-EM” ECIR 2012
16. Some numbers
62 ≈ 1y 1 hs
sources time batch size
≈ 820 000 541 546 10 752
articles crawled assigned to events events
11
17. Remainder of this Talk
1 Method: how we did it
2 Results: what we found
12
18. Two Opinions
There exists a group of news sources X such that at least one of the
following is true:
“All news come from X and the rest just comments on them”
“A story receives wide public attention only if X reports on them”
19. First reports
Percentage of total events source x reported first on:
source percentage
Reuters 12.96%
All Africa 11.68%
France24 10.41%
The Globe and Mail 5.47%
BBC 4.91%
CNN 4.89%
Businessweek 3.01%
RIAN 2.67%
Daily Mirror 2.59%
CBS News 2.32%
Daily Mail 2.21%
The Telegraph 2.21%
NY Post 2.19%
Kyodo 2.07%
NY Times 2.06%
Fox News 1.78%
14
20. Relative first reports
North Africa Journal 4.61 %
AP 1.73 %
France24 1.61 %
Helsinki Times 1.45 %
ANSA 1.38 %
NY Times 1.35 %
Reuters 1.27 %
Chosun 1.22 %
CBS News 1.17 %
The Herald 1.14 %
MSNBC 1.07 %
RFERL 1.06 %
The Australian 1.03 %
New Europe 0.95 %
Le Monde dipl 0.95 %
Al Jazeera 0.95 %
Der Spiegel 0.94 %
15
21. Two Opinions
There exists a group of news sources X such that at least one of the
following is true:
“All news come from X and the rest just comments on them”
“A story receives wide public attention only if X reports on them”
23. How to detect bursts
“detect features that occur with high density over a limited time period”
Bursty and Hierarchical Structure in Streams ∗
Jon Kleinberg †
Abstract
A fundamental problem in text data mining is to extract meaningful structure
from document streams that arrive continuously over time. E-mail and news articles
are two natural examples of such streams, each characterized by topics that appear,
grow in intensity for a period of time, and then fade away. The published literature
in a particular research field can be seen to exhibit similar phenomena over a much
longer time scale. Underlying much of the text mining work in this area is the following
intuitive premise — that the appearance of a topic in a document stream is signaled
by a “burst of activity,” with certain features rising sharply in frequency as the topic
emerges.
The goal of the present work is to develop a formal approach for modeling such
“bursts,” in such a way that they can be robustly and efficiently identified, and can
provide an organizational framework for analyzing the underlying content. The ap-
proach is based on modeling the stream using an infinite-state automaton, in which
bursts appear naturally as state transitions; it can be viewed as drawing an analogy
with models from queueing theory for bursty network traffic. The resulting algorithms
are highly efficient, and yield a nested representation of the set of bursts that imposes
a hierarchical structure on the overall stream. Experiments with e-mail and research
paper archives suggest that the resulting structures have a natural meaning in terms
of the content that gave rise to them.
∗
This work appears in the Proceedings of the 8th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2002.
†
Department of Computer Science, Cornell University, Ithaca NY 14853. Email: kleinber@cs.cornell.edu.
Supported in part by a David and Lucile Packard Foundation Fellowship, an ONR Young Investigator Award,
NSF ITR/IM Grant IIS-0081334, and NSF Faculty Early Career Development Award CCR-9701399.
1
18
24. How to detect bursts
“detect features that occur with high density over a limited time period”
Black box: input an arrival series (with timestamps), and get for each
event zero or more periods where it was bursty (plus a score)
A bit more technical: model this with HMM of 2 nodes: normal
bursty emissions
18
25. Bursty sources
1 Compute burst(s) for each event
2 Take for each burst the first article
3 Add the burst’s score to this source total score
19
26. Bursty sources
source total burst score
Reuters 100.0
The Globe and Mail 83.9
CNN 72.7
Al Jazeera 58.0
France24 53.1
RIAN 47.0
The Star 45.6
CBS News 43.8
MSNBC 42.4
NPR 38.6
The Sun 37.5
DW 34.7
The Guardian 32.1
BBC 30.9
Businessweek 26.8
All Africa 22.1
AP 21.6
19
27. Bursty sources: explanations
1 News agencies: its their job
2 Regional news sources which are trusted (RIAN, AlJazeera)
3 Good “journalistic nose”
20
28. Lag to report
1 Consider for all sources those events it reported on, but not first
2 Take the time delta with the first article
21
29. Lag to report
source hours (median)
France24 20.85
Reuters 20.87
BBC 21.41
Antara News 21.78
All Africa 22.69
Kyodo 23.47
Fox News 24.74
Al Jazeera 24.86
ANSA 24.95
CNN 24.96
RIAN 25.38
RFERL 26.29
The Telegraph 26.53
Daily Mirror 26.70
Euronews 27.40
The Globe and Mail 27.45
NPR 27.94
21
30. Lag to report: Explanations
Online first policy: you can’t argue (in most cases) that this is due to
nighttime
intrinsic delay waiting for confirmation, and assessing news (?)
lingering conservatism (?)
Everybody chooses his priorities
22
35. Conclusions
Data-driven analysis to study which news outlets break news
Some important fine-tuning deviations from classical TDT to scale up
capture many global events
24
36. Conclusions
Data-driven analysis to study which news outlets break news
Some important fine-tuning deviations from classical TDT to scale up
capture many global events
(In our data) It is not true that only Big Agencies break news
Regarding hotness, many trusted regional outlets rank at the top
Lag to report remains big, but may hide in-house priorities /
strategies.
24