SlideShare a Scribd company logo
1 of 36
Download to read offline
Two Opinions
There exists a group of news sources X such that at least one of the
following is true:
“All news come from X and the rest just comments on them”
“A story receives wide public attention only if X reports on them”
1 Method: how we did it
2 Results: what we found
2
Remainder of this Talk
1 Method: how we did it
2 Results: what we found
3
The Setting
Given a list of sources, crawl periodically all new articles and find events
4
The Setting
Given a list of sources, crawl periodically all new articles and find events
Event
“An event is a particular thing that happens at a specific time and place”
The Setting
Given a list of sources, crawl periodically all new articles and find events
Event
“An event is a particular thing that happens at a specific time and place”
We know how to do this since 15+ years (TDT)
A Study on Retrospective and On-Line Event Detection
Yiming Yang, Tom Pierce, Jaime Carbonell
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213-3702, USA
www.cs.cmu.edu ~yiming
Abstract This paper investigates the use and exten-
sion of text retrieval and clustering techniques for event
detection. The task is to automatically detect novel
events from a temporally-ordered stream of news stories,
either retrospectively or as the stories arrive. We applied
hierarchical and non-hierarchical document clustering al-
gorithms to a corpus of 15,836 stories, focusing on the
exploitation of both content and temporal information.
We found the resulting cluster hierarchies highly infor-
mative for retrospective detection of previously uniden-
ti ed events, e ectively supporting both query-free and
query-driven retrieval. We also found that temporal dis-
tribution patterns of document clusters provide useful
information for improvement in both retrospective de-
tection and on-line detection of novel events. In an
evaluation using manually labelled events to judge the
system-detected events, we obtained a result of 82 in
the F1 measure for retrospective detection, and a F1
value of 42 for on-line detection.
1 Introduction
The rapidly-growing amount of electronically available
information threatens to overwhelm human attention,
raising new challenges for information retrieval technol-
ogy. Although traditional query-driven retrieval is use-
ful for content-focused queries, it is de cient for generic
queries such as What happened? or What's new?.
Browsing without guidance or a conceptual structure of
the search space is useful only in miniscule information
spaces.
Consider a person who returns from an extended va-
cation and needs to nd out quickly what happened in the
world during her absence. Reading the entire news col-
lection is a daunting task, and generating speci c queries
about unknown facts is rather unrealistic. Thus, intel-
ligent assistance from the computer is clearly desirable.
Such assistance could take the form of a content summary
of a corpus for a quick review, the temporal evolution of
past events of interest, or a listing of automatically de-
tected new events which demonstrate a signi cant con-
tent shift from any previously known events. It would
also be useful to have structured guidelines for naviga-
tion through document clusters. Table 1 shows a sample
Permission to make digital hard copy of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for pro t or commercial ad-
vantage, the copyright notice, the title of the publication and its
date appear, and notice is given that copying is by permission of
ACM, Inc. To copy otherwise, to republish, to post on servers or
to redistribute to lists, requires prior speci c permission and or
fee. SIGIR'98, Melbourne, Australia c 1998 ACM 1-58113-015-5
8 98 $5.00.
Table 1. Corpus summary using keywords of
automatically generated clusters of news stories
Size* Top-ranking Words stemmed
330 republ clinton congress hous amend
217 simpson o prosecut trial jury
98 israel palestin gaza peac arafat
97 japan kobe earthquak quak toky
93 russian chech chechny grozn yeltsin
56 somal u mogadishu iraq marin
55 ood rain californ malibu rive
48 serb bosnian bosnia croat u
35 game leagu play basebal season
33 crash airlin ight airport passeng
28 clinic sav abort massachuset norfolk
27 shuttl spac astronaut mir discov
26 patient drug virus holtz infect
24 chin beij deng trad copyright
...
* Size means the number of documents included.
summary of a corpus obtained by applying our hierarchi-
cal content-based clustering algorithm to a few thousand
news stories CNN news and Reuters articles from Jan-
uary to February in 1995 and presenting each cluster
using a few statistically signi cant key terms. As the
table shows, domestic politics reigns supreme as usual,
the OJ trial still receives media attention, etc. How-
ever, the table also reveals that disasters struck Kobe
Japan and Malibu California, and Chechnia has ared up
again, events which were not present the month before.
The key terms provide content information, and the story
counts imply signi cance, as measured by media atten-
tion. If further detail is desired, the sub-clusters can be
examined via query-driven retrieval, browsing individual
documents or synthetic summaries across documents 2 .
The utility of such computer assistance is evident even
though some clusters may be imperfect and the current
user interface is rudimentary.
This paper reports our work in event detection, a
new research topic initiated by the Topic Detection and
Tracking TDT project1
. The objective is to identify
stories in several continuous news streams that pertain
to new or previously unidenti ed events. To be more
precise, detection consists of two tasks: retrospective de-
tection and on-line detection. The former entails the dis-
covery of previously unidenti ed events in an accumu-
lated collection, and the latter strives to identify the on-
set of new events from live news feeds in real-time. Both
1The TDT project is supported by the U.S. Government, con-
sisting of segmentation of stories in a continuous news-stream,
temporal event tracking and event detection. Our event tracking
work will be reported in a separate paper.
4
The Particularities
We wanted to be global:
ABC News US EHealthNews Europe North Africa Journal North Africa
Al Jazeera Arabic World EUbusiness Europe Novaya Gazeta Russia
All Africa Africa EUobserver Europe Novinite Bulgaria
ANSA Italy EurActiv Europe NPR US
Antara News Indonesia Euronews Europe NY Post US
AOL news Global EuropeanAgenda Europe NY Times US
AP Global EuroTopics Europe Reuters Global
BBC UK Fox News US RFERL Asia, M East
Boston Globe US France24 France RIAN Russia
Budapeast Business J Hungary FT Global The Australian Australia
Businessweek Global Helsinki Times Finland The Globe and Mail Canada
CBS News US Kyodo News Japan The Guardian UK
China News Service China The WSJ US The Herald (Glasgow) Scotland
Chosun South Korea Irish Examiner Ireland The Star Malaysia
CNN US Le Monde dipl France The Sun UK
Cyprus Mail Cyprus Mercopress Latin America The Telegraph UK
Daily Mail UK Moscov News Russia Times of India India
Daily Mirror UK MSNBC Global Times of Malta Malta
Der Spiegel Germany New Europe Europe Voice of America US
DW-World Germany New Scientist Global
5
The Particularities
Danger of falling into one of two extremes:
-
get drown consider only
by local events most salient events
6
Our Solution
Two-stage approach:
1 Scaffold given by main-segments of primary sources
2 Fill in all the remaining articles
7
Our Solution
Two-stage approach:
1 Scaffold given by main-segments of primary sources
2 Fill in all the remaining articles
Main-segment
Sentences containing the first 100 words
The Particularities
Primary Sources:
ABC News US EHealthNews Europe North Africa Journal North Africa
Al Jazeera Arabic World EUbusiness Europe Novaya Gazeta Russia
All Africa Africa EUobserver Europe Novinite Bulgaria
ANSA Italy EurActiv Europe NPR US
Antara News Indonesia Euronews Europe NY Post US
AOL news Global EuropeanAgenda Europe NY Times US
AP Global EuroTopics Europe Reuters Global
BBC UK Fox News US RFERL Asia, M East
Boston Globe US France24 France RIAN Russia
Budapeast Business J Hungary FT Global The Australian Australia
Businessweek Global Helsinki Times Finland The Globe and Mail Canada
CBS News US Kyodo News Japan The Guardian UK
China News Service China The WSJ US The Herald (Glasgow) Scotland
Chosun South Korea Irish Examiner Ireland The Star Malaysia
CNN US Le Monde dipl France The Sun UK
Cyprus Mail Cyprus Mercopress Latin America The Telegraph UK
Daily Mail UK Moscov News Russia Times of India India
Daily Mirror UK MSNBC Global Times of Malta Malta
Der Spiegel Germany New Europe Europe Voice of America US
DW-World Germany New Scientist Global
8
Our Solution
1 Get new articles
2 Scaffold:
1 S = main-segments from primary sources
2 Update existing clusters with S
3 Consider as event those that have ≥ 3 articles from ≥ 2 sources
3 Fill-in:
1 Consider all articles of the last 48 hours and try to match them to one
of the existing events
4 Archive old events (average time ≥ 3 days)
9
Algorithms used
Scaffolding: Star-EM1
Results are a bit better than traditional TDT results ( 0.8 micro F1
measure).
1
M Gall´e  JM Renders. “Full and mini-batch clustering of news articles with
Star-EM” ECIR 2012
Algorithms used
Scaffolding: Star-EM1
Results are a bit better than traditional TDT results ( 0.8 micro F1
measure).
Fill-in: tried several alternatives, used one that dynamically classifies
 segments:
Scoring function
Given articles d = s1 . . . sn, find (meta-)segments T1, . . . , Tk, such that
d = T1 . . . Tk, Ti = sj . . . sj+ for some j and  0 and
1
k
k
i=1
max
e∈E∪{u}
sim(Ti , e) − β × k
is maximised.
1
M Gall´e  JM Renders. “Full and mini-batch clustering of news articles with
Star-EM” ECIR 2012
Some numbers
62 ≈ 1y 1 hs
sources time batch size
≈ 820 000 541 546 10 752
articles crawled assigned to events events
11
Remainder of this Talk
1 Method: how we did it
2 Results: what we found
12
Two Opinions
There exists a group of news sources X such that at least one of the
following is true:
“All news come from X and the rest just comments on them”
“A story receives wide public attention only if X reports on them”
First reports
Percentage of total events source x reported first on:
source percentage
Reuters 12.96%
All Africa 11.68%
France24 10.41%
The Globe and Mail 5.47%
BBC 4.91%
CNN 4.89%
Businessweek 3.01%
RIAN 2.67%
Daily Mirror 2.59%
CBS News 2.32%
Daily Mail 2.21%
The Telegraph 2.21%
NY Post 2.19%
Kyodo 2.07%
NY Times 2.06%
Fox News 1.78%
14
Relative first reports
North Africa Journal 4.61 %
AP 1.73 %
France24 1.61 %
Helsinki Times 1.45 %
ANSA 1.38 %
NY Times 1.35 %
Reuters 1.27 %
Chosun 1.22 %
CBS News 1.17 %
The Herald 1.14 %
MSNBC 1.07 %
RFERL 1.06 %
The Australian 1.03 %
New Europe 0.95 %
Le Monde dipl 0.95 %
Al Jazeera 0.95 %
Der Spiegel 0.94 %
15
Two Opinions
There exists a group of news sources X such that at least one of the
following is true:
“All news come from X and the rest just comments on them”
“A story receives wide public attention only if X reports on them”
Bursty events
17
How to detect bursts
“detect features that occur with high density over a limited time period”
Bursty and Hierarchical Structure in Streams ∗
Jon Kleinberg †
Abstract
A fundamental problem in text data mining is to extract meaningful structure
from document streams that arrive continuously over time. E-mail and news articles
are two natural examples of such streams, each characterized by topics that appear,
grow in intensity for a period of time, and then fade away. The published literature
in a particular research field can be seen to exhibit similar phenomena over a much
longer time scale. Underlying much of the text mining work in this area is the following
intuitive premise — that the appearance of a topic in a document stream is signaled
by a “burst of activity,” with certain features rising sharply in frequency as the topic
emerges.
The goal of the present work is to develop a formal approach for modeling such
“bursts,” in such a way that they can be robustly and efficiently identified, and can
provide an organizational framework for analyzing the underlying content. The ap-
proach is based on modeling the stream using an infinite-state automaton, in which
bursts appear naturally as state transitions; it can be viewed as drawing an analogy
with models from queueing theory for bursty network traffic. The resulting algorithms
are highly efficient, and yield a nested representation of the set of bursts that imposes
a hierarchical structure on the overall stream. Experiments with e-mail and research
paper archives suggest that the resulting structures have a natural meaning in terms
of the content that gave rise to them.
∗
This work appears in the Proceedings of the 8th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2002.
†
Department of Computer Science, Cornell University, Ithaca NY 14853. Email: kleinber@cs.cornell.edu.
Supported in part by a David and Lucile Packard Foundation Fellowship, an ONR Young Investigator Award,
NSF ITR/IM Grant IIS-0081334, and NSF Faculty Early Career Development Award CCR-9701399.
1
18
How to detect bursts
“detect features that occur with high density over a limited time period”
Black box: input an arrival series (with timestamps), and get for each
event zero or more periods where it was bursty (plus a score)
A bit more technical: model this with HMM of 2 nodes: normal 
bursty emissions
18
Bursty sources
1 Compute burst(s) for each event
2 Take for each burst the first article
3 Add the burst’s score to this source total score
19
Bursty sources
source total burst score
Reuters 100.0
The Globe and Mail 83.9
CNN 72.7
Al Jazeera 58.0
France24 53.1
RIAN 47.0
The Star 45.6
CBS News 43.8
MSNBC 42.4
NPR 38.6
The Sun 37.5
DW 34.7
The Guardian 32.1
BBC 30.9
Businessweek 26.8
All Africa 22.1
AP 21.6
19
Bursty sources: explanations
1 News agencies: its their job
2 Regional news sources which are trusted (RIAN, AlJazeera)
3 Good “journalistic nose”
20
Lag to report
1 Consider for all sources those events it reported on, but not first
2 Take the time delta with the first article
21
Lag to report
source hours (median)
France24 20.85
Reuters 20.87
BBC 21.41
Antara News 21.78
All Africa 22.69
Kyodo 23.47
Fox News 24.74
Al Jazeera 24.86
ANSA 24.95
CNN 24.96
RIAN 25.38
RFERL 26.29
The Telegraph 26.53
Daily Mirror 26.70
Euronews 27.40
The Globe and Mail 27.45
NPR 27.94
21
Lag to report: Explanations
Online first policy: you can’t argue (in most cases) that this is due to
nighttime
intrinsic delay waiting for confirmation, and assessing news (?)
lingering conservatism (?)
Everybody chooses his priorities
22
Differences in reporting time
23
Differences in reporting time
23
Differences in reporting time
23
Conclusions
24
Conclusions
Data-driven analysis to study which news outlets break news
Some important fine-tuning deviations from classical TDT to scale up
 capture many global events
24
Conclusions
Data-driven analysis to study which news outlets break news
Some important fine-tuning deviations from classical TDT to scale up
 capture many global events
(In our data) It is not true that only Big Agencies break news
Regarding hotness, many trusted regional outlets rank at the top
Lag to report remains big, but may hide in-house priorities /
strategies.
24

More Related Content

Similar to Two Opinions on Event Detection Methods

Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Bill Liu
 
Big Data and AI for Covid-19
Big Data and AI for Covid-19Big Data and AI for Covid-19
Big Data and AI for Covid-19Andrew Zhang
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Nattiya Kanhabua
 
Archiving news on the Web through RSS flows. A new tool for studying interna...
Archiving news on the Web through RSS flows. A new tool for studying interna...Archiving news on the Web through RSS flows. A new tool for studying interna...
Archiving news on the Web through RSS flows. A new tool for studying interna...Marta Severo
 
Topic Tracking for Punjabi Language
Topic Tracking for Punjabi LanguageTopic Tracking for Punjabi Language
Topic Tracking for Punjabi LanguageCSEIJJournal
 
Thin Slicing a Black Swan: A Search for the Unknowns
Thin Slicing a Black Swan: A Search for the UnknownsThin Slicing a Black Swan: A Search for the Unknowns
Thin Slicing a Black Swan: A Search for the UnknownsMichele Chubirka
 
2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in Data
2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in Data2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in Data
2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in DataDachis Group
 
Module 1 - CaseFRAMEWORKS OF INFORMATION SECURITY MANAGEMENT.docx
Module 1 - CaseFRAMEWORKS OF INFORMATION SECURITY MANAGEMENT.docxModule 1 - CaseFRAMEWORKS OF INFORMATION SECURITY MANAGEMENT.docx
Module 1 - CaseFRAMEWORKS OF INFORMATION SECURITY MANAGEMENT.docxroushhsiu
 
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsElinor Velasquez
 
Meliorating usable document density for online event detection
Meliorating usable document density for online event detectionMeliorating usable document density for online event detection
Meliorating usable document density for online event detectionIJICTJOURNAL
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
Time Series Project
Time Series Project Time Series Project
Time Series Project Sean Cahill
 
PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...
PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...
PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...webwinkelvakdag
 

Similar to Two Opinions on Event Detection Methods (20)

wendi_ppt
wendi_pptwendi_ppt
wendi_ppt
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Big Data and AI for Covid-19
Big Data and AI for Covid-19Big Data and AI for Covid-19
Big Data and AI for Covid-19
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
 
Archiving news on the Web through RSS flows. A new tool for studying interna...
Archiving news on the Web through RSS flows. A new tool for studying interna...Archiving news on the Web through RSS flows. A new tool for studying interna...
Archiving news on the Web through RSS flows. A new tool for studying interna...
 
Topic Tracking for Punjabi Language
Topic Tracking for Punjabi LanguageTopic Tracking for Punjabi Language
Topic Tracking for Punjabi Language
 
Thin Slicing a Black Swan: A Search for the Unknowns
Thin Slicing a Black Swan: A Search for the UnknownsThin Slicing a Black Swan: A Search for the Unknowns
Thin Slicing a Black Swan: A Search for the Unknowns
 
2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in Data
2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in Data2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in Data
2011 SBS Singapore | Nicholas Gruen, The Coming Revolution in Data
 
Module 1 - CaseFRAMEWORKS OF INFORMATION SECURITY MANAGEMENT.docx
Module 1 - CaseFRAMEWORKS OF INFORMATION SECURITY MANAGEMENT.docxModule 1 - CaseFRAMEWORKS OF INFORMATION SECURITY MANAGEMENT.docx
Module 1 - CaseFRAMEWORKS OF INFORMATION SECURITY MANAGEMENT.docx
 
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
 
paper_148.pptx
paper_148.pptxpaper_148.pptx
paper_148.pptx
 
Kohlmeier "Innovations in Academic Search & Discovery - A Case Study From the...
Kohlmeier "Innovations in Academic Search & Discovery - A Case Study From the...Kohlmeier "Innovations in Academic Search & Discovery - A Case Study From the...
Kohlmeier "Innovations in Academic Search & Discovery - A Case Study From the...
 
Meliorating usable document density for online event detection
Meliorating usable document density for online event detectionMeliorating usable document density for online event detection
Meliorating usable document density for online event detection
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Lstm covid 19 prediction
Lstm covid 19 predictionLstm covid 19 prediction
Lstm covid 19 prediction
 
Randomness
Randomness Randomness
Randomness
 
Time Series Project
Time Series Project Time Series Project
Time Series Project
 
PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...
PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...
PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...
 
10probs.ppt
10probs.ppt10probs.ppt
10probs.ppt
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Two Opinions on Event Detection Methods

  • 1.
  • 2. Two Opinions There exists a group of news sources X such that at least one of the following is true: “All news come from X and the rest just comments on them” “A story receives wide public attention only if X reports on them”
  • 3. 1 Method: how we did it 2 Results: what we found 2
  • 4. Remainder of this Talk 1 Method: how we did it 2 Results: what we found 3
  • 5. The Setting Given a list of sources, crawl periodically all new articles and find events 4
  • 6. The Setting Given a list of sources, crawl periodically all new articles and find events Event “An event is a particular thing that happens at a specific time and place”
  • 7. The Setting Given a list of sources, crawl periodically all new articles and find events Event “An event is a particular thing that happens at a specific time and place” We know how to do this since 15+ years (TDT) A Study on Retrospective and On-Line Event Detection Yiming Yang, Tom Pierce, Jaime Carbonell School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213-3702, USA www.cs.cmu.edu ~yiming Abstract This paper investigates the use and exten- sion of text retrieval and clustering techniques for event detection. The task is to automatically detect novel events from a temporally-ordered stream of news stories, either retrospectively or as the stories arrive. We applied hierarchical and non-hierarchical document clustering al- gorithms to a corpus of 15,836 stories, focusing on the exploitation of both content and temporal information. We found the resulting cluster hierarchies highly infor- mative for retrospective detection of previously uniden- ti ed events, e ectively supporting both query-free and query-driven retrieval. We also found that temporal dis- tribution patterns of document clusters provide useful information for improvement in both retrospective de- tection and on-line detection of novel events. In an evaluation using manually labelled events to judge the system-detected events, we obtained a result of 82 in the F1 measure for retrospective detection, and a F1 value of 42 for on-line detection. 1 Introduction The rapidly-growing amount of electronically available information threatens to overwhelm human attention, raising new challenges for information retrieval technol- ogy. Although traditional query-driven retrieval is use- ful for content-focused queries, it is de cient for generic queries such as What happened? or What's new?. Browsing without guidance or a conceptual structure of the search space is useful only in miniscule information spaces. Consider a person who returns from an extended va- cation and needs to nd out quickly what happened in the world during her absence. Reading the entire news col- lection is a daunting task, and generating speci c queries about unknown facts is rather unrealistic. Thus, intel- ligent assistance from the computer is clearly desirable. Such assistance could take the form of a content summary of a corpus for a quick review, the temporal evolution of past events of interest, or a listing of automatically de- tected new events which demonstrate a signi cant con- tent shift from any previously known events. It would also be useful to have structured guidelines for naviga- tion through document clusters. Table 1 shows a sample Permission to make digital hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial ad- vantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci c permission and or fee. SIGIR'98, Melbourne, Australia c 1998 ACM 1-58113-015-5 8 98 $5.00. Table 1. Corpus summary using keywords of automatically generated clusters of news stories Size* Top-ranking Words stemmed 330 republ clinton congress hous amend 217 simpson o prosecut trial jury 98 israel palestin gaza peac arafat 97 japan kobe earthquak quak toky 93 russian chech chechny grozn yeltsin 56 somal u mogadishu iraq marin 55 ood rain californ malibu rive 48 serb bosnian bosnia croat u 35 game leagu play basebal season 33 crash airlin ight airport passeng 28 clinic sav abort massachuset norfolk 27 shuttl spac astronaut mir discov 26 patient drug virus holtz infect 24 chin beij deng trad copyright ... * Size means the number of documents included. summary of a corpus obtained by applying our hierarchi- cal content-based clustering algorithm to a few thousand news stories CNN news and Reuters articles from Jan- uary to February in 1995 and presenting each cluster using a few statistically signi cant key terms. As the table shows, domestic politics reigns supreme as usual, the OJ trial still receives media attention, etc. How- ever, the table also reveals that disasters struck Kobe Japan and Malibu California, and Chechnia has ared up again, events which were not present the month before. The key terms provide content information, and the story counts imply signi cance, as measured by media atten- tion. If further detail is desired, the sub-clusters can be examined via query-driven retrieval, browsing individual documents or synthetic summaries across documents 2 . The utility of such computer assistance is evident even though some clusters may be imperfect and the current user interface is rudimentary. This paper reports our work in event detection, a new research topic initiated by the Topic Detection and Tracking TDT project1 . The objective is to identify stories in several continuous news streams that pertain to new or previously unidenti ed events. To be more precise, detection consists of two tasks: retrospective de- tection and on-line detection. The former entails the dis- covery of previously unidenti ed events in an accumu- lated collection, and the latter strives to identify the on- set of new events from live news feeds in real-time. Both 1The TDT project is supported by the U.S. Government, con- sisting of segmentation of stories in a continuous news-stream, temporal event tracking and event detection. Our event tracking work will be reported in a separate paper. 4
  • 8. The Particularities We wanted to be global: ABC News US EHealthNews Europe North Africa Journal North Africa Al Jazeera Arabic World EUbusiness Europe Novaya Gazeta Russia All Africa Africa EUobserver Europe Novinite Bulgaria ANSA Italy EurActiv Europe NPR US Antara News Indonesia Euronews Europe NY Post US AOL news Global EuropeanAgenda Europe NY Times US AP Global EuroTopics Europe Reuters Global BBC UK Fox News US RFERL Asia, M East Boston Globe US France24 France RIAN Russia Budapeast Business J Hungary FT Global The Australian Australia Businessweek Global Helsinki Times Finland The Globe and Mail Canada CBS News US Kyodo News Japan The Guardian UK China News Service China The WSJ US The Herald (Glasgow) Scotland Chosun South Korea Irish Examiner Ireland The Star Malaysia CNN US Le Monde dipl France The Sun UK Cyprus Mail Cyprus Mercopress Latin America The Telegraph UK Daily Mail UK Moscov News Russia Times of India India Daily Mirror UK MSNBC Global Times of Malta Malta Der Spiegel Germany New Europe Europe Voice of America US DW-World Germany New Scientist Global 5
  • 9. The Particularities Danger of falling into one of two extremes: - get drown consider only by local events most salient events 6
  • 10. Our Solution Two-stage approach: 1 Scaffold given by main-segments of primary sources 2 Fill in all the remaining articles 7
  • 11. Our Solution Two-stage approach: 1 Scaffold given by main-segments of primary sources 2 Fill in all the remaining articles Main-segment Sentences containing the first 100 words
  • 12. The Particularities Primary Sources: ABC News US EHealthNews Europe North Africa Journal North Africa Al Jazeera Arabic World EUbusiness Europe Novaya Gazeta Russia All Africa Africa EUobserver Europe Novinite Bulgaria ANSA Italy EurActiv Europe NPR US Antara News Indonesia Euronews Europe NY Post US AOL news Global EuropeanAgenda Europe NY Times US AP Global EuroTopics Europe Reuters Global BBC UK Fox News US RFERL Asia, M East Boston Globe US France24 France RIAN Russia Budapeast Business J Hungary FT Global The Australian Australia Businessweek Global Helsinki Times Finland The Globe and Mail Canada CBS News US Kyodo News Japan The Guardian UK China News Service China The WSJ US The Herald (Glasgow) Scotland Chosun South Korea Irish Examiner Ireland The Star Malaysia CNN US Le Monde dipl France The Sun UK Cyprus Mail Cyprus Mercopress Latin America The Telegraph UK Daily Mail UK Moscov News Russia Times of India India Daily Mirror UK MSNBC Global Times of Malta Malta Der Spiegel Germany New Europe Europe Voice of America US DW-World Germany New Scientist Global 8
  • 13. Our Solution 1 Get new articles 2 Scaffold: 1 S = main-segments from primary sources 2 Update existing clusters with S 3 Consider as event those that have ≥ 3 articles from ≥ 2 sources 3 Fill-in: 1 Consider all articles of the last 48 hours and try to match them to one of the existing events 4 Archive old events (average time ≥ 3 days) 9
  • 14. Algorithms used Scaffolding: Star-EM1 Results are a bit better than traditional TDT results ( 0.8 micro F1 measure). 1 M Gall´e JM Renders. “Full and mini-batch clustering of news articles with Star-EM” ECIR 2012
  • 15. Algorithms used Scaffolding: Star-EM1 Results are a bit better than traditional TDT results ( 0.8 micro F1 measure). Fill-in: tried several alternatives, used one that dynamically classifies segments: Scoring function Given articles d = s1 . . . sn, find (meta-)segments T1, . . . , Tk, such that d = T1 . . . Tk, Ti = sj . . . sj+ for some j and 0 and 1 k k i=1 max e∈E∪{u} sim(Ti , e) − β × k is maximised. 1 M Gall´e JM Renders. “Full and mini-batch clustering of news articles with Star-EM” ECIR 2012
  • 16. Some numbers 62 ≈ 1y 1 hs sources time batch size ≈ 820 000 541 546 10 752 articles crawled assigned to events events 11
  • 17. Remainder of this Talk 1 Method: how we did it 2 Results: what we found 12
  • 18. Two Opinions There exists a group of news sources X such that at least one of the following is true: “All news come from X and the rest just comments on them” “A story receives wide public attention only if X reports on them”
  • 19. First reports Percentage of total events source x reported first on: source percentage Reuters 12.96% All Africa 11.68% France24 10.41% The Globe and Mail 5.47% BBC 4.91% CNN 4.89% Businessweek 3.01% RIAN 2.67% Daily Mirror 2.59% CBS News 2.32% Daily Mail 2.21% The Telegraph 2.21% NY Post 2.19% Kyodo 2.07% NY Times 2.06% Fox News 1.78% 14
  • 20. Relative first reports North Africa Journal 4.61 % AP 1.73 % France24 1.61 % Helsinki Times 1.45 % ANSA 1.38 % NY Times 1.35 % Reuters 1.27 % Chosun 1.22 % CBS News 1.17 % The Herald 1.14 % MSNBC 1.07 % RFERL 1.06 % The Australian 1.03 % New Europe 0.95 % Le Monde dipl 0.95 % Al Jazeera 0.95 % Der Spiegel 0.94 % 15
  • 21. Two Opinions There exists a group of news sources X such that at least one of the following is true: “All news come from X and the rest just comments on them” “A story receives wide public attention only if X reports on them”
  • 23. How to detect bursts “detect features that occur with high density over a limited time period” Bursty and Hierarchical Structure in Streams ∗ Jon Kleinberg † Abstract A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise — that the appearance of a topic in a document stream is signaled by a “burst of activity,” with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The ap- proach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. ∗ This work appears in the Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. † Department of Computer Science, Cornell University, Ithaca NY 14853. Email: kleinber@cs.cornell.edu. Supported in part by a David and Lucile Packard Foundation Fellowship, an ONR Young Investigator Award, NSF ITR/IM Grant IIS-0081334, and NSF Faculty Early Career Development Award CCR-9701399. 1 18
  • 24. How to detect bursts “detect features that occur with high density over a limited time period” Black box: input an arrival series (with timestamps), and get for each event zero or more periods where it was bursty (plus a score) A bit more technical: model this with HMM of 2 nodes: normal bursty emissions 18
  • 25. Bursty sources 1 Compute burst(s) for each event 2 Take for each burst the first article 3 Add the burst’s score to this source total score 19
  • 26. Bursty sources source total burst score Reuters 100.0 The Globe and Mail 83.9 CNN 72.7 Al Jazeera 58.0 France24 53.1 RIAN 47.0 The Star 45.6 CBS News 43.8 MSNBC 42.4 NPR 38.6 The Sun 37.5 DW 34.7 The Guardian 32.1 BBC 30.9 Businessweek 26.8 All Africa 22.1 AP 21.6 19
  • 27. Bursty sources: explanations 1 News agencies: its their job 2 Regional news sources which are trusted (RIAN, AlJazeera) 3 Good “journalistic nose” 20
  • 28. Lag to report 1 Consider for all sources those events it reported on, but not first 2 Take the time delta with the first article 21
  • 29. Lag to report source hours (median) France24 20.85 Reuters 20.87 BBC 21.41 Antara News 21.78 All Africa 22.69 Kyodo 23.47 Fox News 24.74 Al Jazeera 24.86 ANSA 24.95 CNN 24.96 RIAN 25.38 RFERL 26.29 The Telegraph 26.53 Daily Mirror 26.70 Euronews 27.40 The Globe and Mail 27.45 NPR 27.94 21
  • 30. Lag to report: Explanations Online first policy: you can’t argue (in most cases) that this is due to nighttime intrinsic delay waiting for confirmation, and assessing news (?) lingering conservatism (?) Everybody chooses his priorities 22
  • 35. Conclusions Data-driven analysis to study which news outlets break news Some important fine-tuning deviations from classical TDT to scale up capture many global events 24
  • 36. Conclusions Data-driven analysis to study which news outlets break news Some important fine-tuning deviations from classical TDT to scale up capture many global events (In our data) It is not true that only Big Agencies break news Regarding hotness, many trusted regional outlets rank at the top Lag to report remains big, but may hide in-house priorities / strategies. 24