Using Wikipedia Concurrent Edit Spikes With Social Network Plausibility Checks For Breaking News Detection
1. MJ no more:
Using Wikipedia Concurrent Edit Spikes
With Social Network Plausibility Checks
For Breaking News Detection
Thomas Steiner (tomac@google.com, @tomayac)
Seth van Hooland (svhoolan@ulb.ac.be, @sethvanhooland)
Ed Summers (edsu@loc.gov, @edsu)
3. First Story Detection on Realtime Social Networks
Typically based on Twitter because of their Streaming API [Twitter2012].
Try to detect spikes in time, locality, text (oftentimes restricted domain, e.
g., earthquake prediction).
A typical representative for this kind of approach is, e.g., [Petrović2010].
High recall
Low precision
[Twitter2012] https://dev.twitter.com/docs/streaming-apis/streams/public
[Petrović2010] Saša Petrović, Miles Osborne, and Victor Lavrenko. 2010. Streaming first story detection with
application to Twitter. In Human Language Technologies: The 2010 Annual Conference of the North American
Chapter of the Association for Computational Linguistics (HLT '10). Association for Computational Linguistics,
Stroudsburg, PA, USA, 181–189.
4. Curation based on Wikipedia
Wikipedia page view logs are publicly available [Wikipedia2012]. Updated
on an hourly basis.
Osbourne et al. have successfully shown that there is a relation between
Wikipedia page views and news events [Osbourne2012].
Improves the approach of [Petrović2010] by using Wikipedia logs.
Key findings:
Wikipedia lags about 2h behind the news.
Newly created pages add noise.
[Wikipedia2012] http://dumps.wikimedia.org/other/pagecounts-raw/
[Osbourne2012] M. Osborne, S. Petrovic, R. McCreadie, C. Macdonald, I. Ounis. 2012. Bieber no more: First Story
Detection using Twitter and Wikipedia. In SIGIR 2012 Workshop on Time-aware Information Access (#TAIA2012),
Portland, Oregon, USA
5. Key idea: inverse the process
Use Wikipedia live IRC stream of recent changes [WikipediaIRC2012],
then do a sanity check on social networks.
[WikipediaIRC2012] http://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds
6. Introducing Wikipedia Live Monitor
Hooks into the Wikipedia recent changes IRC channels for all Wikipedia
locales.
Channel names follow the pattern
#language.project, e.g., #de.wikipedia
When an article gets edited, retrieve all language versions and treat them
as a cluster.
E.g., en:Albert_Einstein is in the same cluster as de:
Albert_Einstein.
7. 1) ≥ 5 Occurrences
An article cluster must have at least n edits before it is considered a
breaking news candidate.
2) ≤60 Seconds Between Edits
An article cluster may have at max n seconds in between edits in order to
be regarded a breaking news candidate.
3) ≥2 Concurrent Editors
An article cluster must be edited by at least n concurrent editors before it
is considered a breaking news candidate.
4) ≤240 Seconds Since Last Edit
An article cluster is thrown out of the monitoring loop if its last edit is
longer ago than n seconds.
Breaking News Conditions
11. Lag time for global events: <5 min
Resignation of Pope Benedict XVI (http://en.wikipedia.
org/wiki/Resignation_of_Pope_Benedict_XVI)
Three first edit times (UTC) after news broke on Feb 11, 2013
● English Wikipedia article: 10:58, 10:59, 11:02
● French Wikipedia article: 11:00, 11:00, 11:01
Implies that by looking at only two language versions (the actual number
of monitored versions is 42) of the Pope article, the system would have
reported the news at 11:01
Twitter account of Reuters announced the news at 10:59
Vatican Radio’s announcement was made at 10:57:47
Evaluation—How well does it work?
12. Work with realtime page view logs in addition to page edit logs
(API format currently being defined by Wikimedia)
News categorization and classification
E.g., Category Living-Persons removed from person implies (sad)
news
Improve false-positive rate, make connection with social networks and
actual article edits stronger
Auto notification system upon breaking news candidates
Pre-announcement: follow @WikiLiveMon
Future Work
13. Play with the system at
http://wikipedia-irc.herokuapp.com/
Read the paper at
http://arxiv.org/abs/1303.4702
Ask questions here or via
tomac@google.com & @tomayac
Demo and thank you