Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

1,138 views

Published on

An overview of using Wikipedia time signal data. These are the slides for the TempWeb workshop paper: http://www.stewh.com/wp-content/uploads/2014/02/w14temp07-whiting.pdf

Published in: Internet, Technology
  • Be the first to comment

  • Be the first to like this

Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

  1. 1. Time Machine STEWART WHITING AND JOEMON M. JOSE UNIVERSITY OF GLASGOW, SCOTLAND, UK Wikipedia as a OMAR ALONSO MICROSOFT BING, MOUNTAIN VIEW, CA, USA Temporal Web Analytics Workshop 2014
  2. 2. Introduction Wiki Characteristics Time Signals Final RemarksData Anyone can create and edit content Moderator-curated Reflects time-based news, culture and phenomena Wikipedia English started in 2001 Now contains 4.5M+ articles ~20.4 revisions per article Vast amounts of open data Rich structure (article hierarchy, linking, taxonomies – semantics) Understanding Wikipedia 6th most visited website on the internet[Alexa] Huge collaborative encyclopaedic effort
  3. 3. Introduction Wiki Characteristics Time Signals Final RemarksData Wikipedia as a Time Machine Text content People write about the past/present/future Explicit/implicit structure Meta-data signals Pulse of real-time activity Side-effects of temporal user interest - without needing a query log! Wikipedia offers a great deal of time information: Insight into: Story Temporal sequencing Entity relationships Impact
  4. 4. Introduction Wiki Characteristics Time Signals Final RemarksData This Talk How can we discover, understand and track past, present and future temporal topics using Wikipedia? And, how can this knowledge be exploited in time-aware information retrieval tasks?
  5. 5. Introduction Wiki Characteristics Time Signals Final RemarksData Wikipedia text and structure used extensively in many non-temporal IR tasks Semantic Similarity/Relatedness Measures [GabrilovichEtAl2007 – Wiki. Explicit Semantic Analysis] [StrubeEtAl2006 – WikiRelate!] External Collection Query Expansion [XuEtAl2009] Query Intent Modelling [HuEtAl2009] Cross-Lingual IR [PotthastEtAl2008] Entity Tasks – Recognition, Disambiguation etc [Many!] IR & Wikipedia
  6. 6. Introduction Wiki Characteristics Time Signals Final RemarksData Time-aware IR & Wikipedia Using Wikipedia temporal signals in time-aware IR tasks Event/Topic Detection & Tracking Detection/tracking: [CiglanNorvag2010,OsborneEtAl2012,SteinerEtAl2013] Summarisation: [GeorgescuEtAl2013,WhitingEtAl2012] Evaluation (ground-truth): [McMinnEtAl2013] Event Visualisation [WattenbergEtAl2007] Temporal Semantics - Entity/Fact Extraction [WangEtAl2010,BalogNorvag2012] Temporal Query Intent Modelling Ambiguous intents: [ZhouEtAl2013] Multi-faceted intents: [WhitingEtAl2013] There are many opportunities…
  7. 7. Introduction Wiki Characteristics Time Signals Final RemarksData Wikipedia Characteristics How quickly does Wikipedia reflect the world? What topic coverage does it offer? Is Wikipedia content high-quality? Can it be trusted?
  8. 8. Introduction Wiki Characteristics Time Signals Final RemarksData Freshness/Timeliness Latency ‘Main-stream’ events – very small (<30 mins? <2 hours? Depends who you ask…) KBA filtering task at TREC: improve event coverage/speed Pope Benedict XVI’s Resignation EN and FR articles updated at 10:58 and 11:00 Reuters broke news at 10:59, following Vatican announcement at 10:57:47 Whitney Houston’s Death Reported on Twitter at 00:15 UTC by niece of hotel worker who found her Spread through Twitter, confirmed by AP via Twitter at 00:57 UTC WH’s article updated ‘has died’ as 01:01 UTC
  9. 9. Introduction Wiki Characteristics Time Signals Final RemarksData Topic Coverage Not all topics covered representatively Events may only appear as a sentence or sub-section of main article (e.g. a celebrity in a scandal) Separate article(s) created for major events 39th G8 Summit, 2013 North India Floods See Also: Response to...., Criticisms of… etc. Meta-data signals quantify impact An Analysis of Topical Coverage of Wikipedia Halavais and Lackaff, 2008
  10. 10. Introduction Wiki Characteristics Time Signals Final RemarksData Content Quality Idealistically – facts verified by 3rd party through citations Plenty of editorial guidelines “Wikipedia is not a newspaper” Bots make lots of changes Talk pages contain temporal discourse Sometimes prominent articles are locked – far less edits (but, pre-verified) Period Digest 1 {{death}} (Refers to the article ’infobox’ with birth and death dates.) 2 Houston died on February 11, 2012. Publicist Kristen Foster said Saturday that the singer had died, but the cause of her death was unknown. She died in [[Ottawa]], [[Canada]]. 3 [Similar to previous.] 4 4 On February 11, 2012, publicist Kristen Foster revealed Houston had died aged 48. A cause of death was not immediately given. She died in her Beverly Hills home. 5 [Similar to previous.] 6 [Similar to previous.] 7 On February 11, 2012, publicist Kristen Foster revealed Houston had died from unspecified causes at the age of 48, with unconfirmed reports suggesting her death occurred in her room at the [[Beverly Hilton Hotel]]. 8 Houston released her new album, ”[[I Look to You]]”, on August 2009. The album’s first two singles are "I Look to You" and "Million Dollar Bill". The album entered the [[Billboard 200]] at No. 1... 9 Local police said there were "no obvious signs of criminal intent." Two days prior to her death, witnesses reported seeing Houston behave erratically. They were rumored that she died of drug overdose.
  11. 11. Introduction Wiki Characteristics Time Signals Final RemarksData Data Sources Page APIs Easy random access to revisions etc. (slow!) Article Creation/Change IRC Channels All updates, no full-text Article Creation/Change RSS/Atom Feeds Not all updates, but includes full-text content XML Article Dumps (monthly) All article/page revisions (EN is 7TB decompressed!) Or, current article revision only Need a cluster to derive more useful datasets Page View Dumps (hourly) Measure of article popularity, since end 2007 See stats.grok.se for an easier interface May 2013 daily article changes RSS feed volume (in log scale) for Wikipedia EN, FR, IT, DE and ES Several openly available Wikipedia data sources
  12. 12. Introduction Wiki Characteristics Time Signals Final RemarksData Current Events Portal Manually curated list of recent/ongoing mainstream events Ad-hoc taxonomy, e.g. finance, sports, deaths, politics etc. Used as a ground-truth for automated TDT evaluation May 2013: Avg. 15 (±6) articles per day
  13. 13. Introduction Wiki Characteristics Time Signals Final RemarksData Temporal Expressions Using temporal tagger (e.g. HeidelTime) Extracted dates in article content YEAR, MONTH-YEAR and DAY-MONTH-YEAR Year mentions in Wikipedia English from 1900 to 2020 Visualises past and future time coverage 9/11, 2001 is a large spike 1st/2nd World Wars also prominent ‘Recentism’ - biased coverage of recent information
  14. 14. Introduction Wiki Characteristics Time Signals Final RemarksData Page Edit Stream ‘Arab Spring’ daily article edit frequency and length (in characters) since 27th January 2011 (to 23rd March 2012) Derived from historic revision dumps, RSS or IRC feeds Changed text can be mined for summaries, inc. references Look for links, sections, images in markup
  15. 15. Introduction Wiki Characteristics Time Signals Final RemarksData Temporal Article Structure Changes in article (sub-)sections Finer-grained interest over time People edit what is changing - Evolving section hierarchy A temporal directed acyclic graph - Cumulative ‘Arab Spring’ article section edit frequency since 27th January 2011 root
  16. 16. Introduction Wiki Characteristics Time Signals Final RemarksData Temporal Link Graph Cumulative ‘Arab Spring’ article in- and out-link degree since 27th January 2011 Links created using [article/redirect|[name]] Wiki markup Need to be careful with namespaces, languages, link naming and redirects Can also include external ‘citation’ links
  17. 17. Introduction Wiki Characteristics Time Signals Final RemarksData Page View Stream Page views are very sensitive Little correlation between page edit and viewing activity More edits than interest at first - Correlations between articles are interesting [CiglanNorvag2010] ‘Arab Spring’ article daily edit frequency and page views since 27th January 2011 (to 23rd March 2012)
  18. 18. Introduction Wiki Characteristics Time Signals Final RemarksData Final Remarks I have various distilled datasets with me (and can arrange download + C# MapReduce code) ArticleEditTimestamps SampleEventSummarisation DisambiguationPages TemporalLinkGraphWithSections RedirectPages TemporalSectionChanges TimeExpressions 120gb total, or select Wikipedia temporal datasets cover a wide range of events, culture and phenomena Temporal meta-data and content signals openly available Informative power – hugely valuable for time-aware IR research Probably won’t beat Twitter for speed, but Wiki has structure and quality control Many open research questions and opportunities for time-aware IR!
  19. 19. Introduction Wiki Characteristics Time Signals Final RemarksData Some Research Questions 1. How fast does Wikipedia respond to events of different types in countries? 2. How can Wikipedia data supplement query log, Twitter and news feed streams to improve time-aware IR? 3. What do temporal correlations between linked article page views mean – is this reflected in the text content? 4. Can event similarity be measured on temporal and topical dimensions? 5. Can this temporal knowledge be used to predict interest in topics that become associated in similar ways? (E.g. actors selected by famous shows, or directors etc.)

×