Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automated News Suggestions for Populating Wikipedia Entity Pages

1,268 views

Published on

Presentation given at the Wikimedia Research Showcases, September 2015.

Published in: Science
  • Be the first to comment

Automated News Suggestions for Populating Wikipedia Entity Pages

  1. 1. Automated News Suggestions for Populating Wikipedia Entity Pages Besnik Fetahu, Katja Markert and Avishek Anand
  2. 2. Outline 1. Introduction 2. Problem Definition & Motivation 3. News Suggestion Approach • Article-Entity Placement • Article-Section Placement 4. Experimental Setup & Evaluation 5. Results 6. Conclusions 2
  3. 3. Introduction 3 • Human fatalities: 10k vs 1.8k losses • Estimated damages: $4.5 vs. $108 billions • ‘Odisha cyclone’ has no coverage in the entity location ‘Odisha’ • ‘Hurricane Katrina’ finds broad coverage in entity location `New Orleans’ New Orleans Odisha Hurricane Katrina Odisha Cyclone
  4. 4. Introduction • Entities comprise of facts and statements supported by external references! • News as authoritative sources with emerging facts and events. • Delay between the reporting of an event in news and its inclusion in entity pages1 • Incomplete section structure for long—tail entities • Several implications on real-world applications that make use of Wikipedia, e.g. KB maintenance, entity disambiguation etc. [1] “How much is Wikipedia lagging behind news?” Besnik Fetahu, Abhijit Anand and Avishek Anand, WebSci’15, Oxford, UK. 4
  5. 5. Motivation: News Density in Wikipedia 0 0.2 0.4 0.6 0.8 1 C om icsC reator Artw ork N aturalPlaceAirlineFilm SoccerM anager LegalC aseAlbumBand SportsTeam TelevisionShow Anatom icalStructure Athlete W eapon extbfC rim inal M usicalArtist PoliticianPlantSong N on-ProfitO rganisationBookActor FictionalC haracter R ecordLabel Broadcaster PoliticalParty extbfAutom obile TradeU nion Scientist M ilitaryPerson Philosopher TelevisionSeason Election O fficeH older SportsLeague G overnm entAgencySingle Anim al Aw ard SportsEvent Airport M ilitaryC onflict TelevisionEpisode Aircraft M agazineW riter Location news book court journal web thesis • Citation templates (‘news’, ‘books’, ‘web’, ‘journal’ etc.) • ~60% vs. 20% ‘web’ and ‘news’ citations • On average there are ~6.5 news citations per entity • On average a news article is assigned to ~1.3 entities • The most cited news article is cited by 81 entities [1] “How much is Wikipedia lagging behind news?” Besnik Fetahu, Abhijit Anand and Avishek Anand, WebSci’15, Oxford, UK. 5
  6. 6. Problem Definition news Pub.date: tk entity pages Rev.date: tk-1 news article • news title • headline • paragraphs • named entities entity page • section template • categories • entities (anchors) • ….. 6
  7. 7. Problem Definition news Pub.date: tk entity pages Rev.date: tk-1 news article • news title • headline • paragraphs • named entities entity page • section template • categories • entities (anchors) • ….. suggest news n to entity e ? 6
  8. 8. Problem Definition news Pub.date: tk entity pages Rev.date: tk-1 news article • news title • headline • paragraphs • named entities entity page • section template • categories • entities (anchors) • ….. suggest news n to entity e ? 6 suggest news n to entity e ?
  9. 9. Problem Definition news Pub.date: tk entity pages Rev.date: tk-1 news article • news title • headline • paragraphs • named entities entity page • section template • categories • entities (anchors) • ….. suggest news n to entity e ? specify the section in e for n 6 suggest news n to entity e ?
  10. 10. Approach: Automated news suggestion to entity pages feature  extrac*on Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. en**es news  ar*cle 7
  11. 11. Approach: Automated news suggestion to entity pages feature  extrac*on Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. en**es news  ar*cle 7 ar*cle  en*ty   placement Odisha Bay of Bengal Phailin Task#1
  12. 12. Approach: Automated news suggestion to entity pages feature  extrac*on Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. en**es news  ar*cle 7 ar*cle  en*ty   placement Odisha Bay of Bengal Phailin Task#1 one  classifier  per   en*ty  type ar*cle  sec*on   placement   [state]:geography [city]:climate … Task#2
  13. 13. Approach: Automated news suggestion to entity pages feature  extrac*on Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. en**es news  ar*cle sec*ons wikipedia     en*ty  page 7 ar*cle  en*ty   placement Odisha Bay of Bengal Phailin Task#1 one  classifier  per   en*ty  type ar*cle  sec*on   placement   [state]:geography [city]:climate … Task#2
  14. 14. 8 Nikola Tesla Elon Musk Larry Page John B. Kennedy News Suggestion Attributes: Task#1 Entity Salience Entity Salience: Relative Entity Frequency • reward entity appearing throughout the text • reward entity appearing in the top paragraphs • weigh an entity w.r.t its co-occurring entities Tesla is a central concept in the given news article
  15. 15. 9 News Suggestion Attributes: Task#1 Relative Entity Authority Elias TabanHillary Clinton Relative Entity Authority • entities with `low authority’ have lower entry barrier for a news article • a news article in which an entity co- occurs with `high authority’ entities conveys news the importance • entity authority as an a priori probability or any centrality based measure
  16. 16. News Suggestion Attributes: Task#1 Novelty & Redundancy previously added news articles • novelty is measured w.r.t previously added news articles in an entity page • major events have wide coverage in news media • place the news article into the correct section Novelty and Redundancy Measure 10
  17. 17. News Suggestion Attributes: Task#1 Novelty & Redundancy previously added news articles • novelty is measured w.r.t previously added news articles in an entity page • major events have wide coverage in news media • place the news article into the correct section Novelty and Redundancy Measure 10
  18. 18. Approach: When to suggest a news article to an entity page? 11 • Entity Salience: entity is a central concept in a given news article • Relative Entity Authority: entity co-occurs with entities with higher authority (e.g. a priori probability on a given corpus) • Novelty: the news article in which an entity occurs brings novel information to its entity page in Wikipedia • Section: the news article fits to one of the main sections of an entity in Wikipedia. Expand the entity page with a new section if that doesn't exist already (Task#2)
  19. 19. Approach: News Suggestion Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. en**es ar*cle  sec*on   placement  feature  extrac*on [state]:geography [city]:climate … one  classifier  for     each  en*ty  type ar*cle  en*ty   placement Odisha Bay of Bengal Phailinnews  ar*cle sec*ons wikipedia     en*ty  page Task#1 1. salience 2. relative authority 3. novelty Task#2 1. section—templates 2. overall—section fit 12
  20. 20. Approach: News Suggestion Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. en**es ar*cle  sec*on   placement  feature  extrac*on [state]:geography [city]:climate … one  classifier  for     each  en*ty  type ar*cle  en*ty   placement Odisha Bay of Bengal Phailinnews  ar*cle sec*ons wikipedia     en*ty  page Task#1 1. salience 2. relative authority 3. novelty Task#2 1. section—templates 2. overall—section fit 12
  21. 21. Approach: News Suggestion Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. en**es ar*cle  sec*on   placement  feature  extrac*on [state]:geography [city]:climate … one  classifier  for     each  en*ty  type ar*cle  en*ty   placement Odisha Bay of Bengal Phailinnews  ar*cle sec*ons wikipedia     en*ty  page Task#1 1. salience 2. relative authority 3. novelty Task#2 1. section—templates 2. overall—section fit 12
  22. 22. Task#2: Section—template generation Germanwings Adria Lufthansa • Section templates per entity type • Pre-determined number of main sections • Canonicalize sections • Generate `complete’ section templates based on similar entities • Cluster based on the X—means[3] algorithm [3] D. Pelleg, A. W. Moore, et al. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, pages 727–734, 2000.13
  23. 23. Task#2: Overall news—section fit • What is the best section to append a given news article? • measure overall similarity between n and the pre-computed sections in the section templates • Similarity aspects between news articles and sections • Topic similarity (LDA models over the sections and news documents) • Syntactic similarity • Lexical similarity • Entity—based similarity (overlap of named entities) • Frequency 14
  24. 24. Evaluation Strategy What comprises of the ground-truth for such a task? Challenges • `Invasive’: add news articles and wait for a time period until it is either accepted or deleted by the Wikipedia editors • Long tail vs. trunk entities: long tail entities might not be of particular interest to editors, hence, many `false positives’ will go unnoticed. • Crowdsourcing: Challenging to find knowledgable workers for long-tail entities Approach • Use already referenced news articles from entity pages • Avoid the uncertainty of judgements and expertise of crowd workers • Non-invasive approach for entity pages • Reusable test bed for similar approaches 15
  25. 25. Experimental Setup Distribution of news articles, entities, and sections across the years Datasets Evaluation Plan • train at years [to, ti], test at (ti, tk] • P/R/F1 metrics Baselines Task#1: AEP • B1: AEP based on Dunietz and Gillick • B2: AEP if entity appears in the news title Task#2: ASP • S1: AES based on max similarity to one of the sections • S2: AES to the most frequent section 16
  26. 26. Task#1: Article—Entity Placement 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall Precision group B1 B1+F_e 2009 Performance Robustness Feature Analysis 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 Recall Precision group B1 B1+All B1+Authority B1+Novelty B1+Salience Number Instances 17
  27. 27. Task#2: Article—Section Placement Average precision for AES across entity classes 0 0.2 0.4 0.6 0.8 1 2009 2010 2011 2012 2013 2014 Avg.Precision Fs S2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 2010 2011 2012 2013 2014 MissingSectionRatio Person Organization Creative Work Location Entity Profile expansion 18
  28. 28. Conclusions • Two—stage news suggestion approach for Wikipedia entity pages • Model and define what makes a good news suggestion • Model functions for salience, relative authority, novelty and section placement defined as attributes for a ‘good news suggestion’ • Entity profile expansion • Extensive evaluation over 350k news articles, 73k entity pages and for the different Wikipedia states between 2009 and 2014. • A publicly available and reusable test bed for similar tasks 19
  29. 29. Thank you! Questions? For more: Twitter: @FetahuBesnik Web: http://l3s.de/~fetahu/ 20

×