Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Approaches for Improving and Enriching Textual Knowledge Bases

469 views

Published on

These are the slides of my PhD defense presentation at the Leibniz University of Hannover, 8.11.2017.

Published in: Science
  • Be the first to comment

Approaches for Improving and Enriching Textual Knowledge Bases

  1. 1. Approaches for Improving and Enriching Textual Knowledge Bases Besnik Fetahu PhD Defense 8th of November 2017 Hannover, Germany
  2. 2. What is a textual knowledge base?
  3. 3. Wikipedia as a textual knowledge base 3 Wikipedia Articles University of Hannover Infobox Section Template Section Text Wikipedia is a free online encyclopedia with the aim to allow anyone to edit articles. Wikipedia is the largest and most popular general reference work on the Internet, and is ranked the 5th popular website. Wikipedia is owned by the nonprofit Wikimedia Foundation.
  4. 4. Wikipedia Editor Collaboration Dynamics 4 Wikipedia Wikipedia Editors Localized Wikipedias Editor Profiles lang: {English} topic: {Education, Politics} ~40 mill. articles 293 lang. 32 mill. editors (only in english Wikipedia) Wikipedia Revisions (cur | prev) 02:51, 5 October 2017 Brilliantwiki2 (talk | contribs) . . (14,516 bytes) (+69) . . (→Rankings) (undo | thank) (cur | prev) 00:30, 21 August 2017 Blueclaw (talk | contribs) . . (14,447 bytes) (+83) . . (→Alumni: added Flügge-Lotz) (undo | thank) (cur | prev) 09:03, 18 June 2017 77.23.196.148 (talk) . . (14,364 bytes) (-2) . . (→History: pupils -> students, today and now in same sentence corrected) (undo) (cur | prev) 05:35, 10 June 2017 AnomieBOT (talk | contribs) . . (14,366 bytes) (+319) . . (Rescuing orphaned refs ("Mitarbeiter und Etat" from rev 782668206; "Studierende" from rev 782668206)) (undo) (cur | prev) 01:24, 10 June 2017 Mephistolus (talk | contribs) m . . (14,047 bytes) (+9) . . (undo | thank) (Tag: Visual edit) (cur | prev) 01:21, 10 June 2017 Mephistolus (talk | contribs) . . (14,038 bytes) (-89) . . (Update infobox) (undo | thank) (Tag: Visual edit)
  5. 5. • Wikipedia and its sister projects develop at a rate of over 10 edits per second, performed by editors from all over the world. • English Wikipedia has an average growth rate of 600 new articles per day. Wikipedia Dynamics and Growth 5 Wikipedia’s Daily Growth Rate
  6. 6. Editorial Policies in Wikipedia 6 Wikipedia is written from a neutral point of view. Content in Wikipedia must be verifiable. The burden of evidence lies with the editor who adds content into a page. No original research. Content — such as facts, allegations, and ideas — for which no reliable, published sources exist.
  7. 7. Why Wikipedia?
  8. 8. Quality in Wikipedia 8 [1] Giles, Jim. "Internet encyclopaedias go head to head." (2005): 900-901. Nature. [2] Keegan, Brian, Darren Gergle, and Noshir Contractor. "Hot off the wiki: Structures and dynamics of Wikipedia’s coverage of breaking news events." American Behavioral Scientist 57, no. 5 (2013) ~ • Comparable quality to Britannica[1] • Verifiable content through third- party external sources • Up-to-date information on emerging entities and events[2]
  9. 9. Importance and Use of Wikipedia 9
  10. 10. Importance and Use of Wikipedia 9 Rank Site Daily Time on Site Daily Pageviews per Visitor % of traffic from search Total Sites Lining in 1 google.com 8:02 8.93 4.30% 3.56 mill. 2 youtube.com 8:27 4.98 15.40% 2.69 mill. 3 facebook.com 9:48 4.01 8.30% 7.6 mill. 4 baidu.com 7:56 6.36 8.50% 1.3 mill. 5 wikipedia.org 4:11 3.28 68.40% 1.7 mill.
  11. 11. From Wikipedia to Structured Data and Search 10 ++
  12. 12. From Wikipedia to Structured Data and Search 11 Web Search (Google Knowledge Cards) VoiceSearch
  13. 13. Use Cases: Question Answering 12 Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes: Reading Wikipedia to Answer Open-Domain Questions. ACL 2017. Q: When did the 1973 oil crisis begin? A: October 1973
  14. 14. What are the issues in Wikipedia?
  15. 15. Issues with Verifiability in Wikipedia 14 https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wikipedia
  16. 16. Issues with Verifiability in Wikipedia 15 https://www.theguardian.com/technology/2017/feb/08/wikipedia-bans- daily-mail-as-unreliable-source-for-website
  17. 17. Issues with Verifiability in Wikipedia 15 Unreliable news source https://www.theguardian.com/technology/2017/feb/08/wikipedia-bans- daily-mail-as-unreliable-source-for-website
  18. 18. Issues with Verifiability in Wikipedia 16 Where was Obama born? In 2012, Breitbart.com published a copy of a promotional booklet that Obama's literary agency, Acton & Dystel, printed in 1991 (and later posted to their website, in a biography in place until April 2007) which misidentified Obama's birthplace and states that Obama was "born in Kenya and raised in Indonesia and Hawaii." Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. He is the first President to have been born in Hawaii, making him the first President born outside of the contiguous 48 states. He was born to a white mother and a black father. His mother, Ann Dunham (1942–1995), was born in Wichita, Kansas, of mostly English descent, with some German, Irish, Scottish, Swiss, and Welsh ancestry. Kenya. Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi: Bidirectional Attention Flow for Machine Comprehension. CoRR abs/1611.01603 (2016)
  19. 19. Issues with Verifiability in Wikipedia 16 Where was Obama born? In 2012, Breitbart.com published a copy of a promotional booklet that Obama's literary agency, Acton & Dystel, printed in 1991 (and later posted to their website, in a biography in place until April 2007) which misidentified Obama's birthplace and states that Obama was "born in Kenya and raised in Indonesia and Hawaii." Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. He is the first President to have been born in Hawaii, making him the first President born outside of the contiguous 48 states. He was born to a white mother and a black father. His mother, Ann Dunham (1942–1995), was born in Wichita, Kansas, of mostly English descent, with some German, Irish, Scottish, Swiss, and Welsh ancestry. Kenya. Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi: Bidirectional Attention Flow for Machine Comprehension. CoRR abs/1611.01603 (2016)
  20. 20. Misalignment of Editor Efforts in Wikipedia 17 • Human Fatalities: 10k vs 1.8k losses • Estimated Damages: $4.5 vs. $108 billions • “Odisha Cyclone” without coverage and mention in Wikipedia article “Odisha” • “Hurricane Katrina” finds broad coverage in Wikipedia article “New Orleans”
  21. 21. Challenges addressed in this thesis
  22. 22. Challenges and Contributions 19 For an arbitrary statement in Wikipedia how can we find citations which provide evidence for it? For a paragraph in Wikipedia and an existing citation how can we determine the exact span of the citation? For a Wikipedia page and a given news corpus how can we find and suggest important and novel information for a page?
  23. 23. Part (I): Finding news citations for Wikipedia entity pages? Besnik Fetahu, Katja Markert, Wolfgang Nejdl, Avishek Anand: “Finding News Citations for Wikipedia”. CIKM 2016: 337-346
  24. 24. News Collection t1 t2 tn Textual Knowledge Base t1 t2 tn Citation Recommendation Citation Span News Suggestion Entity Placement Section Placement e:“Barack Obama” Obama was born on August 4, 1961,[4] ….. The couple married in Wailuku on Maui on … After graduating ... a JD … magna cum laude[49]… Obama was elected to the Illinois Senate in … news? query for s1 c4 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. c4s1 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. citation c4 span e:“Barack Obama” AND t2 news: nk time: > t2 The choice of Barack Obama on Friday as the recipient of the 2009 Nobel Peace Prize, [...] around the globe. [...] The Nobel committee’s embrace of Mr. Obama was viewed [as a rejection of the unpopular tenure, in] Europe especially, of his predecessor, George W. Bush. [...] “To be honest,” the president said in the Rose Garden, [...] Last year’s laureate, former President Martti Ahtisaari of Finland, saw the award as an endorsement of Mr. Obama’s goal of achieving Middle East peace. 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Family and Personal life 7.Cultural and political image 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Nobel Peace Prize 7.Family and Personal life 8.Cultural and political image t2 t3 publish date t headline body entity mentions (e.g. “Barack Obama, Nobel Prize”…) revision date t entity title sections section text categories citations
  25. 25. Finding news citations for Wikipedia entity pages 22 Wikipedia Articles 1. Early life and career 2. Presidential campaigns 3. Presidency (2009-2017) 4. Post-presidency (2017-present) 5. Legacy 6. Books written Section Chunking Statements Extraction Obama was born on August 4, 1961,[c] at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii.[c] He is the first President to have been born in Hawaii,[c] making him the first President born outside of the contiguous 48 states.[c] […] His father, Barack Obama Sr. (1936–1982), was a married Luo Kenyan man from Nyang'oma Kogelo. Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] […] Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] Find a citation for the statement! Does it require a news citation? Yes
  26. 26. Finding news citations for Wikipedia entity pages 22 Wikipedia Articles 1. Early life and career 2. Presidential campaigns 3. Presidency (2009-2017) 4. Post-presidency (2017-present) 5. Legacy 6. Books written Section Chunking Statements Extraction Obama was born on August 4, 1961,[c] at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii.[c] He is the first President to have been born in Hawaii,[c] making him the first President born outside of the contiguous 48 states.[c] […] His father, Barack Obama Sr. (1936–1982), was a married Luo Kenyan man from Nyang'oma Kogelo. Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] […] Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] Find a citation for the statement! Does it require a news citation? Yes Task—1: Statement Categorization
  27. 27. Finding news citations for Wikipedia entity pages 22 Wikipedia Articles 1. Early life and career 2. Presidential campaigns 3. Presidency (2009-2017) 4. Post-presidency (2017-present) 5. Legacy 6. Books written Section Chunking Statements Extraction Obama was born on August 4, 1961,[c] at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii.[c] He is the first President to have been born in Hawaii,[c] making him the first President born outside of the contiguous 48 states.[c] […] His father, Barack Obama Sr. (1936–1982), was a married Luo Kenyan man from Nyang'oma Kogelo. Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] […] Obama's parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on scholarship.[c] Find a citation for the statement! Does it require a news citation? Yes Task—1: Statement Categorization Task—2: Citation Discovery
  28. 28. Task#1: Statement Categorization
  29. 29. Why Statement Categorization? 24 type description arXiv arXiv preprints AV media audio and visual AV media notes audio and visual liner notesbioRxiv bioRxiv preprints book books conference conference papers encylopedia edited collections episode radio or television collectionsinterview interviews journal academic journals and papersmagazine magazines, periodicalsmailing list public mailing lists map maps news news articles newsgroup online newsgroups podcast audio or video podcastpress release press releases report reports serial audio or video serials sign sign, plaques speech speeches techreport technical report thesis theses web any resource accessible through a Citation Types in Wikipedia 0 0.2 0.4 0.6 0.8 1 ComicsCreator Artwork NaturalPlace Airline Film SoccerManager LegalCase Album Band SportsTeam TelevisionShow AnatomicalStructure Athlete Weapon Criminal MusicalArtist Politician Plant Song Non-ProfitOrganisation Book Actor FictionalCharacter RecordLabel Broadcaster PoliticalParty Automobile TradeUnion Scientist MilitaryPerson Philosopher TelevisionSeason Election OfficeHolder SportsLeague GovernmentAgency Single Animal Award SportsEvent Airport MilitaryConflict TelevisionEpisode Aircraft Magazine Writer Location news book court journal web thesis Besnik Fetahu, Abhijit Anand, Avishek Anand: “How much is Wikipedia Lagging Behind News?” WebSci 2015: 28:1-28:9
  30. 30. Why Statement Categorization? 24 type description arXiv arXiv preprints AV media audio and visual AV media notes audio and visual liner notesbioRxiv bioRxiv preprints book books conference conference papers encylopedia edited collections episode radio or television collectionsinterview interviews journal academic journals and papersmagazine magazines, periodicalsmailing list public mailing lists map maps news news articles newsgroup online newsgroups podcast audio or video podcastpress release press releases report reports serial audio or video serials sign sign, plaques speech speeches techreport technical report thesis theses web any resource accessible through a Citation Types in Wikipedia 0 0.2 0.4 0.6 0.8 1 ComicsCreator Artwork NaturalPlace Airline Film SoccerManager LegalCase Album Band SportsTeam TelevisionShow AnatomicalStructure Athlete Weapon Criminal MusicalArtist Politician Plant Song Non-ProfitOrganisation Book Actor FictionalCharacter RecordLabel Broadcaster PoliticalParty Automobile TradeUnion Scientist MilitaryPerson Philosopher TelevisionSeason Election OfficeHolder SportsLeague GovernmentAgency Single Animal Award SportsEvent Airport MilitaryConflict TelevisionEpisode Aircraft Magazine Writer Location news book court journal web thesis Besnik Fetahu, Abhijit Anand, Avishek Anand: “How much is Wikipedia Lagging Behind News?” WebSci 2015: 28:1-28:9 • Citations of type web and news account for the absolute majority of citations in Wikipedia • Citations of type news are considered as “reliable, published source” • Depending on the context and added information in Wikipedia, different citation types are preferred
  31. 31. Why Statement Categorization? 25 Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[1] in a campaign that projected themes of hope and change.[2] On June 3, 2008, Senator Obama—along with Senators Tom Carper, Tom Coburn, and John McCain—introduced follow-up legislation: Strengthening Transparency and Accountability in Federal Spending Act of 2008.[1] In mid—1988, he traveled for the first time in Europe for three weeks and then for five weeks in Kenya, where he met many of his paternal relatives for the first time.[1][2] 1. “Barack Obama on the Issues: What Would Be Your Top Three Overall Priorities If Elected?". The Washington Post. 2. “The Obama promise of hope and change". The Independent. London. November 1, 2008. cite type = “news” 1. "S. 3077: Strengthening Transparency and Accountability in Federal Spending Act of 2008: 2007–2008 (110th Congress)". Govtrack.us. June 3, 2008. cite type = “report” cite type = “book” 1. Obama, Auma (2012). And then life happens: a memoir. New York: St. Martin's Press. pp. 189–208, 212–216. ISBN 978-1-250-01005-6.
  32. 32. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.).
  33. 33. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.). Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[c] in a campaign that projected themes of hope and change.[c]
  34. 34. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.). Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[c] in a campaign that projected themes of hope and change.[c] Wikipedia language style Wikipedia entity structure feature extraction
  35. 35. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.). Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[c] in a campaign that projected themes of hope and change.[c] Wikipedia language style Wikipedia entity structure feature extraction multi-class classification feature representation
  36. 36. Task#1: Statement Categorization 26 For a given Wikipedia statement, categorize it through a supervised model into one of the predefined citation types (e.g. “news”, “web” etc.). Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[c] in a campaign that projected themes of hope and change.[c] Wikipedia language style Wikipedia entity structure feature extraction multi-class classification feature representation
  37. 37. Task#2: Citation Discovery
  38. 38. Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus.
  39. 39. Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement
  40. 40. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement
  41. 41. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query
  42. 42. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query top—k retrieval 1 doc1 2 doc2 3 doc3 100 doc100 ranked news
  43. 43. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query top—k retrieval 1 doc1 2 doc2 3 doc3 100 doc100 ranked news citation discovery
  44. 44. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query top—k retrieval 1 doc1 2 doc2 3 doc3 100 doc100 ranked news citation discovery
  45. 45. citation discovery Wikipedia Statement Task#2: Citation Discovery 28 For a Wikipedia statement that requires a citation of type news, find one or more news article(s) as a citation from a given news corpus. On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia statement news index query top—k retrieval 1 doc1 2 doc2 3 doc3 100 doc100 ranked news citation discovery
  46. 46. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement +
  47. 47. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence
  48. 48. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence • Lexical similarity between sentence and statement • Language model similarity • TreeKernel similarity • Query and news article Entailment
  49. 49. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence • Lexical similarity between sentence and statement • Language model similarity • TreeKernel similarity • Query and news article Entailment • TextRank for measuring sentence centrality in a news article • Entailment feature scores w.r.t most central sentence in a news article Centrality
  50. 50. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence • Lexical similarity between sentence and statement • Language model similarity • TreeKernel similarity • Query and news article Entailment • TextRank for measuring sentence centrality in a news article • Entailment feature scores w.r.t most central sentence in a news article Centrality • Entity type specific news citation suggestion • Authority of news domains on specific entity types Authority
  51. 51. Task#2: Citation Discovery Properties 29 Candidate News Article On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois. Wikipedia Statement + Speaking in a single-digit, morning chill and sunshine to thousands of supporters outside the Old State Capitol, the first-term Democratic senator delivered an address that built upon his biography as a community organizer in Chicago, state legislator and U.S. senator to call for quick action on issues ranging from bringing a close to the Iraq war to the need for universal health care and an end to foreign-oil dependence. News Sentence • Lexical similarity between sentence and statement • Language model similarity • TreeKernel similarity • Query and news article Entailment • TextRank for measuring sentence centrality in a news article • Entailment feature scores w.r.t most central sentence in a news article Centrality • Entity type specific news citation suggestion • Authority of news domains on specific entity types Authority Properties of a good citation: 1. the statement should be entailed by the news article 2. the statement is central in the news article 3. the cited news article should be from an authoritative source
  52. 52. Evaluation
  53. 53. Evaluation Datasets 31 • 6.9 million Wikipedia statements • 8.8 million citations to external references • 1.6 million Wikipedia entities • 1.88 million news articles cited from Wikipedia statements • 20 million news articles from a real world news collection (GDelt), between 2013— 2015 • 27k news articles cited from Wikipedia statements in the within the range of GDelt Task#1: Statement Categorization Data Task#2: Citation Discovery Data GDelt domain stats news domain news articles yahoo.com 1244781 allafrica.com 1035646 reuters.com 828133 dailymail.co.uk 815372 indiatimes.com 743991 wn.com 587607 Wikipedia statement distribution by citation type
  54. 54. Task#1: Statement Categorization Results 32 yagoLegalActorGeo Parent Type Child Type 1  ⌧  10 10 < ⌧  50 50 < ⌧  90 P R F1 P R F1 P R F1 owl:Thing Legal Actor Geo 0.48 0.36 0.41 0.51 0.43 0.47 0.53 0.47 0.50 Legal Actor Geo Legal Actor 0.51 0.34 0.41 0.54 0.41 0.47 0.56 0.45 0.50 location 0.30 0.29 0.29 0.34 0.40 0.37 0.36 0.45 0.40 location region 0.30 0.28 0.29 0.35 0.40 0.37 0.37 0.44 0.40 point 0.30 0.10 0.14 0.38 0.22 0.28 0.39 0.26 0.32 Legal Actor person 0.53 0.36 0.43 0.56 0.43 0.49 0.58 0.46 0.51 person preserver 0.63 0.31 0.42 0.67 0.46 0.54 0.67 0.49 0.57 authority 0.53 0.20 0.29 0.62 0.24 0.35 0.65 0.33 0.44 contestant 0.59 0.43 0.50 0.62 0.52 0.57 0.64 0.56 0.60 leader 0.53 0.26 0.34 0.59 0.34 0.43 0.61 0.37 0.46 wc Living people 0.55 0.37 0.44 0.58 0.44 0.50 0.59 0.47 0.52 Separate models per entity type YAGO type hierarchy Statement Categorization results based on RandomForests
  55. 55. Task#2: Citation Discovery Results 33
  56. 56. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model
  57. 57. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model B2: supervised model based on rank and similarity score from the search engine
  58. 58. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model B2: supervised model based on rank and similarity score from the search engine E1: our approach with entailment, centrality, authority features, where for a statement a correct citation are news articles which are cited originally from the statement in the Wikipedia page
  59. 59. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model B2: supervised model based on rank and similarity score from the search engine E1: our approach with entailment, centrality, authority features, where for a statement a correct citation are news articles which are cited originally from the statement in the Wikipedia page E1+FP: similar as E1, but consider as relevant false positives which have very high similarity to ground—truth articles
  60. 60. Task#2: Citation Discovery Results 33 B1: top—1 retrieved article from the retrieval model B2: supervised model based on rank and similarity score from the search engine E1: our approach with entailment, centrality, authority features, where for a statement a correct citation are news articles which are cited originally from the statement in the Wikipedia page E1+FP: similar as E1, but consider as relevant false positives which have very high similarity to ground—truth articles E2: builds on top of E1+FP with additionally assessing for the relevance of FP articles below the similarity threshold
  61. 61. Part (I): Conclusion 34 • Specific citation types are preferred based on the statement, its context, and the Wikipedia page • Statement categorization works fairly well for some entity types • Challenging to distinguish between citation type web and news • Citation discovery can be performed accurately across all entity types
  62. 62. Part (II): Fine Grained Citation Span for References in Wikipedia Besnik Fetahu, Katja Markert, Avishek Anand: “Fine Grained Citation Span for References in Wikipedia”. EMNLP 2017: 1980-1989
  63. 63. News Collection t1 t2 tn Textual Knowledge Base t1 t2 tn Citation Recommendation Citation Span News Suggestion Entity Placement Section Placement e:“Barack Obama” Obama was born on August 4, 1961,[4] ….. The couple married in Wailuku on Maui on … After graduating ... a JD … magna cum laude[49]… Obama was elected to the Illinois Senate in … news? query for s1 c4 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. c4s1 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. citation c4 span e:“Barack Obama” AND t2 news: nk time: > t2 The choice of Barack Obama on Friday as the recipient of the 2009 Nobel Peace Prize, [...] around the globe. [...] The Nobel committee’s embrace of Mr. Obama was viewed [as a rejection of the unpopular tenure, in] Europe especially, of his predecessor, George W. Bush. [...] “To be honest,” the president said in the Rose Garden, [...] Last year’s laureate, former President Martti Ahtisaari of Finland, saw the award as an endorsement of Mr. Obama’s goal of achieving Middle East peace. 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Family and Personal life 7.Cultural and political image 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Nobel Peace Prize 7.Family and Personal life 8.Cultural and political image t2 t3 publish date t headline body entity mentions (e.g. “Barack Obama, Nobel Prize”…) revision date t entity title sections section text categories citations
  64. 64. Citations in Wikipedia
  65. 65. Citation Span Cases 38 Obama was born on August 4, 1961,[5] at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii.[6][7][8] On February 10, 2007, Obama announced his candidacy for President of the United States in front of the Old State Capitol building in Springfield, Illinois.[158][159] […] Obama emphasized issues of rapidly ending the Iraq War, increasing energy independence, and reforming the health care system,[161] in a campaign that projected themes of hope and change.[162] At the Democratic National Convention in Charlotte, North Carolina, Obama and Joe Biden were formally nominated by former President Bill Clinton as the Democratic Party candidates for president and vice president in the general election. Their main opponents were Republicans Mitt Romney, the former governor of Massachusetts, and Representative Paul Ryan of Wisconsin.[183] Citation marker placed at a sub-sentence level Citation marker placed at the end of a sentence Citation marker placed after multiple sentences in a paragraph
  66. 66. Citation Span Task
  67. 67. Citation Span Task 40 Citing Paragraph He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118]
  68. 68. Citation Span Task 40 Citing Paragraph He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118] Textual Fragments Chunk Paragraph (punctuation symbols) He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118]
  69. 69. Citation Span Task 40 Citing Paragraph He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118] Textual Fragments Chunk Paragraph (punctuation symbols) He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118] CitingSpan Citation Span for reference [117] He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118]
  70. 70. Citation Span Approach 41 Sequence Classification (linear—chain CRF) Plain Classification • Citations other than c • Same sentence as c • Same sentence as previous text fragment • Distance in terms of text fragments to c Paragraph Structure He was reelected to the Illinois Senate in 1998, defeating Republican Yesse Yehudah in the general election, and was re-elected again in 2002.[117] In 2000, he lost a Democratic primary race for Illinois's 1st congressional district in the United States House of Representatives to four- term incumbent Bobby Rush by a margin of two to one.[118] • Language models per paragraphs in cited document • Language model similarity between fragment and paragraph’s LM Citation Features • Explicit discourse sense annotation of sentences • Fragments in a sentence with explicit discourse (e.g. comparison) are likely to have same label • Fragments with different time points are unlikely to have same label Discourse/Temporal Features Determine span for citation c= [117] Extract features for each text fragment
  71. 71. Evaluation
  72. 72. Citation Span Dataset 43 • 500 citing paragraphs, pointing to either web or news citations • Manual annotation of each textual fragment whether it is explicitly supported or implied by the corresponding citation • High inter-rater agreement on a 10% sample with 𝜅=0.84 span news web dist. skip frag. skip sent. dist. skip frag. skip sent. <= 0.5 11% 6% - 6.8% - - (.5,1] 63% - - 63% - 1% (1,2] 17% - 8% 14% - 19% (2,5] 7% 5% 18% 13.1% - 21% > 5 1.8% - 20% 3.1% - 67%
  73. 73. Evaluation Metrics 44 MAP = 1 |N| X p2N |S0 St | |S0| R = 1 |N| X p2N |S0 St | |St| w = 1 |N| X p2N P 2S0St words( ) P 2St words( ) Mean Average Precision — MAP: Recall — R: Erroneous Span (word and text-fragment level) - ∆” • S’ — text fragment marked as covered by one of the approaches. • St — text fragment marked as covered by the citation in our ground-truth. • words(∂) — number of words in a text fragment.
  74. 74. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases.
  75. 75. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases. MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document.
  76. 76. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases. MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document. IC: the span of a citation is the text between two citations, or it starts in the beginning of a paragraph, and it can end at the end of a paragraph.
  77. 77. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases. MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document. IC: the span of a citation is the text between two citations, or it starts in the beginning of a paragraph, and it can end at the end of a paragraph. CS/CSW: the span is the citing sentence, and for CSW the span is the citing sentence +/- 2 sentences depending if they contain specific cue words in specific locations in the sentence.
  78. 78. Citation Span Analysis — Accuracy 45 Citation Span results decomposed across the different span cases. MRF: compute two functions: (i) potential and (ii) compatibility. In (i) measure the similarity between sentences in a citing paragraph, and in (ii) measure the similarity between a sentences and sentences in the citing document. IC: the span of a citation is the text between two citations, or it starts in the beginning of a paragraph, and it can end at the end of a paragraph. CS/CSW: the span is the citing sentence, and for CSW the span is the citing sentence +/- 2 sentences depending if they contain specific cue words in specific locations in the sentence. CSPC/CSPS: our plain classifier with the proposed features, and for CSPS we train a structured prediction model based on the same feature set.
  79. 79. Citation Span Analysis — Erroneous Span 46 ≤ 0.5 9 % 11 % 872 % 274 %258 % 480 % 0 250 500 750 1000 CSPS CSPC CS CSW IC MRF (0.5,1] 6 % 5 % 313 % 14 %12 % 80 % 0 100 200 300 CSPS CSPC CS CSW IC MRF (1,2] 11 % 7 % 114 % 11 %10 % 65 % 0 50 100 150 CSPS CSPC CS CSW IC MRF > 2 45 % 26 % 96 % 17 %16 % 68 % 0 30 60 90 CSPS CSPC CS CSW IC MRF Citation Span Buckets ErroneousSpanΔw%
  80. 80. Part (II): Conclusion 47 • Citation span can be accurately determined for web and news citations • Sequence classification achieves slightly better performance than plain classification and outperforms baseline approaches • Baseline approaches from the scientific domain do not generalize in Wikipedia’s language style
  81. 81. Part (III): Automated News Suggestion for Populating Wikipedia Pages Besnik Fetahu, Katja Markert, Avishek Anand: “Automated News Suggestion for Populating Wikipedia Pages”. CIKM 2015: 323-332
  82. 82. News Collection t1 t2 tn Textual Knowledge Base t1 t2 tn Citation Recommendation Citation Span News Suggestion Entity Placement Section Placement e:“Barack Obama” Obama was born on August 4, 1961,[4] ….. The couple married in Wailuku on Maui on … After graduating ... a JD … magna cum laude[49]… Obama was elected to the Illinois Senate in … news? query for s1 c4 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. c4s1 Obama was born on August 4, 1961, at Kapiʻolani Maternity & Gynecological Hospital in Honolulu, Hawaii. citation c4 span e:“Barack Obama” AND t2 news: nk time: > t2 The choice of Barack Obama on Friday as the recipient of the 2009 Nobel Peace Prize, [...] around the globe. [...] The Nobel committee’s embrace of Mr. Obama was viewed [as a rejection of the unpopular tenure, in] Europe especially, of his predecessor, George W. Bush. [...] “To be honest,” the president said in the Rose Garden, [...] Last year’s laureate, former President Martti Ahtisaari of Finland, saw the award as an endorsement of Mr. Obama’s goal of achieving Middle East peace. 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Family and Personal life 7.Cultural and political image 1.Early life and career 2.Political career 3.2008 presidential campaign 4.Presidency 5.Political positions 6.Nobel Peace Prize 7.Family and Personal life 8.Cultural and political image t2 t3 publish date t headline body entity mentions (e.g. “Barack Obama, Nobel Prize”…) revision date t entity title sections section text categories citations
  83. 83. Automated News Suggestion to Entity Pages 50 Daily News Articles
  84. 84. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa
  85. 85. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa
  86. 86. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […]
  87. 87. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […]
  88. 88. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […]
  89. 89. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […]
  90. 90. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement
  91. 91. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions
  92. 92. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin
  93. 93. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin Task#2: Section Placement Odisha
  94. 94. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin Task#2: Section Placement Odisha 1. Etymology 2. History 3. Geography 4. Government and politics 5. Economy 6. Transportation 7. Demographics 8. Education Sections
  95. 95. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin Task#2: Section Placement Odisha 1. Etymology 2. History 3. Geography 4. Government and politics 5. Economy 6. Transportation 7. Demographics 8. Education Sections Add section in case it is missing
  96. 96. Automated News Suggestion to Entity Pages 50 Daily News Articles Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] WSJ: As it Happened: Cyclone Reaches Orissa Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. […] Task#1: Article Placement Bay of Bengal Odisha India Cyclone Phailin Entity Mentions Bay of Bengal Odisha India Cyclone Phailin Task#2: Section Placement Odisha 1. Etymology 2. History 3. Geography 4. Government and politics 5. Economy 6. Transportation 7. Demographics 8. Education Sections Add section in case it is missing Catastrophes
  97. 97. News Suggestion Attributes 51 • The entity should be a central concept in the news article • The information in the news article should be important for the Wikipedia entity • Information in news article should contain novel or missing information for a Wikipedia entity • A news article should be suggested for to the exact section, if such section does not exist, it needs to be added
  98. 98. Task#1: Article — Placement
  99. 99. Article—Placement: News Suggestion Attributes 53 Entity Salience Relative Authority Novelty • Reward entities appearing throughout the news article • Reward entities appearing in top-paragraphs • Weigh the entities w.r.t the score of the co-occurring entities • Entry barrier is lower for information from news articles for entities with low-authority • Important information for an entity can be unveiled by measuring the relative importance of its co-occurring entities • Information from a news article should be novel w.r.t to the entity under consideration • Measure information novelty against already cited news sources • Measure information novelty against the already existing content in a Wikipedia entity
  100. 100. Task#2: Section Placement
  101. 101. Section—Placement: Template Generation and Section Fit 55 News Articles [Germanwings incident] Article—Entity Placement Wikipedia Entity
  102. 102. Section Template typeOf Airline 1. History 2. Corporate affairs 3. Destinations 4. Fleet 5. Services 6. References Germanwings 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services 7. Incidents and accidents 8. References Adria 1. History 2. Corporate affairs and identity 3. Miles & More 4. Lounges 5. Accidents and incidents 6. Criticism 7. See also 8. Citations 9. External Links Lufthansa Section—Placement: Template Generation and Section Fit 55 News Articles [Germanwings incident] Article—Entity Placement Wikipedia Entity
  103. 103. Section Template typeOf Airline 1. History 2. Corporate affairs 3. Destinations 4. Fleet 5. Services 6. References Germanwings 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services 7. Incidents and accidents 8. References Adria 1. History 2. Corporate affairs and identity 3. Miles & More 4. Lounges 5. Accidents and incidents 6. Criticism 7. See also 8. Citations 9. External Links Lufthansa Section—Placement: Template Generation and Section Fit 55 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services / Lounges 7. Criticism 8. Incidents and accidents 9. References Section Template [Airline] News Articles [Germanwings incident] Article—Entity Placement Wikipedia Entity
  104. 104. Section Template typeOf Airline 1. History 2. Corporate affairs 3. Destinations 4. Fleet 5. Services 6. References Germanwings 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services 7. Incidents and accidents 8. References Adria 1. History 2. Corporate affairs and identity 3. Miles & More 4. Lounges 5. Accidents and incidents 6. Criticism 7. See also 8. Citations 9. External Links Lufthansa Section—Placement: Template Generation and Section Fit 55 1. History 2. Corporate affairs and Identity 3. Destinations 4. Codeshare agreements 5. Fleet 6. Services / Lounges 7. Criticism 8. Incidents and accidents 9. References Section Template [Airline] News Articles [Germanwings incident] Article—Entity Placement Wikipedia Entity Section Fit • Content similarity of the news article w.r.t sections in the template • Topic similarity of the news article w.r.t sections in the template Incidents and Accidents
  105. 105. Evaluation
  106. 106. Evaluation Dataset 57 year #news #entities #sections 2009 42707 13550 3510 2010 78328 24953 8416 2011 73491 23144 6581 2012 81473 25980 8455 2013 69079 22121 8183 2014 29961 11088 4694 Evaluation Datasets Evaluation Plan • B1 — baseline for AEP (Dunietz and Gillick) • S2 — baseline for AES (most frequent section) Task#1 — AEP: Baselines Task#2 — AES: Baselines Dunietz, Jesse, and Daniel Gillick. "A New Entity Salience Task with Millions of Training Examples." In EACL, p. 205. 2014.
  107. 107. Article—Entity and Article—Section Placement Results 58 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall Precision group B1 B1+F_e 2009 B1 — baseline approach for AEP AEP: B1 + entity salience, relative authority, novelty 0 0.2 0.4 0.6 0.8 1 2009 2010 2011 2012 2013 2014 Avg.Precision Fs S2 S2 — most frequent section AES: topic, content, lexical features
  108. 108. Part (III): Conclusions 59 • Three main properties of a good news suggestion • Through AEP and AES tasks, we can suggest important and novel information for Wikipedia entities • Entity profile expansion through section templates generated at entity type level
  109. 109. Conclusions and Future Work
  110. 110. Conclusions 61 • We account for the evolving nature of Wikipedia entities as new and novel information becomes available on the Web • We present a holistic approach for enriching and improving Wikipedia entities • Through our approach we enforce the core principles of Wikipedia such as the “verifiability” principle • Our automated approach provides accurate enrichments and improvements, and furthermore accounts for long-tail entities, where editor interests are low.
  111. 111. Future Work 62 • Wikipedia is a collaboratively edited and created data source, as such it can have pitfalls like “echo chambers”. We want to investigate how are such “echo chambers” established, and what are the factors (e.g. editors, sources, topic interests) that cause it? • Quality issues such as NPOV violations in Wikipedia are coarse-grained and such quality indicators are inexistent in long-tail entities, thus, investigating editor, language, and source biases that cause such a NPOV violations is an important quality assurance step. • Editors dynamics reflect the quality of Wikipedia pages. How can we provide a mechanism for distributing “uniformly” Wikipedia pages across editors, such that we satisfy their interests and at the same time increase the overall quality of Wikipedia pages overall.
  112. 112. Thank you for your attention.
 
 Questions?

×