Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Evaluating the Helpfulness of Linked Entities to Readers

1,076 views

Published on

Published in: Data & Analytics
  • Be the first to comment

Evaluating the Helpfulness of Linked Entities to Readers

  1. 1. Evaluating the Helpfulness of Linked Entities to Readers Ikuya Yamada1,2,3 Tomotaka Ito1,2 Shinnosuke Usami1,2 Shinsuke Takagi1,2 Hideaki Takeda3 Yoshiyasu Takefuji2 1Studio Ousia 2Keio University 3National Institute of Informatics 14年9月1日月曜日
  2. 2. STUDIO OUSIA Background ‣ We frequently encounter unfamiliar entities while reading text ‣ Tedious steps are required to retrieve detailed information about an entity ✦ Select, copy, and paste text into the search box... ✦ Especially problematic on recent touchscreen devices 2 Kyary Pamyu Pamyu is a Japanese model and singer. Her public image is associated with Japan's kawaisa culture centered in the Harajuku, Tokyo. 14年9月1日月曜日
  3. 3. STUDIO OUSIA Entity linking ‣ Entity Linking: The task of linking entity mentions to entries in a knowledge base (KB) (e.g., Wikipedia) ‣ Recently the task has received considerable attention ✦ Many research papers (2006-) [Cucerzan 2007, Milne et al. 2008, etc.] ✦ Implementations: Wikipedia Miner, TAGME, AIDA, DBpedia Spotlight, Apache Stanbol, Illinois Wikifier, etc. 3 Kyary Pamyu Pamyu is a Japanese model and singer. Her public image is associated with Japan's kawaisa culture centered in the Harajuku, Tokyo. Harajuku wikipedia/Harajuku wikipedia/Kawaii Kyary Pamyu Pamyu Kawaii wikipedia/Kyary_Pamyu_Pamyu 14年9月1日月曜日
  4. 4. STUDIO OUSIA Entity linking ‣ The text reading experience can be enhanced using entity linking, which links entity mentions to relevant web sources 4 Kyary Pamyu Pamyu is a Japanese model and singer. Her public image is associated with Japan's kawaisa culture centered in the Harajuku, Tokyo. Harajuku wikipedia/Harajuku wikipedia/Kawaii Kyary Pamyu Pamyu Kawaii wikipedia/Kyary_Pamyu_Pamyu 14年9月1日月曜日
  5. 5. STUDIO OUSIA Problem ‣ Current entity linking systems attempt to detect all entities in the text ✦ They are intended to be components of other text analysis tasks ‣ KBs typically contain many unhelpful entities 5 Kyary Pamyu Pamyu is a Japanese model and singer. Her public image is associated with Japan's kawaisa culture centered in the Harajuku, Tokyo. 14年9月1日月曜日
  6. 6. STUDIO OUSIA Problem ‣ Wikipedia instructs its contributors to insert links only where they are relevant and helpful ‣ The quality of links matters when detected entities are directly presented to users ✦ How to automatically detect helpful entities? 6 http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#Wikilinks 14年9月1日月曜日
  7. 7. STUDIO OUSIA Approach ‣ Formulate the problem as a binary classification that determines whether an entity is helpful to readers ‣ Adopt supervised machine-learning with a broad set of features ‣ Wikipedia is used as the KB 7 Evaluate the helpfulness of linked entities from the reader’s perspective 14年9月1日月曜日
  8. 8. Implemented as a post-processing step of entity linking STUDIO OUSIA Approach ‣ Removes unhelpful entities from entities detected by an entity linking system ‣ Can be used with any existing entity linking system 8 Entity Linking Helpfulness Evaluation (Our Method) Input Text Entities 14年9月1日月曜日
  9. 9. STUDIO OUSIA Task Definition ‣ A binary classification task to detect whether an entity is helpful to readers ‣ We use supervised machine-learning to assign binary labels to entities detected by an entity linking system 9 Kyary Pamyu Pamyu is a Japanese model and singer. Her public image is associated with Japan's kawaisa culture centered in the Harajuku, Tokyo. TRUE / FALSE 14年9月1日月曜日
  10. 10. STUDIO OUSIA Terminologies ‣ Document: a document to be analyzed ‣ Entity: an entity (an entry in Wikipedia) ‣ Mention: an entity mention (entity text) that occurs in the text of the document 10 14年9月1日月曜日
  11. 11. Machine-learning Features List of machine-learning features STUDIO OUSIA ‣ Link probability ‣ Entity ‣ Entity class ‣ Topical coherence ‣ Textual ‣ Mention occurrence 11 Features are classified into six categories 14年9月1日月曜日
  12. 12. Link Probability Features Link probability is the probability that a mention is used as an anchor text in Wikipedia ‣ It reflects Wikipedia contributors’ judgment of whether or not the mention is helpful to readers ‣ Three derivative features: ✦ LINK_PROB: The link probability of the corresponding mention ✦ LINK_PROB_AVG: The average link probability of all mentions that link STUDIO OUSIA to the entity ✦ LINK_PROB_MAX: The maximum link probability in all mentions that link to the entity 12 Ca(m): A set of entities that contain mention as an anchor Ct(m): A set of entities that contain mention 14年9月1日月曜日
  13. 13. Link Probability Features STUDIO OUSIA 13 Her public image is associated with Japan's kawaisa culture centered in the Harajuku, Tokyo Takeshita Street is a street lined with fashion boutiques, cafes in Harajuku in Tokyo, Japan. Department Store and Museum is a department store located in the Harajuku... Takeshita Street Kyary Pamyu Pamyu Laforet Link Plain text LINK_PROB(Harajuku) = 2/3 14年9月1日月曜日
  14. 14. Link Probability Features Link probability is the probability that a mention is used as an anchor text in Wikipedia ‣ It reflects Wikipedia contributors’ judgment of whether or not the mention is helpful to readers ‣ Three derivative features: ✦ LINK_PROB: The link probability of the corresponding mention ✦ LINK_PROB_AVG: The average link probability of all mentions that link STUDIO OUSIA to the entity ✦ LINK_PROB_MAX: The maximum link probability of all mentions that link to the entity 14 Ca(m): A set of entities that contain mention as an anchor Ct(m): A set of entities that contain mention 14年9月1日月曜日
  15. 15. STUDIO OUSIA Entity Features ‣ WIKISTATS: average number of page views ‣ IN_LINKS: number of inbound links ‣ OUT_LINKS: number of outbound links ‣ CATEGORIES: number of associated Wikipedia categories ‣ GENERALITY: minimum depth at which the entity is located in Wikipedia’s category tree 15 Entity features represent several statistics of the entity in Wikipedia They represent the popularity and specificity of an entity 14年9月1日月曜日
  16. 16. Entity class features are derived from the classes of an entity in STUDIO OUSIA Entity Class Features 16 DBpedia, Freebase, and Schema.org Kyary Pamyu Pamyu wikipedia/Kyary_Pamyu_Pamyu DBpedia:/ontology/Person DBpedia:/ontology/Artist Freebase:/people/person Freebase:/music/artist Harajuku wikipedia/Harajuku … DBpedia:/ontology/Place Freebase:/location/location Freebase:/location/neighborhood … Each class is used as a binary feature 14年9月1日月曜日
  17. 17. Topical Coherence Feature ‣ Measured by how a target entity is related to other detected entities ‣ Relatedness between two entities is calculated based on the Wikipedia in-link-based similarity ‣ The average relatedness between the entity and other entities in the document [Milne 2008] is added as a feature STUDIO OUSIA 17 Topical coherence feature represents how an entity is related to the topics of the document e: Entity; KB: Entities in KB; c(e): Entities having a link to e e: Entity; E: Set of entities in the document 14年9月1日月曜日
  18. 18. Topical Coherence Feature STUDIO OUSIA 18 Kyary Pamyu Pamyu is a Japanese model and singer. Her public image is associated with Japan's kawaisa culture centered in the Harajuku, Tokyo. 0.2 0.4 REL(Kyary Pamyu Pamyu, kawaisa culture) = 0.4 REL(Kyary Pamyu Pamyu, Harajuku) = 0.2 COHERENCE(Kyary Pamyu Pamyu, E) = 0.4 * 0.2 / 2 = 0.3 14年9月1日月曜日
  19. 19. STUDIO OUSIA Other Features ‣ Textual Features ✦ CAPITALIZED: whether m is capitalized ✦ MENTION_LEN: length of the mention ✦ TITLE_LEN: length of the entity title in KB ✦ MENTION_IN_TITLE: whether the entity title contains the mention ✦ TITLE_IN_MENTION: whether the mention contains entity title ✦ MENTION_EQ_TITLE: whether the mention equals to the entity title ✦ EDIT_DISTANCE: edit distance between the mention and the entity title ‣ Mention Occurrence Features ✦ SPREAD: number of tokens between the first occurrence and the last occurrence of the mention in the document ✦ FREQUENCY: number of occurrence of the mention in the document 19 14年9月1日月曜日
  20. 20. STUDIO OUSIA Dataset The dataset for our task was developed using crowd-sourcing service ‣ Based on the IITB EL dataset ✦ Contains annotated entities on news articles for entity linking tasks ✦ Popular in entity linking studies ‣ We added an annotation to each entity in the IITB EL dataset that represents whether it is helpful to readers ‣ Amazon Mechanical Turk was used to hire annotators online ‣ We posed the following question to the annotators: 20 “Do you think converting the highlighted keyword into a link is sufficiently helpful to readers and could it improve the readers’ overall experience?” 14年9月1日月曜日
  21. 21. STUDIO OUSIA Dataset ‣ To avoid the bias of annotators’ subjectivity: ✦ Various annotators were used (60 annotators) ✦ Three annotators were assigned to each entity ‣ A total number of 33,567 annotations were obtained ‣ Annotations for 72.8% of the entities are identical in binary judgement ‣ The dataset is publicly available for further studies https://github.com/studio-ousia/el-helpfulness-dataset/ 21 14年9月1日月曜日
  22. 22. Annotation screen displayed to the annotators STUDIO OUSIA 22 14年9月1日月曜日
  23. 23. STUDIO OUSIA Evaluation Criteria 23 ‣ Precision, recall, and F1 (harmonic mean of precision and recall) ‣ Scores are calculated using 5-fold cross validation on our dataset 14年9月1日月曜日
  24. 24. Machine-learning Algorithms C4.5 0.8281 0.8069 0.8173 SVMLINEAR 0.8581 0.7892 0.8221 SVMRBF 0.8676 0.7806 0.8217 AdaBoost 0.8484 0.8383 0.8433 Random Forest 0.8697 0.8419 0.8554 ‣ Algorithms: Precision Recall F1 ✦ Decision Tree (C4.5) ✦ Support Vector Machine (Linear Kernel & RBF Kernel) ✦ AdaBoost ✦ Random Forest ‣ Summary: ✦ Random forest achieves the highest performance ✦ AdaBoost performs competitively ✦ Decision Tree and SVM perform poorly STUDIO OUSIA 24 We chose Random Forest as our machine-learning algorithm 14年9月1日月曜日
  25. 25. STUDIO OUSIA Baselines ‣ LinkProb (Similar to Wikify! [Mihalcea & Csomai 2007]) ✦ Only uses link probability ✦ The best threshold is estimated in our dataset (10.6%) ‣ TAGME [Ferragina & Scaiella 2010] ✦ Uses link probability & topical coherence ‣ Milne&Witten [Milne & Witten 2008] ✦ Uses the following features: - link probability - topical coherence - generality (entity features) - spread and frequency (mention occurrence features) 25 14年9月1日月曜日
  26. 26. STUDIO OUSIA Main Results ‣ Our method performs significantly better than existing methods by 5.7-12% F1 26 Precision Recall F1 LinkProb 0.6980 0.7761 0.7349 TAGME 0.7936 0.7782 0.7857 Milne&Witten 0.8079 0.7904 0.7989 Our Method 0.8697 0.8419 0.8554 14年9月1日月曜日
  27. 27. Effectiveness of Each Feature STUDIO OUSIA ‣ Extracted 15 most effective features (measured using the averaged information gain over RF instances) ‣ Evaluated feature contributions by incrementally adding each feature to the model 27 Feature name F1 ΔF1 LINK_PROB_AVG (LP) LINK_PROB_MAX (LP) LINK_PROB (LP) 0.8158 COHERENCE (TC) 0.8201 0.43% SPREAD (MO) 0.8243 0.43% IN_LINKS (ENT) 0.8314 0.70% TITLE_LEN (TXT) 0.8353 0.39% WIKISTATS (ENT) 0.8346 -0.07% CATEGORIES (ENT) 0.8449 1.03% OUT_LINKS (ENT) 0.8427 -0.22% FREQUENCY (MO) 0.8438 0.11% CAPITALIZED (TXT) 0.8461 0.23% GENERALITY (ENT) 0.8459 -0.03% MENTION_LEN (TXT) 0.8482 0.24% EDIT_DISTANCE (TXT) 0.8488 0.06% Top 15 Most Effective Features 14年9月1日月曜日
  28. 28. Effectiveness of Each Feature STUDIO OUSIA ‣ Link probability features are strong features 28 Feature name F1 ΔF1 LINK_PROB_AVG (LP) LINK_PROB_MAX (LP) LINK_PROB (LP) 0.8158 COHERENCE (TC) 0.8201 0.43% SPREAD (MO) 0.8243 0.43% IN_LINKS (ENT) 0.8314 0.70% TITLE_LEN (TXT) 0.8353 0.39% WIKISTATS (ENT) 0.8346 -0.07% CATEGORIES (ENT) 0.8449 1.03% OUT_LINKS (ENT) 0.8427 -0.22% FREQUENCY (MO) 0.8438 0.11% CAPITALIZED (TXT) 0.8461 0.23% GENERALITY (ENT) 0.8459 -0.03% MENTION_LEN (TXT) 0.8482 0.24% EDIT_DISTANCE (TXT) 0.8488 0.06% Top 15 Most Effective Features 14年9月1日月曜日
  29. 29. Effectiveness of Each Feature STUDIO OUSIA ‣ Link probability features are strong features ‣ The entity’s relatedness to the document topics and helpfulness of the entity are highly correlated 29 Feature name F1 ΔF1 LINK_PROB_AVG (LP) LINK_PROB_MAX (LP) LINK_PROB (LP) 0.8158 COHERENCE (TC) 0.8201 0.43% SPREAD (MO) 0.8243 0.43% IN_LINKS (ENT) 0.8314 0.70% TITLE_LEN (TXT) 0.8353 0.39% WIKISTATS (ENT) 0.8346 -0.07% CATEGORIES (ENT) 0.8449 1.03% OUT_LINKS (ENT) 0.8427 -0.22% FREQUENCY (MO) 0.8438 0.11% CAPITALIZED (TXT) 0.8461 0.23% GENERALITY (ENT) 0.8459 -0.03% MENTION_LEN (TXT) 0.8482 0.24% EDIT_DISTANCE (TXT) 0.8488 0.06% Top 15 Most Effective Features 14年9月1日月曜日
  30. 30. Effectiveness of Each Feature STUDIO OUSIA ‣ Link probability features are strong features ‣ The entity’s relatedness to the document topics and helpfulness of the entity are highly correlated ‣ Entity features contribute to the performance 30 Feature name F1 ΔF1 LINK_PROB_AVG (LP) LINK_PROB_MAX (LP) LINK_PROB (LP) 0.8158 COHERENCE (TC) 0.8201 0.43% SPREAD (MO) 0.8243 0.43% IN_LINKS (ENT) 0.8314 0.70% TITLE_LEN (TXT) 0.8353 0.39% WIKISTATS (ENT) 0.8346 -0.07% CATEGORIES (ENT) 0.8449 1.03% OUT_LINKS (ENT) 0.8427 -0.22% FREQUENCY (MO) 0.8438 0.11% CAPITALIZED (TXT) 0.8461 0.23% GENERALITY (ENT) 0.8459 -0.03% MENTION_LEN (TXT) 0.8482 0.24% EDIT_DISTANCE (TXT) 0.8488 0.06% Top 15 Most Effective Features 14年9月1日月曜日
  31. 31. Effectiveness of Features by Categories STUDIO OUSIA ‣ The initial model is built using link probability features ‣ All feature categories contribute to the performance 31 Feature Types Precision Recall F1 Link probability only 0.8228 0.8093 0.8158 LP+Entity 0.8511 0.8266 0.8385 LP+Entity class 0.8487 0.8341 0.8412 LP+Topical coherence 0.8281 0.8117 0.8197 LP+Textual 0.8268 0.8266 0.8266 LP+Mention Occurrence 0.8279 0.8117 0.8196 Effectiveness of features by their categories Category name Description Link probability (LP) Probability that mention is used as anchor Entity (ENT) Several statistics of entity in Wikipedia Entity class (EC) Entity classes in DBpedia and Freebase Topical coherence (TC) How entity is related to topics of document Textual (TXT) Several textual features Mention occurrence (MO) How mention is appeared in document Summary of Feature Categories 14年9月1日月曜日
  32. 32. Effectiveness of Features by Categories STUDIO OUSIA ‣ The initial model is built using link probability features ‣ All feature categories contribute to the performance ‣ Entity class features significantly contribute to the performance, especially improving recall 32 Feature Types Precision Recall F1 Link probability only 0.8228 0.8093 0.8158 LP+Entity 0.8511 0.8266 0.8385 LP+Entity class 0.8487 0.8341 0.8412 LP+Topical coherence 0.8281 0.8117 0.8197 LP+Textual 0.8268 0.8266 0.8266 LP+Mention Occurrence 0.8279 0.8117 0.8196 Effectiveness of features by their categories Category name Description Link probability (LP) Probability that mention is used as anchor Entity (ENT) Several statistics of entity in Wikipedia Entity class (EC) Entity classes in DBpedia and Freebase Topical coherence (TC) How entity is related to topics of document Textual (TXT) Several textual features Mention occurrence (MO) How mention is appeared in document Summary of Feature Categories 14年9月1日月曜日
  33. 33. Effectiveness of Features by Categories STUDIO OUSIA ‣ The initial model is built using link probability features ‣ All feature categories contribute to the performance ‣ Entity class features significantly contribute to the performance, especially improving recall ‣ Entity features also contribute to the performance, particularly improving precision 33 Feature Types Precision Recall F1 Link probability only 0.8228 0.8093 0.8158 LP+Entity 0.8511 0.8266 0.8385 LP+Entity class 0.8487 0.8341 0.8412 LP+Topical coherence 0.8281 0.8117 0.8197 LP+Textual 0.8268 0.8266 0.8266 LP+Mention Occurrence 0.8279 0.8117 0.8196 Effectiveness of features by their categories Category name Description Link probability (LP) Probability that mention is used as anchor Entity (ENT) Several statistics of entity in Wikipedia Entity class (EC) Entity classes in DBpedia and Freebase Topical coherence (TC) How entity is related to topics of document Textual (TXT) Several textual features Mention occurrence (MO) How mention is appeared in document Summary of Feature Categories 14年9月1日月曜日
  34. 34. Effectiveness of Entity Class Features 1. DBPEDIA/ontology/Agent 2. FREEBASE/people/person 3. FREEBASE/business/employer 4. FREEBASE/organization/organization 5. SCHEMA ORG/Person 6. DBPEDIA/ontology/Person 7. FREEBASE/book/book subject 8. DBPEDIA/ontology/Disease 9. FREEBASE/business/business operation 10. SCHEMA ORG/Organization 11. DBPEDIA/ontology/Organization 12. FREEBASE/sports/proathlete 13. FREEBASE/location/location 14. DBPEDIA/ontology/Athlete 15. DBPEDIA/base/consumermedical/medical term STUDIO OUSIA ‣ 15 most effective entity class features are extracted 34 List of top 15 entity class features 14年9月1日月曜日
  35. 35. Effectiveness of Entity Class Features 1. DBPEDIA/ontology/Agent 2. FREEBASE/people/person 3. FREEBASE/business/employer 4. FREEBASE/organization/organization 5. SCHEMA ORG/Person 6. DBPEDIA/ontology/Person 7. FREEBASE/book/book subject 8. DBPEDIA/ontology/Disease 9. FREEBASE/business/business operation 10. SCHEMA ORG/Organization 11. DBPEDIA/ontology/Organization 12. FREEBASE/sports/proathlete 13. FREEBASE/location/location 14. DBPEDIA/ontology/Athlete 15. DBPEDIA/base/consumermedical/medical term STUDIO OUSIA ‣ 15 most effective entity class features are extracted ‣ Entity classes that indicate that the entity is a named entity, (e.g., agent, person, or organization) seem highly significant 35 List of top 15 entity class features 14年9月1日月曜日
  36. 36. Effectiveness of Entity Class Features 1. DBPEDIA/ontology/Agent 2. FREEBASE/people/person 3. FREEBASE/business/employer 4. FREEBASE/organization/organization 5. SCHEMA ORG/Person 6. DBPEDIA/ontology/Person 7. FREEBASE/book/book subject 8. DBPEDIA/ontology/Disease 9. FREEBASE/business/business operation 10. SCHEMA ORG/Organization 11. DBPEDIA/ontology/Organization 12. FREEBASE/sports/proathlete 13. FREEBASE/location/location 14. DBPEDIA/ontology/Athlete 15. DBPEDIA/base/consumermedical/medical term STUDIO OUSIA ‣ 15 most effective entity class features are extracted ‣ Entity classes that indicate that the entity is a named entity, (e.g., agent, person, or organization) seem highly significant ‣ Some specific entity classes (e.g., disease, medical term, and athlete) are also effective 36 List of top 15 entity class features 14年9月1日月曜日
  37. 37. STUDIO OUSIA Conclusions ‣ We proposed a method to evaluate the helpfulness of linked entities ✦ Supervised machine-learning is used with a broad set of features ✦ A dataset is developed for this task and publicly available for further studies 37 14年9月1日月曜日
  38. 38. STUDIO OUSIA Conclusions ‣ We proposed a method to evaluate the helpfulness of linked entities ✦ Supervised machine-learning is used with a broad set of features ✦ A dataset is developed for this task and publicly available for further studies ‣ What we learned: ✦ A method for estimating the helpfulness of entities can be improved using machine-learning with a broad set of features ✦ Knowledge extracted from the Wikipedia Hypertext structure is highly useful for evaluating the helpfulness of entities 38 14年9月1日月曜日
  39. 39. THANK YOU! 14年9月1日月曜日

×