Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation

164 views

Published on

http://2016.semantics.cc/henrik-j%C3%BCrges

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation

  1. 1. Don’t compare Apples to Oranges - Extending GERBIL for a fine grained NEL Evaluation Jörg Waitelonis, Henrik Jürges, Harald Sack Hasso-Plattner-Institute for IT-Systems Engineering, University of Potsdam Semantics 2016, Leipzig, Germany, September 12-15th, 2016
  2. 2. Agenda 1. NEL and NEL evaluation 2. Dataset properties and evaluation drawbacks 3. Extending GERBIL ● Building conditional datasets ● Measure dataset characteristics 1. Results 2. Demonstration 3. Summary & Future work
  3. 3. Named Entity Linking (NEL) Chart 3 … Armstrong …
  4. 4. Named Entity Linking (NEL), Principle Chart 4 “Armstrong landed on the moon.” Candidates: dbr:Neil_Armstrong dbr:Lance_Armstrong dbr:Louis_Armstrong …. Candidates: dbr:Moon dbr:Lunar …. Correct entities Entity mention with surface form String Distance Link Analysis Vector Space Fuzzy String Matching Conditional Random Fields Random Forest RankSVM Learning to Rank Surface Aggregation Word Embeddings Context Similarity Matching 1. Tokenize text 2. Find candidates in KB 3. Score candidates with a magic algorithm and select the best one KEA Wikifier ● Algorithm only approximates correct entities ● Need for verification and testing
  5. 5. ● Dataset consists out of: ■ Documents (String/Sentences) ■ Annotations (ground truth) Named Entity Linking, Evaluation Chart 5 ACE2004 AIDA/CoNLL DBpedia Spotlight IITB KORE50 MSNBC Micropost2014 N3-RSS-500 WES2015 N3-Reuters-128 ● Traditional measures are: ■ Precision: defines how well an annotator works ■ Recall: defines how complete the results are ■ F1-measure: harmonic medium between precision and recall ■ And more, cf. Rizzo et al. [1]
  6. 6. ● GERBIL - a general entity annotation system (AKSW Leipzig), cf. Usbeck et. al [2] ● Used for testing/optimizing/benchmarking annotators ● Neat Webinterface ● 13 Annotators / 20 datasets ● F-measure to rough for detailed evaluation ● Developer need dataset insights Named Entity Linking, Benchmarking Chart 6 Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  7. 7. ● Size of a datasets ● Amount of annotations/documents/words ● What types of entities are used? E.g. persons, places, events, …. ● Are there documents without annotations? E.g. Microposts 2014 ● What sort of popularity have the entities? E.g. PageRank, Indegree ● How ambiguous are the entities and surface forms? ● How divers are the entities and surface forms? ● …. Cf. van Erp et al. [3] Properties of Datasets Chart 7 Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  8. 8. ● How does the dataset characteristics influence the evaluation results? ● How does the popularity of entities influence the evaluation results? ● How can a general dataset be used for domain specific NEL tools? ● How can datasets be compared? Is there something like a general difficulty? ● Limited comparability between benchmark results ● Penalization of good annotators with inappropriate datasets Cf. van Erp et al. [3] Research Questions and Drawbacks Chart 8 Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  9. 9. ● Approach for a solution: ● Adjustable filter system for GERBIL ● Expose dataset characteristics ● Datasets and annotators added at runtime are also included ● Visualize the results Extending GERBIL Chart 9 Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  10. 10. Extending GERBIL, Conditional Datasets Chart 10 Dataset Type and popularity specific datasets Annotator Documents Annotator results Benchmark results Evaluate each specific dataset and result PR(e) > t PR(e) > t PR(e) > t rdf:type rdf:type rdf:type rdf:type rdf:type rdf:type
  11. 11. Results, Types Chart 11 Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  12. 12. Results, Popularity Chart 12
  13. 13. Extending GERBIL, Not Annotated Documents ● Not annotated documents: shows the relative amount of empty documents within a datasets ● Only affects if annotators searches entity mentions by themselves Chart 13 Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  14. 14. Extending GERBIL, Density Chart 14 ● Density: shows the relation between number of annotations and words in the document ● Only affects if annotators searches entity mentions by themselves Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  15. 15. Extending GERBIL, Likelihood of Confusion ● Likelihood of Confusion (Level of Ambiguity) ● True measures are unknown due to missing exhaustive collections ● Rough overview how difficult to disambiguate Chart 15 Entities Surface Forms Tegel TXL Bruce Otto Lilienthal Bruce Lee Bruce Willis Airport Tegel Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation synonyms
  16. 16. Results, Likelihood of Confusion Chart 16 Entities ● A high red bar indicates an entity has a high amount of homonyms ● A high blue bar indicates a surface form has a high amount of synonyms Surface Forms
  17. 17. Extending GERBIL, Dominance of Entities Chart 17 Bruce Bruci Bruce Willis Testdata Vocabulary dominance(e)= e(t)/e(v) ● Expresses the relation between used words and all words ● True measures unknown ● High rates prevents overfitting ● Prevents repetition of surface forms dbr:Bruce_Willis Bruce Walter Willis
  18. 18. Extending GERBIL, Dominance of Surface Forms Chart 18Chart 18 dbr:Irene_Angelina dbr:Angelina_Jordan dbr:Angelina_Jolie Vocabulary dominance(s)= s(t)/s(v) ● Expresses the relation between used mentions and all mentions ● True measures unknown ● High rates prevents overfitting ● Indicates how context dependent a disambiguation is Testdata Angelina
  19. 19. Results, Dominance Chart 19 ● Blue bar indicates that for a entity a variety of surface forms is used ● Red bar indicates how context dependent the disambiguation of an surface form is Dominance of surface forms Dominance of entities
  20. 20. Demo ● http://gerbil.s16a.org/ ● https://github.com/santifa/gerbil/ Chart 20 Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  21. 21. ■ Summary: □ Implemented a domain specific filter system □ Measure dataset characteristics □ Annotator results are nearly the same on entities of different popularity □ Enable specific analyses and optimization of annotators □ Enable users to select the tools that performs best for a specific domain ■ Future work: □ Keep up with GERBIL development, increase performance □ More measurements, e. g. max_recall □ Dataset remixing ≙ assemble new customized datasets – E. g. Unpopular companies Summary & Future work Chart 21 Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  22. 22. [1] Giuseppe Rizzo, Amparo Elizabeth Cano Basave, Bianca Pereira, and Andrea Varga. Making Sense of Microposts (#Microposts2015) Named Entity rEcognition and Linking (NEEL) Challenge. In 5th Workshop on Making Sense of Microposts (#Microposts2015), pages 44–53. CEUR-WS.org, 2015 [2] M. Röder, R. Usbeck, and A.-C. Ngonga Ngomo. Gerbil’s new stunts: Semantic annotation benchmarking improved. Technical report, Leipzig University, 2016 [3] M. van Erp, P. Mendes, H. Paulheim, F. Ilievski, J. Plu, G. Rizzo, and J. Waitelonis. Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job. In Proc. of 10th edition of the Language Resources and Evaluation Conference, Portoroz, Slovenia, 2016. References Chart 22 Henrik Jürges Semantics 2016 Leipzig, Germany Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
  23. 23. Questions? Questions? Thank you for your attention!

×