Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation

Don’t compare Apples to Oranges -
Extending GERBIL for a fine grained NEL Evaluation
Jörg Waitelonis, Henrik Jürges, Harald Sack
Hasso-Plattner-Institute for IT-Systems Engineering, University of Potsdam
Semantics 2016, Leipzig, Germany, September 12-15th, 2016

Agenda
1. NEL and NEL evaluation
2. Dataset properties and evaluation drawbacks
3. Extending GERBIL
● Building conditional datasets
● Measure dataset characteristics
1. Results
2. Demonstration
3. Summary & Future work

Named Entity Linking (NEL)
Chart 3
… Armstrong …

Named Entity Linking (NEL), Principle
Chart 4
“Armstrong landed on the moon.”
Candidates:
dbr:Neil_Armstrong
dbr:Lance_Armstrong
dbr:Louis_Armstrong
….
Candidates:
dbr:Moon
dbr:Lunar
….
Correct entities
Entity mention with
surface form
String Distance
Link Analysis
Vector Space
Fuzzy String Matching
Conditional Random Fields
Random Forest
RankSVM
Learning to Rank
Surface Aggregation
Word Embeddings
Context Similarity Matching
1. Tokenize text
2. Find candidates in KB
3. Score candidates with a
magic algorithm and
select the best one
KEA
Wikifier
● Algorithm only approximates correct
entities
● Need for verification and testing

● Dataset consists out of:
■ Documents (String/Sentences)
■ Annotations (ground truth)
Named Entity Linking, Evaluation
Chart 5
ACE2004
AIDA/CoNLL
DBpedia Spotlight
IITB
KORE50
MSNBC
Micropost2014
N3-RSS-500
WES2015
N3-Reuters-128
● Traditional measures are:
■ Precision: defines how well an annotator works
■ Recall: defines how complete the results are
■ F1-measure: harmonic medium between precision and recall
■ And more, cf. Rizzo et al. [1]

● GERBIL - a general entity annotation system (AKSW Leipzig), cf.
Usbeck et. al [2]
● Used for testing/optimizing/benchmarking annotators
● Neat Webinterface
● 13 Annotators / 20 datasets
● F-measure to rough for detailed
evaluation
● Developer need dataset insights
Named Entity Linking, Benchmarking
Chart 6
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

● Size of a datasets
● Amount of annotations/documents/words
● What types of entities are used? E.g. persons, places, events, ….
● Are there documents without annotations? E.g. Microposts 2014
● What sort of popularity have the entities? E.g. PageRank, Indegree
● How ambiguous are the entities and surface forms?
● How divers are the entities and surface forms?
● ….
Cf. van Erp et al. [3]
Properties of Datasets
Chart 7
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

● How does the dataset characteristics influence the evaluation
results?
● How does the popularity of entities influence the evaluation
results?
● How can a general dataset be used for domain specific NEL tools?
● How can datasets be compared? Is there something like a general
difficulty?
● Limited comparability between benchmark results
● Penalization of good annotators with inappropriate datasets
Cf. van Erp et al. [3]
Research Questions and Drawbacks
Chart 8
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

● Approach for a solution:
● Adjustable filter system for GERBIL
● Expose dataset characteristics
● Datasets and annotators added at runtime are also included
● Visualize the results
Extending GERBIL
Chart 9
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

Extending GERBIL, Conditional Datasets
Chart 10
Dataset
Type and popularity
specific datasets
Annotator
Documents Annotator results
Benchmark results
Evaluate each specific
dataset and result
PR(e) > t
PR(e) > t
PR(e) > t
rdf:type
rdf:type
rdf:type
rdf:type
rdf:type
rdf:type

Results, Types
Chart 11
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

Extending GERBIL, Not Annotated Documents
● Not annotated documents: shows the relative amount of empty
documents within a datasets
● Only affects if annotators searches entity mentions by themselves
Chart 13
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

Extending GERBIL, Density
Chart 14
● Density: shows the relation between number of annotations and
words in the document
● Only affects if annotators searches entity mentions by themselves
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

Extending GERBIL, Likelihood of Confusion
● Likelihood of Confusion (Level of Ambiguity)
● True measures are unknown due to missing exhaustive collections
● Rough overview how difficult to disambiguate
Chart 15
Entities Surface Forms
Tegel
TXL
Bruce
Otto Lilienthal
Bruce Lee
Bruce Willis
Airport Tegel
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
synonyms

Results, Likelihood of Confusion
Chart 16
Entities
● A high red bar indicates
an entity has a high
amount of homonyms
● A high blue bar indicates
a surface form has a high
amount of synonyms
Surface Forms

Extending GERBIL, Dominance of Entities
Chart 17
Bruce
Bruci
Bruce Willis
Testdata
Vocabulary
dominance(e)=
e(t)/e(v)
● Expresses the relation between
used words and all words
● True measures unknown
● High rates prevents overfitting
● Prevents repetition of surface
forms
dbr:Bruce_Willis
Bruce Walter Willis

Extending GERBIL, Dominance of Surface Forms
Chart 18Chart 18
dbr:Irene_Angelina
dbr:Angelina_Jordan
dbr:Angelina_Jolie
Vocabulary
dominance(s)=
s(t)/s(v)
● Expresses the relation between
used mentions and all
mentions
● True measures unknown
● High rates prevents
overfitting
● Indicates how context
dependent a disambiguation is
Testdata
Angelina

Results, Dominance
Chart 19
● Blue bar indicates that for a
entity a variety of surface
forms is used
● Red bar indicates how context
dependent the disambiguation of
an surface form is
Dominance of surface forms Dominance of entities

Demo
● http://gerbil.s16a.org/
● https://github.com/santifa/gerbil/
Chart 20
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

■ Summary:
□ Implemented a domain specific filter system
□ Measure dataset characteristics
□ Annotator results are nearly the same on entities of different
popularity
□ Enable specific analyses and optimization of annotators
□ Enable users to select the tools that performs best for a specific
domain
■ Future work:
□ Keep up with GERBIL development, increase performance
□ More measurements, e. g. max_recall
□ Dataset remixing ≙ assemble new customized datasets
– E. g. Unpopular companies
Summary & Future work
Chart 21
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

[1] Giuseppe Rizzo, Amparo Elizabeth Cano Basave, Bianca Pereira, and Andrea
Varga. Making Sense of Microposts (#Microposts2015) Named Entity rEcognition and
Linking (NEEL) Challenge. In 5th Workshop on Making Sense of Microposts
(#Microposts2015), pages 44–53. CEUR-WS.org, 2015
[2] M. Röder, R. Usbeck, and A.-C. Ngonga Ngomo. Gerbil’s new stunts: Semantic
annotation benchmarking improved. Technical report, Leipzig University, 2016
[3] M. van Erp, P. Mendes, H. Paulheim, F. Ilievski, J. Plu, G. Rizzo, and J. Waitelonis.
Evaluating entity linking: An analysis of current benchmark datasets and a roadmap
for doing a better job. In Proc. of 10th edition of the Language Resources and
Evaluation Conference, Portoroz, Slovenia, 2016.
References
Chart 22
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation

Questions?
Questions?
Thank you for your attention!

Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation

Similar to Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation (20)

More from semanticsconference

More from semanticsconference (20)

Recently uploaded

Recently uploaded (20)

Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation

Editor's Notes