IAC 2024 - IA Fast Track to Search Focused AI Solutions
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation
1. Don’t compare Apples to Oranges -
Extending GERBIL for a fine grained NEL Evaluation
Jörg Waitelonis, Henrik Jürges, Harald Sack
Hasso-Plattner-Institute for IT-Systems Engineering, University of Potsdam
Semantics 2016, Leipzig, Germany, September 12-15th, 2016
2. Agenda
1. NEL and NEL evaluation
2. Dataset properties and evaluation drawbacks
3. Extending GERBIL
● Building conditional datasets
● Measure dataset characteristics
1. Results
2. Demonstration
3. Summary & Future work
4. Named Entity Linking (NEL), Principle
Chart 4
“Armstrong landed on the moon.”
Candidates:
dbr:Neil_Armstrong
dbr:Lance_Armstrong
dbr:Louis_Armstrong
….
Candidates:
dbr:Moon
dbr:Lunar
….
Correct entities
Entity mention with
surface form
String Distance
Link Analysis
Vector Space
Fuzzy String Matching
Conditional Random Fields
Random Forest
RankSVM
Learning to Rank
Surface Aggregation
Word Embeddings
Context Similarity Matching
1. Tokenize text
2. Find candidates in KB
3. Score candidates with a
magic algorithm and
select the best one
KEA
Wikifier
● Algorithm only approximates correct
entities
● Need for verification and testing
5. ● Dataset consists out of:
■ Documents (String/Sentences)
■ Annotations (ground truth)
Named Entity Linking, Evaluation
Chart 5
ACE2004
AIDA/CoNLL
DBpedia Spotlight
IITB
KORE50
MSNBC
Micropost2014
N3-RSS-500
WES2015
N3-Reuters-128
● Traditional measures are:
■ Precision: defines how well an annotator works
■ Recall: defines how complete the results are
■ F1-measure: harmonic medium between precision and recall
■ And more, cf. Rizzo et al. [1]
6. ● GERBIL - a general entity annotation system (AKSW Leipzig), cf.
Usbeck et. al [2]
● Used for testing/optimizing/benchmarking annotators
● Neat Webinterface
● 13 Annotators / 20 datasets
● F-measure to rough for detailed
evaluation
● Developer need dataset insights
Named Entity Linking, Benchmarking
Chart 6
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
7. ● Size of a datasets
● Amount of annotations/documents/words
● What types of entities are used? E.g. persons, places, events, ….
● Are there documents without annotations? E.g. Microposts 2014
● What sort of popularity have the entities? E.g. PageRank, Indegree
● How ambiguous are the entities and surface forms?
● How divers are the entities and surface forms?
● ….
Cf. van Erp et al. [3]
Properties of Datasets
Chart 7
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
8. ● How does the dataset characteristics influence the evaluation
results?
● How does the popularity of entities influence the evaluation
results?
● How can a general dataset be used for domain specific NEL tools?
● How can datasets be compared? Is there something like a general
difficulty?
● Limited comparability between benchmark results
● Penalization of good annotators with inappropriate datasets
Cf. van Erp et al. [3]
Research Questions and Drawbacks
Chart 8
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
9. ● Approach for a solution:
● Adjustable filter system for GERBIL
● Expose dataset characteristics
● Datasets and annotators added at runtime are also included
● Visualize the results
Extending GERBIL
Chart 9
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
10. Extending GERBIL, Conditional Datasets
Chart 10
Dataset
Type and popularity
specific datasets
Annotator
Documents Annotator results
Benchmark results
Evaluate each specific
dataset and result
PR(e) > t
PR(e) > t
PR(e) > t
rdf:type
rdf:type
rdf:type
rdf:type
rdf:type
rdf:type
11. Results, Types
Chart 11
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
13. Extending GERBIL, Not Annotated Documents
● Not annotated documents: shows the relative amount of empty
documents within a datasets
● Only affects if annotators searches entity mentions by themselves
Chart 13
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
14. Extending GERBIL, Density
Chart 14
● Density: shows the relation between number of annotations and
words in the document
● Only affects if annotators searches entity mentions by themselves
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
15. Extending GERBIL, Likelihood of Confusion
● Likelihood of Confusion (Level of Ambiguity)
● True measures are unknown due to missing exhaustive collections
● Rough overview how difficult to disambiguate
Chart 15
Entities Surface Forms
Tegel
TXL
Bruce
Otto Lilienthal
Bruce Lee
Bruce Willis
Airport Tegel
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
synonyms
16. Results, Likelihood of Confusion
Chart 16
Entities
● A high red bar indicates
an entity has a high
amount of homonyms
● A high blue bar indicates
a surface form has a high
amount of synonyms
Surface Forms
17. Extending GERBIL, Dominance of Entities
Chart 17
Bruce
Bruci
Bruce Willis
Testdata
Vocabulary
dominance(e)=
e(t)/e(v)
● Expresses the relation between
used words and all words
● True measures unknown
● High rates prevents overfitting
● Prevents repetition of surface
forms
dbr:Bruce_Willis
Bruce Walter Willis
18. Extending GERBIL, Dominance of Surface Forms
Chart 18Chart 18
dbr:Irene_Angelina
dbr:Angelina_Jordan
dbr:Angelina_Jolie
Vocabulary
dominance(s)=
s(t)/s(v)
● Expresses the relation between
used mentions and all
mentions
● True measures unknown
● High rates prevents
overfitting
● Indicates how context
dependent a disambiguation is
Testdata
Angelina
19. Results, Dominance
Chart 19
● Blue bar indicates that for a
entity a variety of surface
forms is used
● Red bar indicates how context
dependent the disambiguation of
an surface form is
Dominance of surface forms Dominance of entities
21. ■ Summary:
□ Implemented a domain specific filter system
□ Measure dataset characteristics
□ Annotator results are nearly the same on entities of different
popularity
□ Enable specific analyses and optimization of annotators
□ Enable users to select the tools that performs best for a specific
domain
■ Future work:
□ Keep up with GERBIL development, increase performance
□ More measurements, e. g. max_recall
□ Dataset remixing ≙ assemble new customized datasets
– E. g. Unpopular companies
Summary & Future work
Chart 21
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
22. [1] Giuseppe Rizzo, Amparo Elizabeth Cano Basave, Bianca Pereira, and Andrea
Varga. Making Sense of Microposts (#Microposts2015) Named Entity rEcognition and
Linking (NEEL) Challenge. In 5th Workshop on Making Sense of Microposts
(#Microposts2015), pages 44–53. CEUR-WS.org, 2015
[2] M. Röder, R. Usbeck, and A.-C. Ngonga Ngomo. Gerbil’s new stunts: Semantic
annotation benchmarking improved. Technical report, Leipzig University, 2016
[3] M. van Erp, P. Mendes, H. Paulheim, F. Ilievski, J. Plu, G. Rizzo, and J. Waitelonis.
Evaluating entity linking: An analysis of current benchmark datasets and a roadmap
for doing a better job. In Proc. of 10th edition of the Language Resources and
Evaluation Conference, Portoroz, Slovenia, 2016.
References
Chart 22
Henrik Jürges
Semantics 2016
Leipzig, Germany
Don’t compare
Apples to Oranges:
Extending GERBIL
for a fine grained
NEL Evaluation
Hello and welcome to this talk. Thank you all very much for coming today.
My name is Henrik Jürges and I’m student co worker at the semantic web technologies research group at the hpi
This is joint work with Jörg Waitelonis and Harald Sack
The title of my presentation is don’t compare apple to oranges - extending gerbil for a fine grained nel evaluation
And as the name suggest my purpose is to give you a brief overview of our approach to follow new trends in named entity linking
(So I’m hoping to cover three points. Firstly I will introduce GERBIL and it’s disadvantages to cover up new trends, after that we will look
At the work we have done to lay a foundation for future work in this direction and finally I will present some results.)
The presentation is structured as follows:
First i give you a brief overview about named entity linking and its evaluation
After that i state some questions regarding current data sets and evaluation methods
Then I show you our approach to deal with these questions and present you some results we got
Finally, i sum up this presentation and give you quick view into future work
A more complete example is the next sentence: Armstrong landed on the moon
Generally the same three steps are applied by almost every nel tool or so called annotator
The text is split into tokens by white spaces and important entity mentions are located. These are called surface forms
After that possible candidates are searched from a formal knowledge-base like dbpedia e.g. by comparing the surface forms and entity labels
Then some magic algorithm ranks these candidates. As you can see there are many possible algorithms from easy ones like string distance to
Complex ones like random forest. But these are all implemented by various tools you can see on the right side.
The highest ranked candidate will hopefully our correct entity
This will lead us to the evaluation of named entity linking
Let me show you a more enlightening example of named entity linking
First we have some text fragment, mostly a sentence or some larger text. Here the phrase “Armstrong landed on the moon.” is your fragment.
Then we have the application fulfilling the task of named entity linking
Mostly all named entity linking tools or so called annotators have three basic stages:
Firstly they tokenize the text into words. After that they search for candidates in some formal knowledge-base like dbpedia.
Finally they apply a magic algorithm on these candidates, finding the best candidate.
Here we should distinguish between two common task in named entity linking.
One task only disambiguates provided entity mentions with the help of the text fragment,
Called disambiguate to knowledge-base
The other task provides only the text fragment leaving the annotator to find expected entity mentions and disambiguate them.
This task is called annotate to knowledge-base and Encapsulates the former task.
Ok, taking this slow, we have our text fragment and tokenize the text searching for entity mentions.
We found two possible entity mentions: armstrong and the moon. The textual fragments for the two mentions are called surface forms.
Which are indeed just words who are syntactically right.
After that we search one or multiple formal knowledge-base for the possible candidates for each surface form. Leading us to the candidates Neil Armstrong, Lance Armstrong, Louis Armstrong and so on.
Having all candidates together the annotator uses some magic algorithm for scoring the candidates.
These algorithms are varying in their complexity from easy ones like string distance to more complex ones
like random forest or context similarity matching.
Glad for us there are many annotators implementing these algorithms. Some of them are shown at the right side.
By scoring the candidates, the candidate with the highest score might be our correct entity we’re searching for.
This will lead us to the entity Neil armstrong and the moon for our entity mentions.
Since these algorithm only approximate the correct entity, we need evaluation to verify and test the results
To do so we need data sets which are the models of our expectation
A data sets contains two things, first the documents or text fragments and as second the expected annotations for these documents which is called ground truth
For the evaluation we need some measures which are withdrawn from other research areas
The most common measures are precision which defines how well the results are and recall which defines how complete the results are.
And to combine these the f one measure gives an overview about the quality of the annotators results
Since we have the matrix describing our expectations, we need something that represents our expectation.
For this we use a data sets.
A data sets models our expectation within two aspects.
Firstly it contains the documents or text fragments which we use for evaluation. These documents are mostly string which are one or more sentences.
Secondly a set of annotations so called goldstandard or ground truth which is our expectation for the annotator results for some documents
The ground truth is either hand-crafted by multiple researchers or taken from annotators and correct later
Some of the known data sets are mentioned at the right
With the help of our expectation matrix we can now measure the amount of correct, missing and false results.
But the values are a bit hard to compare and are without deeper meaning, so some measure a borrowed from other research areas.
I will present the three most common here which are all based on our expectation matrix.
First the precision defines how well an annotator works.
Secondly the recall describes how complete the results are.
And lastly the f one measure is the harmonic medium between precision and recall. It gives an overview about the quality an annotator produces.
But doing this by hand is a quite error-prone and annoying task, so benchmarking and evaluation system evolved over time and the successor of all is GERBIL
GERBIL is work done by our colleagues from AKSW Leipzig and is a general entity annotation system which can be used for testing optimizing and benchmarking
Annotators.
Provides a nice web interface for configuring, benchmarking and visualizing data sets and annotators
At the time there are 13 annotators and 20 data sets provided but new data sets and annotators can be added at runtime
The results are presented as a spider diagram.
As you see in the spider diagram every dot is the f1 measure for an annotator and a specific data sets that owns this line.
But when see some further there are minor or major changes for an annotator between the data sets which leads us to the question why?
Some changes a relatively clear like the micropost data sets contains short sentences with only a little bit of context but other changes are not quite clear.
This is a major drawback when developing an annotator and you don’t know all characteristics for each data set
With all the measures and data sets we have a foundation for the evaluation, but for now this is all done by hand and the results
Are not comparable across single evaluations or between annotators.
So GERBIL has evolved which is a general entity annotation system.
It can be used for testing, optimizing and comparing annotator..
It provides a neat web interface and a possibility to add new data sets and annotators at runtime.
Entwicklerperspektive, ich brauche die eigenschaften der datensätze um annotator zu verbessern
This lead us to the question of measuring data sets properties
There are basic properties like the size of a data sets, the amount of annotation or documents in a data set
And some advanced properties like what types of entities are used, are there documents without annotations,
Which popularity the entities have.
And also how ambig or divers are the used entities and surface forms?
Considering these properties and the whole evaluation process we came up with some questions and drawbacks in regard of named entity linking evaluation
Can we show the influence of some data set characteristics
Are annotators better on popular entities?
Can we use the existing data sets for annotators that are focused on some domain especially entity types.
Is there a way to compare data sets or something like a difficulty level.
In general we found that results are less comparable between data sets and some quite good annotator could be penalized if using in conjunction with inappropriate data sets.
So to sum these problems up:
We have a general benchmarking approach against domain specific annotators with focus on geotagging, person, organization’s, tweets and so on
The data sets are annotated sparsely with old knowledge-bases in mind but modern nel tools are more precise and the knowledge-bases are more complete
The assumption was that all data sets have the same difficulty level but they don’t
All this led to results which are not comparable and which penalize good annotators
Besides that there are more problems growing out of this few:
How we get and manage all the domain specific data sets?
How we could define a difficulty level?
And how we can keep up with upcoming trends?
To tackle these questions we have done multiple things
We introduced an arbitrary filter system which can be used for domain specific evaluation
We implemented some characteristics for data sets
Booth things are automatically applied and there are expendable
And we provide some visual graphics for the results
To provide a domain specific evaluation we build an arbitrary filter system.
A general schematic can be seen on the right side.
As you can see, we get a list of annotations. First we unwrap them leaving only the IRIs of the entities, then we clean the
List from IRIs which are not complained with the standard
For performance issues we cache the results and split them into chunks
At this time we implemented two basic filters one dedicated for SPARQL queries and one for Popularity.
A configuration example is shown on the left side.
Every filter takes a name and a backend service like dbpedia or a file or something else
The real filter is here a sparql query which returns all entity links which are persons. The doublecross is replaced with the real links
And to cover some issues handling the amount of data, the filter can define a chunk size.
The popularity filter follows the same conventions only using another service and filter query.
We came up with two measures for sparse data sets.
The first one show the relative amount of empty documents within a data sets. Meaning that one or more documents don’t have any annotations at all.
And the second one is called Missing annotations which is a little bit misleading so density is the better name.
Since we can not measure what is missing we decided to measure how many expectations are in a document with respect to all possible expectations for a document also known
As words.
It is quite important to mention that these measures only affect a certain subtask in named entity linking. If an annotator only gets a text fragment and searches for entities by themselve, it is possible to find entity mentions in not annotated documents or new one in data sets with a low density. This leads to an penalization since the annotator results are taken as false positives although they could be right.
The ambiguity can be described in booth directions for either surface forms or entities
For example the entity airport tegel can be described through three surface which makes them all synonyms
The other way round the surface form bruce is linked to two entities which makes them homonyms
Since no exhaustive collection of relations between surface forms and entities exists the true measures remain unknown
But the level of ambiguity gives a rough extent about how hard to disambiguate is a dataset.