NERD: Evaluating Named Entity Recognition Tools in the Web of Data
NERD: Evaluating Named Entity
Recognition Tools in the Web of Data
Giuseppe Rizzo <giuseppe.rizzo@eurecom.fr>
Raphaël Troncy <raphael.troncy@eurecom.fr>
What is a Named Entity recognition task?
A task that aims to locate and classify the name of a person or an
organization, a location, a brand, a product, a numeric expression
including time, date, money and percent in a textual document
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 2/21
Named Entity recognition tools
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 3/21
Differences among those NER extractors
Granularity
extract NE from sentences vs from the entire document
Technologies used
algorithms used to extract NE
supported languages
taxonomy of type of NE recognized
disambiguation (dataset used to provide links)
content request size
Response format
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 4/21
And ...
What about precision and recall?
Which extractor best fits my needs?
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 5/21
Seeks to find pros and cons of
those extractors
What is NERD?
REST API1 ontology3
UI2
1
http://nerd.eurecom.fr/api/application.wadl
2
http://nerd.eurecom.fr/
3
http://nerd.eurecom.fr/ontology
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 6/21
Showcase
http://nerd.eurecom.fr
Science: "Google Cars Drive Themselves", http://bit.ly/oTj8md (part
of the original resource found at http://nyti.ms/9p19i8)
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 7/21
Evaluation
5 extractors using default configurations
Controlled experiment
4 human raters
10 English news articles (5 from BBC and 5 from The New York Times)
each rater evaluated each article for all the extractors
200 evaluations in total
Uncontrolled experiment
17 human raters
53 English news articles (sources: CNN, BBC, The New York Times and
Yahoo! News)
free selection of articles
Each human rater received a training1
1
http://nerd.eurecom.fr/help
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 8/21
Evaluation output
t = (NE, type, URI, relevant)
The assessment consists in rating these criteria with a Boolean value
If no type or no disambiguation URI is provided by the extractor, it is
considered false by default
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 9/21
Controlled experiment - dataset1
Categories: World, Business, Sport, Science, Health
1 BBC article and 1 NYT article for each category
Average word number per article: 981
The final number of unique entities detected is 4641 with an average
number of named entity per article equal to 23.2
Some of the extractors (e.g. DBpedia Spotlight and Extractiv) provide NE
duplicates. We removed all duplicates do not bias the statistics
1
http://nerd.eurecom.fr/ui/evaluation/wekex2011-goldenset.tar.gz
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 10/21
Controlled experiment – agreement score
Fleiss's kappa score1
Grouped by
extractor
Grouped by
source
Grouped by
category
1
Joseph L. Fleiss. Measuring nominal scale agreement among many raters. Psychological
Bulletin, 76(5):378–382, 1971
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 11/21
Controlled experiment – statistic result
Overall
statistics
Grouped by
extractor
different behavior
for different sources
Grouped by
category
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 12/21
Uncontrolled experiment - dataset
17 raters were free to select English news articles from CNN, BBC,
The New York Times and Yahoo! News
53 news articles selected
Total number of assessments = 94 and the assessment average number
per user = 5.2
Each article assessed at least by 2 different tools
The final number of unique entities detected is 1616 with an average
number of named entity per article equal to 34
Some of the extractors (e.g. DBpedia Spotlight and Extractiv) provide NE
duplicates. In order do not bias the statistics, we removed all duplicates
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 13/21
Uncontrolled experiment – statistic result (I)
Overall
precision
Grouped by
extractors
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 14/21
Uncontrolled experiment – statistic result (II)
Grouped by
category
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 15/21
Q. Which are the best NER tools ?
Conclusion A. They are ...
AlchemyAPI has obtained the best results in NE extraction and
categorization
DBpedia Spotlight and Zemanta showed ability to disambiguate NE in the
LOD cloud
Experiments across categories of articles did not show significant
differences in the analysis.
Published the WEKEX'11 ground-truth
http://nerd.eurecom.fr/ui/evaluation/wekex2011-goldenset.tar.gz
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 16/21
Future Work (NERD Timeline)
beginning core application
uncontrolled experiment
controlled experiment
today REST API, release WEKEX'11 ground-truth
release ISWC'11 ground truth
NERD “smart” service: combining the best of all NER
tools
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 17/21
ISWC'11 golden-set
Do you believe it's easy to find
an agreement among all raters?
We'd like inviting to create a new golden-set during the
ISWC'2011 poster and demo session. We will kindly ask
each rater to evaluate two short parts of two English news
articles with all extractors supported by NERD
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 18/21
Thanks for your time and your attention
http://nerd.eurecom.fr
@giusepperizzo @rtroncy #nerd
http://www.slideshare.net/giusepperizzo
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 19/21
Fleiss ' Kappa
chance agreement
K = 1 fully agreement among all raters
K = 0 (or lesser than) poor agreement
24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 20/21