Effective Named Entity Recognition for Idiosyncratic Web Collections
1. Effective Named Entity Recognition for
Idiosyncratic Web Collections
Roman Prokofyev, Gianluca Demartini, Philippe Cudre-Mauroux
eXascale Infolab, University of Fribourg, Switzerland
WWW 2014
April 10, 2014
1
2. Outline
• Introduction
• Problem definition
• Existing approaches and applicability
• Overview
• Candidate Named Entities Selection
• Dataset description
• Features description
• Experimental setup & Evaluation
2
3. Problem Definition
• search engine
• web search engine
• navigational query
• user intent
• information need
• web content
• …
Entity type: scientific concept
3
4. Traditional NER
Types:
• Maximum Entropy (Mallet, NLTK)
• Conditional Random Fields (Stanford NER, Mallet)
Properties:
• Require extensive training
• Usually domain-specific, different collections require
training on their domain
• Very good at detecting such types as Location, Person,
Organization
4
5. Proposed Approach
Our problem is defined as a classification task.
Two-step classification:
• Extract candidate named entities using frequency filtration
algorithm.
• Classify candidate named entities using supervised
classifier.
Candidate selection should allow us to greatly reduce the
number of n-grams to classify, possibly without significant
loss in Recall.
5
7. Candidate Selection: Part I
Consider all bigrams with frequency > k (k=2):
candidate named: 5
entity are: 4
entity candidate: 3
entity in: 18
entity recognition: 12
named entity: 101
of named: 10
that named: 3
the named: 4
candidate named: 5
entity candidate: 3
entity recognition: 12
named entity: 101
NLTK stop word filter
7
8. Candidate Selection: Part II
Trigram frequency is looked up from the n-gram index.
candidate named entity: 5
named entity candidate: 3
named entity recognition: 12
named entity: 101
candidate named: 5
entity candidate: 3
entity recognition: 12
candidate named: 5
entity candidate: 3
entity recognition: 12
named entity: 101
candidate named entity: 5
named entity candidate: 3
named entity recognition: 12
named entity: 81
candidate named: 0
entity candidate: 0
entity recognition: 0
8
14. Features: POS Tags, part II
Two feature schemes:
• Raw POS tag patterns, each tag is a binary feature
• Regex POS tag patterns:
• First tag match, for example:
• Last tag match:
JJ NNS
JJ NN NN
JJ NN
...
JJ*
NN VB
NN NN VB
JJ NN VB
...
*VB
14
15. Features: External Knowledge Bases
Domain-specific knowledge bases:
• DBLP (Computer Science): contains author-assigned
keywords to the papers
• ScienceWISE: high-quality scientific concepts (mostly for
Physics domain) http://sciencewise.info
We perform exact string matching with these KBs.
15
16. Features: DBPedia, part I
DBPedia pages essentially represent valid entities
But there are a few problems when:
• N-gram is not an entity
• N-gram is not a scientific concept (“Tom Cruise” in IR
paper)
CS Collection Physics Collection
Precision Recall Precision Recall
Exact string matching 0.9045 0.2394 0.7063 0.0155
Matching with redirects 0.8457 0.4229 0.7768 0.5843
16
17. Features: DBPedia, part II
Com ponent siz eCom ponent siz e
NumberofcomponentsNumberofcomponents
0 10 20 30 40 50 60 70
0.4
1
2
4
10
20
40
100
200
400
Com ponent siz eCom ponent siz e
NumberofcomponentsNumberofcomponents
5 10 15 20 25 30 35 40
0.4
1
2
4
10
20
40
100
200
400
Without redirects With redirects
17
18. Features: Syntactic
Set of common syntactic features:
• N-gram length in words
• Whether n-gram is uppercased
• The number of other n-gram given n-gram is part of
18
19. Experiments: Overview
1. Regex POS Patterns vs Normal POS tags
2. Redirects vs Non-redirects
3. Feature importance scores
4. MaxEntropy comparison
All results are obtained using average with 10-fold cross-
validation.
19
20. Experiments: Comparison I
CS Collection Precision Recall F1
score
Accuracy N#
features
Normal POS +
Components
0.8794 0.8058* 0.8409* 0.8429* 54
Regex POS +
Components
0.8475* 0.8524* 0.8499* 0.8448* 9
Normal POS +
Components-Redirects
0.8678* 0.8305* 0.8487* 0.8473 50
Regex POS +
Components-Redirects
0.8406* 0.8769 0.8584 0.8509 7
20
The symbol * indicates a statistically significant difference as compared to the
approach in bold.
21. Experiments: Comparison II
Physics Collection Precision Recall F1
score
Accuracy N#
features
Normal POS +
Components
0.8253* 0.6567* 0.7311* 0.7567 53
Regex POS +
Components
0.7941* 0.6781 0.7315* 0.7492* 4
Normal POS +
Components-Redirects
0.8339 0.6674* 0.7412 0.7653 50
Regex POS +
Components-Redirects
0.8375 0.6479* 0.7305* 0.7592* 6
21
The symbol * indicates a statistically significant difference as compared to the
approach in bold.
23. Experiments: MaxEntropy
Precision Recall F1 score
Maximum Entropy 0.6566 0.7196 0.6867
Decision Trees 0.8121 0.8742 0.8420
MaxEnt classifier receives full text as input.
(we used a classifier from NLTK package)
Comparison experiment: 80% of CS Collection as a training
data, 20% as a test dataset.
23
24. Lessons Learned
Classic NER approaches are not good enough for
Idiosyncratic Web Collections
Leveraging the graph of scientific concepts is a key feature
Domain specific KBs and POS patterns work well
Experimental results show up to 85% accuracy over
different scientific collections
24
http://iner.exascale.info/
eXascale Infolab, http://exascale.info