SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Effective Named Entity Recognition for Idiosyncratic Web Collections
Effective Named Entity Recognition for Idiosyncratic Web Collections
1.
Effective Named Entity Recognition for
Idiosyncratic Web Collections
Roman Prokofyev, Gianluca Demartini, Philippe Cudre-Mauroux
eXascale Infolab, University of Fribourg, Switzerland
WWW 2014
April 10, 2014
1
2.
Outline
• Introduction
• Problem definition
• Existing approaches and applicability
• Overview
• Candidate Named Entities Selection
• Dataset description
• Features description
• Experimental setup & Evaluation
2
3.
Problem Definition
• search engine
• web search engine
• navigational query
• user intent
• information need
• web content
• …
Entity type: scientific concept
3
4.
Traditional NER
Types:
• Maximum Entropy (Mallet, NLTK)
• Conditional Random Fields (Stanford NER, Mallet)
Properties:
• Require extensive training
• Usually domain-specific, different collections require
training on their domain
• Very good at detecting such types as Location, Person,
Organization
4
5.
Proposed Approach
Our problem is defined as a classification task.
Two-step classification:
• Extract candidate named entities using frequency filtration
algorithm.
• Classify candidate named entities using supervised
classifier.
Candidate selection should allow us to greatly reduce the
number of n-grams to classify, possibly without significant
loss in Recall.
5
6.
Pipeline
6
Text
extraction
(Apache Tika)
List of
extracted
n-grams
n-gram
Indexing
foreach
Candidat e
Selection
List of
selected
n-grams
Supervised
Classi! er
Ranked
list of
n-grams
Lemmat
ization
n+1 grams
merging
Feature
extractionFeature
extractionFeatures
POS
Tagging
frequency
reweighting
7.
Candidate Selection: Part I
Consider all bigrams with frequency > k (k=2):
candidate named: 5
entity are: 4
entity candidate: 3
entity in: 18
entity recognition: 12
named entity: 101
of named: 10
that named: 3
the named: 4
candidate named: 5
entity candidate: 3
entity recognition: 12
named entity: 101
NLTK stop word filter
7
8.
Candidate Selection: Part II
Trigram frequency is looked up from the n-gram index.
candidate named entity: 5
named entity candidate: 3
named entity recognition: 12
named entity: 101
candidate named: 5
entity candidate: 3
entity recognition: 12
candidate named: 5
entity candidate: 3
entity recognition: 12
named entity: 101
candidate named entity: 5
named entity candidate: 3
named entity recognition: 12
named entity: 81
candidate named: 0
entity candidate: 0
entity recognition: 0
8
9.
Candidate Selection: Discussion
Possible to extract n-grams (n>2) with frequency ≤k
9
10.
After Candidate Selection
TwiNER: named entity
recognition in targeted
twitter stream
„SIGIR 2012
10
11.
Classifier: Overview
Machine Learning algorithm:
Decision Trees from scikit-learn package.
Feature types:
• POS Tags and their derivatives
• External Knowledge Bases (DBLP, DBPedia)
• DBPedia relation graphs
• Syntactic features
11
13.
Features: POS Tags, part I
100+ different tag patterns
13
14.
Features: POS Tags, part II
Two feature schemes:
• Raw POS tag patterns, each tag is a binary feature
• Regex POS tag patterns:
• First tag match, for example:
• Last tag match:
JJ NNS
JJ NN NN
JJ NN
...
JJ*
NN VB
NN NN VB
JJ NN VB
...
*VB
14
15.
Features: External Knowledge Bases
Domain-specific knowledge bases:
• DBLP (Computer Science): contains author-assigned
keywords to the papers
• ScienceWISE: high-quality scientific concepts (mostly for
Physics domain) http://sciencewise.info
We perform exact string matching with these KBs.
15
16.
Features: DBPedia, part I
DBPedia pages essentially represent valid entities
But there are a few problems when:
• N-gram is not an entity
• N-gram is not a scientific concept (“Tom Cruise” in IR
paper)
CS Collection Physics Collection
Precision Recall Precision Recall
Exact string matching 0.9045 0.2394 0.7063 0.0155
Matching with redirects 0.8457 0.4229 0.7768 0.5843
16
17.
Features: DBPedia, part II
Com ponent siz eCom ponent siz e
NumberofcomponentsNumberofcomponents
0 10 20 30 40 50 60 70
0.4
1
2
4
10
20
40
100
200
400
Com ponent siz eCom ponent siz e
NumberofcomponentsNumberofcomponents
5 10 15 20 25 30 35 40
0.4
1
2
4
10
20
40
100
200
400
Without redirects With redirects
17
18.
Features: Syntactic
Set of common syntactic features:
• N-gram length in words
• Whether n-gram is uppercased
• The number of other n-gram given n-gram is part of
18
19.
Experiments: Overview
1. Regex POS Patterns vs Normal POS tags
2. Redirects vs Non-redirects
3. Feature importance scores
4. MaxEntropy comparison
All results are obtained using average with 10-fold cross-
validation.
19
20.
Experiments: Comparison I
CS Collection Precision Recall F1
score
Accuracy N#
features
Normal POS +
Components
0.8794 0.8058* 0.8409* 0.8429* 54
Regex POS +
Components
0.8475* 0.8524* 0.8499* 0.8448* 9
Normal POS +
Components-Redirects
0.8678* 0.8305* 0.8487* 0.8473 50
Regex POS +
Components-Redirects
0.8406* 0.8769 0.8584 0.8509 7
20
The symbol * indicates a statistically significant difference as compared to the
approach in bold.
21.
Experiments: Comparison II
Physics Collection Precision Recall F1
score
Accuracy N#
features
Normal POS +
Components
0.8253* 0.6567* 0.7311* 0.7567 53
Regex POS +
Components
0.7941* 0.6781 0.7315* 0.7492* 4
Normal POS +
Components-Redirects
0.8339 0.6674* 0.7412 0.7653 50
Regex POS +
Components-Redirects
0.8375 0.6479* 0.7305* 0.7592* 6
21
The symbol * indicates a statistically significant difference as compared to the
approach in bold.
23.
Experiments: MaxEntropy
Precision Recall F1 score
Maximum Entropy 0.6566 0.7196 0.6867
Decision Trees 0.8121 0.8742 0.8420
MaxEnt classifier receives full text as input.
(we used a classifier from NLTK package)
Comparison experiment: 80% of CS Collection as a training
data, 20% as a test dataset.
23
24.
Lessons Learned
Classic NER approaches are not good enough for
Idiosyncratic Web Collections
Leveraging the graph of scientific concepts is a key feature
Domain specific KBs and POS patterns work well
Experimental results show up to 85% accuracy over
different scientific collections
24
http://iner.exascale.info/
eXascale Infolab, http://exascale.info