Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Effective Named Entity Recognition for Idiosyncratic Web Collections

2,709 views

Published on

Presentation at WWW 2014

Published in: Science
  • Be the first to comment

Effective Named Entity Recognition for Idiosyncratic Web Collections

  1. 1. Effective Named Entity Recognition for Idiosyncratic Web Collections Roman Prokofyev, Gianluca Demartini, Philippe Cudre-Mauroux eXascale Infolab, University of Fribourg, Switzerland WWW 2014 April 10, 2014 1
  2. 2. Outline • Introduction • Problem definition • Existing approaches and applicability • Overview • Candidate Named Entities Selection • Dataset description • Features description • Experimental setup & Evaluation 2
  3. 3. Problem Definition • search engine • web search engine • navigational query • user intent • information need • web content • … Entity type: scientific concept 3
  4. 4. Traditional NER Types: • Maximum Entropy (Mallet, NLTK) • Conditional Random Fields (Stanford NER, Mallet) Properties: • Require extensive training • Usually domain-specific, different collections require training on their domain • Very good at detecting such types as Location, Person, Organization 4
  5. 5. Proposed Approach Our problem is defined as a classification task. Two-step classification: • Extract candidate named entities using frequency filtration algorithm. • Classify candidate named entities using supervised classifier. Candidate selection should allow us to greatly reduce the number of n-grams to classify, possibly without significant loss in Recall. 5
  6. 6. Pipeline 6 Text extraction (Apache Tika) List of extracted n-grams n-gram Indexing foreach Candidat e Selection List of selected n-grams Supervised Classi! er Ranked list of n-grams Lemmat ization n+1 grams merging Feature extractionFeature extractionFeatures POS Tagging frequency reweighting
  7. 7. Candidate Selection: Part I Consider all bigrams with frequency > k (k=2): candidate named: 5 entity are: 4 entity candidate: 3 entity in: 18 entity recognition: 12 named entity: 101 of named: 10 that named: 3 the named: 4 candidate named: 5 entity candidate: 3 entity recognition: 12 named entity: 101 NLTK stop word filter 7
  8. 8. Candidate Selection: Part II Trigram frequency is looked up from the n-gram index. candidate named entity: 5 named entity candidate: 3 named entity recognition: 12 named entity: 101 candidate named: 5 entity candidate: 3 entity recognition: 12 candidate named: 5 entity candidate: 3 entity recognition: 12 named entity: 101 candidate named entity: 5 named entity candidate: 3 named entity recognition: 12 named entity: 81 candidate named: 0 entity candidate: 0 entity recognition: 0 8
  9. 9. Candidate Selection: Discussion Possible to extract n-grams (n>2) with frequency ≤k 9
  10. 10. After Candidate Selection TwiNER: named entity recognition in targeted twitter stream „SIGIR 2012 10
  11. 11. Classifier: Overview Machine Learning algorithm: Decision Trees from scikit-learn package. Feature types: • POS Tags and their derivatives • External Knowledge Bases (DBLP, DBPedia) • DBPedia relation graphs • Syntactic features 11
  12. 12. Datasets Two collections: • CS Collection (SIGIR 2012 Research Track): 100 papers • Physics collection: 100 papers randomly selected from arXiv.org High Energy Physics category CS Collection Physics Collection N# Candidate N-grams 21 531 18 129 N# Judged N-grams 15 057 11 421 N# Valid Entities 8 145 5 747 N# Invalid N-grams 6 912 5 674 Available at: github.com/XI-lab/scientific_NER_dataset 12
  13. 13. Features: POS Tags, part I 100+ different tag patterns 13
  14. 14. Features: POS Tags, part II Two feature schemes: • Raw POS tag patterns, each tag is a binary feature • Regex POS tag patterns: • First tag match, for example: • Last tag match: JJ NNS JJ NN NN JJ NN ... JJ* NN VB NN NN VB JJ NN VB ... *VB 14
  15. 15. Features: External Knowledge Bases Domain-specific knowledge bases: • DBLP (Computer Science): contains author-assigned keywords to the papers • ScienceWISE: high-quality scientific concepts (mostly for Physics domain) http://sciencewise.info We perform exact string matching with these KBs. 15
  16. 16. Features: DBPedia, part I DBPedia pages essentially represent valid entities But there are a few problems when: • N-gram is not an entity • N-gram is not a scientific concept (“Tom Cruise” in IR paper) CS Collection Physics Collection Precision Recall Precision Recall Exact string matching 0.9045 0.2394 0.7063 0.0155 Matching with redirects 0.8457 0.4229 0.7768 0.5843 16
  17. 17. Features: DBPedia, part II Com ponent siz eCom ponent siz e NumberofcomponentsNumberofcomponents 0 10 20 30 40 50 60 70 0.4 1 2 4 10 20 40 100 200 400 Com ponent siz eCom ponent siz e NumberofcomponentsNumberofcomponents 5 10 15 20 25 30 35 40 0.4 1 2 4 10 20 40 100 200 400 Without redirects With redirects 17
  18. 18. Features: Syntactic Set of common syntactic features: • N-gram length in words • Whether n-gram is uppercased • The number of other n-gram given n-gram is part of 18
  19. 19. Experiments: Overview 1. Regex POS Patterns vs Normal POS tags 2. Redirects vs Non-redirects 3. Feature importance scores 4. MaxEntropy comparison All results are obtained using average with 10-fold cross- validation. 19
  20. 20. Experiments: Comparison I CS Collection Precision Recall F1 score Accuracy N# features Normal POS + Components 0.8794 0.8058* 0.8409* 0.8429* 54 Regex POS + Components 0.8475* 0.8524* 0.8499* 0.8448* 9 Normal POS + Components-Redirects 0.8678* 0.8305* 0.8487* 0.8473 50 Regex POS + Components-Redirects 0.8406* 0.8769 0.8584 0.8509 7 20 The symbol * indicates a statistically significant difference as compared to the approach in bold.
  21. 21. Experiments: Comparison II Physics Collection Precision Recall F1 score Accuracy N# features Normal POS + Components 0.8253* 0.6567* 0.7311* 0.7567 53 Regex POS + Components 0.7941* 0.6781 0.7315* 0.7492* 4 Normal POS + Components-Redirects 0.8339 0.6674* 0.7412 0.7653 50 Regex POS + Components-Redirects 0.8375 0.6479* 0.7305* 0.7592* 6 21 The symbol * indicates a statistically significant difference as compared to the approach in bold.
  22. 22. Experiments: Feature Importance Importance NN STARTS 0.3091 DBLP 0.1442 Components + DBLP 0.1125 Components 0.0789 VB ENDS 0.0386 NN ENDS 0.0380 JJ STARTS 0.0364 Importance ScienceWISE 0.2870 Component + ScienceWISE 0.1948 Wikipedia redirect 0.1104 Components 0.1093 Wikilinks 0.0439 Participation count 0.0370 CS Collection, 7 features Physics Collection, 6 features 22
  23. 23. Experiments: MaxEntropy Precision Recall F1 score Maximum Entropy 0.6566 0.7196 0.6867 Decision Trees 0.8121 0.8742 0.8420 MaxEnt classifier receives full text as input. (we used a classifier from NLTK package) Comparison experiment: 80% of CS Collection as a training data, 20% as a test dataset. 23
  24. 24. Lessons Learned Classic NER approaches are not good enough for Idiosyncratic Web Collections Leveraging the graph of scientific concepts is a key feature Domain specific KBs and POS patterns work well Experimental results show up to 85% accuracy over different scientific collections 24 http://iner.exascale.info/ eXascale Infolab, http://exascale.info

×