Two-stage Named Entity Recognition using          averaged perceptrons            Lars Buitinck           Maarten Marx    ...
Outline          Buitinck, Marx   Two-stage NER
Named Entity Recognition     Find names in text and classify them as belonging to     persons, locations, organizations, e...
Named Entity Recognition     Find names in text and classify them as belonging to     persons, locations, organizations, e...
Named Entity Recognition for Dutch     State of the art algorithm for Dutch by Desmet and Hoste     (2011); voting classifi...
Named Entity Recognition for Dutch     State of the art algorithm for Dutch by Desmet and Hoste     (2011); voting classifi...
Named Entity Recognition for Dutch     State of the art algorithm for Dutch by Desmet and Hoste     (2011); voting classifi...
Overview     Realize that NER is two problems in one: recognition and     classification     Pipeline solution with two cla...
Overview     Realize that NER is two problems in one: recognition and     classification     Pipeline solution with two cla...
Overview     Realize that NER is two problems in one: recognition and     classification     Pipeline solution with two cla...
Overview     Realize that NER is two problems in one: recognition and     classification     Pipeline solution with two cla...
Overview     Realize that NER is two problems in one: recognition and     classification     Pipeline solution with two cla...
Recognition stage     Token-level task: is a token the Beginning of, Inside, or     Outside any entity name?     Features:...
Recognition stage     Token-level task: is a token the Beginning of, Inside, or     Outside any entity name?     Features:...
Recognition stage     Token-level task: is a token the Beginning of, Inside, or     Outside any entity name?     Features:...
Recognition stage     Token-level task: is a token the Beginning of, Inside, or     Outside any entity name?     Features:...
Recognition stage     Token-level task: is a token the Beginning of, Inside, or     Outside any entity name?     Features:...
Recognition stage     Token-level task: is a token the Beginning of, Inside, or     Outside any entity name?     Features:...
Recognition stage     Token-level task: is a token the Beginning of, Inside, or     Outside any entity name?     Features:...
Recognition stage     Token-level task: is a token the Beginning of, Inside, or     Outside any entity name?     Features:...
Classification stage     Don’t do this at token-level; we know the entity spans!     Input is a list of tokens considered a...
Classification stage     Don’t do this at token-level; we know the entity spans!     Input is a list of tokens considered a...
Classification stage     Don’t do this at token-level; we know the entity spans!     Input is a list of tokens considered a...
Classification stage     Don’t do this at token-level; we know the entity spans!     Input is a list of tokens considered a...
Classification stage     Don’t do this at token-level; we know the entity spans!     Input is a list of tokens considered a...
Classification stage     Don’t do this at token-level; we know the entity spans!     Input is a list of tokens considered a...
Classification stage     Don’t do this at token-level; we know the entity spans!     Input is a list of tokens considered a...
Classification stage     Don’t do this at token-level; we know the entity spans!     Input is a list of tokens considered a...
Learning algorithm     Use averaged perceptron for both stages     Learns an approximation of max-margin solution (linear ...
Learning algorithm     Use averaged perceptron for both stages     Learns an approximation of max-margin solution (linear ...
Learning algorithm     Use averaged perceptron for both stages     Learns an approximation of max-margin solution (linear ...
Learning algorithm     Use averaged perceptron for both stages     Learns an approximation of max-margin solution (linear ...
Evaluation     Aim for F1 score, as defined in the CoNLL 2002 shared     task on NER     Two corpora: CoNLL 2002 and a subs...
Evaluation     Aim for F1 score, as defined in the CoNLL 2002 shared     task on NER     Two corpora: CoNLL 2002 and a subs...
Evaluation     Aim for F1 score, as defined in the CoNLL 2002 shared     task on NER     Two corpora: CoNLL 2002 and a subs...
Results on CoNLL 2002     309.686 tokens containing 19901 names, four categories     65% training, 22% validation and 12% ...
Results on CoNLL 2002     309.686 tokens containing 19901 names, four categories     65% training, 22% validation and 12% ...
Results on CoNLL 2002     309.686 tokens containing 19901 names, four categories     65% training, 22% validation and 12% ...
Results on CoNLL 2002     309.686 tokens containing 19901 names, four categories     65% training, 22% validation and 12% ...
Results on SoNaR     New, large corpus with manual annotations     Used a 200k tokens subset of a preliminary version,    ...
Results on SoNaR     New, large corpus with manual annotations     Used a 200k tokens subset of a preliminary version,    ...
Results on SoNaR     New, large corpus with manual annotations     Used a 200k tokens subset of a preliminary version,    ...
Results on SoNaR     New, large corpus with manual annotations     Used a 200k tokens subset of a preliminary version,    ...
Results on SoNaR     New, large corpus with manual annotations     Used a 200k tokens subset of a preliminary version,    ...
Results on SoNaR     New, large corpus with manual annotations     Used a 200k tokens subset of a preliminary version,    ...
Conclusion     Near-state of the art performance from simple learners     with good feature sets     No gazetteers, so sho...
Conclusion     Near-state of the art performance from simple learners     with good feature sets     No gazetteers, so sho...
Conclusion     Near-state of the art performance from simple learners     with good feature sets     No gazetteers, so sho...
Future work     Being integrated in UvA’s xTAS text analysis pipeline     Used to find entities in Dutch Hansard corpus    ...
Future work     Being integrated in UvA’s xTAS text analysis pipeline     Used to find entities in Dutch Hansard corpus    ...
Future work     Being integrated in UvA’s xTAS text analysis pipeline     Used to find entities in Dutch Hansard corpus    ...
Upcoming SlideShare
Loading in …5
×

Presentation at NLDB 2012

1,353 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,353
On SlideShare
0
From Embeds
0
Number of Embeds
722
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Presentation at NLDB 2012

  1. 1. Two-stage Named Entity Recognition using averaged perceptrons Lars Buitinck Maarten Marx Information and Language Processing Systems Informatics Institute University of Amsterdam 17th Int’l Conf. on Applications of NLP to Information Systems Buitinck, Marx Two-stage NER
  2. 2. Outline Buitinck, Marx Two-stage NER
  3. 3. Named Entity Recognition Find names in text and classify them as belonging to persons, locations, organizations, events, products or “miscellaneous” Use machine learning Buitinck, Marx Two-stage NER
  4. 4. Named Entity Recognition Find names in text and classify them as belonging to persons, locations, organizations, events, products or “miscellaneous” Use machine learning Buitinck, Marx Two-stage NER
  5. 5. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classifiers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER
  6. 6. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classifiers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER
  7. 7. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classifiers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER
  8. 8. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  9. 9. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  10. 10. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  11. 11. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  12. 12. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  13. 13. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  14. 14. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  15. 15. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  16. 16. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  17. 17. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  18. 18. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  19. 19. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  20. 20. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  21. 21. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  22. 22. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  23. 23. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  24. 24. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  25. 25. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  26. 26. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  27. 27. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  28. 28. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  29. 29. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  30. 30. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  31. 31. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  32. 32. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  33. 33. Evaluation Aim for F1 score, as defined in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER
  34. 34. Evaluation Aim for F1 score, as defined in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER
  35. 35. Evaluation Aim for F1 score, as defined in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER
  36. 36. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  37. 37. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  38. 38. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  39. 39. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  40. 40. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  41. 41. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  42. 42. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  43. 43. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  44. 44. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  45. 45. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  46. 46. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER
  47. 47. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER
  48. 48. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER
  49. 49. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to find entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER
  50. 50. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to find entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER
  51. 51. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to find entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER

×