SlideShare a Scribd company logo
Two-stage Named Entity Recognition using
          averaged perceptrons

            Lars Buitinck           Maarten Marx

          Information and Language Processing Systems
                       Informatics Institute
                     University of Amsterdam


 17th Int’l Conf. on Applications of NLP to Information
                        Systems




                   Buitinck, Marx   Two-stage NER
Outline




          Buitinck, Marx   Two-stage NER
Named Entity Recognition




     Find names in text and classify them as belonging to
     persons, locations, organizations, events, products or
     “miscellaneous”
     Use machine learning




                      Buitinck, Marx   Two-stage NER
Named Entity Recognition




     Find names in text and classify them as belonging to
     persons, locations, organizations, events, products or
     “miscellaneous”
     Use machine learning




                      Buitinck, Marx   Two-stage NER
Named Entity Recognition for Dutch




     State of the art algorithm for Dutch by Desmet and Hoste
     (2011); voting classifiers with GA to train weights
     Good training sets are just becoming available
     Many practitioners retrain Stanford CRF-NER tagger




                      Buitinck, Marx   Two-stage NER
Named Entity Recognition for Dutch




     State of the art algorithm for Dutch by Desmet and Hoste
     (2011); voting classifiers with GA to train weights
     Good training sets are just becoming available
     Many practitioners retrain Stanford CRF-NER tagger




                      Buitinck, Marx   Two-stage NER
Named Entity Recognition for Dutch




     State of the art algorithm for Dutch by Desmet and Hoste
     (2011); voting classifiers with GA to train weights
     Good training sets are just becoming available
     Many practitioners retrain Stanford CRF-NER tagger




                      Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Learning algorithm




     Use averaged perceptron for both stages
     Learns an approximation of max-margin solution (linear
     SVM)
     40 iterations
     Used the LBJ machine learning toolkit




                      Buitinck, Marx   Two-stage NER
Learning algorithm




     Use averaged perceptron for both stages
     Learns an approximation of max-margin solution (linear
     SVM)
     40 iterations
     Used the LBJ machine learning toolkit




                      Buitinck, Marx   Two-stage NER
Learning algorithm




     Use averaged perceptron for both stages
     Learns an approximation of max-margin solution (linear
     SVM)
     40 iterations
     Used the LBJ machine learning toolkit




                      Buitinck, Marx   Two-stage NER
Learning algorithm




     Use averaged perceptron for both stages
     Learns an approximation of max-margin solution (linear
     SVM)
     40 iterations
     Used the LBJ machine learning toolkit




                      Buitinck, Marx   Two-stage NER
Evaluation




     Aim for F1 score, as defined in the CoNLL 2002 shared
     task on NER
     Two corpora: CoNLL 2002 and a subset of SoNaR
     (courtesy Desmet and Hoste)
     Compare against Stanford and Desmet and Hoste’s
     algorithm




                     Buitinck, Marx   Two-stage NER
Evaluation




     Aim for F1 score, as defined in the CoNLL 2002 shared
     task on NER
     Two corpora: CoNLL 2002 and a subset of SoNaR
     (courtesy Desmet and Hoste)
     Compare against Stanford and Desmet and Hoste’s
     algorithm




                     Buitinck, Marx   Two-stage NER
Evaluation




     Aim for F1 score, as defined in the CoNLL 2002 shared
     task on NER
     Two corpora: CoNLL 2002 and a subset of SoNaR
     (courtesy Desmet and Hoste)
     Compare against Stanford and Desmet and Hoste’s
     algorithm




                     Buitinck, Marx   Two-stage NER
Results on CoNLL 2002




     309.686 tokens containing 19901 names, four categories
     65% training, 22% validation and 12% test sets
     Stanford achieves F1 = 74.72; "miscellaneous" category is
     hard (< 0.7)
     We achieve F1 = 75.14; "organization" category is hard




                      Buitinck, Marx   Two-stage NER
Results on CoNLL 2002




     309.686 tokens containing 19901 names, four categories
     65% training, 22% validation and 12% test sets
     Stanford achieves F1 = 74.72; "miscellaneous" category is
     hard (< 0.7)
     We achieve F1 = 75.14; "organization" category is hard




                      Buitinck, Marx   Two-stage NER
Results on CoNLL 2002




     309.686 tokens containing 19901 names, four categories
     65% training, 22% validation and 12% test sets
     Stanford achieves F1 = 74.72; "miscellaneous" category is
     hard (< 0.7)
     We achieve F1 = 75.14; "organization" category is hard




                      Buitinck, Marx   Two-stage NER
Results on CoNLL 2002




     309.686 tokens containing 19901 names, four categories
     65% training, 22% validation and 12% test sets
     Stanford achieves F1 = 74.72; "miscellaneous" category is
     hard (< 0.7)
     We achieve F1 = 75.14; "organization" category is hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Conclusion




     Near-state of the art performance from simple learners
     with good feature sets
     No gazetteers, so should be fairly reusable
     (Side conclusion: SoNaR is more easily learnable than
     CoNLL)




                      Buitinck, Marx   Two-stage NER
Conclusion




     Near-state of the art performance from simple learners
     with good feature sets
     No gazetteers, so should be fairly reusable
     (Side conclusion: SoNaR is more easily learnable than
     CoNLL)




                      Buitinck, Marx   Two-stage NER
Conclusion




     Near-state of the art performance from simple learners
     with good feature sets
     No gazetteers, so should be fairly reusable
     (Side conclusion: SoNaR is more easily learnable than
     CoNLL)




                      Buitinck, Marx   Two-stage NER
Future work




     Being integrated in UvA’s xTAS text analysis pipeline
     Used to find entities in Dutch Hansard corpus
     (forthcoming) and link entities to Wikipedia
     Full SoNaR is now available; new evaluation needed




                      Buitinck, Marx   Two-stage NER
Future work




     Being integrated in UvA’s xTAS text analysis pipeline
     Used to find entities in Dutch Hansard corpus
     (forthcoming) and link entities to Wikipedia
     Full SoNaR is now available; new evaluation needed




                      Buitinck, Marx   Two-stage NER
Future work




     Being integrated in UvA’s xTAS text analysis pipeline
     Used to find entities in Dutch Hansard corpus
     (forthcoming) and link entities to Wikipedia
     Full SoNaR is now available; new evaluation needed




                      Buitinck, Marx   Two-stage NER

More Related Content

More from maartenmarx

Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13maartenmarx
 
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13maartenmarx
 
Economie van de aandacht
  Economie van de aandacht  Economie van de aandacht
Economie van de aandachtmaartenmarx
 
Dans dataprijs2012
Dans dataprijs2012Dans dataprijs2012
Dans dataprijs2012maartenmarx
 
College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08
maartenmarx
 
Women in Dutch parliament: what they did
Women in Dutch parliament: what they didWomen in Dutch parliament: what they did
Women in Dutch parliament: what they didmaartenmarx
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publications
maartenmarx
 
Namescape 2012 03 06
Namescape 2012 03 06Namescape 2012 03 06
Namescape 2012 03 06
maartenmarx
 
voting advice slides
 voting advice slides voting advice slides
voting advice slidesmaartenmarx
 
TV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalTV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalmaartenmarx
 
Groningen nl pgroep
Groningen nl pgroepGroningen nl pgroep
Groningen nl pgroepmaartenmarx
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccctmaartenmarx
 
Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10maartenmarx
 

More from maartenmarx (13)

Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13
 
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
 
Economie van de aandacht
  Economie van de aandacht  Economie van de aandacht
Economie van de aandacht
 
Dans dataprijs2012
Dans dataprijs2012Dans dataprijs2012
Dans dataprijs2012
 
College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08
 
Women in Dutch parliament: what they did
Women in Dutch parliament: what they didWomen in Dutch parliament: what they did
Women in Dutch parliament: what they did
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publications
 
Namescape 2012 03 06
Namescape 2012 03 06Namescape 2012 03 06
Namescape 2012 03 06
 
voting advice slides
 voting advice slides voting advice slides
voting advice slides
 
TV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalTV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaal
 
Groningen nl pgroep
Groningen nl pgroepGroningen nl pgroep
Groningen nl pgroep
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccct
 
Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Presentation at NLDB 2012

  • 1. Two-stage Named Entity Recognition using averaged perceptrons Lars Buitinck Maarten Marx Information and Language Processing Systems Informatics Institute University of Amsterdam 17th Int’l Conf. on Applications of NLP to Information Systems Buitinck, Marx Two-stage NER
  • 2. Outline Buitinck, Marx Two-stage NER
  • 3. Named Entity Recognition Find names in text and classify them as belonging to persons, locations, organizations, events, products or “miscellaneous” Use machine learning Buitinck, Marx Two-stage NER
  • 4. Named Entity Recognition Find names in text and classify them as belonging to persons, locations, organizations, events, products or “miscellaneous” Use machine learning Buitinck, Marx Two-stage NER
  • 5. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classifiers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER
  • 6. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classifiers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER
  • 7. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classifiers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER
  • 8. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 9. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 10. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 11. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 12. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 13. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 14. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 15. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 16. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 17. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 18. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 19. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 20. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 21. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 22. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 23. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 24. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 25. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 26. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 27. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 28. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 29. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  • 30. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  • 31. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  • 32. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  • 33. Evaluation Aim for F1 score, as defined in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER
  • 34. Evaluation Aim for F1 score, as defined in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER
  • 35. Evaluation Aim for F1 score, as defined in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER
  • 36. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  • 37. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  • 38. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  • 39. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  • 40. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 41. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 42. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 43. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 44. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 45. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 46. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER
  • 47. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER
  • 48. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER
  • 49. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to find entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER
  • 50. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to find entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER
  • 51. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to find entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER