SlideShare a Scribd company logo
1 of 12
Build your own Named Entity
Recognizer with Python
Bogdan @ http://nlpforhackers.io
What is NER?
short for Named Entity Recognition
is probably the first step towards information extraction
extracting what is a real world entity from the text
Person, Organization, Event etc …
NLTK NER Chunker
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Mark and John are working at Google."
print ne_chunk(pos_tag(word_tokenize(sentence)))
"""
(S
(PERSON Mark/NNP)
and/CC
(PERSON John/NNP)
are/VBP
working/VBG
at/IN
(ORGANIZATION Google/NNP)
./.)
"""
IOB Tagging
nltk.Tree is great for processing such information in Python, but it’s not the standard way of annotating
chunks. Maybe this can be an article on its own but we’ll cover this here really quickly.
The IOB Tagging system contains tags of the form:
1. B-{CHUNK_TYPE} – for the word in the Beginning chunk
2. I-{CHUNK_TYPE} – for words Inside the chunk
3. O – Outside any chunk
IOB Tagging
Here’s how to convert between the nltk.Tree and IOB format:
from nltk.chunk import conlltags2tree, tree2conlltags
sentence = "Mark and John are working at Google."
ne_tree = ne_chunk(pos_tag(word_tokenize(sentence)))
iob_tagged = tree2conlltags(ne_tree)
print iob_tagged
"""[('Mark', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('John', 'NNP', u'B-PERSON'), ('are', 'VBP', u'O'), ...]"""
ne_tree = conlltags2tree(iob_tagged)
"""(S
(PERSON Mark/NNP)
and/CC
(PERSON John/NNP)
are/VBP
working/VBG
at/IN
(ORGANIZATION Google/NNP)
./.)"""
GMB Corpus
NLTK doesn’t have a proper English corpus for NER. It has the CoNLL 2002 Named Entity CoNLL but it’s
only for Spanish and Dutch. You can definitely try the method presented here on that corpora. In fact
doing so would be easier because NLTK provides a good corpus reader. We are going with Groningen
Meaning Bank (GMB) though.
GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold
standard corpus, meaning that it’s not completely human annotated and it’s not considered 100%
correct. The corpus is created by using already existed annotators and then corrected by humans
where needed.
Feature Extraction
import string
from nltk.stem.snowball import SnowballStemmer
def features(tokens, index, history):
"""
`tokens` = a POS-tagged sentence [(w1, t1), ...]
`index` = the index of the token we want to extract features for
`history` = the previous predicted IOB tags
"""
# init the stemmer
stemmer = SnowballStemmer('english')
# Pad the sequence with placeholders
tokens = [('[START2]', '[START2]'), ('[START1]', '[START1]')] + list(tokens) + [('[END1]', '[END1]'), ('[END2]',
'[END2]')]
history = ['[START2]', '[START1]'] + list(history)
# shift the index with 2, to accommodate the padding
index += 2
Feature Extraction
word, pos = tokens[index]
prevword, prevpos = tokens[index - 1]
prevprevword, prevprevpos = tokens[index - 2]
nextword, nextpos = tokens[index + 1]
nextnextword, nextnextpos = tokens[index + 2]
previob = history[index - 1]
contains_dash = '-' in word
contains_dot = '.' in word
allascii = all([True for c in word if c in string.ascii_lowercase])
allcaps = word == word.capitalize()
capitalized = word[0] in string.ascii_uppercase
prevallcaps = prevword == prevword.capitalize()
prevcapitalized = prevword[0] in string.ascii_uppercase
nextallcaps = prevword == prevword.capitalize()
nextcapitalized = prevword[0] in string.ascii_uppercase
Feature Extraction
return {'word': word,
'lemma': stemmer.stem(word),
'pos': pos,
'all-ascii': allascii,
'next-word': nextword,
'next-lemma': stemmer.stem(nextword),
'next-pos': nextpos,
'next-next-word': nextnextword,
'nextnextpos': nextnextpos,
'prev-word': prevword,
'prev-lemma': stemmer.stem(prevword),
'prev-pos': prevpos,
'prev-prev-word': prevprevword,
'prev-prev-pos': prevprevpos,
'prev-iob': previob,
'contains-dash': contains_dash,
'contains-dot': contains_dot,
'all-caps': allcaps,
'capitalized': capitalized,
'prev-all-caps': prevallcaps,
'prev-capitalized': prevcapitalized,
'next-all-caps': nextallcaps,
'next-capitalized': nextcapitalized,}
Training the system
from collections import Iterable
from nltk.tag import ClassifierBasedTagger
from nltk.chunk import ChunkParserI
class NamedEntityChunker(ChunkParserI):
def __init__(self, train_sents, **kwargs):
assert isinstance(train_sents, Iterable)
self.feature_detector = features
self.tagger = ClassifierBasedTagger(
train=train_sents,
feature_detector=features,
**kwargs)
def parse(self, tagged_sent):
chunks = self.tagger.tag(tagged_sent)
# Transform the result from [((w1, t1), iob1), ...]
# to the preferred list of triplets format [(w1, t1, iob1), ...]
iob_triplets = [(w, t, c) for ((w, t), c) in chunks]
# Transform the list of triplets to nltk.Tree format
return conlltags2tree(iob_triplets)
Taking it for a spin
chunker = NamedEntityChunker(training_samples[:2000])
from nltk import pos_tag, word_tokenize
print chunker.parse(pos_tag(word_tokenize("I'm going to Germany this Monday.")))
"""
(S
I/PRP
'm/VBP
going/VBG
to/TO
(geo Germany/NNP)
this/DT
(tim Monday/NNP)
./.)
"""
score = chunker.evaluate([conlltags2tree([(w, t, iob) for (w, t), iob in iobs]) for iobs in test_samples[:500]])
print score.accuracy() # 0.931132334092 - Awesome :D
Conclusions
 Chunking can be reduced to a tagging problem.
 Named Entity Recognition is a form of chunking.
 We explored a freely available corpus that can be used for real-world applications.
 The NLTK classifier can be replaced with any classifier you can think about. Try replacing it with a
scikit-learn classifier.

More Related Content

Viewers also liked

SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...Guy De Pauw
 
Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Stephen Shellman
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2Arabic_NLP_ImamU2013
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognitionDhwaj Raj
 
Named Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationNamed Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationRichard Littauer
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataDave Lewis
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Enhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER ModelsEnhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER ModelsJulien PLU
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing Rajnish Raj
 
A Vague Sense Classifier for Detecting Vague Definitions in Ontologies
A Vague Sense Classifier for Detecting Vague Definitions in OntologiesA Vague Sense Classifier for Detecting Vague Definitions in Ontologies
A Vague Sense Classifier for Detecting Vague Definitions in OntologiesPanos Alexopoulos
 
Exploiting Linked Open Data and Natural Language Processing for Classificati...
Exploiting Linked Open Data  and Natural Language Processing for Classificati...Exploiting Linked Open Data  and Natural Language Processing for Classificati...
Exploiting Linked Open Data and Natural Language Processing for Classificati...giuseppe_futia
 
Effective Named Entity Recognition for Idiosyncratic Web Collections
Effective Named Entity Recognition for Idiosyncratic Web CollectionsEffective Named Entity Recognition for Idiosyncratic Web Collections
Effective Named Entity Recognition for Idiosyncratic Web CollectionseXascale Infolab
 

Viewers also liked (20)

SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
 
Multlingual Linked Data Patterns
Multlingual Linked Data PatternsMultlingual Linked Data Patterns
Multlingual Linked Data Patterns
 
Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER)
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognition
 
Text mining
Text miningText mining
Text mining
 
Named Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationNamed Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 Presentation
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization data
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data Platforms
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Discoverers of Surface Analysis
Discoverers of Surface AnalysisDiscoverers of Surface Analysis
Discoverers of Surface Analysis
 
Enhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER ModelsEnhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER Models
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing
 
Recipes for PhD
Recipes for PhDRecipes for PhD
Recipes for PhD
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
A Vague Sense Classifier for Detecting Vague Definitions in Ontologies
A Vague Sense Classifier for Detecting Vague Definitions in OntologiesA Vague Sense Classifier for Detecting Vague Definitions in Ontologies
A Vague Sense Classifier for Detecting Vague Definitions in Ontologies
 
Exploiting Linked Open Data and Natural Language Processing for Classificati...
Exploiting Linked Open Data  and Natural Language Processing for Classificati...Exploiting Linked Open Data  and Natural Language Processing for Classificati...
Exploiting Linked Open Data and Natural Language Processing for Classificati...
 
Effective Named Entity Recognition for Idiosyncratic Web Collections
Effective Named Entity Recognition for Idiosyncratic Web CollectionsEffective Named Entity Recognition for Idiosyncratic Web Collections
Effective Named Entity Recognition for Idiosyncratic Web Collections
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Build your own named entity recognizer with python

  • 1. Build your own Named Entity Recognizer with Python Bogdan @ http://nlpforhackers.io
  • 2. What is NER? short for Named Entity Recognition is probably the first step towards information extraction extracting what is a real world entity from the text Person, Organization, Event etc …
  • 3. NLTK NER Chunker from nltk import word_tokenize, pos_tag, ne_chunk sentence = "Mark and John are working at Google." print ne_chunk(pos_tag(word_tokenize(sentence))) """ (S (PERSON Mark/NNP) and/CC (PERSON John/NNP) are/VBP working/VBG at/IN (ORGANIZATION Google/NNP) ./.) """
  • 4. IOB Tagging nltk.Tree is great for processing such information in Python, but it’s not the standard way of annotating chunks. Maybe this can be an article on its own but we’ll cover this here really quickly. The IOB Tagging system contains tags of the form: 1. B-{CHUNK_TYPE} – for the word in the Beginning chunk 2. I-{CHUNK_TYPE} – for words Inside the chunk 3. O – Outside any chunk
  • 5. IOB Tagging Here’s how to convert between the nltk.Tree and IOB format: from nltk.chunk import conlltags2tree, tree2conlltags sentence = "Mark and John are working at Google." ne_tree = ne_chunk(pos_tag(word_tokenize(sentence))) iob_tagged = tree2conlltags(ne_tree) print iob_tagged """[('Mark', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('John', 'NNP', u'B-PERSON'), ('are', 'VBP', u'O'), ...]""" ne_tree = conlltags2tree(iob_tagged) """(S (PERSON Mark/NNP) and/CC (PERSON John/NNP) are/VBP working/VBG at/IN (ORGANIZATION Google/NNP) ./.)"""
  • 6. GMB Corpus NLTK doesn’t have a proper English corpus for NER. It has the CoNLL 2002 Named Entity CoNLL but it’s only for Spanish and Dutch. You can definitely try the method presented here on that corpora. In fact doing so would be easier because NLTK provides a good corpus reader. We are going with Groningen Meaning Bank (GMB) though. GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed.
  • 7. Feature Extraction import string from nltk.stem.snowball import SnowballStemmer def features(tokens, index, history): """ `tokens` = a POS-tagged sentence [(w1, t1), ...] `index` = the index of the token we want to extract features for `history` = the previous predicted IOB tags """ # init the stemmer stemmer = SnowballStemmer('english') # Pad the sequence with placeholders tokens = [('[START2]', '[START2]'), ('[START1]', '[START1]')] + list(tokens) + [('[END1]', '[END1]'), ('[END2]', '[END2]')] history = ['[START2]', '[START1]'] + list(history) # shift the index with 2, to accommodate the padding index += 2
  • 8. Feature Extraction word, pos = tokens[index] prevword, prevpos = tokens[index - 1] prevprevword, prevprevpos = tokens[index - 2] nextword, nextpos = tokens[index + 1] nextnextword, nextnextpos = tokens[index + 2] previob = history[index - 1] contains_dash = '-' in word contains_dot = '.' in word allascii = all([True for c in word if c in string.ascii_lowercase]) allcaps = word == word.capitalize() capitalized = word[0] in string.ascii_uppercase prevallcaps = prevword == prevword.capitalize() prevcapitalized = prevword[0] in string.ascii_uppercase nextallcaps = prevword == prevword.capitalize() nextcapitalized = prevword[0] in string.ascii_uppercase
  • 9. Feature Extraction return {'word': word, 'lemma': stemmer.stem(word), 'pos': pos, 'all-ascii': allascii, 'next-word': nextword, 'next-lemma': stemmer.stem(nextword), 'next-pos': nextpos, 'next-next-word': nextnextword, 'nextnextpos': nextnextpos, 'prev-word': prevword, 'prev-lemma': stemmer.stem(prevword), 'prev-pos': prevpos, 'prev-prev-word': prevprevword, 'prev-prev-pos': prevprevpos, 'prev-iob': previob, 'contains-dash': contains_dash, 'contains-dot': contains_dot, 'all-caps': allcaps, 'capitalized': capitalized, 'prev-all-caps': prevallcaps, 'prev-capitalized': prevcapitalized, 'next-all-caps': nextallcaps, 'next-capitalized': nextcapitalized,}
  • 10. Training the system from collections import Iterable from nltk.tag import ClassifierBasedTagger from nltk.chunk import ChunkParserI class NamedEntityChunker(ChunkParserI): def __init__(self, train_sents, **kwargs): assert isinstance(train_sents, Iterable) self.feature_detector = features self.tagger = ClassifierBasedTagger( train=train_sents, feature_detector=features, **kwargs) def parse(self, tagged_sent): chunks = self.tagger.tag(tagged_sent) # Transform the result from [((w1, t1), iob1), ...] # to the preferred list of triplets format [(w1, t1, iob1), ...] iob_triplets = [(w, t, c) for ((w, t), c) in chunks] # Transform the list of triplets to nltk.Tree format return conlltags2tree(iob_triplets)
  • 11. Taking it for a spin chunker = NamedEntityChunker(training_samples[:2000]) from nltk import pos_tag, word_tokenize print chunker.parse(pos_tag(word_tokenize("I'm going to Germany this Monday."))) """ (S I/PRP 'm/VBP going/VBG to/TO (geo Germany/NNP) this/DT (tim Monday/NNP) ./.) """ score = chunker.evaluate([conlltags2tree([(w, t, iob) for (w, t), iob in iobs]) for iobs in test_samples[:500]]) print score.accuracy() # 0.931132334092 - Awesome :D
  • 12. Conclusions  Chunking can be reduced to a tagging problem.  Named Entity Recognition is a form of chunking.  We explored a freely available corpus that can be used for real-world applications.  The NLTK classifier can be replaced with any classifier you can think about. Try replacing it with a scikit-learn classifier.