Build your own named entity recognizer with python
1. Build your own Named Entity
Recognizer with Python
Bogdan @ http://nlpforhackers.io
2. What is NER?
short for Named Entity Recognition
is probably the first step towards information extraction
extracting what is a real world entity from the text
Person, Organization, Event etc …
3. NLTK NER Chunker
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Mark and John are working at Google."
print ne_chunk(pos_tag(word_tokenize(sentence)))
"""
(S
(PERSON Mark/NNP)
and/CC
(PERSON John/NNP)
are/VBP
working/VBG
at/IN
(ORGANIZATION Google/NNP)
./.)
"""
4. IOB Tagging
nltk.Tree is great for processing such information in Python, but it’s not the standard way of annotating
chunks. Maybe this can be an article on its own but we’ll cover this here really quickly.
The IOB Tagging system contains tags of the form:
1. B-{CHUNK_TYPE} – for the word in the Beginning chunk
2. I-{CHUNK_TYPE} – for words Inside the chunk
3. O – Outside any chunk
5. IOB Tagging
Here’s how to convert between the nltk.Tree and IOB format:
from nltk.chunk import conlltags2tree, tree2conlltags
sentence = "Mark and John are working at Google."
ne_tree = ne_chunk(pos_tag(word_tokenize(sentence)))
iob_tagged = tree2conlltags(ne_tree)
print iob_tagged
"""[('Mark', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('John', 'NNP', u'B-PERSON'), ('are', 'VBP', u'O'), ...]"""
ne_tree = conlltags2tree(iob_tagged)
"""(S
(PERSON Mark/NNP)
and/CC
(PERSON John/NNP)
are/VBP
working/VBG
at/IN
(ORGANIZATION Google/NNP)
./.)"""
6. GMB Corpus
NLTK doesn’t have a proper English corpus for NER. It has the CoNLL 2002 Named Entity CoNLL but it’s
only for Spanish and Dutch. You can definitely try the method presented here on that corpora. In fact
doing so would be easier because NLTK provides a good corpus reader. We are going with Groningen
Meaning Bank (GMB) though.
GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold
standard corpus, meaning that it’s not completely human annotated and it’s not considered 100%
correct. The corpus is created by using already existed annotators and then corrected by humans
where needed.
7. Feature Extraction
import string
from nltk.stem.snowball import SnowballStemmer
def features(tokens, index, history):
"""
`tokens` = a POS-tagged sentence [(w1, t1), ...]
`index` = the index of the token we want to extract features for
`history` = the previous predicted IOB tags
"""
# init the stemmer
stemmer = SnowballStemmer('english')
# Pad the sequence with placeholders
tokens = [('[START2]', '[START2]'), ('[START1]', '[START1]')] + list(tokens) + [('[END1]', '[END1]'), ('[END2]',
'[END2]')]
history = ['[START2]', '[START1]'] + list(history)
# shift the index with 2, to accommodate the padding
index += 2
8. Feature Extraction
word, pos = tokens[index]
prevword, prevpos = tokens[index - 1]
prevprevword, prevprevpos = tokens[index - 2]
nextword, nextpos = tokens[index + 1]
nextnextword, nextnextpos = tokens[index + 2]
previob = history[index - 1]
contains_dash = '-' in word
contains_dot = '.' in word
allascii = all([True for c in word if c in string.ascii_lowercase])
allcaps = word == word.capitalize()
capitalized = word[0] in string.ascii_uppercase
prevallcaps = prevword == prevword.capitalize()
prevcapitalized = prevword[0] in string.ascii_uppercase
nextallcaps = prevword == prevword.capitalize()
nextcapitalized = prevword[0] in string.ascii_uppercase
10. Training the system
from collections import Iterable
from nltk.tag import ClassifierBasedTagger
from nltk.chunk import ChunkParserI
class NamedEntityChunker(ChunkParserI):
def __init__(self, train_sents, **kwargs):
assert isinstance(train_sents, Iterable)
self.feature_detector = features
self.tagger = ClassifierBasedTagger(
train=train_sents,
feature_detector=features,
**kwargs)
def parse(self, tagged_sent):
chunks = self.tagger.tag(tagged_sent)
# Transform the result from [((w1, t1), iob1), ...]
# to the preferred list of triplets format [(w1, t1, iob1), ...]
iob_triplets = [(w, t, c) for ((w, t), c) in chunks]
# Transform the list of triplets to nltk.Tree format
return conlltags2tree(iob_triplets)
11. Taking it for a spin
chunker = NamedEntityChunker(training_samples[:2000])
from nltk import pos_tag, word_tokenize
print chunker.parse(pos_tag(word_tokenize("I'm going to Germany this Monday.")))
"""
(S
I/PRP
'm/VBP
going/VBG
to/TO
(geo Germany/NNP)
this/DT
(tim Monday/NNP)
./.)
"""
score = chunker.evaluate([conlltags2tree([(w, t, iob) for (w, t), iob in iobs]) for iobs in test_samples[:500]])
print score.accuracy() # 0.931132334092 - Awesome :D
12. Conclusions
Chunking can be reduced to a tagging problem.
Named Entity Recognition is a form of chunking.
We explored a freely available corpus that can be used for real-world applications.
The NLTK classifier can be replaced with any classifier you can think about. Try replacing it with a
scikit-learn classifier.