Jeffrey Williams presented on using the spaCy library for natural language processing. He began with an overview of key NLP concepts like tokenization, lemmatization, named entity recognition and part-of-speech tagging. He then demonstrated spaCy's features for these tasks and how to visualize outputs. Williams also showed how to apply spaCy to problems by extending its pipeline and training custom models. Finally, he reviewed alternatives to spaCy like NLTK, CoreNLP and cloud-based services from Microsoft and Google.
1. Industrial Strength
Natural Language Processing
I am Jeffrey Williams
I am here to provide meaning to unstructured text
I work @ Label Insight
You can find me at @jeffxor
Label Insight is Hiring!
https://www.labelinsight.com/careers/topic/engineering
2. Caveats
◇ I am not a linguist specialist
◇ I am not a natural language specialist
◇ I am not a data scientist
◇ I am a software engineer
This talk is aimed at software engineers trying to tackle
text problems by extract meaning or understanding
4. Let’s review some
NLP concepts
Sentence Boundary Detection
Sentence boundaries are often
marked by periods or other
punctuation marks, but these
same characters can serve other
purposes
Tokens/Word Segmentation
Separate a chunk of continuous
text into separate words. Text
segmentation is a significant task
requiring knowledge of the
vocabulary and morphology.
Stemming/Lemmatization
reduce inflectional forms of a
word to a common base form
am, are, is -> be
car, cars, car's, cars' -> car
Named Entity Recognition
Given a stream of text, determine
which items in the text map to
proper names, such as people or
places, and what the type of each
such name is (e.g. person,
location, organization).
Parts of Speech Tagging
Given a sentence, determine the
part of speech for each word.
Many words, especially common
ones, can serve as multiple parts
of speech.
Word sense disambiguation
Many words have more than one
meaning; we have to select the
meaning which makes the most
sense in context.
5. spaCy.io Introduction
◇ Open-source library for advanced (NLP) in Python
◇ Opinionated NLP library (not an API/Service)
◇ Number of pretrained models for common
languages
◇ Great documentation and example code
◇ Helps build information extraction & natural
language understanding systems
spaCy.io is very powerful library that has many extension
points allowing for training and pipeline configuration
6. spaCy.io Features
Lemmatization
Assigning the base forms of
words. For example, the lemma of
"was" is "be", and the lemma of
"rats" is "rat".
Rule-based Matching
Finding sequences of tokens
based on their texts and linguistic
annotations, similar to regular
expressions.
Similarity
Comparing words, text spans and
documents and how similar they
are to each other.
(POS) Part-of-speech Tagging
Assigning word types to tokens,
like verb or noun.
(NER) Named Entity Recognition
Labelling named "real-world"
objects, like persons, companies
or locations.
Dependency Parsing
Assigning syntactic dependency
labels, describing the relations
between individual tokens, like
subject or object.
7. Place your screenshot here
Language Support
spaCy v2.0 features new neural models for
tagging, parsing and entity recognition. The
models have been designed and implemented
from scratch specifically for spaCy, to give you
an unmatched balance of speed, size and
accuracy.
Combination of language (english), training
data (web, news, etc), size of model (sm, md,
lg)
https://spacy.io/usage/models
8. Place your screenshot here
Provided Named Entities
From my experience with Locations it is not as
well trained as Google Cloud Natural Language
https://spacy.io/api/annotation#section-named-entities
9. Place your screenshot here
Parts-of-Speech Tagging
Maps all language-specific part-of-speech tags
to a small, fixed set of word type tags following
the Universal Dependencies scheme.
https://spacy.io/api/annotation#section-pos-tagging
11. Place your screenshot here
import spacy
from spacy import displacy
nlp = spacy.load('en')
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')
Dependency Visualization
12. Place your screenshot here
import spacy
from spacy import displacy
text = """But Google is starting from
behind. The company made a late push
into hardware, and Apple’s Siri,
available on iPhones, and Amazon’s Alexa
software, which runs on its Echo and Dot
devices, have clear leads in
consumer adoption."""
nlp = spacy.load('custom_ner_model')
doc = nlp(text)
displacy.serve(doc, style='ent')
Named Entity
Visualization
14. Navigating Parse Trees
◇ navigate the parse tree including subtrees attached
to a word
◇ Noun chunks (noun plus the words describing the
noun)
◇ terms head and child to describe the words
connected by a single arc
◇ term dep is used for the arc label, ( type of syntactic
relation)
15. Phrase Matcher
◇ efficiently match large terminology lists
◇ match sequences based on lists of token
descriptions
◇ accepts match patterns in the form of Doc objects
17. Training Data
Provide additional data to
either adjust and existing
model or build your own
model.
https://prodi.gy/
spaCy.io Extensions
Functionality
Number of extension points to
add customizations
◇ Adjust pipeline
◇ Add new pipeline features
◇ Add functionality to core
components
◇ Add callback functions into
pipeline processes
18. spaCy.io Pipeline
Disabling/Modifying
If you don't need a particular
component of the pipeline – for
example, the tagger or the parser,
you can disable loading it.
Can sometimes make a big
difference and improve loading
speed.
Custom Components
Custom components can be
added to the pipeline
Allows for adding it before or
after, tell spaCy to add it first or
last in the pipeline, or define a
custom name.
Eg. add spell checking (hunspell)
Extension Attributes
allows you to set any custom
attributes and methods on the
Doc, Span and Token
additional information relevant to
your application, add new
features and functionality to
spaCy, and implement your own
models
Eg. improve spaCy's sentence
boundary detectionhttps://spacy.io/usage/processing-pipelines
19. Place your screenshot here
Processing Pipeline
The Language object coordinates
these components. It takes raw text
and sends it through the pipeline,
returning an annotated document. It
also orchestrates training and
serialization.
https://spacy.io/usage/processing-pipelines
20. Named Entity Extension
Adding Additional Entity Types
Need a few hundred labeled sentences
for a good start, mixin examples of other
entity types
Actual training is performed by looping
over the examples, makes a prediction
against golden parsed data
train_data = [
("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30,
'GPE')]),
("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
("Google rebrands its business apps", [(0, 6, "ORG")]),
("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
Update a pre-trained Model
Need to provide many examples to meaningfully
improve the system — a few hundred
https://spacy.io/usage/training#section-ner
21. Place your screenshot here
Custom Semantics
◇ Can be used to be trained to
predict any type of tree
structure over your input text
◇ Can be useful to for
conversational applications,
◇ Train spaCy's parser to label
intents and their targets, like
attributes, quality, time and
locations
https://spacy.io/usage/training#section-tagger-parser
22. Attempt to summarize my learning curve both from
implementation as well as business buyin
spaCy.io Lessons
Learnt
26. spaCy.io Alternatives
There are many alternatives available they tend to fall into two
categories, alternative libraries and hosted solutions
27. ◇ NLTK Natural Language Toolkit (Python)
◇ Stanford CoreNLP (Java)
◇ NLP4J (Java)
Libraries allow you to configure, extend and train for your
problem domain
Alternate Libraries
28. ◇ Microsoft Azure Text Analytics
◇ Google Cloud Natural Language
Hosted solutions provide a generic solution
◇ Well trained models
◇ Basic/Generic Named Entities
◇ Unable to model/train for your domain (yet!)
Alternate Hosted
Solutions
29. Thanks!
Any questions?
You can find me at:
◇ @jeffxor
◇ jwilliams@labelinsight.com
◇ https://speakerrate.com/speakers/181771 (Feedback)
Label Insight is Hiring!
https://www.labelinsight.com/careers/topic/engineering
30. Useful Information
This presentation used the following resources:
◇ spacy.io
◇ spacy.io github
◇ explosion.ai/demos/
◇ Natural Language Processing Wikipedia
◇ Stanford CoreNLP
◇ Microsoft Azure Text Analytics
◇ Google Cloud Natural Language