Crash Course in Natural Language Processing (2016)

Crash Course in
Natural Language Processing
Vsevolod Dyomkin
04/2016

A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io

A Bit about Grammarly
The best English language writing app
Spellcheck - Grammar check - Style
improvement - Synonyms and word choice
Plagiarism check

Plan
* Overview of NLP
* Where to get Data
* Common NLP problems
and approaches
* How to develop an NLP
system

What Is NLP?
Transforming free-form text
into structured data and back

What Is NLP?
Transforming free-form text
into structured data and back
Intersection of:
* Computational Linguistics
* CompSci & AI
* Stats & Information Theory

Linguistic Basis
* Syntax (form)
* Semantics (meaning)
* Pragmatics (intent/logic)

Natural Language
* ambiguous
* noisy
* evolving

Time flies like an arrow.
Fruit flies like a banana.
I read a story about evolution in ten minutes.
I read a story about evolution in the last million years.

NLP & Data
Types of text data:
* structured
* semi-structured
* unstructured
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig
The Unreasonable Effectiveness of Data.
http://youtu.be/yvDCzhbjYWs

Kinds of Data
* Dictionaries
* Databases/Ontologies
* Corpora
* User Data

Where to Get Data?
* Linguistic Data Consortium
http://www.ldc.upenn.edu/
* Common Crawl
* Wikimedia
* Wordnet
* APIs: Twitter, Wordnik, ...
* University sites &
the academic community:
Stanford, Oxford, CMU, ...

Create Your Own!
* Linguists
* Crowdsourcing
* By-product
-- Johnatahn Zittrain
http://goo.gl/hs4qB

Classic NLP Problems
* Linguistically-motivated:
segmentation, tagging, parsing
* Analytical:
classification, sentiment analysis
* Transformation:
translation, correction, generation
* Conversation:
question answering, dialog

Tokenization
Example:
This is a test that isn't so simple: 1.23.
"This" "is" "a" "test" "that" "is" "n't"
"so" "simple" ":" "1.23" "."
Issues:
* Finland’s capital -
Finland Finlands Finland’s
* what’re, I’m, isn’t -
what ’re, I ’m, is n’t
* Hewlett-Packard or Hewlett Packard
* San Francisco - one token or two?
* m.p.h., PhD.

Regular Expressions
Simplest regex: [^s]+
More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–—
«»“”‘’-]―
Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+

Post-processing
* concatenate abbreviations and decimals
* split contractions with regexes
2-character:
i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)
['‘’`]d$
3-character:
(?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$

Rule-based Approach
* easy to understand and
reason about
* can be arbitrarily precise
* iterative, can be used to
gather more data
Limitations:
* recall problems
* poor adaptability

Rule-based NLP tools
* SpamAssasin
* LanguageTool
* ELIZA
* GATE

Statistical Approach
“Probability theory
is nothing but
common sense
reduced to calculation.”
-- Pierre-Simon Laplace

Language Models
Question: what is the probability of a
sequence of words/sentence?

Language Models
Question: what is the probability of a
sequence of words/sentence?
Answer: Apply the chain rule
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w0 w1 w2) * …
where S = w0 w1 w2 …

Ngrams
Apply Markov assumption: each word depends
only on N previous words (in practice
N=1..4 which results in bigrams-fivegrams,
because we include the current word also).
If n=2:
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w1 w2) * …
According to the chain rule:
P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)

Spelling Correction
Problem: given an out-of-dictionary
word return a list of most probable
in-dictionary corrections.
http://norvig.com/spell-correct.html

Edit Distance
Minimum-edit (Levenstein) distance the–
minimum number of
insertions/deletions/substitutions needed
to transform string A into B.
Other distance metrics:
* the Damerau-Levenstein distance adds
another operation: transposition
* the longest common subsequence (LCS)
metric allows only insertion and deletion,
not substitution
* the Hamming distance allows only
substitution, hence, it only applies to
strings of the same length

Dynamic Programming
Initialization:
D(i,0) = i
D(0,j) = j
Recurrence relation:
For each i = 1..M
For each j = 1..N
D(i,j) = D(i-1,j-1), if X(i) = Y(j)
otherwise:
min D(i-1,j) + w_del(Y(j))
D(i,j-1) + w_ins(X(i))
D(i-1,j-1) + w_subst(X(i),Y(j))

Noisy Channel Model
Given an alphabet A, let A* be the set of all finite
strings over A. Let the dictionary D of valid words be
some subset of A*.
The noisy channel is the matrix G = P(s|w) where w in D is
the intended word and s in A* is the scrambled word that
was actually received.
P(s|w) = sum(P(x(i)|y(i)))
for x(i) in s* (s aligned with w)
for y(i) in w* (w aligned with s)

Spam Filtering
A 2-class classification problem with a
bias towards minimizing FPs.
Default approach: rule-based (SpamAssassin)
Problems:
* scales poorly
* hard to reach arbitrary precision
* hard to rank the importance of
complex features?

Bag-of-words Models
* each word is a feature
* each word is independent of others
* position of the word in a sentence is irrelevant
Pros:
* simple
* fast
* scalable
Limitations:
* independence assumption doesn't hold
Initial results: recall: 92%, precision: 98.84%
Improved results: recall: 99.5%, precision: 99.97%
http://www.paulgraham.com/spam.html

Dependency Parsing
nsubj(ate-2, They-1)
root(ROOT-0, ate-2)
det(pizza-4, the-3)
dobj(ate-2, pizza-4)
prep(ate-2, with-5)
pobj(with-5, anchovies-6)
https://honnibal.wordpress.com/2013/12/18/a-simple-fas
t-algorithm-for-natural-language-dependency-parsing/

ML-based Parsing
The parser starts with an empty stack, and a buffer index at 0, with no
dependencies recorded. It chooses one of the valid actions, and applies it to
the state. It continues choosing actions and applying them until the stack is
empty and the buffer index is at the end of the input.
SHIFT = 0; RIGHT = 1; LEFT = 2
MOVES = [SHIFT, RIGHT, LEFT]
def parse(words, tags):
n = len(words)
deps = init_deps(n)
idx = 1
stack = [0]
while stack or idx < n:
features = extract_features(words, tags, idx, n, stack, deps)
scores = score(features)
valid_moves = get_valid_moves(i, n, len(stack))
next_move = max(valid_moves, key=lambda move: scores[move])
idx = transition(next_move, idx, stack, parse)
return tags, parse

Averaged Perceptron
def train(model, number_iter, examples):
for i in range(number_iter):
for features, true_tag in examples:
guess = model.predict(features)
if guess != true_tag:
for f in features:
model.weights[f][true_tag] += 1
model.weights[f][guess] -= 1
random.shuffle(examples)

Features
* Word and tag unigrams, bigrams, trigrams
* The first three words of the buffer
* The top three words of the stack
* The two leftmost children of the top of
the stack
* The two rightmost children of the top of
the stack
* The two leftmost children of the first
word in the buffer
* Distance between top of buffer and stack

Discriminative ML
Models
Linear:
* (Averaged) Perceptron
* Maximum Entropy / LogLinear / Logistic
Regression; Conditional Random Field
* SVM
Non-linear:
* Decision Trees, Random Forests
* Other ensemble classifiers
* Neural networks

Semantics
Question: how to model relationships
between words?

Semantics
Question: how to model relationships
between words?
Answer: build a graph
Wordnet
Freebase
DBPedia

Word Similarity
Next question: now, how do we measure those
relations?

Word Similarity
relations?
* different Wordnet similarity measures

Word Similarity
relations?
* different Wordnet similarity measures
* PMI(x,y) = log(p(x,y) / p(x) * p(y))

Distributional
Semantics
Distributional hypothesis:
"You shall know a word by
the company it keeps"
--John Rupert Firth
Word representations:
* Explicit representation
Number of nonzero dimensions:
max:474234, min:3, mean:1595, median:415
* Dense representation (word2vec, GloVe)
* Hierarchical representation
(Brown clustering)

Steps to Develop
an NLP System
* Translate real-world requirements
into a measurable goal
* Find a suitable level and
representation
* Find initial data for experiments
* Find and utilize existing tools and
Frameworks where possible
* Don't trust research results
* Setup and perform a proper
experiment (series of experiments)

Going into Prod
* NLP tasks are usually CPU-intensive
but stateless
* General-purpose NLP frameworks are
(mostly) not production-ready
* Value pre- and post- processing
* Gather user feedback

Final Words
We have discussed:
* linguistic basis of NLP
- although some people manage to do NLP
without it:
http://arxiv.org/pdf/1103.0398.pdf
* rule-based & statistical/ML approaches
* different concrete tasks
We haven't covered:
* all the different tasks, such as MT,
question answering, etc.
(but they use the same technics)
* deep learning for NLP
* natural language understanding
(which remains an unsolved problem)

Crash Course in Natural Language Processing (2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Crash Course in Natural Language Processing (2016)

Similar to Crash Course in Natural Language Processing (2016) (20)

More from Vsevolod Dyomkin

More from Vsevolod Dyomkin (13)

Recently uploaded

Recently uploaded (20)

Crash Course in Natural Language Processing (2016)