NLP Project Full Cycle

NLP Project Full Cycle
Vsevolod Dyomkin
10/2016

A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io

Plan
* Overview of NLP
* NLP Data
* Common NLP problems
and approaches
* Example NLP application:
text language identification

What Is NLP?
Transforming free-form text
into structured data and back

What Is NLP?
Transforming free-form text
into structured data and back
Intersection of:
* Computational Linguistics
* CompSci & AI
* ML, Stats, Information Theory

Natural Language
* ambiguous
* noisy
* evolving

linguist [noun]
1. A specialist in linguistics

linguist [noun]
1. A specialist in linguistics
linguistics [noun]
1. The scientific study of
language.

NLP Data
Types of text data:
* structured
* semi-structured
* unstructured
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig
The Unreasonable Effectiveness of Data.
http://youtu.be/yvDCzhbjYWs

Kinds of Data
* Dictionaries
* Databases/Ontologies
* Corpora
* Internet/user Data

Where to Get Data?
* Linguistic Data Consortium
http://www.ldc.upenn.edu/
* Common Crawl
* Wikimedia
* Wordnet
* APIs: Twitter, Wordnik, ...
* University sites &
the academic community:
Stanford, Oxford, CMU, ...

Create Your Own!
* Linguists
* Crowdsourcing
* By-product
-- Johnatahn Zittrain
http://goo.gl/hs4qB

Classic NLP Problems
* Linguistically-motivated:
segmentation, tagging, parsing
* Analytical:
classification, sentiment analysis
* Transformation:
translation, correction, generation
* Conversation:
question answering, dialog

engineer [noun]
5. A person skilled in the
design and programming of
computer systems

Tokenization
Example:
This is a test that isn't so simple: 1.23.
"This" "is" "a" "test" "that" "is" "n't"
"so" "simple" ":" "1.23" "."
Issues:
* Finland’s capital -
Finland Finlands Finland’s
* what’re, I’m, isn’t -
what ’re, I ’m, is n’t
* Hewlett-Packard or Hewlett Packard
* San Francisco - one token or two?
* m.p.h., PhD.

Regular Expressions
Simplest regex: [^s]+
More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–—
«»“”‘’-]―
Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
In fact, it works:
https://github.com/lang-uk/ner-uk/blob/master/doc
/tokenization.md

Rule-based Approach
* easy to understand and
reason about
* can be arbitrarily precise
* iterative, can be used to
gather more data
Limitations:
* recall problems
* poor adaptability

Rule-based NLP tools
* SpamAssasin
* LanguageTool
* ELIZA
* GATE

researcher [noun]
1. One who researches

researcher [noun]
1. One who researches
research [noun]
1. Diligent inquiry or
examination to seek or revise
facts, principles, theories,
applications, etc.; laborious
or continued search after
truth

Statistical Approach
“Probability theory
is nothing but
common sense
reduced to calculation.”
-- Pierre-Simon Laplace

Language Models
Question: what is the probability of a
sequence of words/sentence?

Language Models
Question: what is the probability of a
sequence of words/sentence?
Answer: Apply the chain rule
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w0 w1 w2) * …
where S = w0 w1 w2 …

Ngrams
Apply Markov assumption: each word depends
only on N previous words (in practice
N=1..4 which results in bigrams-fivegrams,
because we include the current word also).
If n=2:
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w1 w2) * …
According to the chain rule:
P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)

Spam Filtering
A 2-class classification problem with a
bias towards minimizing FPs.
Default approach: rule-based (SpamAssassin)
Problems:
* scales poorly
* hard to reach arbitrary precision
* hard to rank the importance of
complex features?

Bag-of-words Model
* each word is a feature
* each word is independent of others
* position of the word in a sentence is irrelevant
Pros:
* simple
* fast
* scalable
Limitations:
* independence assumption doesn't hold

Bag-of-words Model
* each word is a feature
* each word is independent of others
* position of the word in a sentence is irrelevant
Pros:
* simple
* fast
* scalable
Limitations:
* independence assumption doesn't hold
http://www.paulgraham.com/spam.html - A Plan for Spam
Initial results: recall: 92%, precision: 98.84%
Improved results: recall: 99.5%, precision: 99.97%

Dependency Parsing
nsubj(ate-2, They-1)
root(ROOT-0, ate-2)
det(pizza-4, the-3)
dobj(ate-2, pizza-4)
prep(ate-2, with-5)
pobj(with-5, anchovies-6)
https://honnibal.wordpress.com/2013/12/18/a-simple-fas
t-algorithm-for-natural-language-dependency-parsing/

Averaged Perceptron
def train(model, number_iter, examples):
for i in range(number_iter):
for features, true_tag in examples:
guess = model.predict(features)
if guess != true_tag:
for f in features:
model.weights[f][true_tag] += 1
model.weights[f][guess] -= 1
random.shuffle(examples)

ML-based Parsing
The parser starts with an empty stack, and a buffer index at 0, with no
dependencies recorded. It chooses one of the valid actions, and applies it to
the state. It continues choosing actions and applying them until the stack is
empty and the buffer index is at the end of the input.
SHIFT = 0; RIGHT = 1; LEFT = 2
MOVES = [SHIFT, RIGHT, LEFT]
def parse(words, tags):
n = len(words)
deps = init_deps(n)
idx = 1
stack = [0]
while stack or idx < n:
features = extract_features(words, tags, idx, n, stack, deps)
scores = score(features)
valid_moves = get_valid_moves(i, n, len(stack))
next_move = max(valid_moves, key=lambda move: scores[move])
idx = transition(next_move, idx, stack, parse)
return tags, parse

The Hierarchy of
ML Models
Linear:
* (Averaged) Perceptron
* Maximum Entropy / LogLinear / Logistic
Regression; Conditional Random Field
* SVM
Non-linear:
* Decision Trees, Random Forests, Boosted
Trees
* Artificial Neural networks

Semantics
Question: how to model relationships
between words?

Semantics
Question: how to model relationships
between words?
Answer: build a graph
Wordnet
Freebase
DBPedia

Word Similarity
Next question: now, how do we measure those
relations?

Word Similarity
relations?
* different Wordnet similarity measures

Word Similarity
relations?
* different Wordnet similarity measures
* PMI(x,y) = log(p(x,y) / p(x) * p(y))

Distributional
Semantics
Distributional hypothesis:
"You shall know a word by
the company it keeps"
--John Rupert Firth
Word representations:
* Explicit representation
Number of nonzero dimensions:
max:474234, min:3, mean:1595, median:415
* Dense representation (word2vec, GloVe, …)
* Hierarchical repr (Brown clusters)

Steps to Develop
an NLP System
* Translate real-world requirements
into a measurable goal
* Find a suitable level and
representation
* Find initial data for experiments
* Find and utilize existing tools and
frameworks where possible
* Setup and perform a proper
experiment (series of experiments)
* Optimize the system for production

Going into Prod
* NLP tasks are usually CPU-intensive
but stateless
* General-purpose NLP frameworks are
(mostly) not production-ready
* Don't trust research results
* Value pre- and post- processing
* Gather user feedback

Text Language
Identification
Not an unsolved problem:
* https://github.com/CLD2Owners/cld2 - C++
* https://github.com/saffsd/langid.py - Python
* https://github.com/shuyo/language-detection/ - Java
To read:
https://blog.twitter.com/2015/evaluating-language-identifi
cation-performance
http://blog.mikemccandless.com/2011/10/accuracy-and-perfor
mance-of-googles.html
http://lab.hypotheses.org/1083
http://labs.translated.net/language-identifier/

YALI WILD
* All of them use weak models
* Wanted to use Wiktionary —
150+ languages,
always evolving
* Wanted to do in Lisp

WILD Linguistics
* Scripts vs languages
http://www.omniglot.com/writing/langalph.htm
* Languages distribution
https://en.wikipedia.org/wiki/Languages_used_o
n_the_Internet#Content_languages_for_websites
* Frequency word lists
https://invokeit.wordpress.com/frequency-word-
lists/
* Word segmentation?

WILD Data
Wiktionary Wikipedia data:
used abstracts, ~175 languages
- download & store
- process (SAX parsing)
- setup learning & test data sets
10,778,404 unique words
481,581 unique character trigrams

WILD Engineering
* Initial model size ~1G -
script hacks & Huffman coding
to the rescue
* Model pruning
* Proper probability calculations
* Efficient testing
* Properly saving the model
* Library & public API

NLP Project Full Cycle

More Related Content

What's hot

Similar to NLP Project Full Cycle

More from Vsevolod Dyomkin

Recently uploaded

NLP Project Full Cycle