Practical NLP with Lisp

Practical NLP with Lisp

Vsevolod Dyomkin
Grammarly

Topics

* Overview of NLP practice
* Getting Data
* Using Lisp: pros & cons
* A couple of examples

A bit about Grammarly

(c) xkcd

An example of what
we deal with

NLP practice
R - research work:
set a goal →
devise an algorithm →
train the algorithm →
test its accuracy

NLP practice
R - research work:
set a goal →
devise an algorithm →
train the algorithm →
test its accuracy

D - development work:
implement the algorithm as an API with
sufficient performance and scaling
characteristics

Research
1. Set a goal
Business goal:

* Develop best/good enough/better than
Word/etc spellchecker

* Develop a set of grammar rules, that will
catch errors according to MLA Style

* Develop a thesaurus, that will produce
synonyms relevant to context

Translate it to measurable goal
* On a test corpus of 10000 sentences with
common errors achieve smaller number of FNs
(and FPs), that other spellcheckers/Word
spellchecker/etc

* On a corpus of examples of sentences with
each kind of error (and similar sentences
without this kind of error) find all
sentences with errors and do not find
errors in correct sentences

* On a test corpus of 1000 sentences
suggest synonyms for all meaningful words
that will be considered relevant by human
linguists in 90% of the cases

A Note on
Terminology
FN and FP instead of
precision (P), recall (R)

FN = 1-R
FP = 1-P or ???
f1 = P*R/(P+R) =
(1-FN-FP+FN*FP)/(2-(FN+FP))

Research contd.
2. Devise an algorithm
3. Train & improve the
algorithm

Research contd.
2. Devise an algorithm
3. Train & improve the
algorithm

http://nlp-class.org

4. Test its performance
ML: one corpus, divided into
training,development,test

4. Test its performance
ML: one corpus, divided into
training,development,test

Often — different corpora:
* for training some part (not
whole) of the algorithm
* for testing the whole
system

Theoretical maxima
Theoretical maxima are rarely
achievable. Why?

Theoretical maxima
achievable. Why?

* Because you need their
data. (And data is key)

Theoretical maxima
achievable. Why?

* Because you need their
data. (And data is key)

* Domains might differ

Pre/post-processing
What ultimately matters is
not crude performance, but...

Pre/post-processing

Acceptance to users (much
harder to measure & depends
on domain).

Pre/post-processing

Acceptance to users (much
harder to measure & depends
on domain).

Real-world is messier, than
any lab set-up.

Examples of
pre-processing
For spellcheck:

* some people tend to use
words, separated by slashes,
like: spell/grammar check

* handling of abbreviations

Where to get data?
Well-known sources:
* Penn Tree Bank
* Wordnet
* Web1T Google N-gram Corpus
* Linguistic Data Consortium
(http://www.ldc.upenn.edu/)

More data
Also well-known sources, but
with a twist:
* Wikipedia & Wiktionary,
DBPedia
* OpenWeb Common Crawl
(updated: 2010)
* Public APIs of some
services: Twitter, Wordnik

Obscure corpora
Academic resources:
* Stanford
* CoNLL
* Oxford (http://www.ota.ox.ac.uk/)
* CMU, MIT,...
* LingPipe, OpenNLP, NLTK,...

Human-powered?

http://goo.gl/hs4qB

Beyond corpora?

* Bootstrapping
* Seeding

And remember...
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig, “The Unreasonable
Effectiveness of Data.”
http://youtu.be/yvDCzhbjYWs

Using Lisp for NLP

(c) xkcd

Why Lisp?
Lisp is a carefully crafted
tool for:

* Engineers
* Practical researchers
* Artists
* Entrepreneurs

Some examples
* Piano.aero
* ITA Software
* Secure Outcomes
* Impromptu

* Land of Lisp
http://youtu.be/HM1Zb3xmvMc

Research
requirements
* Interactivity
* Mathematical basis
* Expressiveness
* Agility Malleability
* Advanced tools

Specific NLP
requirements
* Good support for statistics
& number-crunching (matrices)
– Statistical AI

* Good support for working
with trees & symbols
– Symbolic AI

Production
requirements
* Scalability
* Maintainability
* Integrability
* ...

...eventually

* Speed
* Speed

...eventually

* Speed
* Speed
* Speed

Heterogeneous
systems
You have to split the system
and communicate:

“Java” way vs. “Unix” way

* Sockets, Redis, ZeroMQ, etc
for communication
* JSON, SEXPs, etc for data

Lisp drawbacks
There's no OpenNLP or SciPy &
generally there's fewer
libraries.

Lisp drawbacks
There's no OpenNLP or SciPy &
generally there's fewer
libraries.

But...
* github: eslick/cl-langutils
* github: mathematical-systems/clml
* github: tpapp/lla
* github: blindglobe/common-lisp-stat
* … and http://quicklisp.org

But #2
Porter stemmer:
http://tartarus.org/~martin/PorterStemmer
& http://www.cliki.net/PorterStemmer

or Soundex:
http://www.cs.cmu.edu/afs/cs/project/ai-
repository/ai/lang/lisp/code/0.html

are irrelevant with good data

More drawbacks

Lisp is a fringe language

Not special language
(like R, J or Octave)

Lisp FTW
* truly interactive
environment
* very flexible => DSLs
* native tree support
* fast and solid

Take-aways
* Take nlp-class

* Data is key, collect it, build tools
to work with it easily and efficiently

* A good language for R&D should be
first of all interactive & malleable,
with as few barriers as possible

* ... it also helps if you don't need to
port your code for production

* Lisp is one of the good examples

Thanks!

Vsevolod Dyomkin
@vseloved

Practical NLP with Lisp

More Related Content

What's hot

Viewers also liked

Similar to Practical NLP with Lisp

Recently uploaded

Practical NLP with Lisp