Natural Language Processing
in practice
Topics
* Overview of NLP
* Getting Data
* Models & Algorithms
* Building an NLP system
* A practical example
A bit about me
* Lisp programmer
* Architect and research lead at Grammarly
(3+ years of NLP work)
* Teacher at KPI: Operating Systems
* Links:
http://lisp-univ-etc.blogspot.com
http://github.com/vseloved
http://twitter.com/vseloved
A bit about Grammarly
(c) xkcd
The best English language writing
enhancement app:
Spellcheck - Grammar check - Style
improvement - Synonyms and word choice -
Plagiarism check
What is NLP?
Transforming free-form text
into structured data and back
Intersection of Comp Sci &
Linguistics & Software Eng
Based on Algorithms, Machine
Learning, and Statistics
Popular NLP problems
* Spam Filtering
* Spelling Correction
* Sentiment Analysis
* Question Answering
* Machine Translation
* Text Summarization
* Search (also IR)
http://www.paulgraham.com/spam.html
http://norvig.com/spell-correct.html
(c) gettyimages
Levels of NLP
* data & tools
* models
* production-ready systems
Role of Linguistics
NLP Data
structured semi-structured–
unstructured–
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig
The Unreasonable
Effectiveness of Data.
http://youtu.be/yvDCzhbjYWs
Kinds of data
* Dictionaries
* Corpora
* User Data
Where to get data?
* Linguistic Data Consortium
http://www.ldc.upenn.edu/
* Google ngrams, book ngrams,
syntactic ngrams
* Wikimedia
* Wordnet
* APIs: Twitter, Wordnik, ...
* University sites: Stanford,
Oxford, CMU, ...
Create your own!
* Linguists
* Crowdsourcing
* By-product
-- Johnatahn Zittrain
http://goo.gl/hs4qB
Tools
* analysis tools
* processing tools
* Unix command line
* XML processing
* Map-reduce systems
* R, Python, Lisp
(c) O'Reilly Media
Algorithms
* Dynamic Programming
* Search Algorithms
* Tree Algorithms
Beyond Algorithms
* CKY constituency parsing
* Noisy channel spelling
correction
* TF-IDF document
classification
* Bayesian filtering
Models
* generative vs discriminative
* statistical vs rule-based
Language Models
Ngrams
Generative ML models:
* Bayesian inference
(bag-of-words model)
* Hidden Markov model
(sequence model)
* Neural networks
(holistic model)
LM + Domain Model
Discriminative Models
* Heuristic
* Maximum Entropy
* “Advanced” LM Models
Going Into Prod
* Translate real-world requirements
into a measurable goal
* Pre- and post- processing
* Don't trust research results
* Gather user feedback
Practical Example:
Language Detection
Idea
Standard approach:
character LM
Let's try an alternative:
word LM
Data – from Wiktionary
Test data from Wikipedia–
Practical ML System
* Training
ML System
* Training
* Evaluation
ML System
* Training
* Evaluation
* Production
Thanks!
Questions?
Vsevolod Dyomkin
@vseloved

Всеволод Демкин "Natural language processing на практике"