Всеволод Демкин "Natural language processing на практике"

10,076 views
10,032 views

Published on

Конференция "AI&BigData Lab", 12 апреля 2014

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
10,076
On SlideShare
0
From Embeds
0
Number of Embeds
9,291
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Всеволод Демкин "Natural language processing на практике"

  1. 1. Natural Language Processing in practice
  2. 2. Topics * Overview of NLP * Getting Data * Models & Algorithms * Building an NLP system * A practical example
  3. 3. A bit about me * Lisp programmer * Architect and research lead at Grammarly (3+ years of NLP work) * Teacher at KPI: Operating Systems * Links: http://lisp-univ-etc.blogspot.com http://github.com/vseloved http://twitter.com/vseloved
  4. 4. A bit about Grammarly (c) xkcd The best English language writing enhancement app: Spellcheck - Grammar check - Style improvement - Synonyms and word choice - Plagiarism check
  5. 5. What is NLP? Transforming free-form text into structured data and back Intersection of Comp Sci & Linguistics & Software Eng Based on Algorithms, Machine Learning, and Statistics
  6. 6. Popular NLP problems * Spam Filtering * Spelling Correction * Sentiment Analysis * Question Answering * Machine Translation * Text Summarization * Search (also IR) http://www.paulgraham.com/spam.html http://norvig.com/spell-correct.html (c) gettyimages
  7. 7. Levels of NLP * data & tools * models * production-ready systems
  8. 8. Role of Linguistics
  9. 9. NLP Data structured semi-structured– unstructured– “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data. http://youtu.be/yvDCzhbjYWs
  10. 10. Kinds of data * Dictionaries * Corpora * User Data
  11. 11. Where to get data? * Linguistic Data Consortium http://www.ldc.upenn.edu/ * Google ngrams, book ngrams, syntactic ngrams * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites: Stanford, Oxford, CMU, ...
  12. 12. Create your own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain http://goo.gl/hs4qB
  13. 13. Tools * analysis tools * processing tools * Unix command line * XML processing * Map-reduce systems * R, Python, Lisp (c) O'Reilly Media
  14. 14. Algorithms * Dynamic Programming * Search Algorithms * Tree Algorithms
  15. 15. Beyond Algorithms * CKY constituency parsing * Noisy channel spelling correction * TF-IDF document classification * Bayesian filtering
  16. 16. Models * generative vs discriminative * statistical vs rule-based
  17. 17. Language Models Ngrams Generative ML models: * Bayesian inference (bag-of-words model) * Hidden Markov model (sequence model) * Neural networks (holistic model) LM + Domain Model
  18. 18. Discriminative Models * Heuristic * Maximum Entropy * “Advanced” LM Models
  19. 19. Going Into Prod * Translate real-world requirements into a measurable goal * Pre- and post- processing * Don't trust research results * Gather user feedback
  20. 20. Practical Example: Language Detection
  21. 21. Idea Standard approach: character LM Let's try an alternative: word LM Data – from Wiktionary Test data from Wikipedia–
  22. 22. Practical ML System * Training
  23. 23. ML System * Training * Evaluation
  24. 24. ML System * Training * Evaluation * Production
  25. 25. Thanks! Questions? Vsevolod Dyomkin @vseloved

×