Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Всеволод Демкин "Natural language processing на практике"

10,360 views

Published on

Конференция "AI&BigData Lab", 12 апреля 2014

  • Be the first to comment

Всеволод Демкин "Natural language processing на практике"

  1. 1. Natural Language Processing in practice
  2. 2. Topics * Overview of NLP * Getting Data * Models & Algorithms * Building an NLP system * A practical example
  3. 3. A bit about me * Lisp programmer * Architect and research lead at Grammarly (3+ years of NLP work) * Teacher at KPI: Operating Systems * Links: http://lisp-univ-etc.blogspot.com http://github.com/vseloved http://twitter.com/vseloved
  4. 4. A bit about Grammarly (c) xkcd The best English language writing enhancement app: Spellcheck - Grammar check - Style improvement - Synonyms and word choice - Plagiarism check
  5. 5. What is NLP? Transforming free-form text into structured data and back Intersection of Comp Sci & Linguistics & Software Eng Based on Algorithms, Machine Learning, and Statistics
  6. 6. Popular NLP problems * Spam Filtering * Spelling Correction * Sentiment Analysis * Question Answering * Machine Translation * Text Summarization * Search (also IR) http://www.paulgraham.com/spam.html http://norvig.com/spell-correct.html (c) gettyimages
  7. 7. Levels of NLP * data & tools * models * production-ready systems
  8. 8. Role of Linguistics
  9. 9. NLP Data structured semi-structured– unstructured– “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data. http://youtu.be/yvDCzhbjYWs
  10. 10. Kinds of data * Dictionaries * Corpora * User Data
  11. 11. Where to get data? * Linguistic Data Consortium http://www.ldc.upenn.edu/ * Google ngrams, book ngrams, syntactic ngrams * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites: Stanford, Oxford, CMU, ...
  12. 12. Create your own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain http://goo.gl/hs4qB
  13. 13. Tools * analysis tools * processing tools * Unix command line * XML processing * Map-reduce systems * R, Python, Lisp (c) O'Reilly Media
  14. 14. Algorithms * Dynamic Programming * Search Algorithms * Tree Algorithms
  15. 15. Beyond Algorithms * CKY constituency parsing * Noisy channel spelling correction * TF-IDF document classification * Bayesian filtering
  16. 16. Models * generative vs discriminative * statistical vs rule-based
  17. 17. Language Models Ngrams Generative ML models: * Bayesian inference (bag-of-words model) * Hidden Markov model (sequence model) * Neural networks (holistic model) LM + Domain Model
  18. 18. Discriminative Models * Heuristic * Maximum Entropy * “Advanced” LM Models
  19. 19. Going Into Prod * Translate real-world requirements into a measurable goal * Pre- and post- processing * Don't trust research results * Gather user feedback
  20. 20. Practical Example: Language Detection
  21. 21. Idea Standard approach: character LM Let's try an alternative: word LM Data – from Wiktionary Test data from Wikipedia–
  22. 22. Practical ML System * Training
  23. 23. ML System * Training * Evaluation
  24. 24. ML System * Training * Evaluation * Production
  25. 25. Thanks! Questions? Vsevolod Dyomkin @vseloved

×