Natural Language Processing in Practice

  • 965 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
965
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
28
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Natural Language Processing in practice
  • 2. Topics * Overview of NLP * Getting Data * Models & Algorithms * Building an NLP system * A practical example
  • 3. A bit about me * Lisp programmer * Architect and research lead at Grammarly (3+ years of NLP work) * Teacher at KPI: Operating Systems * Links: http://lisp-univ-etc.blogspot.com http://github.com/vseloved http://twitter.com/vseloved
  • 4. A bit about Grammarly (c) xkcd The best English language writing enhancement app: Spellcheck - Grammar check - Style improvement - Synonyms and word choice - Plagiarism check
  • 5. What is NLP? Transforming free-form text into structured data and back Intersection of Comp Sci & Linguistics & Software Eng Based on Algorithms, Machine Learning, and Statistics
  • 6. Popular NLP problems * Spam Filtering * Spelling Correction * Sentiment Analysis * Question Answering * Machine Translation * Text Summarization * Search (also IR) http://www.paulgraham.com/spam.html http://norvig.com/spell-correct.html (c) gettyimages
  • 7. Levels of NLP * data & tools * models * production-ready systems
  • 8. Role of Linguistics
  • 9. NLP Data structured semi-structured– unstructured– “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data. http://youtu.be/yvDCzhbjYWs
  • 10. Kinds of data * Dictionaries * Corpora * User Data
  • 11. Where to get data? * Linguistic Data Consortium http://www.ldc.upenn.edu/ * Google ngrams, book ngrams, syntactic ngrams * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites: Stanford, Oxford, CMU, ...
  • 12. Create your own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain http://goo.gl/hs4qB
  • 13. Tools * analysis tools * processing tools * Unix command line * XML processing * Map-reduce systems * R, Python, Lisp (c) O'Reilly Media
  • 14. Algorithms * Dynamic Programming * Search Algorithms * Tree Algorithms
  • 15. Beyond Algorithms * CKY constituency parsing * Noisy channel spelling correction * TF-IDF document classification * Bayesian filtering
  • 16. Models * generative vs discriminative * statistical vs rule-based
  • 17. Language Models Ngrams Generative ML models: * Bayesian inference (bag-of-words model) * Hidden Markov model (sequence model) * Neural networks (holistic model) LM + Domain Model
  • 18. Discriminative Models * Heuristic * Maximum Entropy * “Advanced” LM Models
  • 19. Going Into Prod * Translate real-world requirements into a measurable goal * Pre- and post- processing * Don't trust research results * Gather user feedback
  • 20. Practical Example: Language Detection
  • 21. Idea Standard approach: character LM Let's try an alternative: word LM Data – from Wiktionary Test data from Wikipedia–
  • 22. Practical ML System * Training
  • 23. ML System * Training * Evaluation
  • 24. ML System * Training * Evaluation * Production
  • 25. Thanks! Questions? Vsevolod Dyomkin @vseloved