Aspects of NLP Practice

784 views

Published on

Some notes on the aspects of applying NLP research in industrial environment

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
784
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Aspects of NLP Practice

  1. 1. Practical Aspects of NLP Work Vsevolod Dyomkin GrammarlyTAAC2012, Kyiv, Ukraine
  2. 2. Topics* Practical vs Theoretical NLP work* Working with Data for NLP* NLP Tools
  3. 3. A bit about Grammarly (c) xkcd
  4. 4. An example of what we deal with
  5. 5. Research vs Development“Trick for productionizing research:read current 3-5 pubs and note thestupid simple thing they all claim tobeat, implement that. --Jay Kreps https://twitter.com/jaykreps/ status/219977241839411200
  6. 6. NLP practiceR - research work:set a goal →devise an algorithm →train the algorithm →test its accuracyD - development work:implement the algorithm as an API withsufficient performance and scaling characteristics
  7. 7. Research1. Set a goalBusiness goal:* Develop best/good enough/better thanWord/etc spellchecker* Develop a set of grammar rules, that willcatch errors according to MLA Style* Develop a thesaurus, that will producesynonyms relevant to context
  8. 8. Translate it to measurable goal* On a test corpus of 10000 sentences withcommon errors achieve smaller number of FNs(and FPs), that other spellcheckers/Wordspellchecker/etc* On a corpus of examples of sentences witheach kind of error (and similar sentenceswithout this kind of error) find allsentences with errors and do not finderrors in correct sentences* On a test corpus of 1000 sentencessuggest synonyms for all meaningful wordsthat will be considered relevant by humanlinguists in 90% of the cases
  9. 9. Research1. Set a goal2. Devise an algorithm3. Train & improve the algorithm
  10. 10. Research1. Set a goal2. Devise an algorithm3. Train & improve the algorithmhttp://nlp-class.org
  11. 11. 4. Test its performanceML: one corpus, divided intotraining,development,test
  12. 12. 4. Test its performanceML: one corpus, divided intotraining,development,testOften — different corpora:* for training some part of the algorithm* for testing the whole system
  13. 13. Theoretical maximaTheoretical maxima are rarelyachievable. Why?
  14. 14. Theoretical maximaTheoretical maxima are rarelyachievable. Why?* because you need their data
  15. 15. Theoretical maximaTheoretical maxima are rarelyachievable. Why?* because you need their data* domains might differ
  16. 16. Pre/post-processingWhat ultimately matters isnot crude performance, but...
  17. 17. Pre/post-processingWhat ultimately matters isnot crude performance, but...Acceptance to users (muchharder to measure & dependson domain).
  18. 18. Pre/post-processingWhat ultimately matters isnot crude performance, but...Acceptance to users (muchharder to measure & dependson domain).Real-world is messier, thanany lab set-up.
  19. 19. Examples of pre-processingFor spellcheck:* some people tend to use words, separated by slashes, like: spell/grammar check* handling of abbreviations
  20. 20. Data“Data is the next Intel Inside. --Tim OReilly, What is Web2.0 http://oreilly.com/web2/archive/what-is-web- 20.html?page=3
  21. 21. Categorization of Data* Structured — small* Semi-structured — medium* Unstructured — big
  22. 22. Where to get data?Well-known sources:* Penn Tree Bank* Wordnet* BNC* Web1T Google N-gram Corpus* Linguistic Data Consortium (http://www.ldc.upenn.edu/)
  23. 23. More dataAlso well-known sources, butwith a twist:* Wikipedia & Wiktionary, DBPedia* OpenWeb Common Crawl* Public APIs of some services: Twitter, Wordnik
  24. 24. Academic resources* Stanford* CoNLL* Oxford (http://www.ota.ox.ac.uk/)* CMU, MIT,...* LingPipe, OpenNLP, NLTK,...
  25. 25. Crowd-sourced data Jonathan Zittrain, The Future of the Internet http://goo.gl/hs4qB
  26. 26. And remember...“Data is ten times morepowerful than algorithms. --Peter Norvig The Unreasonable Effectiveness of Data http://youtu.be/yvDCzhbjYWs
  27. 27. Tools
  28. 28. Levels of NLP toolsHigh-level: user servicesMiddle-level: NLP algorithmsLow-level: data-crunching
  29. 29. Choosing a languageRequirement types:* Research* NLP-specific* Production
  30. 30. Research requirements* Interactivity* Mathematical basis* Expressiveness* Agility Malleability* Advanced tools
  31. 31. Specific NLP requirements* Good support for statistics & number-crunching – Statistical AI* Good support for working with trees & symbols – Symbolic AI
  32. 32. Production requirements* Scalability* Maintainability* Integrability* ...
  33. 33. Choose Lisp (c) xkcd
  34. 34. Lisp FTW* Truly interactive environment* Very flexible => DSLs* Native tree support* Fast and solid- No OpenNLP/NLTK
  35. 35. Heterogeneous systems“Java way” vs. “Unix way”Create language-agnosticsystems, that can easilycommunicate!
  36. 36. Take-aways* As they say, in theory research and practice are the same, but in practice...* Data is key. There are 3 types of it. Collect it, build tools to work with it easily and efficiently* Choose a good language for R&D: interactive & malleable, with as few barriers as possible
  37. 37. Thanks!Vsevolod Dyomkin @vseloved

×