Practical NLP with Lisp

3,609 views
3,204 views

Published on

* Overview of NLP practice
* Getting Data
* Using Lisp: pros & cons

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,609
On SlideShare
0
From Embeds
0
Number of Embeds
35
Actions
Shares
0
Downloads
48
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Practical NLP with Lisp

  1. 1. Practical NLP with Lisp Vsevolod Dyomkin Grammarly
  2. 2. Topics* Overview of NLP practice* Getting Data* Using Lisp: pros & cons* A couple of examples
  3. 3. A bit about Grammarly (c) xkcd
  4. 4. An example of what we deal with
  5. 5. NLP practiceR - research work:set a goal →devise an algorithm →train the algorithm →test its accuracy
  6. 6. NLP practiceR - research work:set a goal →devise an algorithm →train the algorithm →test its accuracyD - development work:implement the algorithm as an API withsufficient performance and scalingcharacteristics
  7. 7. Research1. Set a goalBusiness goal:* Develop best/good enough/better thanWord/etc spellchecker* Develop a set of grammar rules, that willcatch errors according to MLA Style* Develop a thesaurus, that will producesynonyms relevant to context
  8. 8. Translate it to measurable goal* On a test corpus of 10000 sentences withcommon errors achieve smaller number of FNs(and FPs), that other spellcheckers/Wordspellchecker/etc* On a corpus of examples of sentences witheach kind of error (and similar sentenceswithout this kind of error) find allsentences with errors and do not finderrors in correct sentences* On a test corpus of 1000 sentencessuggest synonyms for all meaningful wordsthat will be considered relevant by humanlinguists in 90% of the cases
  9. 9. A Note on TerminologyFN and FP instead ofprecision (P), recall (R)FN = 1-RFP = 1-P or ???f1 = P*R/(P+R) =(1-FN-FP+FN*FP)/(2-(FN+FP))
  10. 10. Research contd.2. Devise an algorithm3. Train & improve thealgorithm
  11. 11. Research contd.2. Devise an algorithm3. Train & improve thealgorithmhttp://nlp-class.org
  12. 12. 4. Test its performanceML: one corpus, divided intotraining,development,test
  13. 13. 4. Test its performanceML: one corpus, divided intotraining,development,testOften — different corpora:* for training some part (notwhole) of the algorithm* for testing the wholesystem
  14. 14. Theoretical maximaTheoretical maxima are rarelyachievable. Why?
  15. 15. Theoretical maximaTheoretical maxima are rarelyachievable. Why?* Because you need theirdata. (And data is key)
  16. 16. Theoretical maximaTheoretical maxima are rarelyachievable. Why?* Because you need theirdata. (And data is key)* Domains might differ
  17. 17. Pre/post-processingWhat ultimately matters isnot crude performance, but...
  18. 18. Pre/post-processingWhat ultimately matters isnot crude performance, but...Acceptance to users (muchharder to measure & dependson domain).
  19. 19. Pre/post-processingWhat ultimately matters isnot crude performance, but...Acceptance to users (muchharder to measure & dependson domain).Real-world is messier, thanany lab set-up.
  20. 20. Examples of pre-processingFor spellcheck:* some people tend to usewords, separated by slashes,like: spell/grammar check* handling of abbreviations
  21. 21. Where to get data?Well-known sources:* Penn Tree Bank* Wordnet* Web1T Google N-gram Corpus* Linguistic Data Consortium (http://www.ldc.upenn.edu/)
  22. 22. More dataAlso well-known sources, butwith a twist:* Wikipedia & Wiktionary,DBPedia* OpenWeb Common Crawl(updated: 2010)* Public APIs of someservices: Twitter, Wordnik
  23. 23. Obscure corporaAcademic resources:* Stanford* CoNLL* Oxford (http://www.ota.ox.ac.uk/)* CMU, MIT,...* LingPipe, OpenNLP, NLTK,...
  24. 24. Human-powered?http://goo.gl/hs4qB
  25. 25. Beyond corpora?* Bootstrapping* Seeding
  26. 26. And remember...“Data is ten times morepowerful than algorithms.”-- Peter Norvig, “The UnreasonableEffectiveness of Data.”http://youtu.be/yvDCzhbjYWs
  27. 27. Using Lisp for NLP (c) xkcd
  28. 28. Why Lisp?Lisp is a carefully craftedtool for:* Engineers* Practical researchers* Artists* Entrepreneurs
  29. 29. Some examples* Piano.aero* ITA Software* Secure Outcomes* Impromptu* Land of Lisphttp://youtu.be/HM1Zb3xmvMc
  30. 30. Research requirements* Interactivity* Mathematical basis* Expressiveness* Agility Malleability* Advanced tools
  31. 31. Specific NLP requirements* Good support for statistics& number-crunching (matrices)– Statistical AI* Good support for workingwith trees & symbols– Symbolic AI
  32. 32. Production requirements* Scalability* Maintainability* Integrability* ...
  33. 33. ...eventually* Speed
  34. 34. ...eventually* Speed* Speed
  35. 35. ...eventually* Speed* Speed* Speed
  36. 36. Heterogeneous systemsYou have to split the systemand communicate:“Java” way vs. “Unix” way* Sockets, Redis, ZeroMQ, etcfor communication* JSON, SEXPs, etc for data
  37. 37. Lisp drawbacksTheres no OpenNLP or SciPy &generally theres fewerlibraries.
  38. 38. Lisp drawbacksTheres no OpenNLP or SciPy &generally theres fewerlibraries.But...* github: eslick/cl-langutils* github: mathematical-systems/clml* github: tpapp/lla* github: blindglobe/common-lisp-stat* … and http://quicklisp.org
  39. 39. But #2Porter stemmer:http://tartarus.org/~martin/PorterStemmer& http://www.cliki.net/PorterStemmeror Soundex:http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/code/0.htmlare irrelevant with good data
  40. 40. More drawbacksLisp is a fringe language Not special language (like R, J or Octave)
  41. 41. Example #1API interaction
  42. 42. Example #2
  43. 43. Lisp FTW* truly interactiveenvironment* very flexible => DSLs* native tree support* fast and solid
  44. 44. Take-aways* Take nlp-class* Data is key, collect it, build toolsto work with it easily and efficiently* A good language for R&D should befirst of all interactive & malleable,with as few barriers as possible* ... it also helps if you dont need toport your code for production* Lisp is one of the good examples
  45. 45. Thanks!Vsevolod Dyomkin @vseloved

×