Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

nlp_compromise

113 views

Published on

Spencer Kelly presents nlp_compromise at the Summer 2016 Toronto Node.js meetup.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

nlp_compromise

  1. 1. The pope is catholic. language as an interfacelanguage as data introduction
  2. 2. (@spencermountain) introduction
  3. 3. problem
  4. 4. problem
  5. 5. problem
  6. 6. london in the rain london in the in the rain london in in the the rain london in the rain 4-gram: 3-gram: 2-gram: 1-gram: 10 requests per keystroke 4-words - 5-words : 15 6-words : 21 7-words : 28 8-words : 36 problem
  7. 7. london in the rain london in the in the rain london in in the the rain London in the rain in the london in the in the rain london in the rain Stopwords blacklist: Edge gram filter: Redundancy check: #1 #2 #3 in the “rain” “london” “london in the rain” problem
  8. 8. ● NLTK - excellent, huge, python ● Stanford parser - excellent, huge, java Alchemy, TextRazor, OpenCalais, Embedly, Zemanta Or an offsite API?● Freeling - excellent, huge, C++ ● Illinois tagger - excellent, huge, java When all you’ve got is a jackhammer.. niche
  9. 9. Can it be hacked? tldr: yes. niche
  10. 10. How big is a language? Shakespeare - 35,000 niche
  11. 11. Zipfs law The top 10 words account for 25% of language. The top 100 words account for 50% of language. The top 50,000 words account for 95% of language. niche
  12. 12. 602 kb uncompressed 50,000 different words An average person will ever hear ~4 lookups in binary search niche
  13. 13. first, let’s kill the nouns 70% process 180 kb uncompressed
  14. 14. Noun Verb Adjective Adverb Tomato Tomatoes Toronto Torontonian Speak Spoke Speaking will speak have spoken had spoken ... nice nicer nicest quickly quicklier quickliest “awesome”“awesomeify” improveify your vocabularies “quickly”“quick” n/2.3 Each word *not handsome *not truly*not is*not economics “tomatoey”“tomato” “speak”“speaker” “aggressive”“agressiveness” “civil”“civilize” niche
  15. 15. then, let’s conjugate our verbs process 110 kb uncompressed
  16. 16. process jQuery 256kb d3js 330kb react 653kb 110 kb uncompressed lodash 503kb the whole english language 110kb
  17. 17. Ok, let’s roll our own POS tagger.. (what could go rong?) process 1) ☑ Lexicon 2) ◻ Suffix regexes 3) ◻ Sentence
  18. 18. Suffix rules process 1) ☑ Lexicon 2) ☑ Suffix regexes 3) ◻ Sentence
  19. 19. Grammar rules - markov She could walk the walk . before: Verb - Det - Verb after: Verb - Det - Noun process 1) ☑ Lexicon 2) ☑ Suffix regexes 3) ☑ Sentence
  20. 20. “Unreasonable effectiveness” of rule-based taggers- ● a 1,000 word lexicon - 45% precision ● fallback to [Noun] - 70% precision ● a little regex - 74% precision ● a little grammar in it - 81% precision process
  21. 21. t.text(“keep on rocking in the free world”) t.negate() //“don’t keep on rocking in the free world.” outcome
  22. 22. t.text(“it is a cool library”) t.toValleyGirl() //“so, it is like, a cool library.” outcome
  23. 23. We gave the monkeys the bananas, ..because they were ripe. ..because they were hungry. outcome
  24. 24. We gave the monkeys the bananas [Pr] [Verb] [Dt] [Noun] [Dt] [Noun] list of letters POS-tagging Dependency parser We give [Noun] [Noun] [act / transfer / voluntary] [genus / monkey] Knowledge engine [plant / banana] outcome
  25. 25. #TODOFML ● Mutable/Immutable API ● Speed, performance testing ● Romantic-language verb conjugations ● ‘bl.ocks.org’ of demos and docs outcome
  26. 26. npm install --wooyeah @spencermountain Slack group, mailing list, github, Toronto/coffee

×