Natural Language Processing in Ruby

4,420 views

Published on

An introduction to performing natural language processing (NLP) tasks in Ruby. Video is here: https://skillsmatter.com/skillscasts/4883-how-to-parse-go#video

Published in: Technology

Natural Language Processing in Ruby

  1. 1. How to parse ‘go’ Natural Language Processing in Ruby Tom Cartwright @tomcartwrightuk ! keepmebooked giveaiddirect.com
  2. 2. Python, surely? Yes. The NLTK is awesome. But you have a Ruby-based app.
  3. 3. Extracting meaning from ! human input Summarisation Extracting entities Tagging text Sentiment analysis Filtering text
  4. 4. document sentence From document level! ! ! ! ! word example to word level
  5. 5. document sentence word example Chunking & segmenting Breaking text into paragraphs, sentences and other zones Start with a document/some text: “The second nonabsolute number is the given time of arrival, which is now known to be one of those most bizarre of mathematical concepts, a recipriversexclusion, a number whose existence can only be defined as being anything other than itself…..”
  6. 6. document sentence word Punkt sentence tokenizer to the rescue…. example
  7. 7. document sentence word example tokenizer = Punkt::SentenceTokenizer.new(! "The second nonabsolute number is the given time of arrival...")! ! result = ! tokenizer.sentences_from_text(text,! :output => :sentences_text)! ! ! !
  8. 8. document sentence word example Training trainer = Punkt::Trainer.new()! trainer.train(bistromatic_text)
  9. 9. document sentence word example Tokenising Breaking text into words, phrases and symbols. “Time is an illusion. Lunchtime doubly so.”.split(“ “)! ! #=> ! ! [“Time", “is", “an", “illusion.”, “Lunchtime", “doubly", “so.”]!
  10. 10. document sentence word example Tokenizer gem Regexes and rules class Tokenizer FS = Regexp.new(‘[[:blank:]]+') PAIR_PRE = ['(', '{', '['] SIMPLE_POST = ['!', '?', ',', ':', ';', '.'] PAIR_POST = [')', '}', ']'] PRE_N_POST = ['"', “'"] …
  11. 11. document sentence word tokenizer = Tokenizer::Tokenizer.new tokenizer.tokenize(“Time is an illusion. Lunchtime doubly so.”) #=> [“Time", “is", “an", “illusion", “.”, “Lunchtime", “doubly", “so", “.”] example
  12. 12. document sentence word example Stemming Jogging => Jog “jogging”.gsub(/.ing/, “”) ! #=> “jog"! ! “bring”.gsub(/.ing/, “”) ! #=> “b"
  13. 13. document sentence 1. Ruby-Stemmer 2. Text word example multi-language porter stemmer porter stemmer stemmer = Lingua::Stemmer.new(:language => "en") stemmer.stem("programming") #=> program stemmer.stem("vimming") #=> vim
  14. 14. document sentence word example Parts-of-speech tagging CC conjunction DET determiner and, but this, some IN preposition / conjunction JJ adjective NNP above, about orange, tiny proper noun Camden Pale Ale
  15. 15. document sentence word A couple of methods! ! Regex tagger /*.ing/ VBG /*.ed/ VBD ! Lookup on words E.g. calculating : { VBG: 6 } orange: { JJ: 2, NN: 5 } example
  16. 16. document sentence word example A tale of two taggers EngTagger rb-brill-tagger Probabilistic (uses • Rule based look up table prev. • • C extensions slide) • Brown corpus trained • Pure ruby
  17. 17. document sentence word example Treat gem Bundles many of the gems shown Wraps them in a DSL s = sentence(“A really good sentence.”) s.do(:chunk, :segment, :tokenize, :parse) stemming; tokenising; chunking; serialising; tagging; text extraction from pdfs and html;
  18. 18. LRUG Sentiments A tag {NN} Pass in regex => /({JJ}|{JJS})({NNS}|{NNP})/ And some tagged tokens #=> [(Word @tag="JJ", @text="jolly"),! (Word @tag="NN", @text="face")]
  19. 19. Sentimental value 1.0 ! 1.0 0.21875 0.21875 -1.0 -1.0 epic! good! chance! brisk! slanderous! piteous
  20. 20. Results ! ! ! • • • • • Ruby! Practical ObjectOriented Design in Ruby! Doctors! Lrug! recruiters (!) • • • dedicated servers! pdfs! Surrey • • • • • unsolicited phone calls from r********s! clients! Paypal! XML! geeks
  21. 21. Gems Text - Paul Battley’s box of tricks Treat Tokenizer Punkt segmenter Chronic - for extracting dates
  22. 22. Other things you can do/I didn’t talk about Calculate text edit distance Extract entities using the Stanford libraries via the RJB ! Extract topic words (LDA) ! Keyword extraction - TfIdf ! Jruby
  23. 23. Thank you for processing. Questions? @tomcartwrightuk Thanks to Tim Cowlishaw and the HT dev team for specialised rubber duck support

×