Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NLTK in 20 minutes


Published on

A sprint thru Python's Natural Language ToolKit, presented at SFPython on 9/14/2011. Covers tokenization, part of speech tagging, chunking & NER, text classification, and training text classifiers with nltk-trainer.

Published in: Technology, Education

NLTK in 20 minutes

  1. 1. NLTK in 20 minutesA sprint thru Pythons Natural Language ToolKit
  2. 2. Jacob PerkinsCo-founder/CTO @ Weotta (were hiring :)"Python Text Processing with NLTK 2.0 Cookbook"NLTK ContributorBlog: http://streamhacker.comNLTK Demos & APIs:
  3. 3. Why Text Processing?sentiment analysisspam filteringplagariasm detection / document similaritydocument categorization / topic detectionphrase extraction, summarizationsmarter searchsimple keyword frequency analysis
  4. 4. Some NLTK Featuressentence & word tokenizationpart-of-speech taggingchunking & named entity recognitiontext classificationmany included corpora
  5. 5. Sentence Tokenization>>> from nltk.tokenize import sent_tokenize>>> sent_tokenize("Hello SF Python. This is NLTK.")[Hello SF Python., This is NLTK.]>>> sent_tokenize("Hello, Mr. Anderson. We missed you!")[Hello, Mr. Anderson., We missed you!]
  6. 6. Word Tokenization>>> from nltk.tokenize import word_tokenize>>> word_tokenize(This is NLTK.)[This, is, NLTK, .]
  7. 7. Whats a Word?>>> word_tokenize("Whats up?")[What, "s", up, ?]>>> from nltk.tokenize import wordpunct_tokenize>>> wordpunct_tokenize("Whats up?")[What, "", s, up, ?]Learn More:
  8. 8. Part-of-Speech Tagging>>> words = word_tokenize("And now for somethingcompletely different")>>> from nltk.tag import pos_tag>>> pos_tag(words)[(And, CC), (now, RB), (for, IN),(something, NN), (completely, RB), (different,JJ)]Tags List:
  9. 9. Why Part-of-Speech Tag?word definition lookup (WordNet, WordNik)fine-grained text analyticspart-of-speech specific keyword analysischunking & named entity recognition (NER)
  10. 10. Chunking & NER>>> from nltk.chunk import ne_chunk>>> ne_chunk(pos_tag(word_tokenize(My name is JacobPerkins.)))Tree(S, [(My, PRP$), (name, NN), (is, VBZ),Tree(PERSON, [(Jacob, NNP), (Perkins, NNP)]),(., .)])
  11. 11. NER not perfect>>> ne_chunk(pos_tag(word_tokenize(San Francisco isfoggy.)))Tree(S, [Tree(GPE, [(San, NNP)]), Tree(PERSON,[(Francisco, NNP)]), (is, VBZ), (foggy, NN),(., .)])
  12. 12. Text Classificationdef bag_of_words(words): return dict([(word, True) for word in words])>>> feats = bag_of_words(word_tokenize("great movie"))>>> import>>> classifier =>>> classifier.classify(feats)pos
  13. 13. Classification Algos in NLTK Naive Bayes Maximum Entropy / Logistic Regression Decision Tree SVM (coming soon)
  14. 14. NLTK-Trainer line scriptstrain custom modelsanalyze corporaanalyze models against corpora
  15. 15. Train a Sentiment Classifier$ ./ movie_reviews --instances parasloading movie_reviews2 labels: [neg, pos]2000 training feats, 2000 testing featstraining NaiveBayes classifieraccuracy: 0.967000neg precision: 1.000000neg recall: 0.934000neg f-measure: 0.965874pos precision: 0.938086pos recall: 1.000000pos f-measure: 0.968054dumping NaiveBayesClassifier to ~/nltk_data/classifiers/movie_reviews_NaiveBayes.pickle
  16. 16. Notable Included Corporamovie_reviews: pos & neg categorized IMDb reviewstreebank: tagged and parsed WSJ texttreebank_chunk: tagged and chunked WSJ textbrown: tagged & categorized english text60 other corpora in many languages
  17. 17. Other NLTK FeaturesclusteringmetricsparsingstemmingWordNet... and a lot more
  18. 18. Other Python NLP Librariespattern:
  19. 19. Learn More mailing list
  20. 20. NLTK Tutorial @ PyConWhat would you want to learn in 3 hours?What kinds of NLP problems do you face at work?What do you want to do with text?