NLTK in 20 minutes


Published on

A sprint thru Python's Natural Language ToolKit, presented at SFPython on 9/14/2011. Covers tokenization, part of speech tagging, chunking & NER, text classification, and training text classifiers with nltk-trainer.

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • text processing is very useful in a number of areas, and there's tons of unstructured text flooding the internet nowadays, and NLP/ML is one of the best ways to deal with it\n
  • this is what I'll cover today, but there's a lot more I won't be covering\n
  • loads a trained sentence tokenizer, then calls its tokenize() method. has sentence tokenizers for 16 languages. Smarter than just splitting on punctuation.\n
  • loads a word tokenizer trained on treebank, then calls the tokenize() method\n
  • non-ascii characters are also a problem for word_tokenize(). wordpunct_tokenize() can often be better, but you need to first decide what a word is for your specific case. do contractions matter? can you replace them with two words? Demo shows the results from 4 different tokenizers\n
  • loads a pos tagger trained on treebank - first call will take a few seconds to load the pickle file off disk, every subsequent call will use in-memory tagger. can find tables of pos tag definitions online.\n
  • pos tags might not be useful by themselves, but they are useful metadata for other NLP tasks like dictionary lookup, pos specific keyword analysis, and they are essential for chunking & NER\n
  • every Tree has a draw() method that uses TKinter\n
  • \n
  • bag-of-words is the simplest model, but ignores frequency. good for small text, but frequency can be very important for larger documents. other algorithms, like SVM, create sparse arrays of 1 or 0 depending on word presence, but require knowning full vocabulary beforehand. this classifier is one I trained with nltk-trainer, and can be used for sentiment analysis because it's categories are "pos" and "neg".\n
  • \n
  • can train taggers, chunkers, and text classifiers, and is great for analyzing corpora and how a model performs against a labeled corpus. I use nltk-trainer to train all my models nowadays.\n
  • this trains a very basic sentiment analysis classifier on the movie_reviews corpus, which has reviews categorized into pos or neg\n
  • treebank is a very standard corpus for testing taggers and chunkers\n
  • NLP isn't black magic, but you can treat it as a black box until the defaults aren't good enough. Then you need to dig in and learn how it works so you can make it do what you want. At that point, the best thing you can do is find/make good data, then use existing algos to learn from it.\n
  • \n
  • the original NLTK is very good, available for free online, but takes "textbook" approach. I tried to be a lot more practical in my cookbook. nltk-users mailing list is pretty active, and you can also try stackoverflow\n
  • \n
  • NLTK in 20 minutes

    1. 1. NLTK in 20 minutesA sprint thru Pythons Natural Language ToolKit
    2. 2. Jacob PerkinsCo-founder/CTO @ Weotta (were hiring :)"Python Text Processing with NLTK 2.0 Cookbook"NLTK ContributorBlog: http://streamhacker.comNLTK Demos & APIs:
    3. 3. Why Text Processing?sentiment analysisspam filteringplagariasm detection / document similaritydocument categorization / topic detectionphrase extraction, summarizationsmarter searchsimple keyword frequency analysis
    4. 4. Some NLTK Featuressentence & word tokenizationpart-of-speech taggingchunking & named entity recognitiontext classificationmany included corpora
    5. 5. Sentence Tokenization>>> from nltk.tokenize import sent_tokenize>>> sent_tokenize("Hello SF Python. This is NLTK.")[Hello SF Python., This is NLTK.]>>> sent_tokenize("Hello, Mr. Anderson. We missed you!")[Hello, Mr. Anderson., We missed you!]
    6. 6. Word Tokenization>>> from nltk.tokenize import word_tokenize>>> word_tokenize(This is NLTK.)[This, is, NLTK, .]
    7. 7. Whats a Word?>>> word_tokenize("Whats up?")[What, "s", up, ?]>>> from nltk.tokenize import wordpunct_tokenize>>> wordpunct_tokenize("Whats up?")[What, "", s, up, ?]Learn More:
    8. 8. Part-of-Speech Tagging>>> words = word_tokenize("And now for somethingcompletely different")>>> from nltk.tag import pos_tag>>> pos_tag(words)[(And, CC), (now, RB), (for, IN),(something, NN), (completely, RB), (different,JJ)]Tags List:
    9. 9. Why Part-of-Speech Tag?word definition lookup (WordNet, WordNik)fine-grained text analyticspart-of-speech specific keyword analysischunking & named entity recognition (NER)
    10. 10. Chunking & NER>>> from nltk.chunk import ne_chunk>>> ne_chunk(pos_tag(word_tokenize(My name is JacobPerkins.)))Tree(S, [(My, PRP$), (name, NN), (is, VBZ),Tree(PERSON, [(Jacob, NNP), (Perkins, NNP)]),(., .)])
    11. 11. NER not perfect>>> ne_chunk(pos_tag(word_tokenize(San Francisco isfoggy.)))Tree(S, [Tree(GPE, [(San, NNP)]), Tree(PERSON,[(Francisco, NNP)]), (is, VBZ), (foggy, NN),(., .)])
    12. 12. Text Classificationdef bag_of_words(words): return dict([(word, True) for word in words])>>> feats = bag_of_words(word_tokenize("great movie"))>>> import>>> classifier =>>> classifier.classify(feats)pos
    13. 13. Classification Algos in NLTK Naive Bayes Maximum Entropy / Logistic Regression Decision Tree SVM (coming soon)
    14. 14. NLTK-Trainer line scriptstrain custom modelsanalyze corporaanalyze models against corpora
    15. 15. Train a Sentiment Classifier$ ./ movie_reviews --instances parasloading movie_reviews2 labels: [neg, pos]2000 training feats, 2000 testing featstraining NaiveBayes classifieraccuracy: 0.967000neg precision: 1.000000neg recall: 0.934000neg f-measure: 0.965874pos precision: 0.938086pos recall: 1.000000pos f-measure: 0.968054dumping NaiveBayesClassifier to ~/nltk_data/classifiers/movie_reviews_NaiveBayes.pickle
    16. 16. Notable Included Corporamovie_reviews: pos & neg categorized IMDb reviewstreebank: tagged and parsed WSJ texttreebank_chunk: tagged and chunked WSJ textbrown: tagged & categorized english text60 other corpora in many languages
    17. 17. Other NLTK FeaturesclusteringmetricsparsingstemmingWordNet... and a lot more
    18. 18. Other Python NLP Librariespattern:
    19. 19. Learn More mailing list
    20. 20. NLTK Tutorial @ PyConWhat would you want to learn in 3 hours?What kinds of NLP problems do you face at work?What do you want to do with text?