NLTK: the Good, the Bad, and the Awesome


Published on

Presented at the first meeting of the Bay Area NLP group: by Jacob Perkins.

Published in: Technology, Education

NLTK: the Good, the Bad, and the Awesome

  1. 1. NLTKThe Good, the Bad, and the Awesome
  2. 2. Jacob Perkins• Python Text Processing with NLTK 2.0 Cookbook•••• @japerk
  3. 3. The Good• Makes NLProc easier and more accessible• Python (great learning language)• Lots of documentation (and 2 books!)• Designed for training custom models• Includes many training corpora• Many algorithms to experiment with
  4. 4. The Bad• NLProc is hard• Few out-of-the-box solutions (see Pattern)• Not designed for big-data (see Mahout)• Doesn’t have latest algorithms (see Scikits-Learn)• No online or active learning algorithms
  5. 5. More Bad• Doesn’t play nice with pip or easy_install• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)• Models can use a lot of memory (& disk if pickled)
  6. 6. The Awesome• Great for education and research• Lots of users & active community• Extensible interfaces• Training algorithms span human languages
  7. 7. More Awesome• Trained models can be very fast• Well known algorithms can be very accurate• NLTK-Trainer (train models with 0 code)• Corpus bootstrapping
  8. 8. Some Numbers• 3 Classification Algorithms• 9 Part-of-Speech Tagging Algorithms• Stemming Algorithms for 15 Languages• 5 Word Tokenization Algorithms• Sentence Tokenizers for 16 Languages• 60 included corpora
  9. 9.• NLTK Demos & APIs• Sentiment Analysis• Part-of-Speech Tagging & Chunking / NER• Stemming• Tokenization
  10. 10. Memory
  11. 11. CPU
  12. 12. NLTK-Trainer•• 3 Training Command Scripts ‣ ‣ ‣• Easy to tweak training parameters• Duck-Typed corpus reading
  13. 13. Training Classifiers• movie_reviews --instances paras• movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2• movie_reviews --instances paras --classifier MEGAM• movie_reviews --instances paras --cross-fold 10• Pickled models are saved in ~/nltk_data/classifiers/
  14. 14. Training Taggers• treebank• treebank --sequential ubt -- brill• treebank --sequential ‘’ -- classifier NaiveBayes• mac_morpho --simplify_tags• Pickled models are saved in ~/nltk_data/taggers/
  15. 15. Training Chunkers• treebank_chunk• treebank_chunk --classifier NaiveBayes• conll2000 --fileids train.txt• Pickled models are saved in ~/nltk_data/chunkers/
  16. 16. Corpus Bootstrapping• Guess & Correct easier than starting from scratch• Use an existing model for initial guesses• emoticons ‣ :) = “pos” ‣ :( = “neg”• ratings ‣ 5 stars = “pos” ‣ 1 star = “neg”
  17. 17. Portuguese PhraseExtraction & Classification• similar to• Brazilian Portuguese• aspect classification is easy with training corpus• need chunked corpus for phrase extraction• use mac_morpho & nltk-trainer to train initial tagger• part-of-speech tag annotation is time consuming• simplified tags are much easier• bracketed phrases w/out pos tags
  18. 18. treebank_chunk[ Pierre/NNP Vinken/NNP ],/,[ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB[ the/DT board/NN ]as/IN[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.
  19. 19. Just Brackets[ Pierre Vinken ] , [ 61 years ] old , willjoin [ the board ] as [ a nonexecutivedirector Nov. 29 ] .
  20. 20. NLP at Weotta• Parsing & information extraction• Text cleaning & normalization (more parsing)• Text & keyword classification• De-duplication• Search indexing / IR• Sentiment analysis• Human integration
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.