NLTK: the Good, the Bad, and the Awesome
Upcoming SlideShare
Loading in...5

NLTK: the Good, the Bad, and the Awesome



Presented at the first meeting of the Bay Area NLP group: by Jacob Perkins.

Presented at the first meeting of the Bay Area NLP group: by Jacob Perkins.



Total Views
Views on SlideShare
Embed Views



16 Embeds 2,590 2345 58 54 37
url_unknown 24 19 17 8 7 7 5 3 2 2 1 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

NLTK: the Good, the Bad, and the Awesome NLTK: the Good, the Bad, and the Awesome Presentation Transcript

  • NLTKThe Good, the Bad, and the Awesome
  • Jacob Perkins• Python Text Processing with NLTK 2.0 Cookbook•••• @japerk
  • The Good• Makes NLProc easier and more accessible• Python (great learning language)• Lots of documentation (and 2 books!)• Designed for training custom models• Includes many training corpora• Many algorithms to experiment with
  • The Bad• NLProc is hard• Few out-of-the-box solutions (see Pattern)• Not designed for big-data (see Mahout)• Doesn’t have latest algorithms (see Scikits-Learn)• No online or active learning algorithms
  • More Bad• Doesn’t play nice with pip or easy_install• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)• Models can use a lot of memory (& disk if pickled)
  • The Awesome• Great for education and research• Lots of users & active community• Extensible interfaces• Training algorithms span human languages
  • More Awesome• Trained models can be very fast• Well known algorithms can be very accurate• NLTK-Trainer (train models with 0 code)• Corpus bootstrapping
  • Some Numbers• 3 Classification Algorithms• 9 Part-of-Speech Tagging Algorithms• Stemming Algorithms for 15 Languages• 5 Word Tokenization Algorithms• Sentence Tokenizers for 16 Languages• 60 included corpora
  •• NLTK Demos & APIs• Sentiment Analysis• Part-of-Speech Tagging & Chunking / NER• Stemming• Tokenization
  • Memory
  • CPU
  • NLTK-Trainer•• 3 Training Command Scripts ‣ ‣ ‣• Easy to tweak training parameters• Duck-Typed corpus reading
  • Training Classifiers• movie_reviews --instances paras• movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2• movie_reviews --instances paras --classifier MEGAM• movie_reviews --instances paras --cross-fold 10• Pickled models are saved in ~/nltk_data/classifiers/
  • Training Taggers• treebank• treebank --sequential ubt -- brill• treebank --sequential ‘’ -- classifier NaiveBayes• mac_morpho --simplify_tags• Pickled models are saved in ~/nltk_data/taggers/
  • Training Chunkers• treebank_chunk• treebank_chunk --classifier NaiveBayes• conll2000 --fileids train.txt• Pickled models are saved in ~/nltk_data/chunkers/
  • Corpus Bootstrapping• Guess & Correct easier than starting from scratch• Use an existing model for initial guesses• emoticons ‣ :) = “pos” ‣ :( = “neg”• ratings ‣ 5 stars = “pos” ‣ 1 star = “neg”
  • Portuguese PhraseExtraction & Classification• similar to• Brazilian Portuguese• aspect classification is easy with training corpus• need chunked corpus for phrase extraction• use mac_morpho & nltk-trainer to train initial tagger• part-of-speech tag annotation is time consuming• simplified tags are much easier• bracketed phrases w/out pos tags
  • treebank_chunk[ Pierre/NNP Vinken/NNP ],/,[ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB[ the/DT board/NN ]as/IN[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.
  • Just Brackets[ Pierre Vinken ] , [ 61 years ] old , willjoin [ the board ] as [ a nonexecutivedirector Nov. 29 ] .
  • NLP at Weotta• Parsing & information extraction• Text cleaning & normalization (more parsing)• Text & keyword classification• De-duplication• Search indexing / IR• Sentiment analysis• Human integration