• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
NLTK: the Good, the Bad, and the Awesome

NLTK: the Good, the Bad, and the Awesome



Presented at the first meeting of the Bay Area NLP group: http://www.meetup.com/Bay-Area-NLP/events/16522295/ by Jacob Perkins.

Presented at the first meeting of the Bay Area NLP group: http://www.meetup.com/Bay-Area-NLP/events/16522295/ by Jacob Perkins.



Total Views
Views on SlideShare
Embed Views



16 Embeds 2,585

http://www.techgig.com 2345
http://simple-is-better.com 57
http://www.52nlp.com 54 37
url_unknown 24
http://paper.li 19
http://blog.52nlp.org 13
http://pythontip.sinaapp.com 8 7
http://twitter.com 7 5
http://tweetedtimes.com 3
http://www.slideshare.net 2
http://feed.feedsky.com 2
https://twitter.com 1
http://www.simple-is-better.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    NLTK: the Good, the Bad, and the Awesome NLTK: the Good, the Bad, and the Awesome Presentation Transcript

    • NLTKThe Good, the Bad, and the Awesome
    • Jacob Perkins• Python Text Processing with NLTK 2.0 Cookbook• streamhacker.com• weotta.com• text-processing.com• @japerk
    • The Good• Makes NLProc easier and more accessible• Python (great learning language)• Lots of documentation (and 2 books!)• Designed for training custom models• Includes many training corpora• Many algorithms to experiment with
    • The Bad• NLProc is hard• Few out-of-the-box solutions (see Pattern)• Not designed for big-data (see Mahout)• Doesn’t have latest algorithms (see Scikits-Learn)• No online or active learning algorithms
    • More Bad• Doesn’t play nice with pip or easy_install• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)• Models can use a lot of memory (& disk if pickled)
    • The Awesome• Great for education and research• Lots of users & active community• Extensible interfaces• Training algorithms span human languages
    • More Awesome• Trained models can be very fast• Well known algorithms can be very accurate• NLTK-Trainer (train models with 0 code)• Corpus bootstrapping
    • Some Numbers• 3 Classification Algorithms• 9 Part-of-Speech Tagging Algorithms• Stemming Algorithms for 15 Languages• 5 Word Tokenization Algorithms• Sentence Tokenizers for 16 Languages• 60 included corpora
    • Text-Processing.com• NLTK Demos & APIs• Sentiment Analysis• Part-of-Speech Tagging & Chunking / NER• Stemming• Tokenization
    • Memory Usagetext-processing.com
    • CPU Usagetext-processing.com
    • NLTK-Trainer• https://github.com/japerk/nltk-trainer• 3 Training Command Scripts ‣ train_classifier.py ‣ train_tagger.py ‣ train_chunker.py• Easy to tweak training parameters• Duck-Typed corpus reading
    • Training Classifiers• train_classifier.py movie_reviews --instances paras• train_classifier.py movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2• train_classifier.py movie_reviews --instances paras --classifier MEGAM• train_classifier.py movie_reviews --instances paras --cross-fold 10• Pickled models are saved in ~/nltk_data/classifiers/
    • Training Taggers• train_tagger.py treebank• train_tagger.py treebank --sequential ubt -- brill• train_tagger.py treebank --sequential ‘’ -- classifier NaiveBayes• train_tagger.py mac_morpho --simplify_tags• Pickled models are saved in ~/nltk_data/taggers/
    • Training Chunkers• train_chunker.py treebank_chunk• train_chunker.py treebank_chunk --classifier NaiveBayes• train_chunker.py conll2000 --fileids train.txt• Pickled models are saved in ~/nltk_data/chunkers/
    • Corpus Bootstrapping• Guess & Correct easier than starting from scratch• Use an existing model for initial guesses• emoticons ‣ :) = “pos” ‣ :( = “neg”• ratings ‣ 5 stars = “pos” ‣ 1 star = “neg”
    • Portuguese PhraseExtraction & Classification• similar to condensr.com• Brazilian Portuguese• aspect classification is easy with training corpus• need chunked corpus for phrase extraction• use mac_morpho & nltk-trainer to train initial tagger• part-of-speech tag annotation is time consuming• simplified tags are much easier• bracketed phrases w/out pos tags
    • treebank_chunk[ Pierre/NNP Vinken/NNP ],/,[ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB[ the/DT board/NN ]as/IN[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.
    • Just Brackets[ Pierre Vinken ] , [ 61 years ] old , willjoin [ the board ] as [ a nonexecutivedirector Nov. 29 ] .
    • NLP at Weotta• Parsing & information extraction• Text cleaning & normalization (more parsing)• Text & keyword classification• De-duplication• Search indexing / IR• Sentiment analysis• Human integration