Your SlideShare is downloading. ×
NLTK: the Good, the Bad, and the Awesome
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

NLTK: the Good, the Bad, and the Awesome


Published on

Presented at the first meeting of the Bay Area NLP group: by Jacob Perkins.

Presented at the first meeting of the Bay Area NLP group: by Jacob Perkins.

Published in: Technology, Education

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. NLTKThe Good, the Bad, and the Awesome
  • 2. Jacob Perkins• Python Text Processing with NLTK 2.0 Cookbook•••• @japerk
  • 3. The Good• Makes NLProc easier and more accessible• Python (great learning language)• Lots of documentation (and 2 books!)• Designed for training custom models• Includes many training corpora• Many algorithms to experiment with
  • 4. The Bad• NLProc is hard• Few out-of-the-box solutions (see Pattern)• Not designed for big-data (see Mahout)• Doesn’t have latest algorithms (see Scikits-Learn)• No online or active learning algorithms
  • 5. More Bad• Doesn’t play nice with pip or easy_install• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)• Models can use a lot of memory (& disk if pickled)
  • 6. The Awesome• Great for education and research• Lots of users & active community• Extensible interfaces• Training algorithms span human languages
  • 7. More Awesome• Trained models can be very fast• Well known algorithms can be very accurate• NLTK-Trainer (train models with 0 code)• Corpus bootstrapping
  • 8. Some Numbers• 3 Classification Algorithms• 9 Part-of-Speech Tagging Algorithms• Stemming Algorithms for 15 Languages• 5 Word Tokenization Algorithms• Sentence Tokenizers for 16 Languages• 60 included corpora
  • 9.• NLTK Demos & APIs• Sentiment Analysis• Part-of-Speech Tagging & Chunking / NER• Stemming• Tokenization
  • 10. Memory
  • 11. CPU
  • 12. NLTK-Trainer•• 3 Training Command Scripts ‣ ‣ ‣• Easy to tweak training parameters• Duck-Typed corpus reading
  • 13. Training Classifiers• movie_reviews --instances paras• movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2• movie_reviews --instances paras --classifier MEGAM• movie_reviews --instances paras --cross-fold 10• Pickled models are saved in ~/nltk_data/classifiers/
  • 14. Training Taggers• treebank• treebank --sequential ubt -- brill• treebank --sequential ‘’ -- classifier NaiveBayes• mac_morpho --simplify_tags• Pickled models are saved in ~/nltk_data/taggers/
  • 15. Training Chunkers• treebank_chunk• treebank_chunk --classifier NaiveBayes• conll2000 --fileids train.txt• Pickled models are saved in ~/nltk_data/chunkers/
  • 16. Corpus Bootstrapping• Guess & Correct easier than starting from scratch• Use an existing model for initial guesses• emoticons ‣ :) = “pos” ‣ :( = “neg”• ratings ‣ 5 stars = “pos” ‣ 1 star = “neg”
  • 17. Portuguese PhraseExtraction & Classification• similar to• Brazilian Portuguese• aspect classification is easy with training corpus• need chunked corpus for phrase extraction• use mac_morpho & nltk-trainer to train initial tagger• part-of-speech tag annotation is time consuming• simplified tags are much easier• bracketed phrases w/out pos tags
  • 18. treebank_chunk[ Pierre/NNP Vinken/NNP ],/,[ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB[ the/DT board/NN ]as/IN[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.
  • 19. Just Brackets[ Pierre Vinken ] , [ 61 years ] old , willjoin [ the board ] as [ a nonexecutivedirector Nov. 29 ] .
  • 20. NLP at Weotta• Parsing & information extraction• Text cleaning & normalization (more parsing)• Text & keyword classification• De-duplication• Search indexing / IR• Sentiment analysis• Human integration