Your SlideShare is downloading. ×
0
NLTKThe Good, the Bad, and the Awesome
Jacob Perkins• Python Text Processing with NLTK 2.0 Cookbook• streamhacker.com• weotta.com• text-processing.com• @japerk
The Good• Makes NLProc easier and more accessible• Python (great learning language)• Lots of documentation (and 2 books!)•...
The Bad• NLProc is hard• Few out-of-the-box solutions (see Pattern)• Not designed for big-data (see Mahout)• Doesn’t have ...
More Bad• Doesn’t play nice with pip or easy_install• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)• Models can use a ...
The Awesome• Great for education and research• Lots of users & active community• Extensible interfaces• Training algorithm...
More Awesome• Trained models can be very fast• Well known algorithms can be very accurate• NLTK-Trainer (train models with...
Some Numbers• 3 Classification Algorithms• 9 Part-of-Speech Tagging Algorithms• Stemming Algorithms for 15 Languages• 5 Wor...
Text-Processing.com• NLTK Demos & APIs• Sentiment Analysis• Part-of-Speech Tagging & Chunking / NER• Stemming• Tokenization
Memory Usagetext-processing.com
CPU Usagetext-processing.com
NLTK-Trainer• https://github.com/japerk/nltk-trainer• 3 Training Command Scripts   ‣ train_classifier.py   ‣ train_tagger....
Training Classifiers• train_classifier.py movie_reviews --instances  paras• train_classifier.py movie_reviews --instances  ...
Training Taggers• train_tagger.py treebank• train_tagger.py treebank --sequential ubt --  brill• train_tagger.py treebank ...
Training Chunkers• train_chunker.py treebank_chunk• train_chunker.py treebank_chunk --classifier  NaiveBayes• train_chunke...
Corpus Bootstrapping• Guess & Correct easier than starting from scratch• Use an existing model for initial guesses• emotic...
Portuguese PhraseExtraction & Classification• similar to condensr.com• Brazilian Portuguese• aspect classification is easy w...
treebank_chunk[ Pierre/NNP Vinken/NNP ],/,[ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB[ the/DT board/NN ]as/IN[ a/DT none...
Just Brackets[ Pierre Vinken ] , [ 61 years ] old , willjoin [ the board ] as [ a nonexecutivedirector Nov. 29 ] .
NLP at Weotta• Parsing & information extraction• Text cleaning & normalization (more parsing)• Text & keyword classificatio...
Upcoming SlideShare
Loading in...5
×

NLTK: the Good, the Bad, and the Awesome

13,422

Published on

Presented at the first meeting of the Bay Area NLP group: http://www.meetup.com/Bay-Area-NLP/events/16522295/ by Jacob Perkins.

Published in: Technology, Education

Transcript of "NLTK: the Good, the Bad, and the Awesome"

  1. 1. NLTKThe Good, the Bad, and the Awesome
  2. 2. Jacob Perkins• Python Text Processing with NLTK 2.0 Cookbook• streamhacker.com• weotta.com• text-processing.com• @japerk
  3. 3. The Good• Makes NLProc easier and more accessible• Python (great learning language)• Lots of documentation (and 2 books!)• Designed for training custom models• Includes many training corpora• Many algorithms to experiment with
  4. 4. The Bad• NLProc is hard• Few out-of-the-box solutions (see Pattern)• Not designed for big-data (see Mahout)• Doesn’t have latest algorithms (see Scikits-Learn)• No online or active learning algorithms
  5. 5. More Bad• Doesn’t play nice with pip or easy_install• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)• Models can use a lot of memory (& disk if pickled)
  6. 6. The Awesome• Great for education and research• Lots of users & active community• Extensible interfaces• Training algorithms span human languages
  7. 7. More Awesome• Trained models can be very fast• Well known algorithms can be very accurate• NLTK-Trainer (train models with 0 code)• Corpus bootstrapping
  8. 8. Some Numbers• 3 Classification Algorithms• 9 Part-of-Speech Tagging Algorithms• Stemming Algorithms for 15 Languages• 5 Word Tokenization Algorithms• Sentence Tokenizers for 16 Languages• 60 included corpora
  9. 9. Text-Processing.com• NLTK Demos & APIs• Sentiment Analysis• Part-of-Speech Tagging & Chunking / NER• Stemming• Tokenization
  10. 10. Memory Usagetext-processing.com
  11. 11. CPU Usagetext-processing.com
  12. 12. NLTK-Trainer• https://github.com/japerk/nltk-trainer• 3 Training Command Scripts ‣ train_classifier.py ‣ train_tagger.py ‣ train_chunker.py• Easy to tweak training parameters• Duck-Typed corpus reading
  13. 13. Training Classifiers• train_classifier.py movie_reviews --instances paras• train_classifier.py movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2• train_classifier.py movie_reviews --instances paras --classifier MEGAM• train_classifier.py movie_reviews --instances paras --cross-fold 10• Pickled models are saved in ~/nltk_data/classifiers/
  14. 14. Training Taggers• train_tagger.py treebank• train_tagger.py treebank --sequential ubt -- brill• train_tagger.py treebank --sequential ‘’ -- classifier NaiveBayes• train_tagger.py mac_morpho --simplify_tags• Pickled models are saved in ~/nltk_data/taggers/
  15. 15. Training Chunkers• train_chunker.py treebank_chunk• train_chunker.py treebank_chunk --classifier NaiveBayes• train_chunker.py conll2000 --fileids train.txt• Pickled models are saved in ~/nltk_data/chunkers/
  16. 16. Corpus Bootstrapping• Guess & Correct easier than starting from scratch• Use an existing model for initial guesses• emoticons ‣ :) = “pos” ‣ :( = “neg”• ratings ‣ 5 stars = “pos” ‣ 1 star = “neg”
  17. 17. Portuguese PhraseExtraction & Classification• similar to condensr.com• Brazilian Portuguese• aspect classification is easy with training corpus• need chunked corpus for phrase extraction• use mac_morpho & nltk-trainer to train initial tagger• part-of-speech tag annotation is time consuming• simplified tags are much easier• bracketed phrases w/out pos tags
  18. 18. treebank_chunk[ Pierre/NNP Vinken/NNP ],/,[ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB[ the/DT board/NN ]as/IN[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.
  19. 19. Just Brackets[ Pierre Vinken ] , [ 61 years ] old , willjoin [ the board ] as [ a nonexecutivedirector Nov. 29 ] .
  20. 20. NLP at Weotta• Parsing & information extraction• Text cleaning & normalization (more parsing)• Text & keyword classification• De-duplication• Search indexing / IR• Sentiment analysis• Human integration
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×