NLTK: the Good, the Bad, and the Awesome

  • 12,326 views
Uploaded on

Presented at the first meeting of the Bay Area NLP group: http://www.meetup.com/Bay-Area-NLP/events/16522295/ by Jacob Perkins.

Presented at the first meeting of the Bay Area NLP group: http://www.meetup.com/Bay-Area-NLP/events/16522295/ by Jacob Perkins.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
12,326
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
175
Comments
0
Likes
13

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. NLTKThe Good, the Bad, and the Awesome
  • 2. Jacob Perkins• Python Text Processing with NLTK 2.0 Cookbook• streamhacker.com• weotta.com• text-processing.com• @japerk
  • 3. The Good• Makes NLProc easier and more accessible• Python (great learning language)• Lots of documentation (and 2 books!)• Designed for training custom models• Includes many training corpora• Many algorithms to experiment with
  • 4. The Bad• NLProc is hard• Few out-of-the-box solutions (see Pattern)• Not designed for big-data (see Mahout)• Doesn’t have latest algorithms (see Scikits-Learn)• No online or active learning algorithms
  • 5. More Bad• Doesn’t play nice with pip or easy_install• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)• Models can use a lot of memory (& disk if pickled)
  • 6. The Awesome• Great for education and research• Lots of users & active community• Extensible interfaces• Training algorithms span human languages
  • 7. More Awesome• Trained models can be very fast• Well known algorithms can be very accurate• NLTK-Trainer (train models with 0 code)• Corpus bootstrapping
  • 8. Some Numbers• 3 Classification Algorithms• 9 Part-of-Speech Tagging Algorithms• Stemming Algorithms for 15 Languages• 5 Word Tokenization Algorithms• Sentence Tokenizers for 16 Languages• 60 included corpora
  • 9. Text-Processing.com• NLTK Demos & APIs• Sentiment Analysis• Part-of-Speech Tagging & Chunking / NER• Stemming• Tokenization
  • 10. Memory Usagetext-processing.com
  • 11. CPU Usagetext-processing.com
  • 12. NLTK-Trainer• https://github.com/japerk/nltk-trainer• 3 Training Command Scripts ‣ train_classifier.py ‣ train_tagger.py ‣ train_chunker.py• Easy to tweak training parameters• Duck-Typed corpus reading
  • 13. Training Classifiers• train_classifier.py movie_reviews --instances paras• train_classifier.py movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2• train_classifier.py movie_reviews --instances paras --classifier MEGAM• train_classifier.py movie_reviews --instances paras --cross-fold 10• Pickled models are saved in ~/nltk_data/classifiers/
  • 14. Training Taggers• train_tagger.py treebank• train_tagger.py treebank --sequential ubt -- brill• train_tagger.py treebank --sequential ‘’ -- classifier NaiveBayes• train_tagger.py mac_morpho --simplify_tags• Pickled models are saved in ~/nltk_data/taggers/
  • 15. Training Chunkers• train_chunker.py treebank_chunk• train_chunker.py treebank_chunk --classifier NaiveBayes• train_chunker.py conll2000 --fileids train.txt• Pickled models are saved in ~/nltk_data/chunkers/
  • 16. Corpus Bootstrapping• Guess & Correct easier than starting from scratch• Use an existing model for initial guesses• emoticons ‣ :) = “pos” ‣ :( = “neg”• ratings ‣ 5 stars = “pos” ‣ 1 star = “neg”
  • 17. Portuguese PhraseExtraction & Classification• similar to condensr.com• Brazilian Portuguese• aspect classification is easy with training corpus• need chunked corpus for phrase extraction• use mac_morpho & nltk-trainer to train initial tagger• part-of-speech tag annotation is time consuming• simplified tags are much easier• bracketed phrases w/out pos tags
  • 18. treebank_chunk[ Pierre/NNP Vinken/NNP ],/,[ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB[ the/DT board/NN ]as/IN[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.
  • 19. Just Brackets[ Pierre Vinken ] , [ 61 years ] old , willjoin [ the board ] as [ a nonexecutivedirector Nov. 29 ] .
  • 20. NLP at Weotta• Parsing & information extraction• Text cleaning & normalization (more parsing)• Text & keyword classification• De-duplication• Search indexing / IR• Sentiment analysis• Human integration