Statistical Learning for Text Classification         with scikit-learn and NLTK                Olivier Grisel         http...
Outline●   Why text classification?●   What is text classification?●   How?    ●   scikit-learn    ●   NLTK    ●   Google ...
Applications of Text Classification             Task                     Predicted outcomeSpam filtering                  ...
Supervised Learning Overview●   Convert training data to a set of vectors of features●   Build a model based on the statis...
Summary    Training          features      Text            vectors   Documents,    Images,    Sounds...                   ...
Bags of Words●   Tokenize document: list of uni-grams      [the, quick, brown, fox, jumps, over, the, lazy, dog]●   Binary...
Better than frequencies: TF-IDF●   Term Frequency●   Inverse Document Frequency●   Non informative words such as “the” are...
Even better features●   bi-grams of words:    ●   “New York”, “very bad”, “not good”●   n-grams of chars:    ●   “the”, “e...
scikit-learn
scikit-learn●   BSD●   numpy / scipy / cython / c++ wrappers●   Many state of the art implementations●   A new release eve...
Features Extraction in scikit-learnfrom scikits.learn.features.text import WordNGramAnalyzertext = (u"Jai mangxe9 du kango...
Features Extraction in scikit-learnfrom scikits.learn.features.text import CharNGramAnalyzeranalyzer = CharNGramAnalyzer(m...
TF-IDF features & SVMsfrom scikits.learn.features.text.sparse import Vectorizerfrom scikits.learn.sparse.svm.sparse import...
)                                    cs                                (do                        form                    ...
NLTK●   Code: ASL 2.0 & Book: CC-BY-NC-ND●   Tokenizers, Stemmers, Parsers, Classifiers,    Clusterers, Corpus Readers
NLTK Corpus Downloader>>> import nltk>>> nltk.download()
Using a NLTK corpus>>> from nltk.corpus import movie_reviews as reviews>>> pos_ids = reviews.fileids(pos)>>> neg_ids = rev...
Common data cleanup operations●   Lower case & remove accentuated chars:import unicodedatas = .join(c for c in unicodedata...
Feature Extraction with NLTK                   Unigram featuresdef word_features(words):   return dict((word, True) for wo...
Feature Extraction with NLTK                     Bigram Collocationsfrom nltk.collocations import BigramCollocationFinderf...
The NLTK Naïve Bayes Classifierfrom nltk.classify import NaiveBayesClassifierneg_examples = [(features(reviews.words(i)), ...
Most informative features>>> classifier.show_most_informative_features()     magnificent = True         pos : neg         ...
Training NLTK classifiers●   Try nltk-trainer●   python train_classifier.py --instances paras     --classifier NaiveBayes ...
REST services
NLTK – Online demos
NLTK – REST APIs% curl -d "text=Inception is the best movie ever"             http://text-processing.com/api/sentiment/{  ...
Google Prediction API
Typical performance results:                   movie reviews●   nltk:    ● unigram occurrences    ● Naïve Bayesian Classif...
Typical results:    newsgroups topics classification●   20 newsgroups dataset    ●   ~ 19K short text documents    ●   20 ...
Confusion Matrix (20 newsgroups)00 alt.atheism01 comp.graphics02 comp.os.ms-windows.misc03 comp.sys.ibm.pc.hardware04 comp...
Typical results:           Language Identification●   15 Wikipedia articles●   [p.text_content() for p in html_tree.findal...
Typical results:Language Identification
Scaling to many possible outcomes●   Example: possible outcomes are all the    categories of Wikipedia (565,108)●   From D...
Some pointers●   http://scikit-learn.sf.net               doc & examples    http://github.com/scikit-learn                ...
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Upcoming SlideShare
Loading in …5
×

Statistical Machine Learning for Text Classification with scikit-learn and NLTK

32,261 views

Published on

Published in: Real Estate

Statistical Machine Learning for Text Classification with scikit-learn and NLTK

  1. 1. Statistical Learning for Text Classification with scikit-learn and NLTK Olivier Grisel http://twitter.com/ogrisel PyCon – 2011
  2. 2. Outline● Why text classification?● What is text classification?● How? ● scikit-learn ● NLTK ● Google Prediction API● Some results
  3. 3. Applications of Text Classification Task Predicted outcomeSpam filtering Spam, Ham, PriorityLanguage guessing English, Spanish, French, ...Sentiment Analysis for Product Positive, Neutral, NegativeReviewsNews Feed Topic Politics, Business, Technology,Categorization Sports, ...Pay-per-click optimal ads Will yield clicks / WontplacementRecommender systems Will I buy this book? / I wont
  4. 4. Supervised Learning Overview● Convert training data to a set of vectors of features● Build a model based on the statistical properties of features in the training set, e.g. ● Naïve Bayesian Classifier ● Logistic Regression ● Support Vector Machines● For each new text document to classify ● Extract features ● Asked model to predict the most likely outcome
  5. 5. Summary Training features Text vectors Documents, Images, Sounds... Machine Learning Algorithm Labels New Text featuresDocument, vector Predictive Expected Image, Model Label Sound
  6. 6. Bags of Words● Tokenize document: list of uni-grams [the, quick, brown, fox, jumps, over, the, lazy, dog]● Binary occurrences / counts: {the: True, quick: True...}● Frequencies: {the: 0.22, quick: 0.11, brown: 0.11, fox: 0.11…}● TF-IDF {the: 0.001, quick: 0.05, brown: 0.06, fox: 0.24…}
  7. 7. Better than frequencies: TF-IDF● Term Frequency● Inverse Document Frequency● Non informative words such as “the” are scaled done
  8. 8. Even better features● bi-grams of words: ● “New York”, “very bad”, “not good”● n-grams of chars: ● “the”, “ed ”, “ a ” (useful for language guessing)● Combine with: ● Binary occurrences ● Frequencies ● TF-IDF
  9. 9. scikit-learn
  10. 10. scikit-learn● BSD● numpy / scipy / cython / c++ wrappers● Many state of the art implementations● A new release every 3 months● 17 contributors on release 0.7● Not just for text classification
  11. 11. Features Extraction in scikit-learnfrom scikits.learn.features.text import WordNGramAnalyzertext = (u"Jai mangxe9 du kangourou ce midi," u" cxe9tait pas trxeas bon.")WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)[uai, umange, udu, ukangourou, uce, umidi, uetait,upas, utres, ubon, uai mange, umange du, udukangourou, ukangourou ce, uce midi, umidi etait, uetaitpas, upas tres, utres bon]
  12. 12. Features Extraction in scikit-learnfrom scikits.learn.features.text import CharNGramAnalyzeranalyzer = CharNGramAnalyzer(min_n=3, max_n=6)char_ngrams = analyzer.analyze(text)print char_ngrams[:5] + char_ngrams[-5:][u"ja", u"ai", uai , ui m, u ma, us tres, u tres , utres b,ures bo, ues bon]
  13. 13. TF-IDF features & SVMsfrom scikits.learn.features.text.sparse import Vectorizerfrom scikits.learn.sparse.svm.sparse import LinearSVCvec = Vectorizer(analyzer=analyzer)features = vec.fit_transform(list_of_documents)clf = LinearSVC(C=100).fit(features, labels)clf2 = pickle.loads(pickle.dumps(clf))predicted_labels = clf2.predict(features_of_new_docs)
  14. 14. ) cs (do form ns .tra Training features Text c Documents, ve vectors Images, Sounds... ) Machine X,y fit( Learning . clf Algorithm Labels w) _ ne w) cs _n e (do X m ct( sfor ed i New an . pr Text c.t r clf e featuresDocument, v vector Predictive Expected Image, Model Label Sound
  15. 15. NLTK● Code: ASL 2.0 & Book: CC-BY-NC-ND● Tokenizers, Stemmers, Parsers, Classifiers, Clusterers, Corpus Readers
  16. 16. NLTK Corpus Downloader>>> import nltk>>> nltk.download()
  17. 17. Using a NLTK corpus>>> from nltk.corpus import movie_reviews as reviews>>> pos_ids = reviews.fileids(pos)>>> neg_ids = reviews.fileids(neg)>>> len(pos_ids), len(neg_ids)1000, 1000>>> reviews.words(pos_ids[0])[films, adapted, from, comic, books, have, ...]
  18. 18. Common data cleanup operations● Lower case & remove accentuated chars:import unicodedatas = .join(c for c in unicodedata.normalize(NFD, s.lower()) if unicodedata.category(c) != Mn)● Extract only word tokens of at least 2 chars ● Using NLTK tokenizers & stemmers ● Using a simple regexp: re.compile(r"bww+b", re.U).findall(s)
  19. 19. Feature Extraction with NLTK Unigram featuresdef word_features(words): return dict((word, True) for word in words)
  20. 20. Feature Extraction with NLTK Bigram Collocationsfrom nltk.collocations import BigramCollocationFinderfrom nltk.metrics import BigramAssocMeasures as BAMfrom itertools import chaindef bigram_features(words, score_fn=BAM.chi_sq): bg_finder = BigramCollocationFinder.from_words(words) bigrams = bg_finder.nbest(score_fn, 100000) return dict((bg, True) for bg in chain(words, bigrams))
  21. 21. The NLTK Naïve Bayes Classifierfrom nltk.classify import NaiveBayesClassifierneg_examples = [(features(reviews.words(i)), neg) for i in neg_ids]pos_examples = [(features(reviews.words(i)), pos) for i in pos_ids]train_set = pos_examples + neg_examplesclassifier = NaiveBayesClassifier.train(train_set)
  22. 22. Most informative features>>> classifier.show_most_informative_features() magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
  23. 23. Training NLTK classifiers● Try nltk-trainer● python train_classifier.py --instances paras --classifier NaiveBayes –bigrams --min_score 3 movie_reviews
  24. 24. REST services
  25. 25. NLTK – Online demos
  26. 26. NLTK – REST APIs% curl -d "text=Inception is the best movie ever" http://text-processing.com/api/sentiment/{ "probability": { "neg": 0.36647424288117808, "pos": 0.63352575711882186 }, "label": "pos"}
  27. 27. Google Prediction API
  28. 28. Typical performance results: movie reviews● nltk: ● unigram occurrences ● Naïve Bayesian Classifier ~ 70%● Google Prediction API ~ 83%● scikit-learn: ● TF-IDF unigram features ● LinearSVC ~ 87%● nltk: ● Collocation features selection ● Naïve Bayesian Classifier ~ 97%
  29. 29. Typical results: newsgroups topics classification● 20 newsgroups dataset ● ~ 19K short text documents ● 20 categories ● By date train / test split● Bigram TF-IDF + LinearSVC: ~ 87%
  30. 30. Confusion Matrix (20 newsgroups)00 alt.atheism01 comp.graphics02 comp.os.ms-windows.misc03 comp.sys.ibm.pc.hardware04 comp.sys.mac.hardware05 comp.windows.x06 misc.forsale07 rec.autos08 rec.motorcycles09 rec.sport.baseball10 rec.sport.hockey11 sci.crypt12 sci.electronics13 sci.med14 sci.space15 soc.religion.christian16 talk.politics.guns17 talk.politics.mideast18 talk.politics.misc19 talk.religion.misc
  31. 31. Typical results: Language Identification● 15 Wikipedia articles● [p.text_content() for p in html_tree.findall(//p)]● CharNGramAnalyzer(min_n=1, max_n=3)● TF-IDF● LinearSVC
  32. 32. Typical results:Language Identification
  33. 33. Scaling to many possible outcomes● Example: possible outcomes are all the categories of Wikipedia (565,108)● From Document Categorization to Information Retrieval● Fulltext index for TF-IDF similarity queries● Smart way to find the top 30 search keywords● Use Apache Lucene / Solr MoreLikeThisQuery
  34. 34. Some pointers● http://scikit-learn.sf.net doc & examples http://github.com/scikit-learn code● http://www.nltk.org code & doc & PDF book● http://streamhacker.com/ ● Jacob Perkins blog on NLTK & APIs● https://github.com/japerk/nltk-trainer● http://www.slideshare.net/ogrisel these slides● http://twitter.com/ogrisel / http://github.com/ogrisel Questions?

×