Statistical Machine Learning for Text Classification with scikit-learn and NLTK
1. Statistical Learning for Text Classification
with scikit-learn and NLTK
Olivier Grisel
http://twitter.com/ogrisel
PyCon – 2011
2. Outline
● Why text classification?
● What is text classification?
● How?
● scikit-learn
● NLTK
● Google Prediction API
● Some results
3. Applications of Text Classification
Task Predicted outcome
Spam filtering Spam, Ham, Priority
Language guessing English, Spanish, French, ...
Sentiment Analysis for Product
Positive, Neutral, Negative
Reviews
News Feed Topic Politics, Business, Technology,
Categorization Sports, ...
Pay-per-click optimal ads
Will yield clicks / Won't
placement
Recommender systems Will I buy this book? / I won't
4. Supervised Learning Overview
● Convert training data to a set of vectors of features
● Build a model based on the statistical properties of
features in the training set, e.g.
● Naïve Bayesian Classifier
● Logistic Regression
● Support Vector Machines
● For each new text document to classify
● Extract features
● Asked model to predict the most likely outcome
5. Summary
Training features
Text vectors
Documents,
Images,
Sounds...
Machine
Learning
Algorithm
Labels
New
Text features
Document, vector Predictive Expected
Image, Model Label
Sound
10. scikit-learn
● BSD
● numpy / scipy / cython / c++ wrappers
● Many state of the art implementations
● A new release every 3 months
● 17 contributors on release 0.7
● Not just for text classification
11. Features Extraction in scikit-learn
from scikits.learn.features.text import WordNGramAnalyzer
text = (u"J'ai mangxe9 du kangourou ce midi,"
u" c'xe9tait pas trxeas bon.")
WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)
[u'ai', u'mange', u'du', u'kangourou', u'ce', u'midi', u'etait',
u'pas', u'tres', u'bon', u'ai mange', u'mange du', u'du
kangourou', u'kangourou ce', u'ce midi', u'midi etait', u'etait
pas', u'pas tres', u'tres bon']
12. Features Extraction in scikit-learn
from scikits.learn.features.text import CharNGramAnalyzer
analyzer = CharNGramAnalyzer(min_n=3, max_n=6)
char_ngrams = analyzer.analyze(text)
print char_ngrams[:5] + char_ngrams[-5:]
[u"j'a", u"'ai", u'ai ', u'i m', u' ma', u's tres', u' tres ', u'tres b',
u'res bo', u'es bon']
13. TF-IDF features & SVMs
from scikits.learn.features.text.sparse import Vectorizer
from scikits.learn.sparse.svm.sparse import LinearSVC
vec = Vectorizer(analyzer=analyzer)
features = vec.fit_transform(list_of_documents)
clf = LinearSVC(C=100).fit(features, labels)
clf2 = pickle.loads(pickle.dumps(clf))
predicted_labels = clf2.predict(features_of_new_docs)
14. )
cs
(do
form
ns
.tra
Training features
Text c
Documents, ve vectors
Images,
Sounds...
) Machine
X,y
fit(
Learning
.
clf Algorithm
Labels
w)
_ ne w)
cs _n
e
(do X
m ct(
sfor ed
i
New an . pr
Text c.t
r clf
e features
Document, v vector Predictive Expected
Image, Model Label
Sound
15. NLTK
● Code: ASL 2.0 & Book: CC-BY-NC-ND
● Tokenizers, Stemmers, Parsers, Classifiers,
Clusterers, Corpus Readers
17. Using a NLTK corpus
>>> from nltk.corpus import movie_reviews as reviews
>>> pos_ids = reviews.fileids('pos')
>>> neg_ids = reviews.fileids('neg')
>>> len(pos_ids), len(neg_ids)
1000, 1000
>>> reviews.words(pos_ids[0])
['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
18. Common data cleanup operations
● Lower case & remove accentuated chars:
import unicodedata
s = ''.join(c for c in unicodedata.normalize('NFD', s.lower())
if unicodedata.category(c) != 'Mn')
● Extract only word tokens of at least 2 chars
● Using NLTK tokenizers & stemmers
● Using a simple regexp:
re.compile(r"bww+b", re.U).findall(s)
19. Feature Extraction with NLTK
Unigram features
def word_features(words):
return dict((word, True) for word in words)
20. Feature Extraction with NLTK
Bigram Collocations
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures as BAM
from itertools import chain
def bigram_features(words, score_fn=BAM.chi_sq):
bg_finder = BigramCollocationFinder.from_words(words)
bigrams = bg_finder.nbest(score_fn, 100000)
return dict((bg, True) for bg in chain(words, bigrams))
21. The NLTK Naïve Bayes Classifier
from nltk.classify import NaiveBayesClassifier
neg_examples = [(features(reviews.words(i)), 'neg') for i in neg_ids]
pos_examples = [(features(reviews.words(i)), 'pos') for i in pos_ids]
train_set = pos_examples + neg_examples
classifier = NaiveBayesClassifier.train(train_set)
33. Typical results:
Language Identification
● 15 Wikipedia articles
● [p.text_content() for p in html_tree.findall('//p')]
● CharNGramAnalyzer(min_n=1, max_n=3)
● TF-IDF
● LinearSVC
35. Scaling to many possible outcomes
● Example: possible outcomes are all the
categories of Wikipedia (565,108)
● From Document Categorization
to Information Retrieval
● Fulltext index for TF-IDF similarity queries
● Smart way to find the top 30 search keywords
● Use Apache Lucene / Solr MoreLikeThisQuery
36. Some pointers
● http://scikit-learn.sf.net doc & examples
http://github.com/scikit-learn code
● http://www.nltk.org code & doc & PDF book
● http://streamhacker.com/
● Jacob Perkins' blog on NLTK & APIs
● https://github.com/japerk/nltk-trainer
● http://www.slideshare.net/ogrisel these slides
● http://twitter.com/ogrisel / http://github.com/ogrisel
Questions?