• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 

Statistical Machine Learning for Text Classification with scikit-learn and NLTK

on

  • 17,002 views

 

Statistics

Views

Total Views
17,002
Views on SlideShare
16,957
Embed Views
45

Actions

Likes
32
Downloads
440
Comments
5

6 Embeds 45

https://twitter.com 34
http://a0.twimg.com 4
http://paper.li 3
http://www.slideshare.net 2
http://springpadit.com 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

15 of 5 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Statistical Machine Learning for Text Classification with scikit-learn and NLTK Statistical Machine Learning for Text Classification with scikit-learn and NLTK Presentation Transcript

    • Statistical Learning for Text Classification with scikit-learn and NLTK Olivier Grisel http://twitter.com/ogrisel PyCon – 2011
    • Outline● Why text classification?● What is text classification?● How? ● scikit-learn ● NLTK ● Google Prediction API● Some results
    • Applications of Text Classification Task Predicted outcomeSpam filtering Spam, Ham, PriorityLanguage guessing English, Spanish, French, ...Sentiment Analysis for Product Positive, Neutral, NegativeReviewsNews Feed Topic Politics, Business, Technology,Categorization Sports, ...Pay-per-click optimal ads Will yield clicks / WontplacementRecommender systems Will I buy this book? / I wont
    • Supervised Learning Overview● Convert training data to a set of vectors of features● Build a model based on the statistical properties of features in the training set, e.g. ● Naïve Bayesian Classifier ● Logistic Regression ● Support Vector Machines● For each new text document to classify ● Extract features ● Asked model to predict the most likely outcome
    • Summary Training features Text vectors Documents, Images, Sounds... Machine Learning Algorithm Labels New Text featuresDocument, vector Predictive Expected Image, Model Label Sound
    • Bags of Words● Tokenize document: list of uni-grams [the, quick, brown, fox, jumps, over, the, lazy, dog]● Binary occurrences / counts: {the: True, quick: True...}● Frequencies: {the: 0.22, quick: 0.11, brown: 0.11, fox: 0.11…}● TF-IDF {the: 0.001, quick: 0.05, brown: 0.06, fox: 0.24…}
    • Better than frequencies: TF-IDF● Term Frequency● Inverse Document Frequency● Non informative words such as “the” are scaled done
    • Even better features● bi-grams of words: ● “New York”, “very bad”, “not good”● n-grams of chars: ● “the”, “ed ”, “ a ” (useful for language guessing)● Combine with: ● Binary occurrences ● Frequencies ● TF-IDF
    • scikit-learn
    • scikit-learn● BSD● numpy / scipy / cython / c++ wrappers● Many state of the art implementations● A new release every 3 months● 17 contributors on release 0.7● Not just for text classification
    • Features Extraction in scikit-learnfrom scikits.learn.features.text import WordNGramAnalyzertext = (u"Jai mangxe9 du kangourou ce midi," u" cxe9tait pas trxeas bon.")WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)[uai, umange, udu, ukangourou, uce, umidi, uetait,upas, utres, ubon, uai mange, umange du, udukangourou, ukangourou ce, uce midi, umidi etait, uetaitpas, upas tres, utres bon]
    • Features Extraction in scikit-learnfrom scikits.learn.features.text import CharNGramAnalyzeranalyzer = CharNGramAnalyzer(min_n=3, max_n=6)char_ngrams = analyzer.analyze(text)print char_ngrams[:5] + char_ngrams[-5:][u"ja", u"ai", uai , ui m, u ma, us tres, u tres , utres b,ures bo, ues bon]
    • TF-IDF features & SVMsfrom scikits.learn.features.text.sparse import Vectorizerfrom scikits.learn.sparse.svm.sparse import LinearSVCvec = Vectorizer(analyzer=analyzer)features = vec.fit_transform(list_of_documents)clf = LinearSVC(C=100).fit(features, labels)clf2 = pickle.loads(pickle.dumps(clf))predicted_labels = clf2.predict(features_of_new_docs)
    • ) cs (do form ns .tra Training features Text c Documents, ve vectors Images, Sounds... ) Machine X,y fit( Learning . clf Algorithm Labels w) _ ne w) cs _n e (do X m ct( sfor ed i New an . pr Text c.t r clf e featuresDocument, v vector Predictive Expected Image, Model Label Sound
    • NLTK● Code: ASL 2.0 & Book: CC-BY-NC-ND● Tokenizers, Stemmers, Parsers, Classifiers, Clusterers, Corpus Readers
    • NLTK Corpus Downloader>>> import nltk>>> nltk.download()
    • Using a NLTK corpus>>> from nltk.corpus import movie_reviews as reviews>>> pos_ids = reviews.fileids(pos)>>> neg_ids = reviews.fileids(neg)>>> len(pos_ids), len(neg_ids)1000, 1000>>> reviews.words(pos_ids[0])[films, adapted, from, comic, books, have, ...]
    • Common data cleanup operations● Lower case & remove accentuated chars:import unicodedatas = .join(c for c in unicodedata.normalize(NFD, s.lower()) if unicodedata.category(c) != Mn)● Extract only word tokens of at least 2 chars ● Using NLTK tokenizers & stemmers ● Using a simple regexp: re.compile(r"bww+b", re.U).findall(s)
    • Feature Extraction with NLTK Unigram featuresdef word_features(words): return dict((word, True) for word in words)
    • Feature Extraction with NLTK Bigram Collocationsfrom nltk.collocations import BigramCollocationFinderfrom nltk.metrics import BigramAssocMeasures as BAMfrom itertools import chaindef bigram_features(words, score_fn=BAM.chi_sq): bg_finder = BigramCollocationFinder.from_words(words) bigrams = bg_finder.nbest(score_fn, 100000) return dict((bg, True) for bg in chain(words, bigrams))
    • The NLTK Naïve Bayes Classifierfrom nltk.classify import NaiveBayesClassifierneg_examples = [(features(reviews.words(i)), neg) for i in neg_ids]pos_examples = [(features(reviews.words(i)), pos) for i in pos_ids]train_set = pos_examples + neg_examplesclassifier = NaiveBayesClassifier.train(train_set)
    • Most informative features>>> classifier.show_most_informative_features() magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
    • Training NLTK classifiers● Try nltk-trainer● python train_classifier.py --instances paras --classifier NaiveBayes –bigrams --min_score 3 movie_reviews
    • REST services
    • NLTK – Online demos
    • NLTK – REST APIs% curl -d "text=Inception is the best movie ever" http://text-processing.com/api/sentiment/{ "probability": { "neg": 0.36647424288117808, "pos": 0.63352575711882186 }, "label": "pos"}
    • Google Prediction API
    • Typical performance results: movie reviews● nltk: ● unigram occurrences ● Naïve Bayesian Classifier ~ 70%● Google Prediction API ~ 83%● scikit-learn: ● TF-IDF unigram features ● LinearSVC ~ 87%● nltk: ● Collocation features selection ● Naïve Bayesian Classifier ~ 97%
    • Typical results: newsgroups topics classification● 20 newsgroups dataset ● ~ 19K short text documents ● 20 categories ● By date train / test split● Bigram TF-IDF + LinearSVC: ~ 87%
    • Confusion Matrix (20 newsgroups)00 alt.atheism01 comp.graphics02 comp.os.ms-windows.misc03 comp.sys.ibm.pc.hardware04 comp.sys.mac.hardware05 comp.windows.x06 misc.forsale07 rec.autos08 rec.motorcycles09 rec.sport.baseball10 rec.sport.hockey11 sci.crypt12 sci.electronics13 sci.med14 sci.space15 soc.religion.christian16 talk.politics.guns17 talk.politics.mideast18 talk.politics.misc19 talk.religion.misc
    • Typical results: Language Identification● 15 Wikipedia articles● [p.text_content() for p in html_tree.findall(//p)]● CharNGramAnalyzer(min_n=1, max_n=3)● TF-IDF● LinearSVC
    • Typical results:Language Identification
    • Scaling to many possible outcomes● Example: possible outcomes are all the categories of Wikipedia (565,108)● From Document Categorization to Information Retrieval● Fulltext index for TF-IDF similarity queries● Smart way to find the top 30 search keywords● Use Apache Lucene / Solr MoreLikeThisQuery
    • Some pointers● http://scikit-learn.sf.net doc & examples http://github.com/scikit-learn code● http://www.nltk.org code & doc & PDF book● http://streamhacker.com/ ● Jacob Perkins blog on NLTK & APIs● https://github.com/japerk/nltk-trainer● http://www.slideshare.net/ogrisel these slides● http://twitter.com/ogrisel / http://github.com/ogrisel Questions?