Statistical Machine Learning for Text Classification with scikit-learn and NLTK

Olivier Grisel
Olivier GriselSoftware Engineer at Inria
Statistical Learning for Text Classification
         with scikit-learn and NLTK

                Olivier Grisel
         http://twitter.com/ogrisel

              PyCon – 2011
Outline
●   Why text classification?
●   What is text classification?
●   How?
    ●   scikit-learn
    ●   NLTK
    ●   Google Prediction API
●   Some results
Applications of Text Classification
             Task                     Predicted outcome

Spam filtering                  Spam, Ham, Priority

Language guessing               English, Spanish, French, ...

Sentiment Analysis for Product
                               Positive, Neutral, Negative
Reviews
News Feed Topic                 Politics, Business, Technology,
Categorization                  Sports, ...
Pay-per-click optimal ads
                                Will yield clicks / Won't
placement
Recommender systems             Will I buy this book? / I won't
Supervised Learning Overview
●   Convert training data to a set of vectors of features
●   Build a model based on the statistical properties of
    features in the training set, e.g.
    ●   Naïve Bayesian Classifier
    ●   Logistic Regression
    ●   Support Vector Machines
●   For each new text document to classify
    ●   Extract features
    ●   Asked model to predict the most likely outcome
Summary
    Training          features
      Text            vectors
   Documents,
    Images,
    Sounds...
                                 Machine
                                 Learning
                                 Algorithm
     Labels




  New
  Text          features
Document,       vector           Predictive   Expected
 Image,                            Model       Label
 Sound
Bags of Words
●   Tokenize document: list of uni-grams
      ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
●   Binary occurrences / counts:
      {'the': True, 'quick': True...}
●   Frequencies:
       {'the': 0.22, 'quick': 0.11, 'brown': 0.11, 'fox': 0.11…}
●   TF-IDF
      {'the': 0.001, 'quick': 0.05, 'brown': 0.06, 'fox': 0.24…}
Better than frequencies: TF-IDF
●   Term Frequency



●   Inverse Document Frequency




●   Non informative words such as “the” are scaled done
Even better features
●   bi-grams of words:
    ●   “New York”, “very bad”, “not good”
●   n-grams of chars:
    ●   “the”, “ed ”, “ a ” (useful for language guessing)
●   Combine with:
    ●   Binary occurrences
    ●   Frequencies
    ●   TF-IDF
scikit-learn
scikit-learn
●   BSD
●   numpy / scipy / cython / c++ wrappers
●   Many state of the art implementations
●   A new release every 3 months
●   17 contributors on release 0.7
●   Not just for text classification
Features Extraction in scikit-learn
from scikits.learn.features.text import WordNGramAnalyzer
text = (u"J'ai mangxe9 du kangourou ce midi,"
       u" c'xe9tait pas trxeas bon.")


WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)
[u'ai', u'mange', u'du', u'kangourou', u'ce', u'midi', u'etait',
u'pas', u'tres', u'bon', u'ai mange', u'mange du', u'du
kangourou', u'kangourou ce', u'ce midi', u'midi etait', u'etait
pas', u'pas tres', u'tres bon']
Features Extraction in scikit-learn
from scikits.learn.features.text import CharNGramAnalyzer


analyzer = CharNGramAnalyzer(min_n=3, max_n=6)
char_ngrams = analyzer.analyze(text)


print char_ngrams[:5] + char_ngrams[-5:]
[u"j'a", u"'ai", u'ai ', u'i m', u' ma', u's tres', u' tres ', u'tres b',
u'res bo', u'es bon']
TF-IDF features & SVMs
from scikits.learn.features.text.sparse import Vectorizer
from scikits.learn.sparse.svm.sparse import LinearSVC



vec = Vectorizer(analyzer=analyzer)

features = vec.fit_transform(list_of_documents)

clf = LinearSVC(C=100).fit(features, labels)



clf2 = pickle.loads(pickle.dumps(clf))

predicted_labels = clf2.predict(features_of_new_docs)
)
                                    cs
                                (do
                        form
                      ns
                  .tra
    Training                 features
      Text       c
   Documents, ve             vectors
     Images,
     Sounds...
                                             )   Machine
                                          X,y
                                     fit(
                                                 Learning
                                   .
                               clf               Algorithm
      Labels

                                       w)
                                 _   ne                                                w)
                               cs                                                  _n
                                                                                      e
                           (do                                                   X
                         m                                                   ct(
                     sfor                                                ed
                                                                           i
  New             an                                              .   pr
  Text       c.t
                 r                                            clf
           e              features
Document, v               vector                 Predictive                     Expected
 Image,                                            Model                         Label
 Sound
NLTK
●   Code: ASL 2.0 & Book: CC-BY-NC-ND
●   Tokenizers, Stemmers, Parsers, Classifiers,
    Clusterers, Corpus Readers
NLTK Corpus Downloader
>>> import nltk
>>> nltk.download()
Using a NLTK corpus
>>> from nltk.corpus import movie_reviews as reviews


>>> pos_ids = reviews.fileids('pos')
>>> neg_ids = reviews.fileids('neg')
>>> len(pos_ids), len(neg_ids)
1000, 1000


>>> reviews.words(pos_ids[0])
['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
Common data cleanup operations
●   Lower case & remove accentuated chars:
import unicodedata
s = ''.join(c for c in unicodedata.normalize('NFD', s.lower())
             if unicodedata.category(c) != 'Mn')
●   Extract only word tokens of at least 2 chars
    ●   Using NLTK tokenizers & stemmers
    ●   Using a simple regexp:
        re.compile(r"bww+b", re.U).findall(s)
Feature Extraction with NLTK
                   Unigram features




def word_features(words):
   return dict((word, True) for word in words)
Feature Extraction with NLTK
                     Bigram Collocations
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures as BAM
from itertools import chain


def bigram_features(words, score_fn=BAM.chi_sq):
   bg_finder = BigramCollocationFinder.from_words(words)
   bigrams = bg_finder.nbest(score_fn, 100000)
   return dict((bg, True) for bg in chain(words, bigrams))
The NLTK Naïve Bayes Classifier
from nltk.classify import NaiveBayesClassifier


neg_examples = [(features(reviews.words(i)), 'neg') for i in neg_ids]
pos_examples = [(features(reviews.words(i)), 'pos') for i in pos_ids]
train_set = pos_examples + neg_examples


classifier = NaiveBayesClassifier.train(train_set)
Most informative features
>>> classifier.show_most_informative_features()
     magnificent = True         pos : neg          =     15.0 : 1.0
     outstanding = True         pos : neg          =      13.6 : 1.0
      insulting = True        neg : pos     =          13.0 : 1.0
      vulnerable = True        pos : neg           =     12.3 : 1.0
      ludicrous = True         neg : pos       =        11.8 : 1.0
        avoids = True         pos : neg        =       11.7 : 1.0
     uninvolving = True         neg : pos          =     11.7 : 1.0
      astounding = True         pos : neg          =      10.3 : 1.0
     fascination = True        pos : neg       =         10.3 : 1.0
       idiotic = True        neg : pos     =           9.8 : 1.0
Training NLTK classifiers
●   Try nltk-trainer

●   python train_classifier.py --instances paras 
    --classifier NaiveBayes –bigrams 
    --min_score 3      movie_reviews
REST services
NLTK – Online demos
NLTK – REST APIs
% curl -d "text=Inception is the best movie ever" 
            http://text-processing.com/api/sentiment/


{
    "probability": {
         "neg": 0.36647424288117808,
         "pos": 0.63352575711882186
    },
    "label": "pos"
}
Google Prediction API
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Typical performance results:
                   movie reviews
●   nltk:
    ● unigram occurrences
    ● Naïve Bayesian Classifier          ~ 70%
●   Google Prediction API                ~ 83%
●   scikit-learn:
    ●  TF-IDF unigram features
    ● LinearSVC                          ~ 87%
●   nltk:
    ●   Collocation features selection
    ●   Naïve Bayesian Classifier        ~ 97%
Typical results:
    newsgroups topics classification

●   20 newsgroups dataset
    ●   ~ 19K short text documents
    ●   20 categories
    ●   By date train / test split

●   Bigram TF-IDF + LinearSVC:       ~ 87%
Confusion Matrix (20 newsgroups)
00 alt.atheism
01 comp.graphics
02 comp.os.ms-windows.misc
03 comp.sys.ibm.pc.hardware
04 comp.sys.mac.hardware
05 comp.windows.x
06 misc.forsale
07 rec.autos
08 rec.motorcycles
09 rec.sport.baseball
10 rec.sport.hockey
11 sci.crypt
12 sci.electronics
13 sci.med
14 sci.space
15 soc.religion.christian
16 talk.politics.guns
17 talk.politics.mideast
18 talk.politics.misc
19 talk.religion.misc
Typical results:
           Language Identification


●   15 Wikipedia articles
●   [p.text_content() for p in html_tree.findall('//p')]
●   CharNGramAnalyzer(min_n=1, max_n=3)
●   TF-IDF
●   LinearSVC
Typical results:
Language Identification
Scaling to many possible outcomes
●   Example: possible outcomes are all the
    categories of Wikipedia (565,108)
●   From Document Categorization
                           to Information Retrieval
●   Fulltext index for TF-IDF similarity queries
●   Smart way to find the top 30 search keywords
●   Use Apache Lucene / Solr MoreLikeThisQuery
Some pointers
●   http://scikit-learn.sf.net               doc & examples
    http://github.com/scikit-learn                     code
●   http://www.nltk.org              code & doc & PDF book
●   http://streamhacker.com/
    ●   Jacob Perkins' blog on NLTK & APIs
●   https://github.com/japerk/nltk-trainer
●   http://www.slideshare.net/ogrisel             these slides
●   http://twitter.com/ogrisel / http://github.com/ogrisel

                             Questions?
1 of 36

Recommended

TabNetの論文紹介 by
TabNetの論文紹介TabNetの論文紹介
TabNetの論文紹介西岡 賢一郎
765 views12 slides
Pythonの理解を試みる 〜バイトコードインタプリタを作成する〜 by
Pythonの理解を試みる 〜バイトコードインタプリタを作成する〜Pythonの理解を試みる 〜バイトコードインタプリタを作成する〜
Pythonの理解を試みる 〜バイトコードインタプリタを作成する〜Preferred Networks
14.6K views30 slides
Jenkinsfileのlintで救える命がある by
Jenkinsfileのlintで救える命があるJenkinsfileのlintで救える命がある
Jenkinsfileのlintで救える命があるJumpei Miyata
5.9K views17 slides
RでGPU使ってみた by
RでGPU使ってみたRでGPU使ってみた
RでGPU使ってみたKazuya Wada
16.2K views32 slides
mypy - 待望のPython3.9型ヒント対応 by
mypy - 待望のPython3.9型ヒント対応mypy - 待望のPython3.9型ヒント対応
mypy - 待望のPython3.9型ヒント対応KyutatsuNishiura
858 views20 slides
Nagoya.R #12 Rprofile作成のススメ by
Nagoya.R #12 Rprofile作成のススメNagoya.R #12 Rprofile作成のススメ
Nagoya.R #12 Rprofile作成のススメYusaku Kawaguchi
2.4K views22 slides

More Related Content

What's hot

[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata by
[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata
[B23] PostgreSQLのインデックス・チューニング by Tomonari KatsumataInsight Technology, Inc.
9.4K views33 slides
2019年度チュートリアルBPE by
2019年度チュートリアルBPE2019年度チュートリアルBPE
2019年度チュートリアルBPE広樹 本間
3.2K views28 slides
PostgreSQL のイケてるテクニック7選 by
PostgreSQL のイケてるテクニック7選PostgreSQL のイケてるテクニック7選
PostgreSQL のイケてるテクニック7選Tomoya Kawanishi
6.4K views14 slides
Deeplearning輪読会 by
Deeplearning輪読会Deeplearning輪読会
Deeplearning輪読会正志 坪坂
9.3K views12 slides
Planet-scale Data Ingestion Pipeline: Bigdam by
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI
6.3K views46 slides
論文紹介 Amortized bayesian meta learning by
論文紹介 Amortized bayesian meta learning論文紹介 Amortized bayesian meta learning
論文紹介 Amortized bayesian meta learningXiangze
763 views34 slides

What's hot(20)

[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata by Insight Technology, Inc.
[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata
[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata
2019年度チュートリアルBPE by 広樹 本間
2019年度チュートリアルBPE2019年度チュートリアルBPE
2019年度チュートリアルBPE
広樹 本間3.2K views
PostgreSQL のイケてるテクニック7選 by Tomoya Kawanishi
PostgreSQL のイケてるテクニック7選PostgreSQL のイケてるテクニック7選
PostgreSQL のイケてるテクニック7選
Tomoya Kawanishi6.4K views
Deeplearning輪読会 by 正志 坪坂
Deeplearning輪読会Deeplearning輪読会
Deeplearning輪読会
正志 坪坂9.3K views
Planet-scale Data Ingestion Pipeline: Bigdam by SATOSHI TAGOMORI
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
SATOSHI TAGOMORI6.3K views
論文紹介 Amortized bayesian meta learning by Xiangze
論文紹介 Amortized bayesian meta learning論文紹介 Amortized bayesian meta learning
論文紹介 Amortized bayesian meta learning
Xiangze 763 views
Hack言語に賭けたチームの話 by Yuji Otani
Hack言語に賭けたチームの話Hack言語に賭けたチームの話
Hack言語に賭けたチームの話
Yuji Otani4.3K views
受託開発でテストファーストしたらXXXを早期発見できてハイアジリティになったはなし by terahide
受託開発でテストファーストしたらXXXを早期発見できてハイアジリティになったはなし受託開発でテストファーストしたらXXXを早期発見できてハイアジリティになったはなし
受託開発でテストファーストしたらXXXを早期発見できてハイアジリティになったはなし
terahide3.9K views
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム by Kouhei Sutou
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
Kouhei Sutou9.2K views
Meta-Learning with Memory Augmented Neural Network by Yusuke Watanabe
Meta-Learning with Memory Augmented Neural NetworkMeta-Learning with Memory Augmented Neural Network
Meta-Learning with Memory Augmented Neural Network
Yusuke Watanabe14.7K views
TensorFlow XLAは、 中で何をやっているのか? by Mr. Vengineer
TensorFlow XLAは、 中で何をやっているのか?TensorFlow XLAは、 中で何をやっているのか?
TensorFlow XLAは、 中で何をやっているのか?
Mr. Vengineer7.1K views
猫でも分かるVariational AutoEncoder by Sho Tatsuno
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder
Sho Tatsuno131.6K views
EthernetやCPUなどの話 by Takanori Sejima
EthernetやCPUなどの話EthernetやCPUなどの話
EthernetやCPUなどの話
Takanori Sejima18.2K views
[データマイニング+WEB勉強会][R勉強会] はじめてでもわかる R言語によるクラスター分析 - 似ているものをグループ化する- by Koichi Hamada
[データマイニング+WEB勉強会][R勉強会] はじめてでもわかる R言語によるクラスター分析 - 似ているものをグループ化する-[データマイニング+WEB勉強会][R勉強会] はじめてでもわかる R言語によるクラスター分析 - 似ているものをグループ化する-
[データマイニング+WEB勉強会][R勉強会] はじめてでもわかる R言語によるクラスター分析 - 似ているものをグループ化する-
Koichi Hamada20.4K views
多段階計算の型システムの基礎 by T. Suwa
多段階計算の型システムの基礎多段階計算の型システムの基礎
多段階計算の型システムの基礎
T. Suwa4.4K views
2値化CNN on FPGAでGPUとガチンコバトル(公開版) by Hiroki Nakahara
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
Hiroki Nakahara39.5K views
Big Bird - Transformers for Longer Sequences by taeseon ryu
Big Bird - Transformers for Longer SequencesBig Bird - Transformers for Longer Sequences
Big Bird - Transformers for Longer Sequences
taeseon ryu535 views
PythonとAutoML at PyConJP 2019 by Masashi Shibata
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
Masashi Shibata13.5K views
公平性を保証したAI/機械学習
アルゴリズムの最新理論 by Kazuto Fukuchi
公平性を保証したAI/機械学習
アルゴリズムの最新理論公平性を保証したAI/機械学習
アルゴリズムの最新理論
公平性を保証したAI/機械学習
アルゴリズムの最新理論
Kazuto Fukuchi1.4K views

Similar to Statistical Machine Learning for Text Classification with scikit-learn and NLTK

Statistical Learning and Text Classification with NLTK and scikit-learn by
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learnOlivier Grisel
18.1K views24 slides
Simple APIs and innovative documentation by
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentationPyDataParis
50 views75 slides
Profiling and optimization by
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
1.3K views43 slides
Authorship attribution pydata london by
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata londonkperi
1.9K views26 slides
Statistical inference for (Python) Data Analysis. An introduction. by
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Piotr Milanowski
2K views25 slides
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand... by
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
6.3K views26 slides

Similar to Statistical Machine Learning for Text Classification with scikit-learn and NLTK(20)

Statistical Learning and Text Classification with NLTK and scikit-learn by Olivier Grisel
Statistical Learning and Text Classification with NLTK and scikit-learnStatistical Learning and Text Classification with NLTK and scikit-learn
Statistical Learning and Text Classification with NLTK and scikit-learn
Olivier Grisel18.1K views
Simple APIs and innovative documentation by PyDataParis
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentation
PyDataParis50 views
Profiling and optimization by g3_nittala
Profiling and optimizationProfiling and optimization
Profiling and optimization
g3_nittala1.3K views
Authorship attribution pydata london by kperi
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
kperi1.9K views
Statistical inference for (Python) Data Analysis. An introduction. by Piotr Milanowski
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.
Piotr Milanowski2K views
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand... by PyData
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData6.3K views
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an... by Chetan Khatri
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri400 views
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana... by Positive Hack Days
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
Positive Hack Days1.7K views
Learning with classification and clustering, neural networks by Shaun D'Souza
Learning with classification and clustering, neural networksLearning with classification and clustering, neural networks
Learning with classification and clustering, neural networks
Shaun D'Souza71 views
Software tookits for machine learning and graphical models by butest
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical models
butest942 views
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro... by Provectus
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...
Provectus675 views
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala by Chetan Khatri
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri88 views
Machine Learning on Code - SF meetup by source{d}
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
source{d}535 views
200612_BioPackathon_ss by Satoshi Kume
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ss
Satoshi Kume2.7K views
Tesseract. Recognizing Errors in Recognition Software by Andrey Karpov
Tesseract. Recognizing Errors in Recognition SoftwareTesseract. Recognizing Errors in Recognition Software
Tesseract. Recognizing Errors in Recognition Software
Andrey Karpov319 views
Standardizing on a single N-dimensional array API for Python by Ralf Gommers
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
Ralf Gommers119 views

More from Olivier Grisel

Strategies and Tools for Parallel Machine Learning in Python by
Strategies and Tools for Parallel Machine Learning in PythonStrategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in PythonOlivier Grisel
14.9K views53 slides
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa... by
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Olivier Grisel
2.3K views20 slides
Nuxeo Iks 2009 11 13 by
Nuxeo Iks 2009 11 13Nuxeo Iks 2009 11 13
Nuxeo Iks 2009 11 13Olivier Grisel
1K views8 slides
Nuxeo 5.3 and Semantic R&D by
Nuxeo 5.3 and Semantic R&DNuxeo 5.3 and Semantic R&D
Nuxeo 5.3 and Semantic R&DOlivier Grisel
967 views28 slides
Hadoop MapReduce - OSDC FR 2009 by
Hadoop MapReduce - OSDC FR 2009Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009Olivier Grisel
3.5K views32 slides
Programming the PS3 by
Programming the PS3Programming the PS3
Programming the PS3Olivier Grisel
4.4K views21 slides

More from Olivier Grisel(6)

Strategies and Tools for Parallel Machine Learning in Python by Olivier Grisel
Strategies and Tools for Parallel Machine Learning in PythonStrategies and Tools for Parallel Machine Learning in Python
Strategies and Tools for Parallel Machine Learning in Python
Olivier Grisel14.9K views
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa... by Olivier Grisel
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Olivier Grisel2.3K views
Hadoop MapReduce - OSDC FR 2009 by Olivier Grisel
Hadoop MapReduce - OSDC FR 2009Hadoop MapReduce - OSDC FR 2009
Hadoop MapReduce - OSDC FR 2009
Olivier Grisel3.5K views

Recently uploaded

December 2023 Featured Portfolio by
 December 2023 Featured Portfolio December 2023 Featured Portfolio
December 2023 Featured PortfolioListing Turkey
19 views63 slides
SVN Live 12.4.23 Weekly Property Broadcast by
SVN Live 12.4.23 Weekly Property BroadcastSVN Live 12.4.23 Weekly Property Broadcast
SVN Live 12.4.23 Weekly Property BroadcastSVN International Corp.
112 views26 slides
Lodha Crown Brochure.pdf by
Lodha Crown Brochure.pdfLodha Crown Brochure.pdf
Lodha Crown Brochure.pdfHouse Seekers
6 views27 slides
Bptp 4BHK+SQ Luxury Residential Apartments Sector 37D Gurgaon - 8826997780 by
Bptp 4BHK+SQ Luxury Residential Apartments Sector 37D Gurgaon - 8826997780 Bptp 4BHK+SQ Luxury Residential Apartments Sector 37D Gurgaon - 8826997780
Bptp 4BHK+SQ Luxury Residential Apartments Sector 37D Gurgaon - 8826997780 ApartmentWala1
83 views69 slides
Barriers to Innovation in Net-Zero Energy by
Barriers to Innovation in Net-Zero EnergyBarriers to Innovation in Net-Zero Energy
Barriers to Innovation in Net-Zero EnergyDerek Satnik
8 views12 slides
Oberoi Forestville Thane Brochure.pdf by
Oberoi Forestville Thane Brochure.pdfOberoi Forestville Thane Brochure.pdf
Oberoi Forestville Thane Brochure.pdfBabyrudram
10 views10 slides

Recently uploaded(14)

December 2023 Featured Portfolio by Listing Turkey
 December 2023 Featured Portfolio December 2023 Featured Portfolio
December 2023 Featured Portfolio
Listing Turkey19 views
Bptp 4BHK+SQ Luxury Residential Apartments Sector 37D Gurgaon - 8826997780 by ApartmentWala1
Bptp 4BHK+SQ Luxury Residential Apartments Sector 37D Gurgaon - 8826997780 Bptp 4BHK+SQ Luxury Residential Apartments Sector 37D Gurgaon - 8826997780
Bptp 4BHK+SQ Luxury Residential Apartments Sector 37D Gurgaon - 8826997780
ApartmentWala183 views
Barriers to Innovation in Net-Zero Energy by Derek Satnik
Barriers to Innovation in Net-Zero EnergyBarriers to Innovation in Net-Zero Energy
Barriers to Innovation in Net-Zero Energy
Derek Satnik8 views
Oberoi Forestville Thane Brochure.pdf by Babyrudram
Oberoi Forestville Thane Brochure.pdfOberoi Forestville Thane Brochure.pdf
Oberoi Forestville Thane Brochure.pdf
Babyrudram10 views
Presentation Value Evolution of Offices in Belgium November 2023.pptx by Koen Batsleer
Presentation Value Evolution of Offices in Belgium November 2023.pptxPresentation Value Evolution of Offices in Belgium November 2023.pptx
Presentation Value Evolution of Offices in Belgium November 2023.pptx
Koen Batsleer65 views
4BHK+SQ Price 5.99Cr* Coming Soon Bptp Sector 37D Gurgaon 8826997780 by ApartmentWala1
4BHK+SQ Price 5.99Cr* Coming Soon Bptp Sector 37D Gurgaon 8826997780 4BHK+SQ Price 5.99Cr* Coming Soon Bptp Sector 37D Gurgaon 8826997780
4BHK+SQ Price 5.99Cr* Coming Soon Bptp Sector 37D Gurgaon 8826997780
ApartmentWala170 views
2023 NAR Profile of Home Buyers and Sellers - Big Changes & Some Stability by Tom Blefko
2023 NAR Profile of Home Buyers and Sellers - Big Changes & Some Stability2023 NAR Profile of Home Buyers and Sellers - Big Changes & Some Stability
2023 NAR Profile of Home Buyers and Sellers - Big Changes & Some Stability
Tom Blefko26 views
Bptp Is Coming Soon With Ultra Luxury Project Bang On Dwarka Expressway - 88... by ApartmentWala1
Bptp Is Coming Soon With Ultra Luxury Project Bang On  Dwarka Expressway - 88...Bptp Is Coming Soon With Ultra Luxury Project Bang On  Dwarka Expressway - 88...
Bptp Is Coming Soon With Ultra Luxury Project Bang On Dwarka Expressway - 88...
ApartmentWala130 views
Upcoming Luxury Project Sector 37D Gurgaon Dwarka Expressway 8826997781 by ApartmentWala1
Upcoming Luxury Project Sector 37D Gurgaon Dwarka Expressway 8826997781Upcoming Luxury Project Sector 37D Gurgaon Dwarka Expressway 8826997781
Upcoming Luxury Project Sector 37D Gurgaon Dwarka Expressway 8826997781
ApartmentWala172 views

Statistical Machine Learning for Text Classification with scikit-learn and NLTK

  • 1. Statistical Learning for Text Classification with scikit-learn and NLTK Olivier Grisel http://twitter.com/ogrisel PyCon – 2011
  • 2. Outline ● Why text classification? ● What is text classification? ● How? ● scikit-learn ● NLTK ● Google Prediction API ● Some results
  • 3. Applications of Text Classification Task Predicted outcome Spam filtering Spam, Ham, Priority Language guessing English, Spanish, French, ... Sentiment Analysis for Product Positive, Neutral, Negative Reviews News Feed Topic Politics, Business, Technology, Categorization Sports, ... Pay-per-click optimal ads Will yield clicks / Won't placement Recommender systems Will I buy this book? / I won't
  • 4. Supervised Learning Overview ● Convert training data to a set of vectors of features ● Build a model based on the statistical properties of features in the training set, e.g. ● Naïve Bayesian Classifier ● Logistic Regression ● Support Vector Machines ● For each new text document to classify ● Extract features ● Asked model to predict the most likely outcome
  • 5. Summary Training features Text vectors Documents, Images, Sounds... Machine Learning Algorithm Labels New Text features Document, vector Predictive Expected Image, Model Label Sound
  • 6. Bags of Words ● Tokenize document: list of uni-grams ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] ● Binary occurrences / counts: {'the': True, 'quick': True...} ● Frequencies: {'the': 0.22, 'quick': 0.11, 'brown': 0.11, 'fox': 0.11…} ● TF-IDF {'the': 0.001, 'quick': 0.05, 'brown': 0.06, 'fox': 0.24…}
  • 7. Better than frequencies: TF-IDF ● Term Frequency ● Inverse Document Frequency ● Non informative words such as “the” are scaled done
  • 8. Even better features ● bi-grams of words: ● “New York”, “very bad”, “not good” ● n-grams of chars: ● “the”, “ed ”, “ a ” (useful for language guessing) ● Combine with: ● Binary occurrences ● Frequencies ● TF-IDF
  • 10. scikit-learn ● BSD ● numpy / scipy / cython / c++ wrappers ● Many state of the art implementations ● A new release every 3 months ● 17 contributors on release 0.7 ● Not just for text classification
  • 11. Features Extraction in scikit-learn from scikits.learn.features.text import WordNGramAnalyzer text = (u"J'ai mangxe9 du kangourou ce midi," u" c'xe9tait pas trxeas bon.") WordNGramAnalyzer(min_n=1, max_n=2).analyze(text) [u'ai', u'mange', u'du', u'kangourou', u'ce', u'midi', u'etait', u'pas', u'tres', u'bon', u'ai mange', u'mange du', u'du kangourou', u'kangourou ce', u'ce midi', u'midi etait', u'etait pas', u'pas tres', u'tres bon']
  • 12. Features Extraction in scikit-learn from scikits.learn.features.text import CharNGramAnalyzer analyzer = CharNGramAnalyzer(min_n=3, max_n=6) char_ngrams = analyzer.analyze(text) print char_ngrams[:5] + char_ngrams[-5:] [u"j'a", u"'ai", u'ai ', u'i m', u' ma', u's tres', u' tres ', u'tres b', u'res bo', u'es bon']
  • 13. TF-IDF features & SVMs from scikits.learn.features.text.sparse import Vectorizer from scikits.learn.sparse.svm.sparse import LinearSVC vec = Vectorizer(analyzer=analyzer) features = vec.fit_transform(list_of_documents) clf = LinearSVC(C=100).fit(features, labels) clf2 = pickle.loads(pickle.dumps(clf)) predicted_labels = clf2.predict(features_of_new_docs)
  • 14. ) cs (do form ns .tra Training features Text c Documents, ve vectors Images, Sounds... ) Machine X,y fit( Learning . clf Algorithm Labels w) _ ne w) cs _n e (do X m ct( sfor ed i New an . pr Text c.t r clf e features Document, v vector Predictive Expected Image, Model Label Sound
  • 15. NLTK ● Code: ASL 2.0 & Book: CC-BY-NC-ND ● Tokenizers, Stemmers, Parsers, Classifiers, Clusterers, Corpus Readers
  • 16. NLTK Corpus Downloader >>> import nltk >>> nltk.download()
  • 17. Using a NLTK corpus >>> from nltk.corpus import movie_reviews as reviews >>> pos_ids = reviews.fileids('pos') >>> neg_ids = reviews.fileids('neg') >>> len(pos_ids), len(neg_ids) 1000, 1000 >>> reviews.words(pos_ids[0]) ['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
  • 18. Common data cleanup operations ● Lower case & remove accentuated chars: import unicodedata s = ''.join(c for c in unicodedata.normalize('NFD', s.lower()) if unicodedata.category(c) != 'Mn') ● Extract only word tokens of at least 2 chars ● Using NLTK tokenizers & stemmers ● Using a simple regexp: re.compile(r"bww+b", re.U).findall(s)
  • 19. Feature Extraction with NLTK Unigram features def word_features(words): return dict((word, True) for word in words)
  • 20. Feature Extraction with NLTK Bigram Collocations from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures as BAM from itertools import chain def bigram_features(words, score_fn=BAM.chi_sq): bg_finder = BigramCollocationFinder.from_words(words) bigrams = bg_finder.nbest(score_fn, 100000) return dict((bg, True) for bg in chain(words, bigrams))
  • 21. The NLTK Naïve Bayes Classifier from nltk.classify import NaiveBayesClassifier neg_examples = [(features(reviews.words(i)), 'neg') for i in neg_ids] pos_examples = [(features(reviews.words(i)), 'pos') for i in pos_ids] train_set = pos_examples + neg_examples classifier = NaiveBayesClassifier.train(train_set)
  • 22. Most informative features >>> classifier.show_most_informative_features() magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
  • 23. Training NLTK classifiers ● Try nltk-trainer ● python train_classifier.py --instances paras --classifier NaiveBayes –bigrams --min_score 3 movie_reviews
  • 26. NLTK – REST APIs % curl -d "text=Inception is the best movie ever" http://text-processing.com/api/sentiment/ { "probability": { "neg": 0.36647424288117808, "pos": 0.63352575711882186 }, "label": "pos" }
  • 30. Typical performance results: movie reviews ● nltk: ● unigram occurrences ● Naïve Bayesian Classifier ~ 70% ● Google Prediction API ~ 83% ● scikit-learn: ● TF-IDF unigram features ● LinearSVC ~ 87% ● nltk: ● Collocation features selection ● Naïve Bayesian Classifier ~ 97%
  • 31. Typical results: newsgroups topics classification ● 20 newsgroups dataset ● ~ 19K short text documents ● 20 categories ● By date train / test split ● Bigram TF-IDF + LinearSVC: ~ 87%
  • 32. Confusion Matrix (20 newsgroups) 00 alt.atheism 01 comp.graphics 02 comp.os.ms-windows.misc 03 comp.sys.ibm.pc.hardware 04 comp.sys.mac.hardware 05 comp.windows.x 06 misc.forsale 07 rec.autos 08 rec.motorcycles 09 rec.sport.baseball 10 rec.sport.hockey 11 sci.crypt 12 sci.electronics 13 sci.med 14 sci.space 15 soc.religion.christian 16 talk.politics.guns 17 talk.politics.mideast 18 talk.politics.misc 19 talk.religion.misc
  • 33. Typical results: Language Identification ● 15 Wikipedia articles ● [p.text_content() for p in html_tree.findall('//p')] ● CharNGramAnalyzer(min_n=1, max_n=3) ● TF-IDF ● LinearSVC
  • 35. Scaling to many possible outcomes ● Example: possible outcomes are all the categories of Wikipedia (565,108) ● From Document Categorization to Information Retrieval ● Fulltext index for TF-IDF similarity queries ● Smart way to find the top 30 search keywords ● Use Apache Lucene / Solr MoreLikeThisQuery
  • 36. Some pointers ● http://scikit-learn.sf.net doc & examples http://github.com/scikit-learn code ● http://www.nltk.org code & doc & PDF book ● http://streamhacker.com/ ● Jacob Perkins' blog on NLTK & APIs ● https://github.com/japerk/nltk-trainer ● http://www.slideshare.net/ogrisel these slides ● http://twitter.com/ogrisel / http://github.com/ogrisel Questions?