Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Uploaded on

Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. …

Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib Jimmy Lai r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536 2013/02/17
  • 2. Critical Technologies for Big Data Analysis User Generated Machine Content Generated Data • Please refer http://www.slideshare.net/jimmy Collecting _lai/when-big-data-meet-python for more detail. StorageInfrastructure C/JAVA ComputingPython/R AnalysisJavascript Visualization 2
  • 3. Fast prototyping - IPython Notebook• Write python code in browser: – Exploit the remote server resources – View the graphical results in web page – Sketch code pieces as blocks – Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow- prototyping-using-ipython-notebook for more introduction. Text Classification in Python 3
  • 4. Demo Code• Demo Code: ipython_demo/text_classification_demo.ipynb in https://bitbucket.org/noahsark/slideshare• Ipython Notebook: – Install $ pip install ipython – Execution (under ipython_demo dir) $ ipython notebook --pylab=inline – Open notebook with browser, e.g. Text Classification in Python 4
  • 5. Machine learning classification• 𝑋 𝑖 = [𝑥1 , 𝑥2 , … , 𝑥 𝑛 ], 𝑥 𝑛 ∈ 𝑅• 𝑦𝑖 ∈ 𝑁• 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌• 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦 𝑖 = 𝑓(𝑋 𝑖 ) Text Classification in Python 5
  • 6. Text classification Feature Generation Model FeatureParameter Selection Tuning Classification Model Training Text Classification in Python 6
  • 7. From: zyeh@caspian.usc.edu (zhenghao yeh)Subject: Re: Newsgroup SplitOrganization: University of Southern California, Los Angeles, CALines: 18 Dataset:Distribution: world 20 newsgroupsNNTP-Posting-Host: caspian.usc.edu Structured Data datasetIn article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu(Chris Herringshaw) writes:|> Concerning the proposed newsgroup split, I personally am not in favor of|> doing this. I learn an awful lot about all aspects of graphics by reading|> this group, from code to hardware to algorithms. I just think making 5|> different groups out of this is a wate, and will only result in a few posts|> a week per group. I kind of like the convenience of having one big forum|> for discussing all aspects of graphics. Anyone else feel this way?|> Just curious.|>|>|> Daemon|> TextI agree with you. Of cause Ill try to be a daemon :-)Yeh Text Classification in Python 7USC
  • 8. Dataset in sklearn• sklearn.datasets – Toy datasets – Download data from http://mldata.org repository• Data format of classification problem – Dataset • data: [raw_data or numerical] • target: [int] • target_names: [str] Text Classification in Python 8
  • 9. Feature extraction from structured data (1/2)• Count the frequency of Keyword Count keyword and select the Distribution 2549 keywords as features: Summary 397 [From, Subject, Disclaimer 125 File 257 Organization, Expires 116 Distribution, Lines] Subject 11612• E.g. From 11398 Keywords 943From: lerxst@wam.umd.edu (wheres my thing)Subject: WHAT car is this!? Originator 291Organization: University of Maryland, College Organization 10872Park Lines 11317Distribution: None Internet 140Lines: 15 To 106 Text Classification in Python 9
  • 10. Feature extraction from structured data (2/2)• Separate structured • Transform token matrix data and text data as numerical matrix by – Text data start from sklearn.feature_extract “Line:” ionDictVectorizer • E.g. [{‘a’: 1, ‘b’: 1}, {‘c’: 1}] => [[1, 1, 0], [0, 0, 1]] Text Classification in Python 10
  • 11. Text Feature extraction in sklearn• sklearn.feature_extraction.text• CountVectorizer – Transform articles into token-count matrix• TfidfVectorizer – Transform articles into token-TFIDF matrix• Usage: – fit(): construct token dictionary given dataset – transform(): generate numerical matrix Text Classification in Python 11
  • 12. Text Feature extraction• Analyzer – Preprocessor: str -> str • Default: lowercase • Extra: strip_accents – handle unicode chars – Tokenizer: str -> [str] • Default: re.findall(ur"(?u)bww+b“, string) – Analyzer: str -> [str] 1. Call preprocessor and tokenizer 2. Filter stopwords 3. Generate n-gram tokens Text Classification in Python 12
  • 13. Text Classification in Python 13
  • 14. Feature Selection• Decrease the number of features: – Reduce the resource usage for faster learning – Remove the most common tokens and the most rare tokens (words with less information): • Parameter for Vectorizer: – max_df – min_df – max_features Text Classification in Python 14
  • 15. Classification Model Training• Common classifiers in sklearn: – sklearn.linear_model – sklearn.svm• Usage: – fit(X, Y): train the model – predict(X): get predicted Y Text Classification in Python 15
  • 16. Cross Validation• When tuning the parameters of model, let each article as training and testing data alternately to ensure the parameters are not dedicated to some specific articles. – from sklearn.cross_validation import KFold – for train_index, test_index in KFold(10, 2): • train_index = [5 6 7 8 9] • test_index = [0 1 2 3 4] Text Classification in Python 16
  • 17. Performance Evaluation 𝑡𝑝 • sklearn.metrics • 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝+𝑓𝑝 – precision_score 𝑡𝑝 • 𝑟𝑒𝑐𝑎𝑙𝑙 = – recall_score 𝑡𝑝+𝑓𝑛 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 – f1_score • 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 Text Classification in Python 17Source: http://en.wikipedia.org/wiki/Precision_and_recall
  • 18. Visualization1. Matplotlib2. plot() function of Series, DataFrame Text Classification in Python 18
  • 19. Experiment Result• Future works: – Feature selection by statistics or dimension reduction – Parameter tuning – Ensemble models Text Classification in Python 19