Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

  • 10,701 views
Uploaded on

Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. …

Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
10,701
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
185
Comments
0
Likes
16

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib Jimmy Lai r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536 2013/02/17
  • 2. Critical Technologies for Big Data Analysis User Generated Machine Content Generated Data • Please refer http://www.slideshare.net/jimmy Collecting _lai/when-big-data-meet-python for more detail. StorageInfrastructure C/JAVA ComputingPython/R AnalysisJavascript Visualization 2
  • 3. Fast prototyping - IPython Notebook• Write python code in browser: – Exploit the remote server resources – View the graphical results in web page – Sketch code pieces as blocks – Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow- prototyping-using-ipython-notebook for more introduction. Text Classification in Python 3
  • 4. Demo Code• Demo Code: ipython_demo/text_classification_demo.ipynb in https://bitbucket.org/noahsark/slideshare• Ipython Notebook: – Install $ pip install ipython – Execution (under ipython_demo dir) $ ipython notebook --pylab=inline – Open notebook with browser, e.g. http://127.0.0.1:8888 Text Classification in Python 4
  • 5. Machine learning classification• 𝑋 𝑖 = [𝑥1 , 𝑥2 , … , 𝑥 𝑛 ], 𝑥 𝑛 ∈ 𝑅• 𝑦𝑖 ∈ 𝑁• 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌• 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦 𝑖 = 𝑓(𝑋 𝑖 ) Text Classification in Python 5
  • 6. Text classification Feature Generation Model FeatureParameter Selection Tuning Classification Model Training Text Classification in Python 6
  • 7. From: zyeh@caspian.usc.edu (zhenghao yeh)Subject: Re: Newsgroup SplitOrganization: University of Southern California, Los Angeles, CALines: 18 Dataset:Distribution: world 20 newsgroupsNNTP-Posting-Host: caspian.usc.edu Structured Data datasetIn article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu(Chris Herringshaw) writes:|> Concerning the proposed newsgroup split, I personally am not in favor of|> doing this. I learn an awful lot about all aspects of graphics by reading|> this group, from code to hardware to algorithms. I just think making 5|> different groups out of this is a wate, and will only result in a few posts|> a week per group. I kind of like the convenience of having one big forum|> for discussing all aspects of graphics. Anyone else feel this way?|> Just curious.|>|>|> Daemon|> TextI agree with you. Of cause Ill try to be a daemon :-)Yeh Text Classification in Python 7USC
  • 8. Dataset in sklearn• sklearn.datasets – Toy datasets – Download data from http://mldata.org repository• Data format of classification problem – Dataset • data: [raw_data or numerical] • target: [int] • target_names: [str] Text Classification in Python 8
  • 9. Feature extraction from structured data (1/2)• Count the frequency of Keyword Count keyword and select the Distribution 2549 keywords as features: Summary 397 [From, Subject, Disclaimer 125 File 257 Organization, Expires 116 Distribution, Lines] Subject 11612• E.g. From 11398 Keywords 943From: lerxst@wam.umd.edu (wheres my thing)Subject: WHAT car is this!? Originator 291Organization: University of Maryland, College Organization 10872Park Lines 11317Distribution: None Internet 140Lines: 15 To 106 Text Classification in Python 9
  • 10. Feature extraction from structured data (2/2)• Separate structured • Transform token matrix data and text data as numerical matrix by – Text data start from sklearn.feature_extract “Line:” ionDictVectorizer • E.g. [{‘a’: 1, ‘b’: 1}, {‘c’: 1}] => [[1, 1, 0], [0, 0, 1]] Text Classification in Python 10
  • 11. Text Feature extraction in sklearn• sklearn.feature_extraction.text• CountVectorizer – Transform articles into token-count matrix• TfidfVectorizer – Transform articles into token-TFIDF matrix• Usage: – fit(): construct token dictionary given dataset – transform(): generate numerical matrix Text Classification in Python 11
  • 12. Text Feature extraction• Analyzer – Preprocessor: str -> str • Default: lowercase • Extra: strip_accents – handle unicode chars – Tokenizer: str -> [str] • Default: re.findall(ur"(?u)bww+b“, string) – Analyzer: str -> [str] 1. Call preprocessor and tokenizer 2. Filter stopwords 3. Generate n-gram tokens Text Classification in Python 12
  • 13. Text Classification in Python 13
  • 14. Feature Selection• Decrease the number of features: – Reduce the resource usage for faster learning – Remove the most common tokens and the most rare tokens (words with less information): • Parameter for Vectorizer: – max_df – min_df – max_features Text Classification in Python 14
  • 15. Classification Model Training• Common classifiers in sklearn: – sklearn.linear_model – sklearn.svm• Usage: – fit(X, Y): train the model – predict(X): get predicted Y Text Classification in Python 15
  • 16. Cross Validation• When tuning the parameters of model, let each article as training and testing data alternately to ensure the parameters are not dedicated to some specific articles. – from sklearn.cross_validation import KFold – for train_index, test_index in KFold(10, 2): • train_index = [5 6 7 8 9] • test_index = [0 1 2 3 4] Text Classification in Python 16
  • 17. Performance Evaluation 𝑡𝑝 • sklearn.metrics • 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝+𝑓𝑝 – precision_score 𝑡𝑝 • 𝑟𝑒𝑐𝑎𝑙𝑙 = – recall_score 𝑡𝑝+𝑓𝑛 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 – f1_score • 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 Text Classification in Python 17Source: http://en.wikipedia.org/wiki/Precision_and_recall
  • 18. Visualization1. Matplotlib2. plot() function of Series, DataFrame Text Classification in Python 18
  • 19. Experiment Result• Future works: – Feature selection by statistics or dimension reduction – Parameter tuning – Ensemble models Text Classification in Python 19