Text classification in scikit-learn


Published on

Introduction to scikit-learn, including installation, tutorial and text classification.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Text classification in scikit-learn

  1. 1. Text classification in Scikit-learnJimmy Lair97922028 [at] ntu.edu.twhttp://tw.linkedin.com/pub/jimmy-lai/27/4a/5362013/06/17
  2. 2. Outline1. Introduction to Data Analysis2. Setup packages3. Scikit-learn tutorial4. Text Classification in Scikit-learn2
  3. 3. Critical Technologies for Big DataAnalysis• Please referhttp://www.slideshare.net/jimmy_lai/when-big-data-meet-pythonfor more detail.CollectingUser GeneratedContentMachineGenerated DataStorageComputingAnalysisVisualizationInfrastructureC/JAVAPython/RJavascript3
  4. 4. Setup all packages on Ubuntu• Packages required:– pip– Numpy– Scipy– Matplotlib– Scikit-learn– Psutil– IPython• Commandssudo apt-get install python-pipsudo apt-get build-dep python-numpysudo apt-get build-dep python-scipysudo apt-get build-dep python-matplotlib# install packages in a virtualenvpip install numpypip install scipypip matplotlibpip install scikit-learnpip install psutilpip install ipython4
  5. 5. Setup IPython Notebook• Install:$ pip install ipython• Create config:$ ipython profile create• Edit config:– c.NotebookApp.certfile =u’cert_file’– c.NotebookApp.password =u’hashed_password’– c.IPKernelApp.pylab = inline• Run server:$ ipython notebook --ip=* --port=9999• Generate cert_file:$ openssl req -x509 -nodes -days365 -newkey rsa:1024 -keyoutmycert.pem -out mycert.pem• Generatehashed_password:In [1]: from IPython.lib importpasswdIn [2]: passwd()Via http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html5
  6. 6. Fast prototyping - IPython Notebook• Write python code in browser:– Exploit the remote server resources– View the graphical results in web page– Sketch code pieces as blocks– Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-prototyping-using-ipython-notebook for more introduction.6
  7. 7. Scikit-learn Cheat-sheetVia http://peekaboo-vision.blogspot.tw/2013/01/machine-learning-cheat-sheet-for-scikit.html7
  8. 8. Scikit-learn Tutorial• https://github.com/ogrisel/parallel_ml_tutorial8
  9. 9. Demo Code• Demo Code:ipython_demo/text_classification_demo.ipynbin https://bitbucket.org/noahsark/slideshare• Ipython Notebook:– Install$ pip install ipython– Execution (under ipython_demo dir)$ ipython notebook --pylab=inline– Open notebook with browser, e.g.
  10. 10. Machine learning classification• 𝑋𝑖 = [𝑥1, 𝑥2, … , 𝑥 𝑛], 𝑥 𝑛 ∈ 𝑅• 𝑦𝑖 ∈ 𝑁• 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌• 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦𝑖 = 𝑓(𝑋𝑖)10
  11. 11. Text classificationFeatureGenerationFeatureSelectionClassificationModel TrainingModelParameterTuning11
  12. 12. From: zyeh@caspian.usc.edu (zhenghao yeh)Subject: Re: Newsgroup SplitOrganization: University of Southern California, Los Angeles, CALines: 18Distribution: worldNNTP-Posting-Host: caspian.usc.eduIn article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu(Chris Herringshaw) writes:|> Concerning the proposed newsgroup split, I personally am not in favor of|> doing this. I learn an awful lot about all aspects of graphics by reading|> this group, from code to hardware to algorithms. I just think making 5|> different groups out of this is a wate, and will only result in a few posts|> a week per group. I kind of like the convenience of having one big forum|> for discussing all aspects of graphics. Anyone else feel this way?|> Just curious.|>|>|> Daemon|>I agree with you. Of cause Ill try to be a daemon :-)YehUSCDataset:20 newsgroupsdatasetTextStructured Data12
  13. 13. Dataset in sklearn• sklearn.datasets– Toy datasets– Download data from http://mldata.org repository• Data format of classification problem– Dataset• data: [raw_data or numerical]• target: [int]• target_names: [str]13
  14. 14. Feature extraction from structureddata (1/2)• Count the frequency ofkeyword and select thekeywords as features:[From, Subject,Organization,Distribution, Lines]• E.g.From: lerxst@wam.umd.edu (wheres my thing)Subject: WHAT car is this!?Organization: University of Maryland, CollegeParkDistribution: NoneLines: 15Keyword CountDistribution 2549Summary 397Disclaimer 125File 257Expires 116Subject 11612From 11398Keywords 943Originator 291Organization 10872Lines 11317Internet 140To 10614
  15. 15. Feature extraction from structureddata (2/2)• Separate structureddata and text data– Text data start from“Line:”• Transform token matrixas numerical matrix bysklearn.feature_extractionDictVectorizer• E.g.[{‘a’: 1, ‘b’: 1}, {‘c’: 1}] =>[[1, 1, 0], [0, 0, 1]]15
  16. 16. Text Feature extraction in sklearn• sklearn.feature_extraction.text• CountVectorizer– Transform articles into token-count matrix• TfidfVectorizer– Transform articles into token-TFIDF matrix• Usage:– fit(): construct token dictionary given dataset– transform(): generate numerical matrix16
  17. 17. Text Feature extraction• Analyzer– Preprocessor: str -> str• Default: lowercase• Extra: strip_accents – handle unicode chars– Tokenizer: str -> [str]• Default: re.findall(ur"(?u)bww+b“, string)– Analyzer: str -> [str]1. Call preprocessor and tokenizer2. Filter stopwords3. Generate n-gram tokens17
  18. 18. 18
  19. 19. Feature Selection• Decrease the number of features:– Reduce the resource usage for faster learning– Remove the most common tokens and the mostrare tokens (words with less information):• Parameter for Vectorizer:– max_df– min_df– max_features19
  20. 20. Classification Model Training• Common classifiers in sklearn:– sklearn.linear_model– sklearn.svm• Usage:– fit(X, Y): train the model– predict(X): get predicted Y20
  21. 21. Cross Validation• When tuning the parameters of model, leteach article as training and testing dataalternately to ensure the parameters are notdedicated to some specific articles.– from sklearn.cross_validation import KFold– for train_index, test_index in KFold(10, 2):• train_index = [5 6 7 8 9]• test_index = [0 1 2 3 4]21
  22. 22. Performance Evaluation• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑡𝑝𝑡𝑝+𝑓𝑝• 𝑟𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝𝑡𝑝+𝑓𝑛• 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙• sklearn.metrics– precision_score– recall_score– f1_scoreSource: http://en.wikipedia.org/wiki/Precision_and_recall22
  23. 23. Visualization1. Matplotlib2. plot() function of Series, DataFrame23
  24. 24. Experiment Result• Future works:– Feature selection by statistics or dimension reduction– Parameter tuning– Ensemble models24