Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Text classification in scikit-learn Slide 1 Text classification in scikit-learn Slide 2 Text classification in scikit-learn Slide 3 Text classification in scikit-learn Slide 4 Text classification in scikit-learn Slide 5 Text classification in scikit-learn Slide 6 Text classification in scikit-learn Slide 7 Text classification in scikit-learn Slide 8 Text classification in scikit-learn Slide 9 Text classification in scikit-learn Slide 10 Text classification in scikit-learn Slide 11 Text classification in scikit-learn Slide 12 Text classification in scikit-learn Slide 13 Text classification in scikit-learn Slide 14 Text classification in scikit-learn Slide 15 Text classification in scikit-learn Slide 16 Text classification in scikit-learn Slide 17 Text classification in scikit-learn Slide 18 Text classification in scikit-learn Slide 19 Text classification in scikit-learn Slide 20 Text classification in scikit-learn Slide 21 Text classification in scikit-learn Slide 22 Text classification in scikit-learn Slide 23 Text classification in scikit-learn Slide 24
Upcoming SlideShare
Text Classification/Categorization
Next

42 Likes

Share

Text classification in scikit-learn

Introduction to scikit-learn, including installation, tutorial and text classification.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Text classification in scikit-learn

  1. 1. Text classification in Scikit-learn Jimmy Lai r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536 2013/06/17
  2. 2. Outline 1. Introduction to Data Analysis 2. Setup packages 3. Scikit-learn tutorial 4. Text Classification in Scikit-learn 2
  3. 3. Critical Technologies for Big Data Analysis • Please refer http://www.slideshare.net/jimmy _lai/when-big-data-meet-python for more detail. Collecting User Generated Content Machine Generated Data Storage Computing Analysis Visualization Infrastructure C/JAVA Python/R Javascript 3
  4. 4. Setup all packages on Ubuntu • Packages required: – pip – Numpy – Scipy – Matplotlib – Scikit-learn – Psutil – IPython • Commands sudo apt-get install python-pip sudo apt-get build-dep python- numpy sudo apt-get build-dep python-scipy sudo apt-get build-dep python- matplotlib # install packages in a virtualenv pip install numpy pip install scipy pip matplotlib pip install scikit-learn pip install psutil pip install ipython 4
  5. 5. Setup IPython Notebook • Install: $ pip install ipython • Create config: $ ipython profile create • Edit config: – c.NotebookApp.certfile = u’cert_file’ – c.NotebookApp.password = u’hashed_password’ – c.IPKernelApp.pylab = 'inline' • Run server: $ ipython notebook --ip=* -- port=9999 • Generate cert_file: $ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem • Generate hashed_password: In [1]: from IPython.lib import passwd In [2]: passwd() Via http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html 5
  6. 6. Fast prototyping - IPython Notebook • Write python code in browser: – Exploit the remote server resources – View the graphical results in web page – Sketch code pieces as blocks – Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow- prototyping-using-ipython-notebook for more introduction. 6
  7. 7. Scikit-learn Cheat-sheet Via http://peekaboo-vision.blogspot.tw/2013/01/machine-learning-cheat-sheet-for-scikit.html 7
  8. 8. Scikit-learn Tutorial • https://github.com/ogrisel/parallel_ml_tutorial 8
  9. 9. Demo Code • Demo Code: ipython_demo/text_classification_demo.ipynb in https://bitbucket.org/noahsark/slideshare • Ipython Notebook: – Install $ pip install ipython – Execution (under ipython_demo dir) $ ipython notebook --pylab=inline – Open notebook with browser, e.g. http://127.0.0.1:8888 9
  10. 10. Machine learning classification • 𝑋𝑖 = [𝑥1, 𝑥2, … , 𝑥 𝑛], 𝑥 𝑛 ∈ 𝑅 • 𝑦𝑖 ∈ 𝑁 • 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌 • 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦𝑖 = 𝑓(𝑋𝑖) 10
  11. 11. Text classification Feature Generation Feature Selection Classification Model Training Model Parameter Tuning 11
  12. 12. From: zyeh@caspian.usc.edu (zhenghao yeh) Subject: Re: Newsgroup Split Organization: University of Southern California, Los Angeles, CA Lines: 18 Distribution: world NNTP-Posting-Host: caspian.usc.edu In article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu (Chris Herringshaw) writes: |> Concerning the proposed newsgroup split, I personally am not in favor of |> doing this. I learn an awful lot about all aspects of graphics by reading |> this group, from code to hardware to algorithms. I just think making 5 |> different groups out of this is a wate, and will only result in a few posts |> a week per group. I kind of like the convenience of having one big forum |> for discussing all aspects of graphics. Anyone else feel this way? |> Just curious. |> |> |> Daemon |> I agree with you. Of cause I'll try to be a daemon :-) Yeh USC Dataset: 20 newsgroups dataset Text Structured Data 12
  13. 13. Dataset in sklearn • sklearn.datasets – Toy datasets – Download data from http://mldata.org repository • Data format of classification problem – Dataset • data: [raw_data or numerical] • target: [int] • target_names: [str] 13
  14. 14. Feature extraction from structured data (1/2) • Count the frequency of keyword and select the keywords as features: ['From', 'Subject', 'Organization', 'Distribution', 'Lines'] • E.g. From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Organization: University of Maryland, College Park Distribution: None Lines: 15 Keyword Count Distribution 2549 Summary 397 Disclaimer 125 File 257 Expires 116 Subject 11612 From 11398 Keywords 943 Originator 291 Organization 10872 Lines 11317 Internet 140 To 106 14
  15. 15. Feature extraction from structured data (2/2) • Separate structured data and text data – Text data start from “Line:” • Transform token matrix as numerical matrix by sklearn.feature_extract ionDictVectorizer • E.g. [{‘a’: 1, ‘b’: 1}, {‘c’: 1}] => [[1, 1, 0], [0, 0, 1]] 15
  16. 16. Text Feature extraction in sklearn • sklearn.feature_extraction.text • CountVectorizer – Transform articles into token-count matrix • TfidfVectorizer – Transform articles into token-TFIDF matrix • Usage: – fit(): construct token dictionary given dataset – transform(): generate numerical matrix 16
  17. 17. Text Feature extraction • Analyzer – Preprocessor: str -> str • Default: lowercase • Extra: strip_accents – handle unicode chars – Tokenizer: str -> [str] • Default: re.findall(ur"(?u)bww+b“, string) – Analyzer: str -> [str] 1. Call preprocessor and tokenizer 2. Filter stopwords 3. Generate n-gram tokens 17
  18. 18. 18
  19. 19. Feature Selection • Decrease the number of features: – Reduce the resource usage for faster learning – Remove the most common tokens and the most rare tokens (words with less information): • Parameter for Vectorizer: – max_df – min_df – max_features 19
  20. 20. Classification Model Training • Common classifiers in sklearn: – sklearn.linear_model – sklearn.svm • Usage: – fit(X, Y): train the model – predict(X): get predicted Y 20
  21. 21. Cross Validation • When tuning the parameters of model, let each article as training and testing data alternately to ensure the parameters are not dedicated to some specific articles. – from sklearn.cross_validation import KFold – for train_index, test_index in KFold(10, 2): • train_index = [5 6 7 8 9] • test_index = [0 1 2 3 4] 21
  22. 22. Performance Evaluation • 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝 𝑡𝑝+𝑓𝑝 • 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑝 𝑡𝑝+𝑓𝑛 • 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 • sklearn.metrics – precision_score – recall_score – f1_score Source: http://en.wikipedia.org/wiki/Precision_and_recall 22
  23. 23. Visualization 1. Matplotlib 2. plot() function of Series, DataFrame 23
  24. 24. Experiment Result • Future works: – Feature selection by statistics or dimension reduction – Parameter tuning – Ensemble models 24
  • fp_elec

    Apr. 4, 2020
  • AadhilRushdy

    Jun. 6, 2018
  • MirceaZachia

    Dec. 26, 2017
  • sudeepghosh779

    Nov. 8, 2017
  • HenryPalacios11

    Feb. 23, 2017
  • ssuserba9086

    Feb. 19, 2017
  • mksaad

    May. 29, 2016
  • NataliaVyrva

    Mar. 28, 2016
  • fcimohamed

    Jun. 20, 2015
  • prahaladd

    May. 28, 2015
  • EthanNguyen

    May. 23, 2015
  • ZeeshanJhandir

    Apr. 24, 2015
  • NabilBELAHRACH

    Apr. 21, 2015
  • kamalchatrath1

    Apr. 7, 2015
  • dragon515

    Mar. 11, 2015
  • MichalVlask

    Mar. 1, 2015
  • christianjohnsson

    Jan. 9, 2015
  • nadern2012

    Jan. 8, 2015
  • AmrHassanThabet

    Dec. 15, 2014
  • likaiguohappy

    Dec. 6, 2014

Introduction to scikit-learn, including installation, tutorial and text classification.

Views

Total views

14,677

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

0

Shares

0

Comments

0

Likes

42

×