Fast data mining flow prototyping using IPython Notebook

2,483 views
2,330 views

Published on

Big data analysis requires fast prototyping on data mining process to gain insight into data. In this slides, the author introduces how to use IPython Notebook to sketch code pieces for data mining stages and make fast observations easily.

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,483
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
63
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Fast data mining flow prototyping using IPython Notebook

  1. 1. Fast data mining flow prototyping using IPython Notebook 2013/01/31 Jimmy Lai r97922028 [at] ntu.edu.tw
  2. 2. Outline1. Workflow for data mining2. What IPython Notebook provides3. Exemplified by text classification4. Demo code and Notebook usage IPython Notebook 2
  3. 3. Workflow for data mining• Traditional programming workflow: – Edit -> Compile -> Run• Data Mining workflow: – Execute -> Explore – Consists of many data processing stages and we may do trials in each stage with different methods. – Stages: data parsing, feature extraction, feature selection, model training, model predicting, post processing, etc. IPython Notebook 3
  4. 4. What IPython Notebook provides• Interactive Web IDE – Display rich data like plots by matplotlib, math symbols by latex – Code cell for sketching – Execute piece of code in arbitrarily order – Browser interface for programming remotely – Easy to demonstrate code and execution result in html or PDF.• IPython Notebook makes sketching data analysis easily. IPython Notebook 4
  5. 5. Demo code and Notebook usage• Demo Code: ipython_demo directory in https://bitbucket.org/noahsark/slideshare• Ipython Notebook: – Install $ pip install ipython – Execution (under ipython_demo dir) $ ipython notebook --pylab=inline – Open notebook with browser, e.g. http://127.0.0.1:8888 IPython Notebook 5
  6. 6. IPython Note Interface IPython Notebook 6
  7. 7. Exemplified by text classification• Text classification on newsgroup dataset.• Dataset: – Build in sklearn.datasets – Each article belongs to one of the 20 groups• Goal: classify article to one of the newsgroup name.• Experiment: feature generation using different ngram parameters. IPython Notebook 7
  8. 8. talk.politics.mideastExample article IPython Notebook 8
  9. 9. IPython Notebook 9
  10. 10. Sample result of feature extraction IPython Notebook 10
  11. 11. Table of experiment setups IPython Notebook 11
  12. 12. IPython Notebook 12
  13. 13. Experiment Result IPython Notebook 13
  14. 14. IPython Notebook 14
  15. 15. Observation from plots IPython Notebook 15

×