Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fast data mining flow prototyping using IPython Notebook


Published on

Big data analysis requires fast prototyping on data mining process to gain insight into data. In this slides, the author introduces how to use IPython Notebook to sketch code pieces for data mining stages and make fast observations easily.

  • Be the first to comment

Fast data mining flow prototyping using IPython Notebook

  1. 1. Fast data mining flow prototyping using IPython Notebook 2013/01/31 Jimmy Lai r97922028 [at]
  2. 2. Outline1. Workflow for data mining2. What IPython Notebook provides3. Exemplified by text classification4. Demo code and Notebook usage IPython Notebook 2
  3. 3. Workflow for data mining• Traditional programming workflow: – Edit -> Compile -> Run• Data Mining workflow: – Execute -> Explore – Consists of many data processing stages and we may do trials in each stage with different methods. – Stages: data parsing, feature extraction, feature selection, model training, model predicting, post processing, etc. IPython Notebook 3
  4. 4. What IPython Notebook provides• Interactive Web IDE – Display rich data like plots by matplotlib, math symbols by latex – Code cell for sketching – Execute piece of code in arbitrarily order – Browser interface for programming remotely – Easy to demonstrate code and execution result in html or PDF.• IPython Notebook makes sketching data analysis easily. IPython Notebook 4
  5. 5. Demo code and Notebook usage• Demo Code: ipython_demo directory in• Ipython Notebook: – Install $ pip install ipython – Execution (under ipython_demo dir) $ ipython notebook --pylab=inline – Open notebook with browser, e.g. IPython Notebook 5
  6. 6. IPython Note Interface IPython Notebook 6
  7. 7. Exemplified by text classification• Text classification on newsgroup dataset.• Dataset: – Build in sklearn.datasets – Each article belongs to one of the 20 groups• Goal: classify article to one of the newsgroup name.• Experiment: feature generation using different ngram parameters. IPython Notebook 7
  8. 8. talk.politics.mideastExample article IPython Notebook 8
  9. 9. IPython Notebook 9
  10. 10. Sample result of feature extraction IPython Notebook 10
  11. 11. Table of experiment setups IPython Notebook 11
  12. 12. IPython Notebook 12
  13. 13. Experiment Result IPython Notebook 13
  14. 14. IPython Notebook 14
  15. 15. Observation from plots IPython Notebook 15