【FIT2016チュートリアル】ここから始める情報処理 ~機械学習編~

1,950 views

Published on

FIT2016@富山・イベント企画
「ここから始める情報処理 ~画像、音声、テキスト、検索、学習、一気にまとめてチュートリアル~」
の機械学習編です。

This is my tutorial slide on machine learning for FIT2016.

Published in: Engineering
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,950
On SlideShare
0
From Embeds
0
Number of Embeds
1,373
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

【FIT2016チュートリアル】ここから始める情報処理 ~機械学習編~

  1. 1. ここから始める情報処理 ~機械学習(とその周辺)編~ Toshihiko Yamasaki Associate Professor, Department of Information and Communication Engineering, Graduate School of Information Science and Technology, The University of Tokyo
  2. 2. Today’s agenda  How can we start?  Which algorithm should you choose?  How can you find real data?  SVMを使い倒す 4
  3. 3. Today’s agenda  How can we start?  Which algorithm should you choose?  How can you find real data?  SVMを使い倒す 5
  4. 4. Tools: Python 6 http://scikit-learn.org/
  5. 5. Tools: MATLAB 7 http://jp.mathworks.com/products/statistics/
  6. 6. MATLAB Student Suite 8 http://jp.mathworks.com/academia/student_version/
  7. 7. Tools: R 9https://www.r-project.org/ http://www.kdnuggets.com/2015/06/top-20-r-machine-learning-packages.html
  8. 8. Tools: Weka 10
  9. 9. No available PC? Use cloud computers  Such as Amazon EC2, Microsoft Azure, etc…  Virtualized PC on the Internet  「ゼロから始めるクラウドコンピューティング」 11 http://aws.amazon.com/jp/ec2/ https://azure.microsoft.com/ja-jp/
  10. 10. Today’s agenda  How can we start?  Which algorithm should you choose?  How can you find real data?  SVMを使い倒す 12
  11. 11. Which ML algorithm is better? 13 [Caruana, ICML06]
  12. 12. In short  BST-DT > RF > BAG-DT > SVMs > ANN > KNN > BST-STMP > DT > LOGREG > NB  Boosted Trees:  RF + boosting technique  Note!!  Feature dims. were 10-100.  RF usually requires 10xdim vectors for training 14 [Caruana, ICML06]
  13. 13. Random Forests and Boosted Trees 15 www.habe-lab.org/habe/RFtutorial/SSII2013_RFtutorial_Slides.pdf http://www.slideshare.net/HitoshiHabe/ss-58784309 https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf  RF  Boosted Trees
  14. 14. What should you use anyway? 16 http://scikit-learn.org/stable/modules/kernel_approximation.html
  15. 15. Today’s agenda  How can we start?  Which algorithm should you choose?  How can you find real data?  SVMを使い倒す 17
  16. 16. If you want real data to play with 18 https://www.kaggle.com/
  17. 17. Today’s agenda  How can we start?  Which algorithm should you choose?  How can you find real data?  SVMを使い倒す 19
  18. 18. How do you usually use SVM?  Through Python/Matlab/R/…  In many cases, you are using libSVM  By downloading binary code  Why don’t we download a source code? 20 http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  19. 19. Which kernel you should use?  Gaussian kernel is the best in many cases  But it takes a lot of time  Linear kernel performs as well as Gaussian  When the data size is large  When the feature dimension is large  You may also consider using liblinear  What else?  You can use your own kernel 21
  20. 20. Scale the data 22 http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f407  Sometimes scaling helps
  21. 21. Optimize the parameters  For binary or MATLAB, use grid.py/grid.m  It tries to optimize C and g for Gaussian kernel  You should check the source code  Use n-cross validation  Sometimes, make train, validation, test data 23 http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  22. 22. Scikit-learn is easier 24 http://qiita.com/SE96UoC5AfUt7uY/items/c81f7cea72a44a7bfd3a
  23. 23. Unbalanced data 25 http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f410  What if you have unbalance data? For example, +1: 1,000 items, -1: 10,000 items SVM can achieve 91% accuracy just by saying “-1”
  24. 24. Non numerical data  libSVM can handle only numerical data  × Sun:0, Mon: 1, Tue: 2 (There is no meaning in magnitude relation)  Change to Categorical/one-hot  Sun: (1, 0, 0, 0…)  Mon: (0, 1, 0, 0…)  Tue: (0, 0, 1, 0…) 26
  25. 25. Missing data?  There is no golden rule  Eliminate such vectors  Use average or median value  Use the most frequently appearing value 27
  26. 26. Use OpenMP 28http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f432 As I introduced in my first lecture, it is very easy
  27. 27. SVM (except for linear kernel) is slow 29 http://mklab.iti.gr/project/GPU-LIBSVM
  28. 28. 30*If you do not have GPUs, just use cloud computers
  29. 29. Decision value 31 http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f415
  30. 30. You can know probability  You can be probability instead of obtaining +1/-1 labels or continuous values  Use “–b 1” option  It is useful for further processing 32
  31. 31. Look at the model file  When using a linear kernel or liblinear  Weight vector w will be saved  You can also know support vectors 33 http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f433
  32. 32. You can use your own kernel 34 README in libSVM In some cases, using other kernels is recommended c2, histogram intersection, etc…

×