Your SlideShare is downloading.
×

- 1. Machine Learning and Data Mining: case studies 2013, April 02nd, 14:00 Dmitry Efimov http://mech.math.msu.su/~efimov/
- 2. 3 Outline 1. Machine Learning problems 2. Methods: Regression, Distance, Probability 3. Case studies 4. How to solve problems?
- 3. 4 How to teach computer to grade students essays? Essay grading
- 4. 5 How to predict prices in the next year? Heavy Machines sales
- 5. 6 How to predict molecule response for medicines? Molecule response
- 6. 7 How to repair missed connections? How to give weights to connections? People relationships
- 7. 8 What is Kaggle?
- 8. 9 Definitions
- 9. • Regression
- 10. 11 • What about this case? • Or if there are many features? • Powerful method: Neural Networks But…
- 11. Distance approach: SVM • 12Vapnik, 1995
- 12. SVM (non-linear case) • 13Vapnik, 1995
- 13. 14 Probability approach: decision trees
- 14. Ensembling: Random Forests • Boosting = average of many simple algorithms • Simple algorithm = one decision tree • Boosting + decision trees = Random Forests 15Breiman, 2001
- 15. 16 Case 1. Social ties strength
- 16. • Organized by Panjia (www.panjiaco.com) • Problem: predict the strength of social ties • The prize pool: 75 000 $ • Training set size: 50 000 • Test set size: 40 000 17 Description of problem
- 17. • Number of features: more than 500! • Features example: 1) Number of friends (node feature) 2) Number of common friends (edge feature) 3) Number of common albums (combined Number of all albums feature) 18 Features engineering
- 18. Stochastic gradient descent in decision trees (GBM) 19Ridgeway, 2007
- 19. 20 Obtained accuracy
- 20. 21 Case 2. Biological Response prediction
- 21. Functional Ensembling • 22Efimov & Nikulin, 2012
- 22. Functional Ensembling: Example • 23Efimov & Nikulin, 2012
- 23. Functional Ensembling: Algorithm • 24
- 24. Final ensembling • 25 min min 0.55 0.1 mean 0.9 0.75 max max
- 25. Obtained accuracy 26 Winner result 0.37356 Our result 0.37363 Our best result 0.37093 0.3705 0.371 0.3715 0.372 0.3725 0.373 0.3735 0.374
- 26. 27 How to solve problems?
- 27. • Algorithm perfectly works on Training set • But! Algorithm does not work on Test set! 28 Overfitting
- 28. • Target is unknown for the Test set • Separate Training set in two parts: • 1st part: New Training set • 2nd part: New Test set (with known target) 29 Crossvalidation
- 29. If you are interested in this topic… • Read papers and books about Machine Learning • Communicate with people (Kaggle, LinkedIn) • Participate in competitions • Study Mathematics 30 What’s next?
- 30. Thank you! Any questions? Dmitry Efimov defimov@aus.edu