Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine Learning and
Data Mining:
case studies
2013, April 02nd, 14:00
Dmitry Efimov
http://mech.math.msu.su/~efimov/
3
Outline
1. Machine Learning problems
2. Methods: Regression, Distance, Probability
3. Case studies
4. How to solve probl...
4
How to teach
computer to
grade
students
essays?
Essay grading
5
How to
predict prices
in the next
year?
Heavy Machines sales
6
How to
predict
molecule
response for
medicines?
Molecule response
7
How to repair
missed
connections?
How to give
weights to
connections?
People relationships
8
What is Kaggle?
9
Definitions
•
Regression
11
• What about this case?
• Or if there are many features?
• Powerful method: Neural Networks
But…
Distance approach: SVM
•
12Vapnik, 1995
SVM (non-linear case)
•
13Vapnik, 1995
14
Probability approach:
decision trees
Ensembling: Random Forests
• Boosting = average of many simple
algorithms
• Simple algorithm = one decision tree
• Boostin...
16
Case 1. Social ties strength
• Organized by Panjia (www.panjiaco.com)
• Problem: predict the strength of social ties
• The prize pool: 75 000 $
• Train...
• Number of features:
more than 500!
• Features example:
1) Number of friends (node feature)
2) Number of common friends (...
Stochastic gradient descent in
decision trees (GBM)
19Ridgeway, 2007
20
Obtained accuracy
21
Case 2. Biological Response
prediction
Functional Ensembling
•
22Efimov & Nikulin, 2012
Functional Ensembling:
Example
•
23Efimov & Nikulin, 2012
Functional Ensembling: Algorithm
•
24
Final ensembling
•
25
min min 0.55
0.1 mean 0.9
0.75 max max
Obtained accuracy
26
Winner
result
0.37356
Our result
0.37363
Our best
result
0.37093
0.3705
0.371
0.3715
0.372
0.3725
0.3...
27
How to solve problems?
• Algorithm perfectly works on Training set
• But! Algorithm does not work on Test set!
28
Overfitting
• Target is unknown for the Test set
• Separate Training set in two parts:
• 1st part: New Training set
• 2nd part: New Te...
If you are interested in this topic…
• Read papers and books about Machine
Learning
• Communicate with people (Kaggle, Lin...
Thank you!
Any questions?
Dmitry Efimov
defimov@aus.edu
Introduction to Machine Learning (case studies)
Upcoming SlideShare
Loading in …5
×

Introduction to Machine Learning (case studies)

5,479 views

Published on

introduction for beginners

Published in: Education, Technology

Introduction to Machine Learning (case studies)

  1. 1. Machine Learning and Data Mining: case studies 2013, April 02nd, 14:00 Dmitry Efimov http://mech.math.msu.su/~efimov/
  2. 2. 3 Outline 1. Machine Learning problems 2. Methods: Regression, Distance, Probability 3. Case studies 4. How to solve problems?
  3. 3. 4 How to teach computer to grade students essays? Essay grading
  4. 4. 5 How to predict prices in the next year? Heavy Machines sales
  5. 5. 6 How to predict molecule response for medicines? Molecule response
  6. 6. 7 How to repair missed connections? How to give weights to connections? People relationships
  7. 7. 8 What is Kaggle?
  8. 8. 9 Definitions
  9. 9. • Regression
  10. 10. 11 • What about this case? • Or if there are many features? • Powerful method: Neural Networks But…
  11. 11. Distance approach: SVM • 12Vapnik, 1995
  12. 12. SVM (non-linear case) • 13Vapnik, 1995
  13. 13. 14 Probability approach: decision trees
  14. 14. Ensembling: Random Forests • Boosting = average of many simple algorithms • Simple algorithm = one decision tree • Boosting + decision trees = Random Forests 15Breiman, 2001
  15. 15. 16 Case 1. Social ties strength
  16. 16. • Organized by Panjia (www.panjiaco.com) • Problem: predict the strength of social ties • The prize pool: 75 000 $ • Training set size: 50 000 • Test set size: 40 000 17 Description of problem
  17. 17. • Number of features: more than 500! • Features example: 1) Number of friends (node feature) 2) Number of common friends (edge feature) 3) Number of common albums (combined Number of all albums feature) 18 Features engineering
  18. 18. Stochastic gradient descent in decision trees (GBM) 19Ridgeway, 2007
  19. 19. 20 Obtained accuracy
  20. 20. 21 Case 2. Biological Response prediction
  21. 21. Functional Ensembling • 22Efimov & Nikulin, 2012
  22. 22. Functional Ensembling: Example • 23Efimov & Nikulin, 2012
  23. 23. Functional Ensembling: Algorithm • 24
  24. 24. Final ensembling • 25 min min 0.55 0.1 mean 0.9 0.75 max max
  25. 25. Obtained accuracy 26 Winner result 0.37356 Our result 0.37363 Our best result 0.37093 0.3705 0.371 0.3715 0.372 0.3725 0.373 0.3735 0.374
  26. 26. 27 How to solve problems?
  27. 27. • Algorithm perfectly works on Training set • But! Algorithm does not work on Test set! 28 Overfitting
  28. 28. • Target is unknown for the Test set • Separate Training set in two parts: • 1st part: New Training set • 2nd part: New Test set (with known target) 29 Crossvalidation
  29. 29. If you are interested in this topic… • Read papers and books about Machine Learning • Communicate with people (Kaggle, LinkedIn) • Participate in competitions • Study Mathematics 30 What’s next?
  30. 30. Thank you! Any questions? Dmitry Efimov defimov@aus.edu

×