Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine learning for developers

1,169 views

Published on

Slides from the talk presented by Mariusz Gil during Boiling Frogs conference 2016

Published in: Technology
  • Be the first to comment

Machine learning for developers

  1. 1. @SrcMinistry @MariuszGil Machine Learning for Developers Data processing
  2. 2. @SrcMinistry
  3. 3. My story
  4. 4. Data classification
  5. 5. Bot detection
  6. 6. Minimize risk of error
  7. 7. Predictions
  8. 8. Click probability
  9. 9. Maximize CTR or eCPM
  10. 10. A lot of data
  11. 11. data + algo = result
  12. 12. Real problem
  13. 13. + value estimator
  14. 14. + chance of sell
  15. 15. + $ optimization
  16. 16. Tens of thousands historical transactions
  17. 17. Tens of data components
  18. 18. Hundreds of data components
  19. 19. HOW?
  20. 20. Machine Learning Theory
  21. 21. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E Tom M. Mitchell
  22. 22. Task
  23. 23. Typical ML techniques Classification Regression Clustering Dimensionality reduction Association learning
  24. 24. o oo o oo o oo o o o o oo o o o o oo o oo o o o feature 1 feature2
  25. 25. o oo o oo o oo o o o o oo o o o o oo o oo o o o feature 1 feature2
  26. 26. o oo o oo o oo o o o o oo o o o o oo o oo o o o feature 1 feature2
  27. 27. Experience
  28. 28. Typical ML paradigms Supervised learning Unsupervised learning Reinforcement learning
  29. 29. Accuracy
  30. 30. Substantive Expertise Hacking skills M ath & Statistics Knowledge Traditional
 Research Danger
 Zone! Machine
 Learning Data
 Science
  31. 31. Substantive Expertise Hacking skills M ath & Statistics KnowledgeEvil Outside Committee
 Member Not that dangerous,
 in retrospect Machine
 Learning James Bond
 Villain NSA Data Science That Guy Who Stole
 Your Online Identity Thesis Advisor Grad School Mate
  32. 32. Machine Learning Practice
  33. 33. data + algo = result
  34. 34. +-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+ …
  35. 35. Learning Data Algorithm Learning Classifier ModelReal Data Classification
  36. 36. Failure recipe
  37. 37. +-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+ …
  38. 38. +-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ …
  39. 39. +-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ …
  40. 40. +-------+--------+-----+------+--------+---------+--------+-------+ | brand | model | gen | year | milage | service | repair | price | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+ …
  41. 41. +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | brand | model | gen | year | milage | service | repair | igla | crying German | price | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 0 | 0 | 67000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 1 | 1 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 0 | 0 | 45000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 1 | 0 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ …
  42. 42. Understand your data first
  43. 43. Exploratory analysis
  44. 44. http://blogs.adobe.com/digitalmarketing/wp-content/uploads/2013/08/aq2.jpg
  45. 45. ML pipeline
  46. 46. Raw Data Collection Pre-processing Sampling Training Dataset Algorithm Training Optimization Post-processing Final model Pre-processingFeature Selection Feature Scaling Dimensionality Reduction Performance Metrics Model Selection Test Dataset CrossValidation Final Model
 Evaluation Pre-processing Classification Missing Data Feature Extraction Data
 Split Data
  47. 47. Raw Data Collection Pre-processing Sampling Training Dataset Algorithm Training Optimization Final model Pre-processingFeature Selection Feature Scaling Dimensionality Reduction Performance Metrics Model Selection Test Dataset CrossValidation Final Model
 Evaluation Pre-processing Classification Missing Data Feature Extraction Data
 Split Post-processing Data
  48. 48. Classification algorithms Linear Classification Logistic Regression Linear Discriminant Analysis PLS Discriminant Analysis Non-Linear Classification Mixture Discriminant Analysis Quadratic Discriminant Analysis Regularized Discriminant Analysis Neural Networks Flexible Discriminant Analysis Support Vector Machines k-Nearest Neighbor Naive Bayes Decission Trees for Classification Classification and Regression Trees C4.5 PART Bagging CART Random Forest Gradient Booster Machines Boosted 5.0
  49. 49. Regression algorithms Linear Regiression Ordinary Least Squares Regression Stepwise Linear Regression Prinicpal Component Regression Partial Least Squares Regression Non-Linear Regression / Penalized Regression Ridge Regression Least Absolute Shrinkage ElasticNet Multivariate Adaptive Regression Support Vector Machines k-Nearest Neighbor Neural Network Decission Trees for Regression Classification and Regression Trees Conditional Decision Tree Rule System Bagging CART Random Forest Gradient Boosted Machine Cubist
  50. 50. Algorithm is only element in the ML chain
  51. 51. Demo #1
  52. 52. > dataset(iris) > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > tail(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 145 6.7 3.3 5.7 2.5 virginica 146 6.7 3.0 5.2 2.3 virginica 147 6.3 2.5 5.0 1.9 virginica 148 6.5 3.0 5.2 2.0 virginica 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica > plot(iris[,1:4])
  53. 53. > library(mclust) > class = iris$Species > mod2 = MclustDA(iris[,1:4], class, modelType = „EDDA") > table(class) class setosa versicolor virginica 50 50 50
  54. 54. > summary(mod2) ------------------------------------------------ Gaussian finite mixture model for classification ------------------------------------------------ EDDA model summary: log.likelihood n df BIC -187.7097 150 36 -555.8024 Classes n Model G setosa 50 VEV 1 versicolor 50 VEV 1 virginica 50 VEV 1 Training classification summary: Predicted Class setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 0 50 Training error = 0.02 > plot(mod2, what = "scatterplot")
  55. 55. Demo #2
  56. 56. > head(titanic.raw) Class Sex Age Survived 1 3rd Male Child No 2 3rd Male Child No 3 3rd Male Child No 4 3rd Male Child No 5 3rd Male Child No 6 3rd Male Child No > tail(titanic.raw) Class Sex Age Survived 2196 Crew Female Adult Yes 2197 Crew Female Adult Yes 2198 Crew Female Adult Yes 2199 Crew Female Adult Yes 2200 Crew Female Adult Yes 2201 Crew Female Adult Yes > summary(titanic.raw) Class Sex Age Survived 1st :325 Female: 470 Adult:2092 No :1490 2nd :285 Male :1731 Child: 109 Yes: 711 3rd :706 Crew:885
  57. 57. > library(arules) Ładowanie wymaganego pakietu: Matrix Dołączanie pakietu: ‘arules’ > rules <- apriori(titanic.raw) Apriori Parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 220 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[10 item(s), 2201 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 done [0.00s]. writing ... [27 rule(s)] done [0.00s]. creating S4 object ... done [0.00s].
  58. 58. > rules <- apriori(titanic.raw, + parameter = list(minlen=2, supp=0.005, conf=0.8), + appearance = list(rhs=c("Survived=No", "Survived=Yes"), + default="lhs"), + control = list(verbose=F)) > rules.sorted <- sort(rules, by="lift") > subset.matrix <- is.subset(rules.sorted, rules.sorted) > subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA > redundant <- colSums(subset.matrix, na.rm=T) >= 1 > which(redundant) {Class=2nd,Sex=Female,Age=Child,Survived=Yes} 2 {Class=1st,Sex=Female,Age=Adult,Survived=Yes} 4 {Class=Crew,Sex=Female,Age=Adult,Survived=Yes} 7 {Class=2nd,Sex=Female,Age=Adult,Survived=Yes} 8 > rules.pruned <- rules.sorted[!redundant]
  59. 59. > inspect(rules.pruned) lhs rhs support confidence lift 1 {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134 1.0000000 3.095640 4 {Class=1st,Sex=Female} => {Survived=Yes} 0.064061790 0.9724138 3.010243 2 {Class=2nd,Sex=Female} => {Survived=Yes} 0.042253521 0.8773585 2.715986 5 {Class=Crew,Sex=Female} => {Survived=Yes} 0.009086779 0.8695652 2.691861 9 {Class=2nd,Sex=Male,Age=Adult} => {Survived=No} 0.069968196 0.9166667 1.354083 3 {Class=2nd,Sex=Male} => {Survived=No} 0.069968196 0.8603352 1.270871 12 {Class=3rd,Sex=Male,Age=Adult} => {Survived=No} 0.175829169 0.8376623 1.237379 6 {Class=3rd,Sex=Male} => {Survived=No} 0.191731031 0.8274510 1.222295
  60. 60. Applications
  61. 61. Tools
  62. 62. Benefits & Problems
  63. 63. o oo o oo o oo o o o o oo o o o o oo o oo o o o feature 1 feature2 o o
  64. 64. Does it do well on
 the training data? Does it do well on
 the test data? Better features /
 Better parameters More data Done! No No Yes by Andrew Ng
  65. 65. Understand your needs first
  66. 66. Tools will change Ideas are immortal
  67. 67. @SrcMinistry Thanks! @MariuszGil

×