Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Holistic approach to machine learning

672 views

Published on

Slides from a talk delivered during 4Developers conference in Warsaw covering basic machine learning concepts and possibilities

Published in: Software
  • Be the first to comment

  • Be the first to like this

Holistic approach to machine learning

  1. 1. @SrcMinistry @MariuszGil Holistic approach to Machine Learning Data processing
  2. 2. @SrcMinistry
  3. 3. We are developers
  4. 4. We love to…
  5. 5. Write code
  6. 6. Write tests
  7. 7. Use DDD/OOP/AOP/ SOLID/GRASP/XYZ
  8. 8. What for?
  9. 9. Write code
  10. 10. Make money
  11. 11. Make users happy
  12. 12. Solve problems
  13. 13. Solve problems by writing code, to make users happy and make money
  14. 14. Solve problems by writing code, to make users happy and make money Solve problems
  15. 15. Solve problems by writing code, to make users happy and make money Solve
  16. 16. Solve problems by writing code, to make users happy and make money problems
  17. 17. Mapping all problems to DDD/OOP/AOP/SOLID/ GRASP/XYZ
  18. 18. Test first
  19. 19. Understand the problem first
  20. 20. Domain knowledge
  21. 21. Ask expert
  22. 22. Real problems
  23. 23. Data classification
  24. 24. Bot detection
  25. 25. Minimize risk of error
  26. 26. + value estimator
  27. 27. + chance of sell
  28. 28. + $ optimization
  29. 29. Tens of thousands historical transactions
  30. 30. Tens of data components
  31. 31. Hundreds of data components
  32. 32. IF-Unsolveable
  33. 33. Machine Learning
  34. 34. The theory
  35. 35. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E Tom M. Mitchell
  36. 36. Task
  37. 37. Typical ML techniques Classification Regression Clustering Dimensionality reduction Association learning
  38. 38. o oo o oo o oo o o o o oo o o o o oo o oo o o o feature 1 feature2
  39. 39. o oo o oo o oo o o o o oo o o o o oo o oo o o o feature 1 feature2
  40. 40. o oo o oo o oo o o o o oo o o o o oo o oo o o o feature 1 feature2
  41. 41. Experience
  42. 42. Typical ML paradigms Supervised learning Unsupervised learning Reinforcement learning
  43. 43. Accuracy
  44. 44. The practice
  45. 45. data + algo = result
  46. 46. +-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+ …
  47. 47. Learning Data Algorithm Learning Classifier ModelReal Data Classification
  48. 48. Failure recipe
  49. 49. +-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+ …
  50. 50. +-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ …
  51. 51. +-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ …
  52. 52. +-------+--------+-----+------+--------+---------+--------+-------+ | brand | model | gen | year | milage | service | repair | price | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+ …
  53. 53. +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | brand | model | gen | year | milage | service | repair | igla | crying German | price | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 0 | 0 | 67000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 1 | 1 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 0 | 0 | 45000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 1 | 0 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ …
  54. 54. Understand your data first
  55. 55. Exploratory analysis
  56. 56. http://blogs.adobe.com/digitalmarketing/wp-content/uploads/2013/08/aq2.jpg
  57. 57. ML pipeline
  58. 58. Raw Data Collection Pre-processing Sampling Training Dataset Algorithm Training Optimization Post-processing Final model Pre-processingFeature Selection Feature Scaling Dimensionality Reduction Performance Metrics Model Selection Test Dataset CrossValidation Final Model
 Evaluation Pre-processing Classification Missing Data Feature Extraction Data
 Split Data
  59. 59. Raw Data Collection Pre-processing Sampling Training Dataset Algorithm Training Optimization Final model Pre-processingFeature Selection Feature Scaling Dimensionality Reduction Performance Metrics Model Selection Test Dataset CrossValidation Final Model
 Evaluation Pre-processing Classification Missing Data Feature Extraction Data
 Split Post-processing Data
  60. 60. Classification algorithms Linear Classification Logistic Regression Linear Discriminant Analysis PLS Discriminant Analysis Non-Linear Classification Mixture Discriminant Analysis Quadratic Discriminant Analysis Regularized Discriminant Analysis Neural Networks Flexible Discriminant Analysis Support Vector Machines k-Nearest Neighbor Naive Bayes Decission Trees for Classification Classification and Regression Trees C4.5 PART Bagging CART Random Forest Gradient Booster Machines Boosted 5.0
  61. 61. Regression algorithms Linear Regiression Ordinary Least Squares Regression Stepwise Linear Regression Prinicpal Component Regression Partial Least Squares Regression Non-Linear Regression / Penalized Regression Ridge Regression Least Absolute Shrinkage ElasticNet Multivariate Adaptive Regression Support Vector Machines k-Nearest Neighbor Neural Network Decission Trees for Regression Classification and Regression Trees Conditional Decision Tree Rule System Bagging CART Random Forest Gradient Boosted Machine Cubist
  62. 62. Algorithm is only element in the ML chain
  63. 63. Everything may be important for ML
  64. 64. Testing
  65. 65. Test datasets
  66. 66. 60% 20% 20%
  67. 67. Andrew NG rule of ML
  68. 68. Does it do well on
 the training data? Does it do well on
 the test data? Better features /
 Better parameters More data Done! No No Yes by Andrew Ng Yes
  69. 69. Calculate, measure, apply later
  70. 70. The code
  71. 71. import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.util.MLUtils // Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) // Run training algorithm to build the model val numIterations = 100 val model = SVMWithSGD.train(training, numIterations) // Clear the default threshold. model.clearThreshold() // Compute raw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() println("Area under ROC = " + auROC) // Save and load model model.save(sc, "myModelPath") val sameModel = SVMModel.load(sc, "myModelPath")
  72. 72. Art of asking right questions related to right data
  73. 73. @SrcMinistry Thanks! @MariuszGil

×