Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

It's Not Magic - Explaining classification algorithms

1,390 views

Published on

As organizations increasingly leverage data and machine learning methods, people throughout those organizations need to build a basic "data literacy" in those topics. In this session, data scientist and instructor Brian Lange provides simple, visual, and equation free explanations for a variety of classification algorithms, geared towards helping anyone understand how they work. Now with Python code examples!

Published in: Data & Analytics

It's Not Magic - Explaining classification algorithms

  1. 1. It’s Not Magic Brian Lange, Data Scientist + Partner at Explaining classification algorithms
  2. 2. HEADS UP
  3. 3. I work with some really freakin’ smart people.
  4. 4. classification algorithms
  5. 5. popular examples
  6. 6. popular examples -spam filters
  7. 7. popular examples -spam filters
  8. 8. popular examples -spam filters -the Sorting Hat
  9. 9. things to know
  10. 10. things to know - you need data labeled with the correct answers to “train” these algorithms before they work
  11. 11. things to know - you need data labeled with the correct answers to “train” these algorithms before they work - feature = dimension = column = attribute of the data
  12. 12. things to know - you need data labeled with the correct answers to “train” these algorithms before they work - feature = dimension = column = attribute of the data - class = category = label = Harry Potter house
  13. 13. BIG CAVEAT Often times choosing/creating good features or gathering more data will help more than changing algorithms...
  14. 14. % of email body that is all-caps # mentions of brand names spam not spam
  15. 15. Linear discriminants
  16. 16. % of email body that is all-caps # mentions of brand names
  17. 17. % of email body that is all-caps # mentions of brand names
  18. 18. % of email body that is all-caps # mentions of brand names
  19. 19. % of email body that is all-caps # mentions of brand names 1 wrong
  20. 20. % of email body that is all-caps # mentions of brand names 5 wrong
  21. 21. % of email body that is all-caps # mentions of brand names 4 wrong
  22. 22. % of email body that is all-caps # mentions of brand names 4 wrong y = .01x+4
  23. 23. terribleness slope intercept a map of terribleness to find the least terrible line
  24. 24. terribleness slope intercept a map of terribleness to find the least terrible line
  25. 25. terribleness slope intercept “gradient descent”
  26. 26. terribleness slope intercept “gradient descent”
  27. 27. training data
  28. 28. training data import numpy as np X = np.array([[1, 0.1], [3, 0.2], [5, 0.1]…]) y = np.array([1, 2, 1])
  29. 29. training data
  30. 30. training data
  31. 31. training data from sklearn.discriminant_analysis import LinearDiscriminantAnalysis model = LinearDiscriminantAnalysis() model.fit(X, y)
  32. 32. training data trained model
  33. 33. new data point trained model
  34. 34. trained model
  35. 35. trained model new_point = np.array([1, .3])
  36. 36. trained model new_point = np.array([1, .3]) print(model.predict(new_point))
  37. 37. trained model new_point = np.array([1, .3]) print(model.predict(new_point)) 1
  38. 38. trained model new_point = np.array([1, .3]) print(model.predict(new_point)) 1 not spam prediction
  39. 39. trained model not spam prediction
  40. 40. Logistic regression
  41. 41. logistic regression “divide it with a logistic function”
  42. 42. logistic regression “divide it with a logistic function”
  43. 43. logistic regression “divide it with a logistic function” from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X,y) predicted = model.predict(z)
  44. 44. Support Vector Machines (SVM)
  45. 45. SVMs (support vector machines) “*advanced* draw a line through it”
  46. 46. SVMs (support vector machines) “*advanced* draw a line through it” - better definition of “terrible”
  47. 47. SVMs (support vector machines) “*advanced* draw a line through it” - better definition of “terrible” - lines can turn into non-linear shapes if you transform your data
  48. 48. 💩
  49. 49. 💩
  50. 50. “the kernel trick”
  51. 51. “the kernel trick”
  52. 52. SVMs (support vector machines) “*advanced* draw a line through it” figure credit: scikit-learn documentation
  53. 53. % of email body that is all-caps # mentions of brand names
  54. 54. % of email body that is all-caps # mentions of brand names
  55. 55. % of email body that is all-caps # mentions of brand names
  56. 56. % of email body that is all-caps # mentions of brand names
  57. 57. % of email body that is all-caps # mentions of brand names
  58. 58. % of email body that is all-caps # mentions of brand names
  59. 59. SVMs (support vector machines) “*advanced* draw a line through it” from sklearn.svm import SVC model = SVC(kernel='poly', degree=2) model.fit(X,y) predicted = model.predict(z)
  60. 60. SVMs (support vector machines) “*advanced* draw a line through it” from sklearn.svm import SVC model = SVC(kernel='rbf') model.fit(X,y) predicted = model.predict(z)
  61. 61. KNN (k-nearest neighbors)
  62. 62. KNN (k-nearest neighbors) “what do similar cases look like?”
  63. 63. KNN (k-nearest neighbors) “what do similar cases look like?” k=1
  64. 64. KNN (k-nearest neighbors) “what do similar cases look like?” k=2
  65. 65. KNN (k-nearest neighbors) “what do similar cases look like?” figure credit: scikit-learn documentation
  66. 66. KNN (k-nearest neighbors) “what do similar cases look like?” k=1
  67. 67. KNN (k-nearest neighbors) “what do similar cases look like?” k=1
  68. 68. KNN (k-nearest neighbors) “what do similar cases look like?” k=2
  69. 69. KNN (k-nearest neighbors) “what do similar cases look like?” k=3
  70. 70. KNN (k-nearest neighbors) “what do similar cases look like?” figure credit: Burton DeWilde
  71. 71. KNN (k-nearest neighbors) “what do similar cases look like?” from sklearn.neighbors import NearestNeighbors model = NearestNeighbors(n_neighbors=5) model.fit(X,y) predicted = model.predict(z)
  72. 72. Decision tree learners
  73. 73. decision tree learners make a flow chart of it
  74. 74. decision tree learners make a flow chart of it x < 3? yes no 3
  75. 75. decision tree learners make a flow chart of it x < 3? yes no y < 4? yes no 3 4
  76. 76. decision tree learners make a flow chart of it x < 3? yes no y < 4? yes no x < 5? yes no 3 5 4
  77. 77. decision tree learners make a flow chart of it x < 3? yes no y < 4? yes no x < 5? yes no 3 5 4
  78. 78. decision tree learners make a flow chart of it from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(X,y) predicted = model.predict(z)
  79. 79. decision tree learners make a flow chart of it sklearn.tree.export_graphviz() + pydot
  80. 80. decision tree learners make a flow chart of it
  81. 81. Ensemble models (make a bunch of models and combine them)
  82. 82. bagging split training set, train one model each, models “vote”
  83. 83. bagging split training set, train one model each, models “vote”
  84. 84. bagging split training set, train one model each, models “vote”
  85. 85. bagging split training set, train one model each, models “vote” new data point
  86. 86. bagging split training set, train one model each, models “vote” new data point
  87. 87. bagging split training set, train one model each, models “vote” new data point not spam spam not spam
  88. 88. bagging split training set, train one model each, models “vote” new data point not spam spam not spam not spam Final Answer:
  89. 89. bagging split training set, train one model each, models “vote”
  90. 90. bagging split training set, train one model each, models “vote”
  91. 91. other spins on this
  92. 92. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from
  93. 93. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from Extra Trees - for each split, make it randomly, non- optimally. Compensate by training a ton of trees
  94. 94. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from Extra Trees - for each split, make it randomly, non- optimally. Compensate by training a ton of trees Voting - combine a bunch of different models of your design, have them “vote” on the correct answer.
  95. 95. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from Extra Trees - for each split, make it randomly, non- optimally. Compensate by training a ton of trees Voting - combine a bunch of different models of your design, have them “vote” on the correct answer. Boosting- train models in order, make the later ones focus on the points the earlier ones missed
  96. 96. voting example figure credit: scikit-learn documentation
  97. 97. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from Extra Trees - for each split, make it randomly, non- optimally. Compensate by training a ton of trees Voting - combine a bunch of different models of your design, have them “vote” on the correct answer. Boosting- train models in order, make the later ones focus on the points the earlier ones missed
  98. 98. from sklearn.ensemble import BaggingClassifier RandomForestClassifier ExtraTreesClassifier VotingClassifier AdaBoostClassifier GradientBoostingClassifier
  99. 99. which one do I pick?
  100. 100. which one do I pick? try a few!
  101. 101. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model
  102. 102. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale
  103. 103. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no
  104. 104. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no
  105. 105. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no Naïve Bayes yes yes no
  106. 106. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no Naïve Bayes yes yes no Decision Tree yes no yes (number of times that feature is used)
  107. 107. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no Naïve Bayes yes yes no Decision Tree yes no yes (number of times that feature is used) Ensemble models yes kinda (% of models that agree) yes, depending on component parts
  108. 108. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no Naïve Bayes yes yes no Decision Tree yes no yes (number of times that feature is used) Ensemble models yes kinda (% of models that agree) yes, depending on component parts Boosted models yes kinda (% of models that agree) yes, depending on component parts
  109. 109. can be updated with new training data easy to parallelize?
  110. 110. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda
  111. 111. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others
  112. 112. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes
  113. 113. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes Naïve Bayes yes yes
  114. 114. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes Naïve Bayes yes yes Decision Tree no no (but it’s very fast)
  115. 115. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes Naïve Bayes yes yes Decision Tree no no (but it’s very fast) Ensemble models kinda, by adding new models to the ensemble yes
  116. 116. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes Naïve Bayes yes yes Decision Tree no no (but it’s very fast) Ensemble models kinda, by adding new models to the ensemble yes Boosted models kinda, by adding new models to the ensemble no
  117. 117. Other quirks
  118. 118. Other quirks SVMs have to pick a kernel
  119. 119. Other quirks SVMs have to pick a kernel KNN you need to define what “similarity” is in a good way. fast to train, slow to classify (compared to other methods)
  120. 120. Other quirks SVMs have to pick a kernel KNN you need to define what “similarity” is in a good way. fast to train, slow to classify (compared to other methods) Naïve Bayes have to choose the distribution can deal with missing data
  121. 121. Other quirks SVMs have to pick a kernel KNN you need to define what “similarity” is in a good way. fast to train, slow to classify (compared to other methods) Naïve Bayes have to choose the distribution can deal with missing data Decision Tree can provide literal flow charts very sensitive to outliers
  122. 122. Other quirks SVMs have to pick a kernel KNN you need to define what “similarity” is in a good way. fast to train, slow to classify (compared to other methods) Naïve Bayes have to choose the distribution can deal with missing data Decision Tree can provide literal flow charts very sensitive to outliers Ensemble models less prone to overfitting than their component parts
  123. 123. Other quirks SVMs have to pick a kernel KNN you need to define what “similarity” is in a good way. fast to train, slow to classify (compared to other methods) Naïve Bayes have to choose the distribution can deal with missing data Decision Tree can provide literal flow charts very sensitive to outliers Ensemble models less prone to overfitting than their component parts Boosted models many parameters to tweak more prone to overfit than normal ensembles most popular Kaggle winners use these
  124. 124. if this sounds cool datascope.co/careers
  125. 125. thanks! question time… .cohttp:// @bjlange

×