Successfully reported this slideshow.
Upcoming SlideShare
×

# It's Not Magic - Explaining classification algorithms

1,447 views

Published on

As organizations increasingly leverage data and machine learning methods, people throughout those organizations need to build a basic "data literacy" in those topics. In this session, data scientist and instructor Brian Lange provides simple, visual, and equation free explanations for a variety of classification algorithms, geared towards helping anyone understand how they work. Now with Python code examples!

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Brian, Simply your video on youtube and the slides here are the best I have watched and learned from about classifying the classifiers. Thank you very much for making difficult stuff simple.

Are you sure you want to  Yes  No

### It's Not Magic - Explaining classification algorithms

1. 1. It’s Not Magic Brian Lange, Data Scientist + Partner at Explaining classiﬁcation algorithms
3. 3. I work with some really freakin’ smart people.
4. 4. classiﬁcation algorithms
5. 5. popular examples
6. 6. popular examples -spam ﬁlters
7. 7. popular examples -spam ﬁlters
8. 8. popular examples -spam ﬁlters -the Sorting Hat
9. 9. things to know
10. 10. things to know - you need data labeled with the correct answers to “train” these algorithms before they work
11. 11. things to know - you need data labeled with the correct answers to “train” these algorithms before they work - feature = dimension = column = attribute of the data
12. 12. things to know - you need data labeled with the correct answers to “train” these algorithms before they work - feature = dimension = column = attribute of the data - class = category = label = Harry Potter house
13. 13. BIG CAVEAT Often times choosing/creating good features or gathering more data will help more than changing algorithms...
14. 14. % of email body that is all-caps # mentions of brand names spam not spam
15. 15. Linear discriminants
16. 16. % of email body that is all-caps # mentions of brand names
17. 17. % of email body that is all-caps # mentions of brand names
18. 18. % of email body that is all-caps # mentions of brand names
19. 19. % of email body that is all-caps # mentions of brand names 1 wrong
20. 20. % of email body that is all-caps # mentions of brand names 5 wrong
21. 21. % of email body that is all-caps # mentions of brand names 4 wrong
22. 22. % of email body that is all-caps # mentions of brand names 4 wrong y = .01x+4
23. 23. terribleness slope intercept a map of terribleness to ﬁnd the least terrible line
24. 24. terribleness slope intercept a map of terribleness to ﬁnd the least terrible line
25. 25. terribleness slope intercept “gradient descent”
26. 26. terribleness slope intercept “gradient descent”
27. 27. training data
28. 28. training data import numpy as np X = np.array([[1, 0.1], [3, 0.2], [5, 0.1]…]) y = np.array([1, 2, 1])
29. 29. training data
30. 30. training data
31. 31. training data from sklearn.discriminant_analysis import LinearDiscriminantAnalysis model = LinearDiscriminantAnalysis() model.fit(X, y)
32. 32. training data trained model
33. 33. new data point trained model
34. 34. trained model
35. 35. trained model new_point = np.array([1, .3])
36. 36. trained model new_point = np.array([1, .3]) print(model.predict(new_point))
37. 37. trained model new_point = np.array([1, .3]) print(model.predict(new_point)) 1
38. 38. trained model new_point = np.array([1, .3]) print(model.predict(new_point)) 1 not spam prediction
39. 39. trained model not spam prediction
40. 40. Logistic regression
41. 41. logistic regression “divide it with a logistic function”
42. 42. logistic regression “divide it with a logistic function”
43. 43. logistic regression “divide it with a logistic function” from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X,y) predicted = model.predict(z)
44. 44. Support Vector Machines (SVM)
45. 45. SVMs (support vector machines) “*advanced* draw a line through it”
46. 46. SVMs (support vector machines) “*advanced* draw a line through it” - better deﬁnition of “terrible”
47. 47. SVMs (support vector machines) “*advanced* draw a line through it” - better deﬁnition of “terrible” - lines can turn into non-linear shapes if you transform your data
48. 48. 💩
49. 49. 💩
50. 50. “the kernel trick”
51. 51. “the kernel trick”
52. 52. SVMs (support vector machines) “*advanced* draw a line through it” ﬁgure credit: scikit-learn documentation
53. 53. % of email body that is all-caps # mentions of brand names
54. 54. % of email body that is all-caps # mentions of brand names
55. 55. % of email body that is all-caps # mentions of brand names
56. 56. % of email body that is all-caps # mentions of brand names
57. 57. % of email body that is all-caps # mentions of brand names
58. 58. % of email body that is all-caps # mentions of brand names
59. 59. SVMs (support vector machines) “*advanced* draw a line through it” from sklearn.svm import SVC model = SVC(kernel='poly', degree=2) model.fit(X,y) predicted = model.predict(z)
60. 60. SVMs (support vector machines) “*advanced* draw a line through it” from sklearn.svm import SVC model = SVC(kernel='rbf') model.fit(X,y) predicted = model.predict(z)
61. 61. KNN (k-nearest neighbors)
62. 62. KNN (k-nearest neighbors) “what do similar cases look like?”
63. 63. KNN (k-nearest neighbors) “what do similar cases look like?” k=1
64. 64. KNN (k-nearest neighbors) “what do similar cases look like?” k=2
65. 65. KNN (k-nearest neighbors) “what do similar cases look like?” ﬁgure credit: scikit-learn documentation
66. 66. KNN (k-nearest neighbors) “what do similar cases look like?” k=1
67. 67. KNN (k-nearest neighbors) “what do similar cases look like?” k=1
68. 68. KNN (k-nearest neighbors) “what do similar cases look like?” k=2
69. 69. KNN (k-nearest neighbors) “what do similar cases look like?” k=3
70. 70. KNN (k-nearest neighbors) “what do similar cases look like?” ﬁgure credit: Burton DeWilde
71. 71. KNN (k-nearest neighbors) “what do similar cases look like?” from sklearn.neighbors import NearestNeighbors model = NearestNeighbors(n_neighbors=5) model.fit(X,y) predicted = model.predict(z)
72. 72. Decision tree learners
73. 73. decision tree learners make a ﬂow chart of it
74. 74. decision tree learners make a ﬂow chart of it x < 3? yes no 3
75. 75. decision tree learners make a ﬂow chart of it x < 3? yes no y < 4? yes no 3 4
76. 76. decision tree learners make a ﬂow chart of it x < 3? yes no y < 4? yes no x < 5? yes no 3 5 4
77. 77. decision tree learners make a ﬂow chart of it x < 3? yes no y < 4? yes no x < 5? yes no 3 5 4
78. 78. decision tree learners make a ﬂow chart of it from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(X,y) predicted = model.predict(z)
79. 79. decision tree learners make a ﬂow chart of it sklearn.tree.export_graphviz() + pydot
80. 80. decision tree learners make a ﬂow chart of it
81. 81. Ensemble models (make a bunch of models and combine them)
82. 82. bagging split training set, train one model each, models “vote”
83. 83. bagging split training set, train one model each, models “vote”
84. 84. bagging split training set, train one model each, models “vote”
85. 85. bagging split training set, train one model each, models “vote” new data point
86. 86. bagging split training set, train one model each, models “vote” new data point
87. 87. bagging split training set, train one model each, models “vote” new data point not spam spam not spam
88. 88. bagging split training set, train one model each, models “vote” new data point not spam spam not spam not spam Final Answer:
89. 89. bagging split training set, train one model each, models “vote”
90. 90. bagging split training set, train one model each, models “vote”
91. 91. other spins on this
92. 92. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from
93. 93. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from Extra Trees - for each split, make it randomly, non- optimally. Compensate by training a ton of trees
94. 94. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from Extra Trees - for each split, make it randomly, non- optimally. Compensate by training a ton of trees Voting - combine a bunch of diﬀerent models of your design, have them “vote” on the correct answer.
95. 95. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from Extra Trees - for each split, make it randomly, non- optimally. Compensate by training a ton of trees Voting - combine a bunch of diﬀerent models of your design, have them “vote” on the correct answer. Boosting- train models in order, make the later ones focus on the points the earlier ones missed
96. 96. voting example ﬁgure credit: scikit-learn documentation
97. 97. other spins on this Random Forest - like bagging, but at each split randomly constrain features to choose from Extra Trees - for each split, make it randomly, non- optimally. Compensate by training a ton of trees Voting - combine a bunch of diﬀerent models of your design, have them “vote” on the correct answer. Boosting- train models in order, make the later ones focus on the points the earlier ones missed
99. 99. which one do I pick?
100. 100. which one do I pick? try a few!
101. 101. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model
102. 102. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale
103. 103. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no
104. 104. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no
105. 105. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no Naïve Bayes yes yes no
106. 106. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no Naïve Bayes yes yes no Decision Tree yes no yes (number of times that feature is used)
107. 107. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no Naïve Bayes yes yes no Decision Tree yes no yes (number of times that feature is used) Ensemble models yes kinda (% of models that agree) yes, depending on component parts
108. 108. Nonlinear decision boundary provide probability estimates tell how important a feature is to the model Logistic Regression no yes yes, if you scale SVMs yes, with kernel no no KNN yes kinda (percent of nearby points) no Naïve Bayes yes yes no Decision Tree yes no yes (number of times that feature is used) Ensemble models yes kinda (% of models that agree) yes, depending on component parts Boosted models yes kinda (% of models that agree) yes, depending on component parts
109. 109. can be updated with new training data easy to parallelize?
110. 110. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda
111. 111. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others
112. 112. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes
113. 113. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes Naïve Bayes yes yes
114. 114. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes Naïve Bayes yes yes Decision Tree no no (but it’s very fast)
115. 115. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes Naïve Bayes yes yes Decision Tree no no (but it’s very fast) Ensemble models kinda, by adding new models to the ensemble yes
116. 116. can be updated with new training data easy to parallelize? Logistic Regression kinda kinda SVMs kinda, depending on kernel yes for some kernels, no for others KNN yes yes Naïve Bayes yes yes Decision Tree no no (but it’s very fast) Ensemble models kinda, by adding new models to the ensemble yes Boosted models kinda, by adding new models to the ensemble no
117. 117. Other quirks
118. 118. Other quirks SVMs have to pick a kernel
119. 119. Other quirks SVMs have to pick a kernel KNN you need to deﬁne what “similarity” is in a good way. fast to train, slow to classify (compared to other methods)
120. 120. Other quirks SVMs have to pick a kernel KNN you need to deﬁne what “similarity” is in a good way. fast to train, slow to classify (compared to other methods) Naïve Bayes have to choose the distribution can deal with missing data
121. 121. Other quirks SVMs have to pick a kernel KNN you need to deﬁne what “similarity” is in a good way. fast to train, slow to classify (compared to other methods) Naïve Bayes have to choose the distribution can deal with missing data Decision Tree can provide literal ﬂow charts very sensitive to outliers
122. 122. Other quirks SVMs have to pick a kernel KNN you need to deﬁne what “similarity” is in a good way. fast to train, slow to classify (compared to other methods) Naïve Bayes have to choose the distribution can deal with missing data Decision Tree can provide literal ﬂow charts very sensitive to outliers Ensemble models less prone to overﬁtting than their component parts
123. 123. Other quirks SVMs have to pick a kernel KNN you need to deﬁne what “similarity” is in a good way. fast to train, slow to classify (compared to other methods) Naïve Bayes have to choose the distribution can deal with missing data Decision Tree can provide literal ﬂow charts very sensitive to outliers Ensemble models less prone to overﬁtting than their component parts Boosted models many parameters to tweak more prone to overﬁt than normal ensembles most popular Kaggle winners use these
124. 124. if this sounds cool datascope.co/careers
125. 125. thanks! question time… .cohttp:// @bjlange