Your SlideShare is downloading.
×

Free with a 30 day trial from Scribd

- 1. Joel Grus Seattle DAML Meetup June 23, 2015 Data Science from Scratch
- 2. About me Old-school DAML-er Wrote a book ----------> SWE at Google Formerly data science at VoloMetrix, Decide, Farecast
- 3. The Road to Data Science
- 4. The Road to Data Science My
- 5. Grad School
- 6. Fareology
- 7. Data Science Is A Broad Field Some Stuff More Stuff Even More Stuff Data Science People who think they're data scientists, but they're not really data scientists People who are a danger to everyone around them People who say "machine learnings"
- 8. a data scientist should be able to JOEL GRUS
- 9. a data scientist should be able to run a regression, JOEL GRUS
- 10. a data scientist should be able to run a regression, write a sql query, JOEL GRUS
- 11. a data scientist should be able to run a regression, write a sql query, scrape a web site, JOEL GRUS
- 12. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, JOEL GRUS
- 13. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, JOEL GRUS
- 14. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, JOEL GRUS
- 15. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, JOEL GRUS
- 16. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, JOEL GRUS
- 17. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, JOEL GRUS
- 18. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, JOEL GRUS
- 19. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, JOEL GRUS
- 20. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, JOEL GRUS
- 21. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, JOEL GRUS
- 22. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, JOEL GRUS
- 23. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, JOEL GRUS
- 24. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, JOEL GRUS
- 25. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, JOEL GRUS
- 26. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, JOEL GRUS
- 27. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. JOEL GRUS
- 28. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers. JOEL GRUS
- 29. A lot of stuff!
- 30. What Are Hiring Managers Looking For?
- 31. What Are Hiring Managers Looking For? Let's check LinkedIn
- 32. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers. JOEL GRUS grad students!
- 33. Learning Data Science
- 34. I want to be a data scientist. Great!
- 35. The Math Way I like to start with matrix decompositions. How's your measure theory?
- 36. The Math Way The Good: Solid foundation Math is the noblest known pursuit
- 37. The Math Way The Good: Solid foundation Math is the noblest known pursuit The Bad: Some weirdos don't think math is fun Can be pretty forbidding Can miss practical skills
- 38. So, did you count the words in that document? No, but I have an elegant proof that the number of words is finite!
- 39. OK, Let's Try Again
- 40. I want to be a data scientist. Great!
- 41. The Tools Way Here's a list of the 25 libraries you really ought to know. How's your R programming?
- 42. The Tools Way The Good: Don't have to understand the math Practical Can get started doing fun stuff right away
- 43. The Tools Way The Good: Don't have to understand the math Practical Can get started doing fun stuff right away The Bad: Don't have to understand the math Can get started doing bad science right away
- 44. So, did you build that model? Yes, and it fits the training data almost perfectly!
- 45. OK, Maybe Not That Either
- 46. So Then What?
- 47. Example: k-means clustering Unsupervised machine learning technique Given a set of points, group them into k clusters in a way that minimizes the within-cluster sum- of-squares i.e. in a way such that the clusters are as "small" as possible (for a particular conception of "small")
- 48. The Math Way
- 49. The Math Way
- 50. The Tools Way # a 2-dimensional example x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") (cl <- kmeans(x, 2)) plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex = 2)
- 51. The Tools Way >>> from sklearn import cluster, datasets >>> iris = datasets.load_iris() >>> X_iris = iris.data >>> y_iris = iris.target >>> k_means = cluster.KMeans(n_clusters=3) >>> k_means.fit(X_iris) KMeans(copy_x=True, init='k-means++', ... >>> print(k_means.labels_[::10]) [1 1 1 1 1 0 0 0 0 0 2 2 2 2 2] >>> print(y_iris[::10]) [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
- 52. So What To Do?
- 53. Bootcamps?
- 54. Data Science from Scratch This is to certify that Joel Grus has honorably completed the course of study outlined in the book Data Science from Scratch: First Principles with Python, and is entitled to all the Rights, Privileges, and Honors thereunto appertaining. Joel GrusJune 23, 2015 Certificate Programs?
- 55. Hey! Data scientists!
- 56. Learning By Building You don't really understand something until you build it For example, I understand garbage disposals much better now that I had to replace one that was leaking water all over my kitchen More relevantly, I thought I understood hypothesis testing, until I tried to write a book chapter + code about it.
- 57. Learning By Building Functional Programming
- 58. Break Things Down Into Small Functions
- 59. So you don't end up with something like this
- 60. Don't Mutate
- 61. Example: k-means clustering Given a set of points, group them into k clusters in a way that minimizes the within-cluster sum- of-squares Global optimization is hard, so use a greedy iterative approach
- 62. Fun Motivation: Image Posterization Image consists of pixels Each pixel is a triplet (R,G,B) Imagine pixels as points in space Find k clusters of pixels Recolor each pixel to its cluster mean I think it's fun, anyway 8 colors
- 63. Example: k-means clustering given some points, find k clusters by choose k "means" repeat: assign each point to cluster of closest "mean" recompute mean of each cluster sounds simple! let's code!
- 64. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means
- 65. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points
- 66. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments
- 67. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration
- 68. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point
- 69. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean
- 70. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean compute the distance
- 71. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean compute the distance assign the point to the cluster of the mean with the smallest distance
- 72. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean compute the distance assign the point to the cluster of the mean with the smallest distance find the points in each cluster
- 73. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean compute the distance assign the point to the cluster of the mean with the smallest distance find the points in each cluster and compute the new means
- 74. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means Not impenetrable, but a lot less helpful than it could be
- 75. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means Not impenetrable, but a lot less helpful than it could be Can we make it simpler?
- 76. Break Things Down Into Small Functions
- 77. def k_means(points, k, num_iters=10): # start with k of the points as "means" means = random.sample(points, k) # and iterate finding new means for _ in range(num_iters): means = new_means(points, means) return means
- 78. def new_means(points, means): # assign points to clusters # each cluster is just a list of points clusters = assign_clusters(points, means) # return the cluster means return [mean(cluster) for cluster in clusters]
- 79. def assign_clusters(points, means): # one cluster for each mean # each cluster starts empty clusters = [[] for _ in means] # assign each point to cluster # corresponding to closest mean for p in points: index = closest_index(point, means) clusters[index].append(point) return clusters
- 80. def closest_index(point, means): # return index of closest mean return argmin(distance(point, mean) for mean in means) def argmin(xs): # return index of smallest element return min(enumerate(xs), key=lambda pair: pair[1])[0]
- 81. To Recap k_means(points, k, num_iters=10) mean(points) k_means(points, k, num_iters=10) new_means(points, means) assign_clusters(points, means) closest_index(point, means) argmin(xs) distance(point1, point2) mean(points) add(point1, point2) scalar_multiply(c, point)
- 82. As a Pedagogical Tool Can be used "top down" (as we did here) Implement high-level logic Then implement the details Nice for exposition Can also be used "bottom up" Implement small pieces Build up to high-level logic Good for workshops
- 83. Example: Decision Trees Want to predict whether a given Meetup is worth attending (True) or not (False) Inputs are dictionaries describing each Meetup { "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" } { "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }
- 84. Example: Decision Trees { "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" } { "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" } beer? True False speaker? True False free none paid @jakevdp @joelgrus
- 85. Example: Decision Trees class LeafNode: def __init__(self, prediction): self.prediction = prediction def predict(self, input_dict): return self.prediction class DecisionNode: def __init__(self, attribute, subtree_dict): self.attribute = attribute self.subtree_dict = subtree_dict def predict(self, input_dict): value = input_dict.get(self.attribute) subtree = self.subtree_dict[value] return subtree.predict(input)
- 86. Example: Decision Trees Again inspiration from functional programming: type Input = Map.Map String String data Tree = Predict Bool | Subtrees String (Map.Map String Tree) look at the "beer" entry a map from each possible "beer" value to a subtree always predict a specific value
- 87. Example: Decision Trees type Input = Map.Map String String data Tree = Predict Bool | Subtrees String (Map.Map String Tree) predict :: Tree -> Input -> Bool predict (Predict b) _ = b predict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.!
- 88. Example: Decision Trees type Input = Map.Map String String data Tree = Predict Bool | Subtrees String (Map.Map String Tree) We can do the same, we'll say a decision tree is either True False (attribute, subtree_dict) ("beer", { "free" : True, "none" : False, "paid" : ("speaker", {...})})
- 89. predict :: Tree -> Input -> Bool predict (Predict b) _ = b predict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a) Example: Decision Trees def predict(tree, input_dict): # leaf node predicts itself if tree in (True, False): return tree else: # destructure tree attribute, subtree_dict = tree # find appropriate subtree value = input_dict[attribute] subtree = subtree_dict[value] # classify using subtree return predict(subtree, input_dict)
- 90. Not Just For Data Science
- 91. In Conclusion Teaching data science is fun, if you're smart about it Learning data science is fun, if you're smart about it Writing a book is not that much fun Having written a book is pretty fun Making slides is actually kind of fun Functional programming is a lot of fun
- 92. Thanks! @joelgrus joelgrus@gmail.com joelgrus.com