Successfully reported this slideshow.
Upcoming SlideShare
×

Clustering: A Scikit Learn Tutorial

A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Clustering: A Scikit Learn Tutorial

1. 1. Clustering: A Scikit-Learn Tutorial Damian Mingle
2. 2. About Me • Chief Data Scientist, WPC Healthcare • Speaker • Researcher • Writer
3. 3. Outline • What is k-means clustering? • How does it work? • When is it appropriate to use it? • K-means clustering in scikit-learn • Basic • Basic with adjustments
4. 4. Clustering • It is unsupervised learning (inferring a function to describe not so obvious structures from unlabeled data) • Groups data objects • Measures distance between data points • Helps in examining the data
5. 5. K-means Clustering • Formally: a method of vector quantization • Informally: a mapping of a large set of inputs to a (countable smaller set) • Separate data into groups with equal variance • Makes use of the Euclidean distance metric
6. 6. K-means Clustering Repeats refinement Three basic steps: • Step 1: Choose k (how many groups) • Repeat over: • Step 2: Assignment (labeling data as part of a group) • Step 3: Update This process continues until its goal is reached
7. 7. K-means Clustering • Assignment • Update
8. 8. K-means Clustering • Advantages • Large data accepted • Fast • Will always find a solution • Disadvantages • Choosing the wrong number of groups • You reach a local optima not a global
9. 9. K-means Clustering • When to use • Normally distributed data • Large number of samples • Not too many clusters • Distance can be measured in a linear fashion
10. 10. Scikit-Learn • Python • Open-source machine learning library • Very well documented
11. 11. Scikit-Learn • Model = EstimatorObject() • Unsupervised: • Model.fit(dataset.data) • dataset.data = dataset
12. 12. K-means in Scikit-Learn • Very fast • Data Scientist: picks number of clusters, • Scikit kmeans: finds the initial centroids of groups
13. 13. Dataset Name: Household Power Consumption by Individuals Number of attributes: 9 Number of instances: 2,075,259 Missing values: Yes
14. 14. K-means in Scikit-Learn
15. 15. K-means in Scikit-Learn • Results
16. 16. K-means Parameters • n_clusters • Number of clusters to form • max_iter • Maximum number of repeats for algo in a single run • n_init • Number of times k-means algo will run with different initialization points • init • Method you want to initialize with • precompute_distances • Selection of Yes, No, or let the machine decide • Tol • How tolerable should the algo be when it converges • n_jobs • How many CPUs do you want to engage when running the algo • random_state • What instance should be the starting point for the algo
17. 17. n_clusters: choosing k • View the variance • cdist is the distance between sets of observations • pdist is the pairwise distances between observations in the same set
18. 18. n_clusters: choosing k Step 1: Determine your k range Step 2: Fit the k-means model for each n_clusters = k Step 3: Pull out the cluster centers for each model
19. 19. n_clusters: choosing k Step 4: Calculate Euclidean distance from each point to each cluster center Step 5: Total within-cluster sum of squares Step 6: Total sum of squares Step 7: Difference between-cluster sum of squares
20. 20. n_clusters: choosing k • Graphing the variance
21. 21. n_clusters: choosing k n_clusters = 4 n_clusters = 7
22. 22. n_clusters: choosing k • n_clusters = 8 (default)
23. 23. init Methods and their meaning: • k-means++ • Selects initial clusters in a way that speeds up convergence • random • Choose k rows at random for initial centroids • Ndarray that gives initial centers • (n_clusters, n_features)
24. 24. K-means (8) n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
25. 25. K-means (7) n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
26. 26. Comparing Results: Silhouette Score • Silhouette coefficient • Not black and white, lots of gray • Average distance between data observations and other data in cluster • Average distance between data observations and all other points in the NEXT nearest cluster • Silhouette score in scikit-learn • Average silhouette coefficient for all data observations • The closer to 1, the better the fit • Computation time increases with larger datasets
27. 27. Result Comparison: Silhouette Score
28. 28. What Do the Results Say? • Data patterns may in fact exist • Similar observations can be grouped • We need additional discovery
29. 29. A Few Hacks • Clustering is a great way to explore your data and develop intution • Too many features create a problem for understanding • Use dimensionality reduction • Use clustering with other methods
30. 30. Let’s Connect • Twitter: @DamianMingle • LinkedIn: DamianRMingle • Sign-up for Data Science Hacks