Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Accelerating Random Forests in Scik... by Gilles Louppe 19172 views
- Converting Scikit-Learn to PMML by Villu Ruusmann 20566 views
- Scikit-learn for easy machine learn... by Gael Varoquaux 8606 views
- Gradient Boosted Regression Trees i... by DataRobot 45190 views
- A Beginner's Guide to Machine Learn... by Sarah Guido 19052 views
- Introduction to Machine Learning wi... by Benjamin Bengfort 13967 views

Given at PyDataSV 2014

In machine learning, clustering is a good way to explore your data and pull out patterns and relationships. Scikit-learn has some great clustering functionality, including the k-means clustering algorithm, which is among the easiest to understand. Let's take an in-depth look at k-means clustering and how to use it. This mini-tutorial/talk will cover what sort of problems k-means clustering is good at solving, how the algorithm works, how to choose k, how to tune the algorithm's parameters, and how to implement it on a set of data.

No Downloads

Total views

45,269

On SlideShare

0

From Embeds

0

Number of Embeds

705

Shares

0

Downloads

813

Comments

0

Likes

48

No notes for slide

- 1. K-Means Clustering with Scikit-Learn Sarah Guido PyData SV 2014
- 2. About Me • Today: graduated from the University of Michigan! • Soon: data scientist at Reonomy • PyLadies co-organizer • @sarah_guido
- 3. Outline • What is k-means clustering? • How it works • When to use it • K-means clustering in scikit-learn • Basic implementation • Implementation with tuned parameters
- 4. Clustering • Unsupervised learning • Unlabeled data • Split observations into groups • Distance between data points • Exploring the data
- 5. K-means clustering • Formally: a method of vector quantization • Partition space into Voronoi cells • Separate samples into n groups of equal variance • Uses the Euclidean distance metric
- 6. K-means clustering • Iterative refinement • Three basic steps • Step 1: Choose k • Iterate over: • Step 2: Assignment • Step 3: Update • Repeats until convergence has been reached
- 7. K-means clustering • Assignment • Update
- 8. K-means clustering • Advantages • Scales well • Efficient • Will always converge • Disadvantages • Choosing the wrong k • Convergence to local minimum
- 9. K-means clustering • When to use • Normally distributed data • Large number of samples • Not too many clusters • Distance can be measured in a linear fashion
- 10. Scikit-Learn • Machine learning module • Open-source • Built-in datasets • Good resources for learning
- 11. Scikit-Learn • Model = EstimatorObject() • Unsupervised: • Model.fit(dataset.data) • dataset.data = dataset • Supervised would use the labels as a second parameter
- 12. K-means in scikit-learn • Efficient and fast • You: pick n clusters, kmeans: finds n initial centroids • Run clustering jobs in parallel
- 13. Dataset • University of California Machine Learning Repository • Individual household power consumption
- 14. K-means in scikit-learn
- 15. K-means in scikit-learn • Results
- 16. K-means parameters • n_clusters • max_iter • n_init • init • precompute_distances • tol • n_jobs • random_state
- 17. n_clusters: choosing k • Graphing the variance • Information criterion • Cross-validation
- 18. n_clusters: choosing k • Graphing the variance • from scipy.spatial.distance import cdist, pdist • cdist: distance computation between sets of observations • pdist: pairwise distances between observations in the same set
- 19. n_clusters: choosing k • Graphing the variance
- 20. n_clusters: choosing k • Graphing the variance
- 21. n_clusters: choosing k • Graphing the variance
- 22. n_clusters: choosing k n_clusters = 4 n_clusters = 7
- 23. n_clusters: choosing k • n_clusters = 8 (default)
- 24. init • k-means++ • Default • Selects initial clusters in a way that speeds up convergence • random • Choose k rows at random for initial centroids • Ndarray that gives initial centers • (n_clusters, n_features)
- 25. K-means revised • Set n_clusters • 7, 8 • Set init • kmeans++, random
- 26. K-means revised n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
- 27. K-means revised n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
- 28. Comparing results: silhouette score • Silhouette coefficient • No ground truth • Mean distance between an observation and all other points in its cluster • Mean distance between an observation and all other points in the next nearest cluster • Silhouette score in scikit-learn • Mean of silhouette coefficient for all of the observations • Closer to 1, the better the fit • Large dataset == long time
- 29. Comparing results: silhouette score • n_clusters=8, init=kmeans++ • 0.8117 • n_clusters=8, init=random • 0.6511 • n_clusters=7, init=kmeans++ • 0.7719 • n_clusters=7, init=random • 0.7037
- 30. What does this tell us? • Patterns exist • Groups of similar observations exist • Sometimes, the defaults work • We need more exploration!
- 31. A few tips • Clustering is a good way to explore your data • Intuition fails in high dimensions • Use dimensionality reduction • Combine with other models • Know your data
- 32. Materials and resources • Scikit-learn documentation • scikit-learn.org/stable/documentation.html • Datasets • http://archive.ics.uci.edu/ml/datasets.html • Mldata.org • Blogs • http://datasciencelab.wordpress.com/
- 33. Contact me! • Twitter: @sarah_guido • www.linkedin.com/in/sarahguido/ • https://github.com/sarguido

No public clipboards found for this slide

Be the first to comment