## Continue your professional development with Scribd

Exclusive 60 day trial to the world's largest digital library.

Join 1+ million members and get unlimited* access to books, audiobooks.

Cancel anytime.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Numerical tour in the Python eco-sy... by Arnaud Joly 3080 views
- Tree models with Scikit-Learn: Grea... by Gilles Louppe 7351 views
- Exploring Machine Learning in Pytho... by Kan Ouivirach, Ph.D. 962 views
- Intro to scikit-learn by AWeber 4077 views
- Realtime predictive analytics using... by AWeber 3711 views
- Authorship Attribution and Forensic... by PyData 5767 views

Just added to SlideShare
##
Continue your professional development with Scribd

Exclusive 60 day trial to the world's largest digital library.

Join 1+ million members and get unlimited* access to books, audiobooks.

Cancel anytime.

A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.

Chief Data Scientist | Speaker | Author | Helping Others Extract Knowledge from Data.

No Downloads

Total views

578

On SlideShare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

41

Comments

0

Likes

1

No notes for slide

- 1. Clustering: A Scikit-Learn Tutorial Damian Mingle
- 2. About Me • Chief Data Scientist, WPC Healthcare • Speaker • Researcher • Writer
- 3. Outline • What is k-means clustering? • How does it work? • When is it appropriate to use it? • K-means clustering in scikit-learn • Basic • Basic with adjustments
- 4. Clustering • It is unsupervised learning (inferring a function to describe not so obvious structures from unlabeled data) • Groups data objects • Measures distance between data points • Helps in examining the data
- 5. K-means Clustering • Formally: a method of vector quantization • Informally: a mapping of a large set of inputs to a (countable smaller set) • Separate data into groups with equal variance • Makes use of the Euclidean distance metric
- 6. K-means Clustering Repeats refinement Three basic steps: • Step 1: Choose k (how many groups) • Repeat over: • Step 2: Assignment (labeling data as part of a group) • Step 3: Update This process continues until its goal is reached
- 7. K-means Clustering • Assignment • Update
- 8. K-means Clustering • Advantages • Large data accepted • Fast • Will always find a solution • Disadvantages • Choosing the wrong number of groups • You reach a local optima not a global
- 9. K-means Clustering • When to use • Normally distributed data • Large number of samples • Not too many clusters • Distance can be measured in a linear fashion
- 10. Scikit-Learn • Python • Open-source machine learning library • Very well documented
- 11. Scikit-Learn • Model = EstimatorObject() • Unsupervised: • Model.fit(dataset.data) • dataset.data = dataset
- 12. K-means in Scikit-Learn • Very fast • Data Scientist: picks number of clusters, • Scikit kmeans: finds the initial centroids of groups
- 13. Dataset Name: Household Power Consumption by Individuals Number of attributes: 9 Number of instances: 2,075,259 Missing values: Yes
- 14. K-means in Scikit-Learn
- 15. K-means in Scikit-Learn • Results
- 16. K-means Parameters • n_clusters • Number of clusters to form • max_iter • Maximum number of repeats for algo in a single run • n_init • Number of times k-means algo will run with different initialization points • init • Method you want to initialize with • precompute_distances • Selection of Yes, No, or let the machine decide • Tol • How tolerable should the algo be when it converges • n_jobs • How many CPUs do you want to engage when running the algo • random_state • What instance should be the starting point for the algo
- 17. n_clusters: choosing k • View the variance • cdist is the distance between sets of observations • pdist is the pairwise distances between observations in the same set
- 18. n_clusters: choosing k Step 1: Determine your k range Step 2: Fit the k-means model for each n_clusters = k Step 3: Pull out the cluster centers for each model
- 19. n_clusters: choosing k Step 4: Calculate Euclidean distance from each point to each cluster center Step 5: Total within-cluster sum of squares Step 6: Total sum of squares Step 7: Difference between-cluster sum of squares
- 20. n_clusters: choosing k • Graphing the variance
- 21. n_clusters: choosing k n_clusters = 4 n_clusters = 7
- 22. n_clusters: choosing k • n_clusters = 8 (default)
- 23. init Methods and their meaning: • k-means++ • Selects initial clusters in a way that speeds up convergence • random • Choose k rows at random for initial centroids • Ndarray that gives initial centers • (n_clusters, n_features)
- 24. K-means (8) n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
- 25. K-means (7) n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
- 26. Comparing Results: Silhouette Score • Silhouette coefficient • Not black and white, lots of gray • Average distance between data observations and other data in cluster • Average distance between data observations and all other points in the NEXT nearest cluster • Silhouette score in scikit-learn • Average silhouette coefficient for all data observations • The closer to 1, the better the fit • Computation time increases with larger datasets
- 27. Result Comparison: Silhouette Score
- 28. What Do the Results Say? • Data patterns may in fact exist • Similar observations can be grouped • We need additional discovery
- 29. A Few Hacks • Clustering is a great way to explore your data and develop intution • Too many features create a problem for understanding • Use dimensionality reduction • Use clustering with other methods
- 30. Let’s Connect • Twitter: @DamianMingle • LinkedIn: DamianRMingle • Sign-up for Data Science Hacks

No public clipboards found for this slide

Be the first to comment