Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Clustering:
A Scikit-Learn Tutorial
Damian Mingle
About Me
• Chief Data Scientist, WPC Healthcare
• Speaker
• Researcher
• Writer
Outline
• What is k-means clustering?
• How does it work?
• When is it appropriate to use it?
• K-means clustering in scik...
Clustering
• It is unsupervised learning (inferring a function to
describe not so obvious structures from
unlabeled data)
...
K-means Clustering
• Formally: a method of vector quantization
• Informally: a mapping of a large set of inputs to a
(coun...
K-means Clustering
Repeats refinement
Three basic steps:
• Step 1: Choose k (how many groups)
• Repeat over:
• Step 2: Ass...
K-means Clustering
• Assignment
• Update
K-means Clustering
• Advantages
• Large data accepted
• Fast
• Will always find a solution
• Disadvantages
• Choosing the ...
K-means Clustering
• When to use
• Normally distributed data
• Large number of samples
• Not too many clusters
• Distance ...
Scikit-Learn
• Python
• Open-source machine learning library
• Very well documented
Scikit-Learn
• Model = EstimatorObject()
• Unsupervised:
• Model.fit(dataset.data)
• dataset.data = dataset
K-means in Scikit-Learn
• Very fast
• Data Scientist: picks number of clusters,
• Scikit kmeans: finds the initial centroi...
Dataset
Name: Household Power Consumption by Individuals
Number of attributes: 9
Number of instances: 2,075,259
Missing va...
K-means in Scikit-Learn
K-means in Scikit-Learn
• Results
K-means Parameters
• n_clusters
• Number of clusters to form
• max_iter
• Maximum number of repeats for algo in a single r...
n_clusters: choosing k
• View the variance
• cdist is the distance between sets of observations
• pdist is the pairwise di...
n_clusters: choosing k
Step 1: Determine your k range
Step 2: Fit the k-means model for each n_clusters = k
Step 3: Pull o...
n_clusters: choosing k
Step 4: Calculate Euclidean distance from each point to each cluster center
Step 5: Total within-cl...
n_clusters: choosing k
• Graphing the variance
n_clusters: choosing k
n_clusters = 4 n_clusters = 7
n_clusters: choosing k
• n_clusters = 8 (default)
init
Methods and their meaning:
• k-means++
• Selects initial clusters in a way that speeds up
convergence
• random
• Choo...
K-means (8)
n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
K-means (7)
n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
Comparing Results: Silhouette Score
• Silhouette coefficient
• Not black and white, lots of gray
• Average distance betwee...
Result Comparison: Silhouette Score
What Do the Results Say?
• Data patterns may in fact exist
• Similar observations can be grouped
• We need additional disc...
A Few Hacks
• Clustering is a great way to explore your data and
develop intution
• Too many features create a problem for...
Let’s Connect
• Twitter: @DamianMingle
• LinkedIn: DamianRMingle
• Sign-up for Data Science Hacks
Upcoming SlideShare
Loading in …5
×

Clustering: A Scikit Learn Tutorial

A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.

  • Be the first to comment

Clustering: A Scikit Learn Tutorial

  1. 1. Clustering: A Scikit-Learn Tutorial Damian Mingle
  2. 2. About Me • Chief Data Scientist, WPC Healthcare • Speaker • Researcher • Writer
  3. 3. Outline • What is k-means clustering? • How does it work? • When is it appropriate to use it? • K-means clustering in scikit-learn • Basic • Basic with adjustments
  4. 4. Clustering • It is unsupervised learning (inferring a function to describe not so obvious structures from unlabeled data) • Groups data objects • Measures distance between data points • Helps in examining the data
  5. 5. K-means Clustering • Formally: a method of vector quantization • Informally: a mapping of a large set of inputs to a (countable smaller set) • Separate data into groups with equal variance • Makes use of the Euclidean distance metric
  6. 6. K-means Clustering Repeats refinement Three basic steps: • Step 1: Choose k (how many groups) • Repeat over: • Step 2: Assignment (labeling data as part of a group) • Step 3: Update This process continues until its goal is reached
  7. 7. K-means Clustering • Assignment • Update
  8. 8. K-means Clustering • Advantages • Large data accepted • Fast • Will always find a solution • Disadvantages • Choosing the wrong number of groups • You reach a local optima not a global
  9. 9. K-means Clustering • When to use • Normally distributed data • Large number of samples • Not too many clusters • Distance can be measured in a linear fashion
  10. 10. Scikit-Learn • Python • Open-source machine learning library • Very well documented
  11. 11. Scikit-Learn • Model = EstimatorObject() • Unsupervised: • Model.fit(dataset.data) • dataset.data = dataset
  12. 12. K-means in Scikit-Learn • Very fast • Data Scientist: picks number of clusters, • Scikit kmeans: finds the initial centroids of groups
  13. 13. Dataset Name: Household Power Consumption by Individuals Number of attributes: 9 Number of instances: 2,075,259 Missing values: Yes
  14. 14. K-means in Scikit-Learn
  15. 15. K-means in Scikit-Learn • Results
  16. 16. K-means Parameters • n_clusters • Number of clusters to form • max_iter • Maximum number of repeats for algo in a single run • n_init • Number of times k-means algo will run with different initialization points • init • Method you want to initialize with • precompute_distances • Selection of Yes, No, or let the machine decide • Tol • How tolerable should the algo be when it converges • n_jobs • How many CPUs do you want to engage when running the algo • random_state • What instance should be the starting point for the algo
  17. 17. n_clusters: choosing k • View the variance • cdist is the distance between sets of observations • pdist is the pairwise distances between observations in the same set
  18. 18. n_clusters: choosing k Step 1: Determine your k range Step 2: Fit the k-means model for each n_clusters = k Step 3: Pull out the cluster centers for each model
  19. 19. n_clusters: choosing k Step 4: Calculate Euclidean distance from each point to each cluster center Step 5: Total within-cluster sum of squares Step 6: Total sum of squares Step 7: Difference between-cluster sum of squares
  20. 20. n_clusters: choosing k • Graphing the variance
  21. 21. n_clusters: choosing k n_clusters = 4 n_clusters = 7
  22. 22. n_clusters: choosing k • n_clusters = 8 (default)
  23. 23. init Methods and their meaning: • k-means++ • Selects initial clusters in a way that speeds up convergence • random • Choose k rows at random for initial centroids • Ndarray that gives initial centers • (n_clusters, n_features)
  24. 24. K-means (8) n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
  25. 25. K-means (7) n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
  26. 26. Comparing Results: Silhouette Score • Silhouette coefficient • Not black and white, lots of gray • Average distance between data observations and other data in cluster • Average distance between data observations and all other points in the NEXT nearest cluster • Silhouette score in scikit-learn • Average silhouette coefficient for all data observations • The closer to 1, the better the fit • Computation time increases with larger datasets
  27. 27. Result Comparison: Silhouette Score
  28. 28. What Do the Results Say? • Data patterns may in fact exist • Similar observations can be grouped • We need additional discovery
  29. 29. A Few Hacks • Clustering is a great way to explore your data and develop intution • Too many features create a problem for understanding • Use dimensionality reduction • Use clustering with other methods
  30. 30. Let’s Connect • Twitter: @DamianMingle • LinkedIn: DamianRMingle • Sign-up for Data Science Hacks

×