Social @tsunamiide tsunami.io Earthquake Enterprises
K-Means Clustering
Social @tsunamiide tsunami.io Earthquake Enterprises
 Two parts
 Simple Clustering Algorithm
 Using ML with Large Datas...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Very elegant
 Scales to large datasets
 It is simple and easy to ...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Competitive Analysis
 Compare products from Company A with
Company...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Multivariate data set
 (i.e. each row is a float[])
 Classificati...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Iris data in (n-1)! charts
Social @tsunamiide tsunami.io Earthquake Enterprises
 E.g. Classifying text documents
 Charting no longer makes sense
 ...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Euclidian
 Manhattan Distance
 Angle between
 Correlation
Social @tsunamiide tsunami.io Earthquake Enterprises
 Many ML algorithms rely on the features
to be in the range of [-1,1...
Social @tsunamiide tsunami.io Earthquake Enterprises
 select the number of clusters (K)
 select a seed for each cluster ...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Number of clusters are known (3)
 Pick seed by randomly selecting ...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Number of clusters
 Distance functions
 Feature scaling
 Dataset...
Social @tsunamiide tsunami.io Earthquake Enterprises
Social @tsunamiide tsunami.io Earthquake Enterprises
 Faster algorithms
with more data will
often beat slower
algorithms ...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Some algorithms do not scale well
 e.g. Layered NN
 can take many...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Random Forest
 Built in, popular and effective
 Leave one out
 M...
Social @tsunamiide tsunami.io Earthquake Enterprises
 Use a fast algorithm for factor discovery
 Use a slow algorithm fo...
Upcoming SlideShare
Loading in …5
×

Machine Learning - Matt Moloney

2,645 views

Published on

Introduction to machine learning covering k-means clustering and support vector machines.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,645
On SlideShare
0
From Embeds
0
Number of Embeds
1,702
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Machine Learning - Matt Moloney

  1. 1. Social @tsunamiide tsunami.io Earthquake Enterprises K-Means Clustering
  2. 2. Social @tsunamiide tsunami.io Earthquake Enterprises  Two parts  Simple Clustering Algorithm  Using ML with Large Datasets
  3. 3. Social @tsunamiide tsunami.io Earthquake Enterprises  Very elegant  Scales to large datasets  It is simple and easy to learn  Works with unsupervised data
  4. 4. Social @tsunamiide tsunami.io Earthquake Enterprises  Competitive Analysis  Compare products from Company A with Company B by clustering them into groups  Semi-Structured Search Engine  Show different results to different users depending on how they are classified ▪ What Google thinks about you: https://www.google.com/settings/ads/onweb/
  5. 5. Social @tsunamiide tsunami.io Earthquake Enterprises  Multivariate data set  (i.e. each row is a float[])  Classification is labeled  Not linearly separable  Popular for testing ML Algorithms
  6. 6. Social @tsunamiide tsunami.io Earthquake Enterprises  Iris data in (n-1)! charts
  7. 7. Social @tsunamiide tsunami.io Earthquake Enterprises  E.g. Classifying text documents  Charting no longer makes sense  Need to rely derived metrics
  8. 8. Social @tsunamiide tsunami.io Earthquake Enterprises  Euclidian  Manhattan Distance  Angle between  Correlation
  9. 9. Social @tsunamiide tsunami.io Earthquake Enterprises  Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]  K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones  We can use this to emphasize some factors over others
  10. 10. Social @tsunamiide tsunami.io Earthquake Enterprises  select the number of clusters (K)  select a seed for each cluster (centroid)  Do {  assign each item in the training set to the closest centroid  update each centroid to the mean of the assigned items }  while (any of the centroids have moved)
  11. 11. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters are known (3)  Pick seed by randomly selecting 3 rows from dataset  We intentionally pick 3 close together for demonstration
  12. 12. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters  Distance functions  Feature scaling  Datasets  E.g. included abalone and breast cancer datasets
  13. 13. Social @tsunamiide tsunami.io Earthquake Enterprises
  14. 14. Social @tsunamiide tsunami.io Earthquake Enterprises  Faster algorithms with more data will often beat slower algorithms with less data.
  15. 15. Social @tsunamiide tsunami.io Earthquake Enterprises  Some algorithms do not scale well  e.g. Layered NN  can take many days (not suited to tutorials)  ML algorithms need to be run repeatedly  Tuning hyper-parameters  K-fold cross validation  Feature discovery
  16. 16. Social @tsunamiide tsunami.io Earthquake Enterprises  Random Forest  Built in, popular and effective  Leave one out  My preferred
  17. 17. Social @tsunamiide tsunami.io Earthquake Enterprises  Use a fast algorithm for factor discovery  Use a slow algorithm for final solution  Many competitions are won on starting the slow algorithm as soon as possible

×