Successfully reported this slideshow.
Your SlideShare is downloading. ×

Machine Learning - Matt Moloney

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 17 Ad
Advertisement

More Related Content

Slideshows for you (15)

Similar to Machine Learning - Matt Moloney (20)

Advertisement

More from Phillip Trelford (20)

Recently uploaded (20)

Advertisement

Machine Learning - Matt Moloney

  1. 1. Social @tsunamiide tsunami.io Earthquake Enterprises K-Means Clustering
  2. 2. Social @tsunamiide tsunami.io Earthquake Enterprises  Two parts  Simple Clustering Algorithm  Using ML with Large Datasets
  3. 3. Social @tsunamiide tsunami.io Earthquake Enterprises  Very elegant  Scales to large datasets  It is simple and easy to learn  Works with unsupervised data
  4. 4. Social @tsunamiide tsunami.io Earthquake Enterprises  Competitive Analysis  Compare products from Company A with Company B by clustering them into groups  Semi-Structured Search Engine  Show different results to different users depending on how they are classified ▪ What Google thinks about you: https://www.google.com/settings/ads/onweb/
  5. 5. Social @tsunamiide tsunami.io Earthquake Enterprises  Multivariate data set  (i.e. each row is a float[])  Classification is labeled  Not linearly separable  Popular for testing ML Algorithms
  6. 6. Social @tsunamiide tsunami.io Earthquake Enterprises  Iris data in (n-1)! charts
  7. 7. Social @tsunamiide tsunami.io Earthquake Enterprises  E.g. Classifying text documents  Charting no longer makes sense  Need to rely derived metrics
  8. 8. Social @tsunamiide tsunami.io Earthquake Enterprises  Euclidian  Manhattan Distance  Angle between  Correlation
  9. 9. Social @tsunamiide tsunami.io Earthquake Enterprises  Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]  K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones  We can use this to emphasize some factors over others
  10. 10. Social @tsunamiide tsunami.io Earthquake Enterprises  select the number of clusters (K)  select a seed for each cluster (centroid)  Do {  assign each item in the training set to the closest centroid  update each centroid to the mean of the assigned items }  while (any of the centroids have moved)
  11. 11. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters are known (3)  Pick seed by randomly selecting 3 rows from dataset  We intentionally pick 3 close together for demonstration
  12. 12. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters  Distance functions  Feature scaling  Datasets  E.g. included abalone and breast cancer datasets
  13. 13. Social @tsunamiide tsunami.io Earthquake Enterprises
  14. 14. Social @tsunamiide tsunami.io Earthquake Enterprises  Faster algorithms with more data will often beat slower algorithms with less data.
  15. 15. Social @tsunamiide tsunami.io Earthquake Enterprises  Some algorithms do not scale well  e.g. Layered NN  can take many days (not suited to tutorials)  ML algorithms need to be run repeatedly  Tuning hyper-parameters  K-fold cross validation  Feature discovery
  16. 16. Social @tsunamiide tsunami.io Earthquake Enterprises  Random Forest  Built in, popular and effective  Leave one out  My preferred
  17. 17. Social @tsunamiide tsunami.io Earthquake Enterprises  Use a fast algorithm for factor discovery  Use a slow algorithm for final solution  Many competitions are won on starting the slow algorithm as soon as possible

×