Successfully reported this slideshow.

×

Machine Learning - Matt Moloney

Introduction to machine learning covering k-means clustering and support vector machines.

Introduction to machine learning covering k-means clustering and support vector machines.

More Related Content

Machine Learning - Matt Moloney

1. 1. Social @tsunamiide tsunami.io Earthquake Enterprises K-Means Clustering
2. 2. Social @tsunamiide tsunami.io Earthquake Enterprises  Two parts  Simple Clustering Algorithm  Using ML with Large Datasets
3. 3. Social @tsunamiide tsunami.io Earthquake Enterprises  Very elegant  Scales to large datasets  It is simple and easy to learn  Works with unsupervised data
4. 4. Social @tsunamiide tsunami.io Earthquake Enterprises  Competitive Analysis  Compare products from Company A with Company B by clustering them into groups  Semi-Structured Search Engine  Show different results to different users depending on how they are classified ▪ What Google thinks about you: https://www.google.com/settings/ads/onweb/
5. 5. Social @tsunamiide tsunami.io Earthquake Enterprises  Multivariate data set  (i.e. each row is a float[])  Classification is labeled  Not linearly separable  Popular for testing ML Algorithms
6. 6. Social @tsunamiide tsunami.io Earthquake Enterprises  Iris data in (n-1)! charts
7. 7. Social @tsunamiide tsunami.io Earthquake Enterprises  E.g. Classifying text documents  Charting no longer makes sense  Need to rely derived metrics
8. 8. Social @tsunamiide tsunami.io Earthquake Enterprises  Euclidian  Manhattan Distance  Angle between  Correlation
9. 9. Social @tsunamiide tsunami.io Earthquake Enterprises  Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]  K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones  We can use this to emphasize some factors over others
10. 10. Social @tsunamiide tsunami.io Earthquake Enterprises  select the number of clusters (K)  select a seed for each cluster (centroid)  Do {  assign each item in the training set to the closest centroid  update each centroid to the mean of the assigned items }  while (any of the centroids have moved)
11. 11. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters are known (3)  Pick seed by randomly selecting 3 rows from dataset  We intentionally pick 3 close together for demonstration
12. 12. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters  Distance functions  Feature scaling  Datasets  E.g. included abalone and breast cancer datasets
13. 13. Social @tsunamiide tsunami.io Earthquake Enterprises
14. 14. Social @tsunamiide tsunami.io Earthquake Enterprises  Faster algorithms with more data will often beat slower algorithms with less data.
15. 15. Social @tsunamiide tsunami.io Earthquake Enterprises  Some algorithms do not scale well  e.g. Layered NN  can take many days (not suited to tutorials)  ML algorithms need to be run repeatedly  Tuning hyper-parameters  K-fold cross validation  Feature discovery
16. 16. Social @tsunamiide tsunami.io Earthquake Enterprises  Random Forest  Built in, popular and effective  Leave one out  My preferred
17. 17. Social @tsunamiide tsunami.io Earthquake Enterprises  Use a fast algorithm for factor discovery  Use a slow algorithm for final solution  Many competitions are won on starting the slow algorithm as soon as possible