Your SlideShare is downloading. ×
Machine Learning - Matt Moloney
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Machine Learning - Matt Moloney


Published on

Introduction to machine learning covering k-means clustering and support vector machines.

Introduction to machine learning covering k-means clustering and support vector machines.

Published in: Technology, Education

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Social @tsunamiide Earthquake Enterprises K-Means Clustering
  • 2. Social @tsunamiide Earthquake Enterprises  Two parts  Simple Clustering Algorithm  Using ML with Large Datasets
  • 3. Social @tsunamiide Earthquake Enterprises  Very elegant  Scales to large datasets  It is simple and easy to learn  Works with unsupervised data
  • 4. Social @tsunamiide Earthquake Enterprises  Competitive Analysis  Compare products from Company A with Company B by clustering them into groups  Semi-Structured Search Engine  Show different results to different users depending on how they are classified ▪ What Google thinks about you:
  • 5. Social @tsunamiide Earthquake Enterprises  Multivariate data set  (i.e. each row is a float[])  Classification is labeled  Not linearly separable  Popular for testing ML Algorithms
  • 6. Social @tsunamiide Earthquake Enterprises  Iris data in (n-1)! charts
  • 7. Social @tsunamiide Earthquake Enterprises  E.g. Classifying text documents  Charting no longer makes sense  Need to rely derived metrics
  • 8. Social @tsunamiide Earthquake Enterprises  Euclidian  Manhattan Distance  Angle between  Correlation
  • 9. Social @tsunamiide Earthquake Enterprises  Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]  K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones  We can use this to emphasize some factors over others
  • 10. Social @tsunamiide Earthquake Enterprises  select the number of clusters (K)  select a seed for each cluster (centroid)  Do {  assign each item in the training set to the closest centroid  update each centroid to the mean of the assigned items }  while (any of the centroids have moved)
  • 11. Social @tsunamiide Earthquake Enterprises  Number of clusters are known (3)  Pick seed by randomly selecting 3 rows from dataset  We intentionally pick 3 close together for demonstration
  • 12. Social @tsunamiide Earthquake Enterprises  Number of clusters  Distance functions  Feature scaling  Datasets  E.g. included abalone and breast cancer datasets
  • 13. Social @tsunamiide Earthquake Enterprises
  • 14. Social @tsunamiide Earthquake Enterprises  Faster algorithms with more data will often beat slower algorithms with less data.
  • 15. Social @tsunamiide Earthquake Enterprises  Some algorithms do not scale well  e.g. Layered NN  can take many days (not suited to tutorials)  ML algorithms need to be run repeatedly  Tuning hyper-parameters  K-fold cross validation  Feature discovery
  • 16. Social @tsunamiide Earthquake Enterprises  Random Forest  Built in, popular and effective  Leave one out  My preferred
  • 17. Social @tsunamiide Earthquake Enterprises  Use a fast algorithm for factor discovery  Use a slow algorithm for final solution  Many competitions are won on starting the slow algorithm as soon as possible