Your SlideShare is downloading. ×
Machine Learning - Matt Moloney
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Machine Learning - Matt Moloney

2,233
views

Published on

Introduction to machine learning covering k-means clustering and support vector machines.

Introduction to machine learning covering k-means clustering and support vector machines.

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,233
On Slideshare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Social @tsunamiide tsunami.io Earthquake Enterprises K-Means Clustering
  • 2. Social @tsunamiide tsunami.io Earthquake Enterprises  Two parts  Simple Clustering Algorithm  Using ML with Large Datasets
  • 3. Social @tsunamiide tsunami.io Earthquake Enterprises  Very elegant  Scales to large datasets  It is simple and easy to learn  Works with unsupervised data
  • 4. Social @tsunamiide tsunami.io Earthquake Enterprises  Competitive Analysis  Compare products from Company A with Company B by clustering them into groups  Semi-Structured Search Engine  Show different results to different users depending on how they are classified ▪ What Google thinks about you: https://www.google.com/settings/ads/onweb/
  • 5. Social @tsunamiide tsunami.io Earthquake Enterprises  Multivariate data set  (i.e. each row is a float[])  Classification is labeled  Not linearly separable  Popular for testing ML Algorithms
  • 6. Social @tsunamiide tsunami.io Earthquake Enterprises  Iris data in (n-1)! charts
  • 7. Social @tsunamiide tsunami.io Earthquake Enterprises  E.g. Classifying text documents  Charting no longer makes sense  Need to rely derived metrics
  • 8. Social @tsunamiide tsunami.io Earthquake Enterprises  Euclidian  Manhattan Distance  Angle between  Correlation
  • 9. Social @tsunamiide tsunami.io Earthquake Enterprises  Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]  K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones  We can use this to emphasize some factors over others
  • 10. Social @tsunamiide tsunami.io Earthquake Enterprises  select the number of clusters (K)  select a seed for each cluster (centroid)  Do {  assign each item in the training set to the closest centroid  update each centroid to the mean of the assigned items }  while (any of the centroids have moved)
  • 11. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters are known (3)  Pick seed by randomly selecting 3 rows from dataset  We intentionally pick 3 close together for demonstration
  • 12. Social @tsunamiide tsunami.io Earthquake Enterprises  Number of clusters  Distance functions  Feature scaling  Datasets  E.g. included abalone and breast cancer datasets
  • 13. Social @tsunamiide tsunami.io Earthquake Enterprises
  • 14. Social @tsunamiide tsunami.io Earthquake Enterprises  Faster algorithms with more data will often beat slower algorithms with less data.
  • 15. Social @tsunamiide tsunami.io Earthquake Enterprises  Some algorithms do not scale well  e.g. Layered NN  can take many days (not suited to tutorials)  ML algorithms need to be run repeatedly  Tuning hyper-parameters  K-fold cross validation  Feature discovery
  • 16. Social @tsunamiide tsunami.io Earthquake Enterprises  Random Forest  Built in, popular and effective  Leave one out  My preferred
  • 17. Social @tsunamiide tsunami.io Earthquake Enterprises  Use a fast algorithm for factor discovery  Use a slow algorithm for final solution  Many competitions are won on starting the slow algorithm as soon as possible