Machine learning


Published on

A brief overview of making recommendations using the K nearest neighbour algorithm and the Euclidean distance. Given at a Forward First Tuesday evening.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • \n
  • Hope to show that it’s not too complicated, very interested, potentially valuable, and various parts are quite similar\nWhat kinds of things does machine learning cover?\n
  • Increasing piles of data\nmachine learning is complimentary to data mining: evolve behaviours from empirical data\n
  • Classic classification.\n
  • Product suggestions.\nList from my Kindle suggestions. Over 850,000 kindle titles alone. Recommendations based on my purchases and content?\n
  • Google employs all kinds of machine learning: query result ranking, news story clustering \n
  • 2 searches, on immediately after the other: one chrome, one safari. there’s a difference!?\n
  • Social sites make use of recommendations.\nInstead of products it’s users to other users.\nThis time it’s pretty good.\n
  • Social sites make use of recommendations. Instead of products it’s users to other users.\n
  • \n
  • Going to cover a high level description of these 3 topics, and then explore some of the details through a classification example\n
  • How much something is or isn’t part of a group. Assign class labels using a classifier built from predictor values\n
  • 16 things. we know there are 4 categories or labels.\nwe want to automate the way find a category for each thing. \n
  • \n
  • Clustering: Group a large number of things into groups of similar things\n
  • 24 blobs, not sure of what the categories are\njust want groups of similar things\n
  • we’ve got 4 categories\n
  • \n
  • lets take an example of looking at recommending items to users\n
  • 3 items, and 2 users\n
  • we can see recommendations for items from those users\n
  • for example, the red user shares 2 items...\n
  • with the blue user... we can use the blue users preferences to identify things that the red user would be interested in...\n
  • and for things like twitter + facebook, these graphs would be users to users\n
  • this brings up an interesting point- how do we model the problem.\nthe first thing we need to look at...\n
  • I mentioned it quite a lot- but what does that mean?\n
  • interesting example\n2 films- how similar?\nboth star jim carrey\n\n
  • Collaborative filtering- based on behaviour of multiple people (for example)\n
  • \n
  • \n
  • How to measure similarity? We can calculate distance... \n
  • One way is euclidean distance. Similar to pythagorean formula for calculating sides of triangles.\n\nWhat are q and p? ...\n
  • p and q are our vectors-\n1) so we first calculate the difference\n2) then square those (ensuring all numbers are signed the same)\n3) we sum the squares\n4) square root of the sum\n\nso, let’s look at the results for our data\n
  • we can see that item 3 is closer to item 2 than item 1.\nthis can be seen by the ratings for items 2 and 3 from all users have a similar shape.\n\nhow does this look in code?\n
  • \n
  • How about content based calculations?\nWell we break down the content into feature vectors.\n
  • This is our previous matrix- user and item ratings, what do we swap users for?\n
  • We swap them for features.\nFor example, items were documents, features may be the words in those documents.\nMovies might break down films into running length, actors etc.\n\nImportantly- Measure similarity in the same way- with distance calculations.\n\nLet’s put this in practice\n
  • We’ve looked at how to represent data, and how to measure similarity.\nHow do we turn that into an algorithm that can classify things?\n
  • One really simple one is k-nearest neighbours: find the most common category for our item from k nearest items\n
  • Our matrix from before- shows the calculated distance of Items 1 and 2 from item 3.\nBut, if we’re classifying, we need to know what the categories are!\n
  • We’ve added the labels so we can see that item 1 was spam and item 2 was ham\n\nitems 1 and 2 represent our trained model- data and their label\n\nlet’s drop the stuff we don’t need any more\n
  • we have just labels and distances from all other items to our new item.\n\nback to our algorithm- knn. method: find the most common label from k nearest items to our item (in this case 3).\n\nso, given the above information we’d classify it into “Ham” category. If we had more data we’d just compare more neighbours.\n\ntime for some code ...\n
  • xs is the vector we’re trying to classify\nk is the number of nearest neighbours we’ll measure the distance of\nm is our trained matrix of data\nlabels are the labels for the items in the matrix\n
  • all very well, how do we know our model is accurately categorising things?\n
  • Similar matrix to before, how can we use the empirical data to measure effectiveness of the algorithm?\n\nWe can take our data and consider part of it to be testing data...\n
  • Item 3 now becomes our test data- we have calculated label and an observed label. We can then measure how well we match.\n\nThis is the same for rating movies (for example) as well- how close is our estimated score to the actual measured score?\n\nAnyway, that brings us to the end of a whistlestop tour\n
  • \n
  • Machine learning

    1. 1. Machine Learning
    2. 2. Machine Learning A n Intro duction
    3. 3. Automated Insights
    4. 4. Spam
    5. 5. You might like ...
    6. 6. The World
    7. 7. People you should follow...
    8. 8. People you may know...
    9. 9. People you may know...
    10. 10. Classifying ClusteringRecommending
    11. 11. Classifying
    12. 12. Clustering
    13. 13. Recommending
    14. 14. ItemsUsers
    15. 15. ItemsUsers
    16. 16. ItemsUsers
    17. 17. ItemsUsers
    18. 18. ItemsUsers
    19. 19. Modeling
    20. 20. Similarity
    21. 21. Movies
    22. 22. Collaborative
    23. 23. How to represent our data?
    24. 24. Data User A User B User CItem 1 1.0 3.0 5.0
    25. 25. Similarity? User A User B User CItem 1 1.0 3.0 5.0Item 2 2.0 5.0 2.0Item 3 1.0 3.0 1.0
    26. 26. Euclidean Distance
    27. 27. Euclidean Distance q 1.0 2.0 1.0 p 2.0 5.0 3.0
    28. 28. Euclidean Distance User A User B User C dItem 1 1.0 3.0 5.0 4Item 2 2.0 5.0 2.0 2.45Item 3 1.0 3.0 1.0
    29. 29. Euclidean Distance(defn euclidean-distance [v m] (let [num-of-rows (first (dim m)) difference (minus (matrix (repeat num-of-rows v)) m)] (sqrt (map sum-of-squares difference)))) Clojure #ftw
    30. 30. Content Based
    31. 31. Distance User A User B User CItem 1 1.0 3.0 5.0Item 2 2.0 5.0 2.0Item 3 1.0 3.0 1.0
    32. 32. Distance Feature A Feature B Feature CItem 1 1.0 3.0 5.0Item 2 2.0 5.0 2.0Item 3 1.0 3.0 1.0
    33. 33. Classification Algorithm
    34. 34. k-nearest neighbours
    35. 35. Our Data A B C dItem 1 1.0 3.0 5.0 4Item 2 2.0 5.0 2.0 2.45Item 3 1.0 3.0 1.0
    36. 36. Our Model A B C d Label {Trained Item 1 1.0 3.0 5.0 4 Spam Item 2 2.0 5.0 2.0 2.45 Ham Item 3 1.0 3.0 1.0
    37. 37. Our Model Label d {Trained Item 1 Spam 4 Item 2 Ham 2.45 Item 3
    38. 38. k-nn Classifier(defn knn-classify [xs k m labels] (let [sorted-labels (take k (map (partial nth labels) (sorted-indexes (euclidean-distance xs m)))) category (mode sorted-labels)] (if (seq? category) (first category) category))) Clojure #ftw
    39. 39. Evaluation
    40. 40. Our Model Label d {Trained Item 1 Spam 4 Item 2 Ham 2.45 Item 3
    41. 41. Our Model Observed Label Calculated Label {Trained Item 1 Spam Item 2 HamTest Item 3 Ham Ham
    42. 42. kʼthx