Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DutchMLSchool. Clusters and Anomalies

63 views

Published on

Cluster Analysis and Anomaly Detection (Unsupervised I) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

DutchMLSchool. Clusters and Anomalies

  1. 1. 1st edition | July 8-11, 2019
  2. 2. BigML, Inc #DutchMLSchool Clusters Finding Similarities Poul Petersen CIO, BigML, Inc 2
  3. 3. BigML, Inc #DutchMLSchool What is Clustering? 3 • An unsupervised learning technique • No labels necessary • Useful for finding similar instances • Smart sampling/labelling • Finds “self-similar" groups of instances • Customer: groups with similar behavior • Medical: patients with similar diagnostic measurements • Defines each group by a “centroid” • Geometric center of the group • Represents the “average” member • Number of centroids (k) can be specified or determined
  4. 4. BigML, Inc #DutchMLSchool Cluster Centroids 4 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  5. 5. BigML, Inc #DutchMLSchool Cluster Centroids 5 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 auth = pin amount ~ $100 Same: date: Mon != Wed customer: Sally != Bob account: 6788 != 3421 class: clothes != gas zip: 26339 != 46140 Different: date = Wed (2 out of 3) customer = Bob account = 3421 auth = pin class = gas zip = 46140 amount = $104 Centroid: similar
  6. 6. BigML, Inc #DutchMLSchool Use Cases 6 • Customer segmentation • Which customers are similar? • How many natural groups are there? • Item discovery • What other items are similar to this one? • Similarity • What other instances share a specific property? • Recommender (almost) • If you like this item, what other items might you like? • Active learning • Labelling unlabelled data efficiently
  7. 7. BigML, Inc #DutchMLSchool Customer Segmentation 7 GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for up-sell. • Dataset of mobile game users. • Data for each user consists of usage statistics and a LTV based on in- game purchases • Assumption: Usage correlates to LTV 0% 3% 1%
  8. 8. BigML, Inc #DutchMLSchool Similarity 8 GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population • Dataset of Lending Club Loans • Mark any loan that is currently or has even been late as “trouble” 0% 3% 7% 1%
  9. 9. BigML, Inc #DutchMLSchool Active Learning 9 GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data. • Dataset of diagnostic measurements of 768 patients. • Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*.
  10. 10. BigML, Inc #DutchMLSchool Active Learning 10 *For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not- fraud. Or a million images which need to be labeled as cat/not-cat. 2323
  11. 11. BigML, Inc #DutchMLSchool Item Discovery 11 GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste. • Dataset of 86 whiskies • Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics. Smoky Fruity
  12. 12. BigML, Inc #DutchMLSchool Clusters Demo #1 12
  13. 13. BigML, Inc #DutchMLSchool Human Expert 13 Cluster into 3 groups…
  14. 14. BigML, Inc #DutchMLSchool Human Expert 14
  15. 15. BigML, Inc #DutchMLSchool Human Expert 15 • Jesa used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “edges”, “hard”, etc • Items were then clustered based on the chosen features • Separation quality was then tested to ensure: • met criteria of K=3 • groups were sufficiently “distant” • no crossover
  16. 16. BigML, Inc #DutchMLSchool Human Expert 16 • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count Create features that capture these object differences
  17. 17. BigML, Inc #DutchMLSchool Clustering Features 17 Object Length / Width Num Surfaces penny 1 3 dime 1 3 knob 1 4 eraser 2,75 6 box 1 6 block 1,6 6 screw 8 3 battery 5 3 key 4,25 3 bead 1 2
  18. 18. BigML, Inc #DutchMLSchool Plot by Features 18 Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Key Insight: We can find clusters using distances in n-dimensional feature space K=3
  19. 19. BigML, Inc #DutchMLSchool Plot by Features 19 Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Find “best” (minimum distance) circles that include all points
  20. 20. BigML, Inc #DutchMLSchool K-Means Algorithm 20 K=3
  21. 21. BigML, Inc #DutchMLSchool K-Means Algorithm 21 K=3 Repeat until centroids stop moving
  22. 22. BigML, Inc #DutchMLSchool Features Matter 22 Metal Other Wood
  23. 23. BigML, Inc #DutchMLSchool Convergence 23 Convergence guaranteed but not necessarily unique Starting points important (K++)
  24. 24. BigML, Inc #DutchMLSchool Starting Points 24 • Random points or instances in n-dimensional space • Might start "too close" • Risk of sub-optimal convergence
  25. 25. BigML, Inc #DutchMLSchool Sub-Optimal Converge 25 Arbitrarily Far Apart
 Sub-Optimal Arbitrarily Far Apart
 Optimal
  26. 26. BigML, Inc #DutchMLSchool Starting Points 26 • Random points or instances in n-dimensional space • Might start "too close" • Risk of sub-optimal convergence • Chose points “farthest” away from each other • but this is sensitive to outliers • k++ • the first point is chosen randomly from instances • each subsequent point is chosen from the remaining instances with a probability proportional to the squared distance from the point's closest existing cluster center
  27. 27. BigML, Inc #DutchMLSchool K++ Initial Centers 27 Low
 Probability High
 ProbabilityHighest
 Probability K=3
  28. 28. BigML, Inc #DutchMLSchool K++ Initial Centers 28 Low
 Probability Low
 Probability K=3
  29. 29. BigML, Inc #DutchMLSchool K++ Initial Centers 29 K=3
  30. 30. BigML, Inc #DutchMLSchool Scaling Matters 30 price number of bedrooms d = 160,000 d = 1
  31. 31. BigML, Inc #DutchMLSchool Other Tricks 31 • What is the distance to a “missing value”? • What is the distance between categorical values? • How far is “red” from “green”? • What is the distance between text features? • Does it have to be Euclidean distance? • Unknown ideal number of clusters, “K”?
  32. 32. BigML, Inc #DutchMLSchool Distance to Missing? 32 • Nonsense! Try replacing missing values with: • Maximum • Mean • Median • Minimum • Zero • Ignore instances with missing values
  33. 33. BigML, Inc #DutchMLSchool Distance to Categorical? 33 • Define special distance function: For two instances 𝑥 and 𝑦 and the categorical field 𝑎: • if 𝑥 𝑎 = 𝑦 𝑎 then
 (𝑥,𝑦)distance=0 (or field scaling value) 
 else 
 (𝑥,𝑦)distance=1 Approach: similar to “k-prototypes”
  34. 34. BigML, Inc #DutchMLSchool Distance to Categorical? 34 animal favorite toy toy color cat ball red cat ball green d=0 d=0 d=1 cat laser red dog squeaky red d=1 d=1 d=0 D = 1 Then compute Euclidean distance between vectors D = √2 Note: the centroid is assigned the most common category of the member instances
  35. 35. BigML, Inc #DutchMLSchool Text Vectors 35 1 Cosine Similarity 0 -1 "hippo" "safari" "zebra" …. 1 0 1 … 1 1 0 … 0 1 1 … Text Field #1 Text Field #2 Features(thousands) • Cosine Similarity • cos() between two vectors • 1 if collinear, 0 if orthogonal • only positive vectors: 0 ≤ CS ≤ 1 • Cosine Distance=1-Cosine Similarity • CD(TF1, TF2) = 0.5
  36. 36. BigML, Inc #DutchMLSchool Finding K: G-Means 36
  37. 37. BigML, Inc #DutchMLSchool Finding K: G-Means 37
  38. 38. BigML, Inc #DutchMLSchool Finding K: G-Means 38 Let K=2 Keep 1, Split 1 New K=3
  39. 39. BigML, Inc #DutchMLSchool Finding K: G-Means 39 Let K=3 Keep 1, Split 2 New K=5
  40. 40. BigML, Inc #DutchMLSchool Finding K: G-Means 40 Let K=5 K=5
  41. 41. BigML, Inc #DutchMLSchool Clusters Demo #2 41
  42. 42. BigML, Inc #DutchMLSchool Summary 42 • Cluster Purpose • Unsupervised technique for finding self-similar groups of instances • Number of centroids (k) can be inputed or computed • Outputs list of centroids • Configuration: • Algorithm: K-means / G-means • Cluster Parameter: k or critical value • Default missing / Summary fields / Scales / Weights • Model Clusters • Centroid / Batchcentroids
  43. 43. BigML, Inc #DutchMLSchool Anomaly Detection Finding the Unusual Poul Petersen CIO, BigML, Inc 43
  44. 44. BigML, Inc #DutchMLSchool What is Anomaly Detection? 44 • An unsupervised learning technique • No labels necessary • Useful for finding unusual instances • Filtering, finding mistakes, 1-class classifiers • Finds instances that do not match • Customer: big or small spender for profile • Medical: healthy patient despite indicative diagnostics • Defines each unusual instance by an “anomaly score” • in BigML: 0=normal, 1=unusual, and 0.7 ≫ 0.6 ﹥0.5 • Standard deviation, distributions, etc
  45. 45. BigML, Inc #DutchMLSchool Clusters 45 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  46. 46. BigML, Inc #DutchMLSchool Clusters 46 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 similar
  47. 47. BigML, Inc #DutchMLSchool Anomaly Detection 47 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  48. 48. BigML, Inc #DutchMLSchool Anomaly Detection 48 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 anomaly • Amount $2,459 is higher than all other transactions • It is the only transaction • In zip 21350 • for the purchase class "tech"
  49. 49. BigML, Inc #DutchMLSchool Use Cases 49 • Unusual instance discovery - "exploration" • Intrusion Detection - "looking for unusual usage patterns" • Fraud - "looking for unusual behavior" • Identify Incorrect Data - "looking for mistakes" • Remove Outliers - "improve model quality" • Model Competence / Input Data Drift
  50. 50. BigML, Inc #DutchMLSchool Removing Outliers 50 • Models need to generalize • Outliers negatively impact generalization GOAL: Use anomaly detector to identify most anomalous points and then remove them before modeling. DATASET FILTERED DATASET ANOMALY DETECTOR CLEAN MODEL
  51. 51. BigML, Inc #DutchMLSchool Diabetes Anomalies 51 DIABETES SOURCE DIABETES DATASET TRAIN SET TEST SET ALL MODEL CLEAN DATASET FILTER ALL MODEL ALL EVALUATION CLEAN EVALUATION COMPARE EVALUATIONS ANAOMALY DETECTOR
  52. 52. BigML, Inc #DutchMLSchool Title 52
  53. 53. BigML, Inc #DutchMLSchool Intrusion Detection 53 GOAL: Identify unusual command line behavior per user and across all users that might indicate an intrusion. • Dataset of command line history for users • Data for each user consists of commands, flags, working directories, etc. • Assumption: Users typically issue the same flag patterns and work in certain directories Per User Per Dir All User All Dir
  54. 54. BigML, Inc #DutchMLSchool Fraud 54 • Dataset of credit card transactions • Additional user profile information GOAL: Cluster users by profile and use multiple anomaly scores to detect transactions that are anomalous on multiple levels. Card Level User Level Similar User Level
  55. 55. BigML, Inc #DutchMLSchool Model Competence 55 • After putting a model it into production, data that is being predicted can become statistically different than the training data. • Train an anomaly detector at the same time as the model. GOAL: For every prediction, compute an anomaly score. If the anomaly score is high, then the model may not be competent and should not be trusted. Prediction T T Confidence 0,86 0,84 Anomaly Score 0,5367 0,7124 Competent? Y N At Prediction TimeAt Training Time DATASET MODEL ANOMALY DETECTOR
  56. 56. BigML, Inc #DutchMLSchool Benford’s Law 56 • In real-life numeric sets the small digits occur disproportionately often as leading significant digits. • Applications include: • accounting records • electricity bills • street addresses • stock prices • population numbers • death rates • lengths of rivers • Available in BigML API
  57. 57. BigML, Inc #DutchMLSchool Univariate Approach 57 • Single variable: heights, test scores, etc • Assume the value is distributed “normally” • Compute standard deviation • a measure of how “spread out” the numbers are • the square root of the variance (The average of the squared differences from the Mean.) • Depending on the number of instances, choose a “multiple” of standard deviations to indicate an anomaly. A multiple of 3 for 1000 instances removes ~ 3 outliers.
  58. 58. BigML, Inc #DutchMLSchool Univariate Approach 58 measurement frequency outliersoutliers • Available in BigML API
  59. 59. BigML, Inc #DutchMLSchool Multivariate Matters 59
  60. 60. BigML, Inc #DutchMLSchool Multivariate Matters 60
  61. 61. BigML, Inc #DutchMLSchool Human Expert 61 Most Unusual?
  62. 62. BigML, Inc #DutchMLSchool Human Expert 62 “Round”“Skinny” “Corners” “Skinny” but not “smooth” No “Corners” Not “Round” Key Insight The “most unusual” object is different in some way from every partition of the features. Most unusual
  63. 63. BigML, Inc #DutchMLSchool Human Expert 63 • Human used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “smooth”, “corners” • Items were then separated based on the chosen features • Each cluster was then examined to see which object fit the least well in its cluster and did not fit any other cluster
  64. 64. BigML, Inc #DutchMLSchool Human Expert 64 • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count • Smooth - true or false Create features that capture these object differences
  65. 65. BigML, Inc #DutchMLSchool Anomaly Features 65 Object Length / Width Num Surfaces Smooth penny 1 3 TRUE dime 1 3 TRUE knob 1 4 TRUE eraser 2,75 6 TRUE box 1 6 TRUE block 1,6 6 TRUE screw 8 3 FALSE battery 5 3 TRUE key 4,25 3 FALSE bead 1 2 TRUE
  66. 66. BigML, Inc #DutchMLSchool length/width > 5 smooth? box blockeraser knob penny/dime bead key battery screw num surfaces = 6 length/width =1 length/width < 2 Know that “splits” matter - don’t know the order TrueFalse TrueFalse TrueFalse FalseTrue TrueFalse Random Splits 66
  67. 67. BigML, Inc #DutchMLSchool Isolation Forest 67 Grow a random decision tree until each instance from a sample is in its own leaf “easy” to isolate “hard” to isolate Depth Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
  68. 68. BigML, Inc #DutchMLSchool Isolation Forest Scoring 68 D = 3 D = 6 D = 2 S=0.45 Map avg depth to final score f1 f2 f3 i1 red cat ball i2 red cat ball i3 red cat box i4 blue dog pen For the instance, i2 Find the depth in each tree
  69. 69. BigML, Inc #DutchMLSchool Model Competence 69 • A low anomaly score means the loan is similar to the modeled loans. • A high anomaly score means you should not trust the model. Prediction T T Confidence 0,86 0,84 Anomaly Score 0,5367 0,7124 Competent? Y N OPEN LOANS PREDICTION ANOMALY SCORE CLOSED LOAN MODEL CLOSED LOAN ANOMALY DETECTOR
  70. 70. BigML, Inc #DutchMLSchool Title 70
  71. 71. BigML, Inc #DutchMLSchool 1-Class Classifier? 71 • You place an advertisement in a local newspaper • You collect demographic information about all responders • Now you want to market in a new locality with direct letters • To optimize mailing costs, need to predict who will respond • But, can not distinguish not interested from didn’t see the ad • Train an anomaly detector on the 1-class data • Pick the households with the lowest scores for mailing: • If a household has a low anomaly score, then they are “similar” to enough of your positive responders and therefore may respond as well • If an individual has a high anomaly score, then they are dissimilar from all previous responders and therefore are less likely to respond.
  72. 72. BigML, Inc #DutchMLSchool Summary 72 • Anomaly detection is the process of finding unusual instances • Some techniques and how they work: • Univariate: standard deviation • Benford’s law • Isolation Forest • Applications • Filtering to improve models • Finding mistakes, fraud, and intruders • Knowing when to retrain a model (competence) • 1-class classifiers • In general… unsupervised learning techniques: • Require more finesse and interpretation • Are more commonly part of a multistep workflow
  73. 73. Co-organized by: Sponsor: Business Partners:

×