Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VSSML16 L3. Clusters and Anomaly Detection

393 views

Published on

VSSML16 L3. Clusters and Anomaly Detection
Valencian Summer School in Machine Learning 2016
Day 1 VSSML16
Lecture 3
Clusters and Anomaly Detection
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016

Published in: Data & Analytics
  • Be the first to comment

VSSML16 L3. Clusters and Anomaly Detection

  1. 1. September 8-9, 2016
  2. 2. BigML, Inc 2 Cluster Analysis Poul Petersen CIO, BigML, Inc Finding Similarities
  3. 3. BigML, Inc 3Unsupervised Learning Trees vs Clusters Trees (Supervised Learning) Provide: labeled data Learning Task: be able to predict label Clusters (Unsupervised Learning) Provide: unlabeled data Learning Task: group data by similarity
  4. 4. BigML, Inc 4Unsupervised Learning Trees vs Clusters sepal length sepal width petal length petal width species 5.1 3.5 1.4 0.2 setosa 5.7 2.6 3.5 1.0 versicolor 6.7 2.5 5.8 1.8 virginica … … … … … sepal length sepal width petal length petal width 5.1 3.5 1.4 0.2 5.7 2.6 3.5 1.0 6.7 2.5 5.8 1.8 … … … … Inputs “X” Label “Y” Learning Task: Find function “f” such that: f(X)≈Y Learning Task: Find “k” clusters such that the data in each cluster is self similar
  5. 5. BigML, Inc 5Unsupervised Learning Use Cases • Customer segmentation • Item discovery • Similarity • Recommender • Active learning
  6. 6. BigML, Inc 6Unsupervised Learning Customer Segmentation GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for up-sell. • Dataset of mobile game users. • Data for each user consists of usage statistics and a LTV based on in-game purchases • Assumption: Usage correlates to LTV 0% 3% 1%
  7. 7. BigML, Inc 7Unsupervised Learning Item Discovery GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste. • Dataset of 86 whiskies • Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics. Smoky Fruity
  8. 8. BigML, Inc 8Unsupervised Learning Clustering Demo #1
  9. 9. BigML, Inc 9Unsupervised Learning Similarity GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population • Dataset of Lending Club Loans • Mark any loan that is currently or has even been late as “trouble” 0% 3% 7% 1%
  10. 10. BigML, Inc 10Unsupervised Learning Active Learning GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data. • Dataset of diagnostic measurements of 768 patients. • Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*.
  11. 11. BigML, Inc 11Unsupervised Learning *For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat. 2323 Active Learning
  12. 12. BigML, Inc 12Unsupervised Learning Human Example Cluster into 3 groups…
  13. 13. BigML, Inc 13Unsupervised Learning Human Example
  14. 14. BigML, Inc 14Unsupervised Learning Learning from Humans • Jesa used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “edges”, “hard”, etc • Items were then clustered based on the chosen features • Separation quality was then tested to ensure: • met criteria of K=3 • groups were sufficiently “distant” • no crossover
  15. 15. BigML, Inc 15Unsupervised Learning Learning from Humans • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count Create features that capture these object differences
  16. 16. BigML, Inc 16Unsupervised Learning Cluster Features Object Length / Width Num Surfaces penny 1 3 dime 1 3 knob 1 4 eraser 2.75 6 box 1 6 block 1.6 6 screw 8 3 battery 5 3 key 4.25 3 bead 1 2
  17. 17. BigML, Inc 17Unsupervised Learning Plot by Features K=3 Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Key Insight: We can find clusters using distances in n-dimensional feature space
  18. 18. BigML, Inc 18Unsupervised Learning Plot by Features Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Find “best” (minimum distance) circles that include all points
  19. 19. BigML, Inc 19Unsupervised Learning K-Means Algorithm K=3
  20. 20. BigML, Inc 20Unsupervised Learning K-Means Algorithm K=3
  21. 21. BigML, Inc 21Unsupervised Learning Features Matter Metal Other Wood
  22. 22. BigML, Inc 22Unsupervised Learning Convergence Convergence guaranteed but not necessarily unique Starting points important (K++)
  23. 23. BigML, Inc 23Unsupervised Learning Starting Points • Random points or instances in n-dimensional space • Chose points “farthest” away from each other • but this is sensitive to outliers • k++ • the first center is chosen randomly from instances • each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center
  24. 24. BigML, Inc 24Unsupervised Learning Scaling price number of bedrooms d = 160,000 d = 1
  25. 25. BigML, Inc 25Unsupervised Learning Other Tricks • What is the distance to a “missing value”? • What is the distance between categorical values? • What is the distance between text features? • Does it have to be Euclidean distance? • Unknown “K”?
  26. 26. BigML, Inc 26Unsupervised Learning Distance to Missing Value? • Nonsense! Try replacing missing values with: • Maximum • Mean • Median • Minimum • Zero • Ignore instances with missing values
  27. 27. BigML, Inc 27Unsupervised Learning Distance to Categorical? • Special distance function • if valA == valB then
 distance = 0 (or scaling value) 
 else 
 distance = 1 • Assign centroid the most common category of the member instances Approach: similar to “k-prototypes”
  28. 28. BigML, Inc 28Unsupervised Learning Distance to Categorical? feature_1 feature_2 feature_3 instance_1 red cat ball instance_2 red cat ball instance_3 red cat box instance_4 blue dog fridge D = 0 D = 1 D = sqrt(3) Compute Euclidean distance between discrete vectors
  29. 29. BigML, Inc 29Unsupervised Learning Text Vectors 1 Cosine Similarity 0 "hippo" "safari" "zebra" …. 1 0 0 … 1 1 0 … 0 0 0 … Text Field #1 Text Field #2 Cosine Distance = 1 - Cosine Similarity Features(thousands)
  30. 30. BigML, Inc 30Unsupervised Learning Finding K: G-Means
  31. 31. BigML, Inc 31Unsupervised Learning Finding K: G-Means
  32. 32. BigML, Inc 32Unsupervised Learning Finding K: G-Means Let K=2 Keep 1, Split 1 New K=3
  33. 33. BigML, Inc 33Unsupervised Learning Finding K: G-Means Let K=3 Keep 1, Split 2 New K=5
  34. 34. BigML, Inc 34Unsupervised Learning Finding K: G-Means Let K=5 K=5
  35. 35. BigML, Inc 35Unsupervised Learning Clustering Demo #2
  36. 36. BigML, Inc 2 Anomaly Detection Poul Petersen CIO, BigML, Inc Finding the Unusual
  37. 37. BigML, Inc 3Unsupervised Learning Clusters vs Anomalies Clusters (Unsupervised Learning) Provide: unlabeled data Learning Task: group data by similarity Anomalies (Unsupervised Learning) Provide: unlabeled data Learning Task: Rank data by dissimilarity
  38. 38. BigML, Inc 4Unsupervised Learning Clusters vs Anomalies sepal length sepal width petal length petal width 5.1 3.5 1.4 0.2 5.7 2.6 3.5 1.0 6.7 2.5 5.8 1.8 … … … … Learning Task: Find “k” clusters such that the data in each cluster is self similar sepal length sepal width petal length petal width 5.1 3.5 1.4 0.2 5.7 2.6 3.5 1.0 6.7 2.5 5.8 1.8 … … … … Learning Task: Assign value from 0 (similar) to 1 (dissimilar) to each instance.
  39. 39. BigML, Inc 5Unsupervised Learning • Unusual instance discovery • Intrusion Detection • Fraud • Identify Incorrect Data • Remove Outliers • Model Competence / Input Data Drift Use Cases
  40. 40. BigML, Inc 6Unsupervised Learning Removing Outliers • Models need to generalize • Outliers negatively impact generalization GOAL: Use anomaly detector to identify most anomalous points and then remove them before modeling. DATASET FILTERED DATASET ANOMALY DETECTOR CLEAN MODEL
  41. 41. BigML, Inc 7Unsupervised Learning Anomaly Demo #1
  42. 42. BigML, Inc 8Unsupervised Learning Intrusion Detection GOAL: Identify unusual command line behavior per user and across all users that might indicate an intrusion. • Dataset of command line history for users • Data for each user consists of commands, flags, working directories, etc. • Assumption: Users typically issue the same flag patterns and work in certain directories Per User Per Dir All User All Dir
  43. 43. BigML, Inc 9Unsupervised Learning Fraud • Dataset of credit card transactions • Additional user profile information GOAL: Cluster users by profile and use multiple anomaly scores to detect transactions that are anomalous on multiple levels. Card Level User Level Similar User Level
  44. 44. BigML, Inc 10Unsupervised Learning Model Competence • After putting a model it into production, data that is being predicted can become statistically different than the training data. • Train an anomaly detector at the same time as the model. G O A L : F o r e v e r y prediction, compute an anomaly score. If the anomaly score is high, then the model may not be competent and should not be Training Data PREDICTION ANOMALY SCORE MODEL ANOMALY DETECTOR
  45. 45. BigML, Inc 11Unsupervised Learning Univariate Approach • Single variable: heights, test scores, etc • Assume the value is distributed “normally” • Compute standard deviation • a measure of how “spread out” the numbers are • the square root of the variance (The average of the squared differences from the Mean.) • Depending on the number of instances, choose a “multiple” of standard deviations to indicate an anomaly. A multiple of 3 for 1000 instances removes ~ 3 outliers.
  46. 46. BigML, Inc 12Unsupervised Learning Univariate Approach measurement frequency outliersoutliers • Available in BigML API
  47. 47. BigML, Inc 13Unsupervised Learning Benford’s Law • In real-life numeric sets the small digits occur disproportionately often as leading significant digits. • Applications include: • accounting records • electricity bills • street addresses • stock prices • population numbers • death rates • lengths of rivers • Available in BigML API
  48. 48. BigML, Inc 14Unsupervised Learning Human Example Most Unusual?
  49. 49. BigML, Inc 15Unsupervised Learning Human Example “Round”“Skinny” “Corners” “Skinny” but not “smooth” No “Corners” Not “Round” Key Insight The “most unusual” object is different in some way from every partition of the features. Most unusual
  50. 50. BigML, Inc 16Unsupervised Learning Human Example • Human used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “smooth”, “corners” • Items were then separated based on the chosen features • Each cluster was then examined to see which object fit the least well in its cluster and did not fit any other cluster
  51. 51. BigML, Inc 17Unsupervised Learning Learning from Humans • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count • Smooth - true or false Create features that capture these object differences
  52. 52. BigML, Inc 18Unsupervised Learning Anomaly Features Object Length / Width Num Surfaces Smooth penny 1 3 TRUE dime 1 3 TRUE knob 1 4 TRUE eraser 2.75 6 TRUE box 1 6 TRUE block 1.6 6 TRUE screw 8 3 FALSE battery 5 3 TRUE key 4.25 3 FALSE bead 1 2 TRUE
  53. 53. BigML, Inc 19Unsupervised Learning smooth = True length/width > 5 box blockeraser knob penny dime bead key battery screw num surfaces = 6 length/width =1 length/width < 2 Random Splits Know that “splits” matter - don’t know the order
  54. 54. BigML, Inc 20Unsupervised Learning Isolation Forest Grow a random decision tree until each instance is in its own leaf “easy” to isolate “hard” to isolate Depth Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
  55. 55. BigML, Inc 21Unsupervised Learning Isolation Forest Scoring f_1 f_2 f_3 i_1 red cat ball i_2 red cat ball i_3 red cat box i_4 blue dog pen D = 3 D = 6 D = 2 Score
  56. 56. BigML, Inc 22Unsupervised Learning • A low anomaly score means the loan is similar to the modeled loans. • A high anomaly score means you can not trust the model. Model Competence Prediction T T Confidence 86% 84% Anomaly Score 0.5367 0.7124 Competent? Y N OPEN LOANS PREDICTION ANOMALY SCORE CLOSED LOAN MODEL CLOSED LOAN ANOMALY DETECTOR
  57. 57. BigML, Inc 23Unsupervised Learning Anomaly Demo #2

×