Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VSSML17 L4. Association Discovery and Latent Dirichlet Allocation

311 views

Published on

Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 4: Association Discovery and Latent Dirichlet Allocation. By Poul Petersen and Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017

Published in: Data & Analytics
  • Be the first to comment

VSSML17 L4. Association Discovery and Latent Dirichlet Allocation

  1. 1. Valencian Summer School in Machine Learning 3rd edition September 14-15, 2017
  2. 2. BigML, Inc 2 Association Discovery Finding Meaningful Correlations Poul Petersen CIO, BigML, Inc
  3. 3. BigML, Inc 3Association Discovery Association Discovery • An unsupervised learning technique • No labels necessary • Useful for data discovery • Finds "significant" correlations/associations/relations • Shopping cart: Coffee and sugar • Medical: High plasma glucose and diabetes • Expresses them as "if then rules" • If "antecedent" then "consequent" • Significance measures • BigML: “Magnum Opus” from Geoff Webb
  4. 4. BigML, Inc 4Association Discovery Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  5. 5. BigML, Inc 5Association Discovery Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 similar
  6. 6. BigML, Inc 6Association Discovery Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  7. 7. BigML, Inc 7Association Discovery Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 anomaly
  8. 8. BigML, Inc 8Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  9. 9. BigML, Inc 9Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 {customer = Bob, account = 3421}
  10. 10. BigML, Inc 10Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140{customer = Bob, account = 3421}
  11. 11. BigML, Inc 11Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140{customer = Bob, account = 3421} {class = gas}
  12. 12. BigML, Inc 12Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140 amount < 100 {customer = Bob, account = 3421} {class = gas}
  13. 13. BigML, Inc 13Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140 amount < 100 Rules: Antecedent Consequent {customer = Bob, account = 3421} {class = gas}
  14. 14. BigML, Inc 14Association Discovery Use Cases • Market Basket Analysis: Items that go together • Data Discovery: how do instances relate? • Behaviors that occur together • Web usage patterns • Intrusion detection • Fraud detection • Bioinformatics • gene expression associated with outcomes • Medical risk factors
  15. 15. BigML, Inc 15Association Discovery What is interesting? • In-frequent patterns can be strong, but are they interesting? • Vodka and caviar • Storms and high water sales • Frequent patterns can be strong, but are they interesting? • Coffee and milk • High plasma glucose and diabetes • “Frequency” isn’t the answer… • Depends on the data and domain • We need to better metrics to define what is interesting
  16. 16. BigML, Inc 16Association Discovery Association Metrics Coverage Percentage of instances which match antecedent “A” Instances A C
  17. 17. BigML, Inc 17Association Discovery Association Metrics Instances A C Support Percentage of instances which match antecedent “A” and Consequent “C”
  18. 18. BigML, Inc 18Association Discovery Confidence Percentage of instances in the antecedent which also contain the consequent. Association Metrics Coverage Support Instances A C
  19. 19. BigML, Inc 19Association Discovery Association Metrics C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C
  20. 20. BigML, Inc 20Association Discovery Association Metrics Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A Problem: if p(C) is "small" then… lift may be large.
  21. 21. BigML, Inc 21Association Discovery Association Metrics C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C
  22. 22. BigML, Inc 22Association Discovery Association Metrics Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ] Independent A C C Observed A
  23. 23. BigML, Inc 23Association Discovery Association Metrics C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C -1…
  24. 24. BigML, Inc 24Association Discovery Magnum Opus • Select measure of interest: Levarage, Lift, etc • System finds the top-k associations on that measure within constraints • Must be statistically significant interaction between antecedent and consequent • Every item in the antecedent must increase the strength of association
  25. 25. BigML, Inc 25Association Discovery Basic AD Configuration 1. Search Strategy: Support/Coverage/Confidence/Lift/Leverage 2. Max Number of Associations: 1 to 500 (default 100) 3. Max Items in Antecedent: 1 to 10 (default 4) 4. Complement Items: True / False • False: Coffee and… • True: Not Coffee and… 5. Missing Items: True / False • False: Loan Description contains "Ferrari" and… • True: Loan Description is missing and…
  26. 26. BigML, Inc 26Association Discovery Data Types numeric 1 2 3 1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical A B C date-time2013-09-25 10:02 DATE-TIME YEAR MONTH DAY-OF-MONTH YYYY-MM-DD DAY-OF-WEEK HOUR MINUTE YYYY-MM-DD YYYY-MM-DD M-T-W-T-F-S-D HH:MM:SS HH:MM:SS 2013 September 25 Wednesday 10 02 text Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. text “great” “afraid” “born” “some” appears 2 times appears 1 time appears 1 time appears 2 times
  27. 27. BigML, Inc 27Association Discovery Items Type itemscoffee, sugar, milk, honey, dish soap, bread items • Canonical example: shopping cart contents • Single feature describing a list of items • Each item separated by a comma (default)
  28. 28. BigML, Inc 28Association Discovery Use Cases GOAL: Discover “interesting” rules about what store items are typically purchased together. • Dataset of 9,834 grocery cart transactions • Each row is a list of all items in a cart at checkout
  29. 29. BigML, Inc 29Association Discovery Association Demo #1
  30. 30. BigML, Inc 30Association Discovery Use Cases GOAL: Find general rules that indicate diabetes. • Dataset of diagnostic measurements of 768 patients. • Each patient labelled True/False for diabetes.
  31. 31. BigML, Inc 31Association Discovery Association Demo #2
  32. 32. BigML, Inc 32Association Discovery Medical Risks Decision Tree If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44 then diabetes = TRUE Association Rule If plasma glucose > 146 then diabetes often TRUE
  33. 33. BigML, Inc 33Association Discovery Advanced AD Config 1. Measures: Set a minimum criteria for AD measures 2. Minimum Significance: lower values reduce spurious rules 3. Consequent: Restrict rules to a specific consequent criteria 4. Discretization: How numeric values are handled 1. Pretty: rounds off discretized values: 20 instead of 20.234 2. Size: the number of ranges (default 5) 3. Type: equal population / width 4. Trim: Removes percentage of values from the tails
  34. 34. BigML, Inc 34Association Discovery Association Demo #3
  35. 35. BigML, Inc 35Association Discovery Summary • Association Discover Purpose • Unsupervised technique for discovering interesting associations • Outputs antecedent/consequent rules • Metrics: Support / Coverage / Confidence / Lift / Leverage • Items type: • Configuration: • Search strategy / Minimum Measures • Complementary rules / Missing • Consequent Filtering • Discretization • Additional Uses • Understanding clusters and anomaly detectors
  36. 36. Topic Modeling
  37. 37. BigML, Inc 2Topic Modeling - September 2017 Topic Modeling • Method for discovering structure in "unstructured" text. • Based on LDA, introduced by David Blei, Andrew Ng, and Michael I. Jordan in 2003. • Now "BigML Easy"
  38. 38. BigML, Inc 3Topic Modeling - September 2017 BigML Resources SOURCE DATASET CORRELATION STATISTICAL TEST MODEL ENSEMBLE LOGISTIC REGRESSION EVALUATION ANOMALY DETECTOR ASSOCIATION DISCOVERY SINGLE/BATCH PREDICTION SCRIPT LIBRARY EXECUTION Data Exploration Supervised Learning Unsupervised Learning Automation CLUSTER Scoring TOPIC MODEL
  39. 39. BigML, Inc 4Topic Modeling - September 2017 Unsupervised Learning Features Instances • Learn from instances • Each instance has features • There is no label Clustering Find similar instances Anomaly Detection Find unusual instances Association Discovery Find feature rules
  40. 40. BigML, Inc 5Topic Modeling - September 2017 Topic Model Text Fields • Unsupervised algorithm • Learns only from text fields • Finds hidden topics that model the text • How is this different from the Text Analysis that BigML already offers? • What does it output and how do we use it • Unsupervised… model? Questions:
  41. 41. BigML, Inc 6Topic Modeling - September 2017 Text Analysis Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. great: appears 4 times Bag of Words
  42. 42. BigML, Inc 7Topic Modeling - September 2017 Text Analysis … great afraid born achieve … … … 4 1 1 1 … … … … … … … … … Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. Model The token “great” occurs more than 3 times The token “afraid” occurs no more than once
  43. 43. BigML, Inc 8Topic Modeling - September 2017 Hodor!
  44. 44. Text Analysis Demo #1
  45. 45. BigML, Inc 10Topic Modeling - September 2017 Text Analysis vs Topic Models Text Topic Model Creates thousands of hidden token counts Token counts are independently uninteresting No semantic importance No measure of co- occurrence Creates tens of topics that model the text Topics are independently interesting Semantic meaning extracted Support for bigrams
  46. 46. BigML, Inc 11Topic Modeling - September 2017 Generative Modeling • Decision trees are discriminative models • Aggressively model the classification boundary • Parsimonious: Don’t consider anything you don’t have to • Topic Models are generative models • Come up with a theory of how the data is generated • Tweak the theory to fit your data
  47. 47. BigML, Inc 12Topic Modeling - September 2017 Generating Documents cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… shoe asteroid flashlight pizza… plate giraffe purple jump… Be not afraid of greatness: some are born great, some achieve greatness… • "Machine" that generates a random word with equal probability with each pull. • Pull random number of times to generate a document. • All documents can be generated, but most are nonsense. word probability shoe ϵ asteroid ϵ flashlight ϵ pizza ϵ … ϵ
  48. 48. BigML, Inc 13Topic Modeling - September 2017 Topic Model • Written documents have meaning - one way to describe meaning is to assign a topic. • For our random machine, the topic can be thought of as increasing the probability of certain words. Intuition: Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… airplane passport pizza … word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… mars quasar lightyear soda word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ
  49. 49. BigML, Inc 14Topic Modeling - September 2017 Topic Model plate giraffe purple jump… Topic: "1" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: "k" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability shoe 12,12 % coffee 3,39 % telephone 13,43 % paper 4,11 % … ϵ …Topic: "2" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ airplane passport pizza … plate giraffe purple jump… • Each text field in a row is concatenated into a document • The documents are analyzed to generate "k" related topics • Each topic is represented by a distribution of term probabilities
  50. 50. Topic Model Demo #1
  51. 51. BigML, Inc 16Topic Modeling - September 2017 Uses • As a preprocessor for other techniques • Bootstrapping categories for classification • Recommendation • Discovery in large, heterogeneous text datasets
  52. 52. BigML, Inc 17Topic Modeling - September 2017 Topic Distribution • Any given document is likely a mixture of the modeled topics… • This can be represented as a distribution of topic probabilities Intuition: Will 2020 be the year that humans will embrace space exploration and finally travel to Mars? Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ 11% Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ 89%
  53. 53. Topic Model Demo #2
  54. 54. BigML, Inc 19Topic Modeling - September 2017 Clustering? Unlabelled Data Centroid Label Unlabelled Data topic 1 prob topic 3 prob topic k prob Clustering Batch Centroid Topic Model Text Fields Batch Topic Distribution …
  55. 55. Topic Model Demo #3
  56. 56. BigML, Inc 21Topic Modeling - September 2017 Some Tips • Setting k • Much like k-means, the best value is data specific • Too few will agglomerate unrelated topics, too many will partition highly related topics • I tend to find the latter more annoying than the former • Tuning the Model • Remove common, useless terms • Set term limit higher, use bigrams

×