Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DutchMLSchool. Associations and Topic Models

110 views

Published on

DutchMLSchool. Association Discovery and Topic Modeling (Unsupervised II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

DutchMLSchool. Associations and Topic Models

  1. 1. 1st edition | July 8-11, 2019
  2. 2. BigML, Inc #DutchMLSchool 2 Association Rules and Topic Modeling Charles Parker VP, Machine Learning Algorithms
  3. 3. BigML, Inc #DutchMLSchool Association Rules 3
  4. 4. BigML, Inc #DutchMLSchool Association Discovery 4 An unsupervised learning technique • No labels necessary • Useful for data discovery Finds "significant" correlations/associations/relations • Shopping cart: Coffee and sugar • Medical: High plasma glucose and diabetes Expresses them as "if then rules" • If "antecedent" then "consequent"
  5. 5. BigML, Inc #DutchMLSchool Review of methods: clustering 5 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  6. 6. BigML, Inc #DutchMLSchool Review of methods: clustering 6 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 similar
  7. 7. BigML, Inc #DutchMLSchool Review: anomaly detection 7 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  8. 8. BigML, Inc #DutchMLSchool Review: anomaly detection 8 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 anomaly
  9. 9. BigML, Inc #DutchMLSchool Association Discovery 9 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  10. 10. BigML, Inc #DutchMLSchool Association Discovery 10 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 {customer = Bob, account = 3421}
  11. 11. BigML, Inc #DutchMLSchool Association Discovery 11 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140{customer = Bob, account = 3421}
  12. 12. BigML, Inc #DutchMLSchool Association Discovery 12 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140{customer = Bob, account = 3421} {class = gas}
  13. 13. BigML, Inc #DutchMLSchool Association Discovery 13 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140 amount < 100 {customer = Bob, account = 3421} {class = gas}
  14. 14. BigML, Inc #DutchMLSchool Association Discovery 14 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140 amount < 100 Rules: Antecedent Consequent {customer = Bob, account = 3421} {class = gas}
  15. 15. BigML, Inc #DutchMLSchool Use Cases 15 • Data Discovery: how do instances relate? • Market Basket Analysis: Items that go together • Behaviors that occur together • Web usage patterns • Intrusion detection • Fraud detection • Medical risk factors
  16. 16. BigML, Inc #DutchMLSchool Association Metrics 16 • Coverage • Support • Confidence • Lift • Leverage Associations between grocery items
  17. 17. BigML, Inc #DutchMLSchool Association Metrics: coverage 17 Coverage Percentage of instances which match antecedent “A” Instances A C
  18. 18. BigML, Inc #DutchMLSchool Association Metrics: support 18 Instances A C Support Percentage of instances which match antecedent “A” and Consequent “C”
  19. 19. BigML, Inc #DutchMLSchool Confidence Percentage of instances in the antecedent which also contain the consequent. Association Metrics: confidence 19 Coverage Support Instances A C
  20. 20. BigML, Inc #DutchMLSchool Association Metrics: confidence 20 C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C A >> C A = C A << C
  21. 21. BigML, Inc #DutchMLSchool Association Metrics: lift 21 Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A Problem: if p(C) is "small" then… lift may be large.
  22. 22. BigML, Inc #DutchMLSchool Association Metrics: lift 22 C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C
  23. 23. BigML, Inc #DutchMLSchool Association Metrics: leverage 23 Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ] Independent A C C Observed A
  24. 24. BigML, Inc #DutchMLSchool Association Metrics: leverage 24 C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C -1…
  25. 25. BigML, Inc #DutchMLSchool Items Type 25 itemscoffee, sugar, milk, honey, dish soap, bread items • Canonical example: shopping cart contents • Single feature describing a list of items • Each item separated by a comma (default)
  26. 26. BigML, Inc #DutchMLSchool Use Cases 26 GOAL: Discover “interesting” rules about what store items are typically purchased together. • Dataset of 9,834 grocery cart transactions • Each row is a list of all items in a cart at checkout
  27. 27. BigML, Inc #DutchMLSchool Association Demo 27
  28. 28. BigML, Inc #DutchMLSchool Summary 28 • Unsupervised learning technique for discovering interesting associations • Outputs antecedent/consequent rules • Metrics: Support / Coverage / Confidence / Lift / Leverage • Useful for “items” type and market basket analysis • Applicable to understanding clusters and anomaly detectors
  29. 29. BigML, Inc #DutchMLSchool Topic Modeling 29
  30. 30. BigML, Inc #DutchMLSchool What is Topic Modeling? 30 • Unsupervised algorithm • Learns only from text fields • Finds hidden topics that model the text Text Fields • How is this different from the Text Analysis that BigML already offers? • What does it output and how do we use it? Questions:
  31. 31. BigML, Inc #DutchMLSchool What is Topic Modeling? 31 • Finds topics in your text fields • A topic is a distribution over terms • Terms with high probability in the same topic often occur together in the same document • Topics often correspond to real-world things that the document may be “about” (e.g., sports, cooking, technology) • Each document is “about” one or more topics • Usually each document is only about one or two topics • But in practice we assign a probability to every topic for every document
  32. 32. BigML, Inc #DutchMLSchool Text Analysis 32 Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. great: appears 4 times 1. Stem Words -> Tokens 2. Remove tokens that occur too often 3. Remove tokens that do not occur often enough 4. Count occurrences of remaining “interesting” tokens
  33. 33. BigML, Inc #DutchMLSchool Text Analysis 33 Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. … great afraid born achieve … … … 4 1 1 1 … … … … … … … … … Model The token “great” occurs more than 3 times The token “afraid” occurs no more than once
  34. 34. BigML, Inc #DutchMLSchool 34 Text Analysis
  35. 35. BigML, Inc #DutchMLSchool Hodor! 35
  36. 36. BigML, Inc #DutchMLSchool Text Analysis vs. Topic Modeling 36 Text Topic Model Creates thousands of hidden token counts Token counts are independently uninteresting No semantic importance Co-occurrence limited to consecutive n-grams Creates tens of topics that model the text Topics are independently interesting Semantic meaning extracted Topics indicate broader co-occurrences
  37. 37. BigML, Inc #DutchMLSchool Generating Documents 37 cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… shoe asteroid flashlight pizza… plate giraffe purple jump… Be not afraid of greatness: some are born great, some achieve greatness… • "Machine" that generates a random word with equal probability with each pull. • Pull random number of times to generate a document. • All documents can be generated, but most are nonsense. word probability shoe ϵ asteroid ϵ flashlight ϵ pizza ϵ … ϵ
  38. 38. BigML, Inc #DutchMLSchool Topic Model 38 • Written documents have meaning - one way to describe meaning is to assign a topic. • For our random machine, the topic can be thought of as increasing the probability of certain words. Intuition: Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… airplane passport pizza … word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… mars quasar lightyear soda word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ
  39. 39. BigML, Inc #DutchMLSchool Topic Model 39 plate giraffe purple jump… Topic: "1" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: "k" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability shoe 12,12 % coffee 3,39 % telephone 13,43 % paper 4,11 % … ϵ …Topic: "2" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ airplane passport pizza … plate giraffe purple jump… • Each text field in a row is concatenated into a document • The documents are analyzed to generate "k" related topics • Each topic is represented by a distribution of term probabilities
  40. 40. BigML, Inc #DutchMLSchool 40 Training Topic Models
  41. 41. BigML, Inc #DutchMLSchool Topic Distribution 41 • Any given document is likely a mixture of the modeled topics… • This can be represented as a distribution of topic probabilities Intuition: Will 2020 be the year that humans will embrace space exploration and finally travel to Mars? Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ 11% Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ 89%
  42. 42. BigML, Inc #DutchMLSchool 42 Topic Distributions
  43. 43. BigML, Inc #DutchMLSchool Prediction? 43 Unlabelled Data Centroid Label Unlabelled Data topic 1 prob topic 3 prob topic k prob Clustering Batch Centroid Topic Model Text Fields Batch Topic Distribution …
  44. 44. BigML, Inc #DutchMLSchool Topic Model Use Cases 44 • As a preprocessor for other techniques • Building better models • Bootstrapping categories for classification • Recommendation • Discovery in large, heterogeneous text datasets
  45. 45. Co-organized by: Sponsor: Business Partners:

×