Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BSSML16 L4. Association Discovery and Topic Modeling

324 views

Published on

Brazilian Summer School in Machine Learning 2016
Day 1 - Lecture 4: Association Discovery and Topic Modeling
Lecturer: Poul Petersen (BigML)

Published in: Data & Analytics
  • Be the first to comment

BSSML16 L4. Association Discovery and Topic Modeling

  1. 1. D E C E M B E R 8 - 9 , 2 0 1 6
  2. 2. BigML, Inc 2 Poul Petersen CIO, BigML, Inc. Association DiscoveryFinding Interesting Correlations
  3. 3. BigML, Inc 3Association Discovery Association Discovery • Algorithm: “Magnum Opus” from Geoff Webb • Unsupervised Learning: Works with unlabelled data, like clustering and anomaly detection. • Learning Task: Find “interesting” relations between variables.
  4. 4. BigML, Inc 4Association Discovery Unsupervised Learning date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 Clustering Anomaly Detection similar unusual
  5. 5. BigML, Inc 5Association Discovery Association Rules date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 zip = 46140 amount < 100 Rules: Antecedent Consequent {customer = Bob, account = 3421} {class = gas}
  6. 6. BigML, Inc 6Association Discovery Use Cases • Market Basket Analysis • Web usage patterns • Intrusion detection • Fraud detection • Bioinformatics • Medical risk factors
  7. 7. BigML, Inc 7Association Discovery Magnum Opus • Select measure of interest: Levarage, Lift, etc • System finds the top-k associations on that measure within constraints • Must be statistically significant interaction between antecedent and consequent • Every item in the antecedent must increase the strength of association
  8. 8. BigML, Inc 8Association Discovery Association Metrics Instances A C Coverage Percentage of instances which match antecedent “A”
  9. 9. BigML, Inc 9Association Discovery Association Metrics Instances A C Support Percentage of instances which match antecedent “A” and Consequent “C”
  10. 10. BigML, Inc 10Association Discovery Association Metrics Coverage Support Instances A C Confidence Percentage of instances in the antecedent which also contain the consequent.
  11. 11. BigML, Inc 11Association Discovery Association Metrics C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C
  12. 12. BigML, Inc 12Association Discovery Association Metrics Independent A C C Observed A Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C)
  13. 13. BigML, Inc 13Association Discovery Association Metrics C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C
  14. 14. BigML, Inc 14Association Discovery Association Metrics Independent A C C Observed A Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ]
  15. 15. BigML, Inc 15Association Discovery Association Metrics C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C -1…
  16. 16. BigML, Inc 16Association Discovery Use Cases GOAL: Discover “interesting” rules about what store items are typically purchased together. • Dataset of 9,834 grocery cart transactions • Each row is a list of all items in a cart at checkout
  17. 17. BigML, Inc 17 Association Discovery Demo #1
  18. 18. BigML, Inc 18Association Discovery Use Cases GOAL: Find general rules that indicate diabetes. • Dataset of diagnostic measurements of 768 patients. • Each patient labelled True/False for diabetes.
  19. 19. BigML, Inc 19 Association Discovery
 Demo #2
  20. 20. BigML, Inc 20Association Discovery Medical Risks Decision Tree If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44 then diabetes = TRUE Association Rule If plasma glucose > 146 then diabetes = TRUE
  21. 21. BigML, Inc 21 Poul Petersen CIO, BigML, Inc. Topic ModelingDiscovering Meaning in Text
  22. 22. BigML, Inc 22Topic Modelling Unsupervised Learning Features Instances • Learn from instances • Each instance has features • There is no label Clustering Find similar instances Anomaly Detection Find unusual instances Association Discovery Find feature rules
  23. 23. BigML, Inc 23Topic Modelling Topic Model Text Fields • Unsupervised algorithm • Learns only from text fields • Finds hidden topics that model the text • How is this different from the Text Analysis that BigML already offers? • What does it output and how do we use it • Unsupervised… model? Questions:
  24. 24. BigML, Inc 24Topic Modelling Text Analysis Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. great: appears 4 times Bag of Words
  25. 25. BigML, Inc 25Topic Modelling Text Analysis … great afraid born achieve … … … 4 1 1 1 … … … … … … … … … Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. Model The token “great” occurs more than 3 times The token “afraid” occurs no more than once
  26. 26. BigML, Inc 26 Topic Model Demo #1
  27. 27. BigML, Inc 27Topic Modelling TA vs TM Text Analysis Topic Model Creates thousands of hidden token counts Token counts are independently uninteresting No semantic importance No measure of co- occurrence Creates tens of topics that model the text Topics are independently interesting Semantic meaning extracted Support for bigrams
  28. 28. BigML, Inc 28Topic Modelling Generative Modeling • Decision trees are discriminative models • Aggressively model the classification boundary • Parsimonious: Don’t consider anything you don’t have to • Topic Models are generative models • Come up with a theory of how the data is generated • Tweak the theory to fit your data
  29. 29. BigML, Inc 29Topic Modelling Generating Documents cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… shoe asteroid flashlight pizza… plate giraffe purple jump… Be not afraid of greatness: some are born great, some achieve greatness… • "Machine" that generates a random word with equal probability with each pull. • Pull random number of times to generate a document. • All documents can be generated, but most are nonsense. word probability shoe ϵ asteroid ϵ flashlight ϵ pizza ϵ … ϵ
  30. 30. BigML, Inc 30Topic Modelling Topic Model • Written documents have meaning - one way to describe meaning is to assign a topic. • For our random machine, the topic can be thought of as increasing the probability of certain words. Intuition: Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… airplane passport pizza … word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… mars quasar lightyear soda word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ
  31. 31. BigML, Inc 31Topic Modelling Topic Model plate giraffe purple jump… Topic: "1" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: "k" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability shoe 12,12 % coffee 3,39 % telephone 13,43 % paper 4,11 % … ϵ …Topic: "2" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ airplane passport pizza … plate giraffe purple jump… • Each text field in a row is concatenated into a document • The documents are analyzed to generate "k" related topics that can model the documents • Each topic is represented by a distribution of term probabilities
  32. 32. BigML, Inc 32 Topic Model Demo #2
  33. 33. BigML, Inc 33Topic Modelling Use Cases • As a preprocessor for other techniques • Bootstrapping categories for classification • Recommendation • Discovery in large, heterogeneous text datasets
  34. 34. BigML, Inc 34Topic Modelling Topic Distribution • Any given document is likely a mixture of the modeled topics… • This can be represented as a distribution of topic probabilities Intuition: Will 2020 be the year that humans will embrace space exploration and finally travel to Mars? Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ 11% Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ 89%
  35. 35. BigML, Inc 35 Topic Model Demo #3
  36. 36. BigML, Inc 36Topic Modelling Batch Topic Distribution Unlabelled Data Centroid Label Unlabelled Data topic 1 prob topic 3 prob topic k prob Clustering Batch Centroid Topic Model Text Fields Batch Topic Distribution …
  37. 37. BigML, Inc 37 Topic Model Demo #4
  38. 38. BigML, Inc 38Topic Modelling Tips • Setting k • Much like k-means, the best value is data specific • Too few will agglomerate unrelated topics, too many will partition highly related topics • I tend to find the latter more annoying than the former • Tuning the Model • Remove common, useless terms • Set term limit higher, use bigrams

×