Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VSSML16 L4. Association Discovery and Latent Dirichlet Allocation

341 views

Published on

VSSML16 L4. Association Discovery and Latent Dirichlet Allocation
Valencian Summer School in Machine Learning 2016
Day 1 VSSML16
Lecture 4
Association Discovery and Latent Dirichlet Allocation
Geoff Webb (Monash University) & Charles Parker (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

VSSML16 L4. Association Discovery and Latent Dirichlet Allocation

  1. 1. September 8-9, 2016
  2. 2. BigML, Inc 2 Association Discovery Geoff Webb Professor of Information Technology Research Monash University, Melbourne, Australia Finding interesting correlations
  3. 3. BigML, Inc 3Unsupervised Learning • Algorithm: “Magnum Opus” from Geoff Webb • Unsupervised Learning: Works with unlabelled data, like clustering and anomaly detection. • Learning Task: Find “interesting” relations between variables. Association Discovery
  4. 4. BigML, Inc 4Unsupervised Learning Unsupervised Learning date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 Clustering Anomaly Detection similar unusual
  5. 5. BigML, Inc 5Unsupervised Learning {class = gas} amount < 100 Association Rules date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 The Sally 6788 sign food 26339 51 {customer = Bob, account = 3421} zip = 46140 Rules: Antecedent Consequent
  6. 6. BigML, Inc 6Unsupervised Learning Use Cases • Market Basket Analysis • Web usage patterns • Intrusion detection • Fraud detection • Bioinformatics • Medical risk factors
  7. 7. BigML, Inc 7Unsupervised Learning Magnum Opus What's wrong with frequent pattern mining?
  8. 8. BigML, Inc 8Unsupervised Learning Magnum Opus What's wrong with frequent pattern mining? • Feast or famine • often results in too few or too many patterns • The vodka and caviar problem • some high value patterns are infrequent • Cannot handle dense data • Minimum support may not be relevant • cannot be low enough to capture all valid rules • cannot be high enough to exclude all spurious rules
  9. 9. BigML, Inc 9Unsupervised Learning Magnum Opus Very infrequent patterns can be significant Data file: Brijs retail.itl, 88162 cases / 16470 items 237 → 1 
 [Coverage=3032; Support=28; Lift=3.06; p=1.99E-007] 237 & 4685 → 1 
 [Coverage=19; Support=9; Lift=157.00; p=5.03E-012] 1159 → 1 
 [Coverage=197; Support=9; Lift=15.14; p=1.13E-008] 4685 → 1 
 [Coverage=270; Support=9; Lift=11.05; p=1.68E-007] 168 → 1 
 [Coverage=293; Support=9; Lift=10.18; p=3.33E-007] 4382 → 1 
 [Coverage=72; Support=8; Lift=36.83; p=6.26E-011] 168 & 4685 → 1 
 [Coverage=9; Support=7; Lift=257.78; p=6.66E-011]
  10. 10. BigML, Inc 10Unsupervised Learning Magnum Opus Very high support patterns can be spurious Data file: covtype.data 581012 cases / 125 values ST15=0 → ST07=0 
 [Coverage=581009; Support=580904; Confidence=1.000] ST07=0 → ST15=0 
 [Coverage=580907; Support=580904; Confidence=1.000] ST15=0 → ST36=0 
 [Coverage=581009; Support=580890; Confidence=1.000] ST36=0 → ST15=0 
 [Coverage=580893; Support=580890; Confidence=1.000] ST15=0 → ST08=0 
 [Coverage=581009; Support=580830; Confidence=1.000] ST08=0 → ST15=0 
 [Coverage=580833; Support=580830; Confidence=1.000] … 197,183,686 such rules have highest support
  11. 11. BigML, Inc 11Unsupervised Learning Magnum Opus • User selects measure of interest • System finds the top-k associations on that measure within constraints • Must be statistically significant interaction between antecedent and consequent • Every item in the antecedent must increase the strength of association
  12. 12. BigML, Inc 12Unsupervised Learning Association Metrics Coverage Percentage of instances which match antecedent “A” Instances A C
  13. 13. BigML, Inc 13Unsupervised Learning Association Metrics Support Percentage of instances which match antecedent “A” and Consequent “C” Instances A C
  14. 14. BigML, Inc 14Unsupervised Learning Association Metrics Confidence Percentage of instances in the antecedent which also contain the consequent. Coverage Support Instances A C
  15. 15. BigML, Inc 15Unsupervised Learning Association Metrics C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C
  16. 16. BigML, Inc 16Unsupervised Learning Association Metrics Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A
  17. 17. BigML, Inc 17Unsupervised Learning Association Metrics C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C
  18. 18. BigML, Inc 18Unsupervised Learning Association Metrics Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ] Independent A C C Observed A
  19. 19. BigML, Inc 19Unsupervised Learning Association Metrics C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Association Positive Correlation Independent A C Independent A C Observed A C -1…
  20. 20. BigML, Inc 20Unsupervised Learning Use Cases GOAL: Discover “interesting” rules about what store items are typically purchased together. • Dataset of 9,834 grocery cart transactions • Each row is a list of all items in a cart at checkout
  21. 21. BigML, Inc 21Unsupervised Learning Association Discovery
 Demo #1
  22. 22. BigML, Inc 22Unsupervised Learning Use Cases GOAL: Find general rules that indicate diabetes. • Dataset of diagnostic measurements of 768 patients. • Each patient labelled True/False for diabetes.
  23. 23. BigML, Inc 23Unsupervised Learning Association Discovery
 Demo #2
  24. 24. BigML, Inc 24Unsupervised Learning Medical Risks Decision Tree If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44 then diabetes = TRUE Association Rule If plasma glucose > 146 then diabetes = TRUE
  25. 25. Latent Dirichlet Allocation #VSSML16 September 2016 #VSSML16 Latent Dirichlet Allocation September 2016 1 / 24
  26. 26. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 2 / 24
  27. 27. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 3 / 24
  28. 28. Bag of Words Analysis • Easiest way of analyzing a text field is just to treat it as a “bag of words” • Each word is a separate feature (usually an occurrence count) • When modeling, the features are treated in isolation from one another, essentially “one at a time” #VSSML16 Latent Dirichlet Allocation September 2016 4 / 24
  29. 29. Limitations • Words are sometimes ambiguous • Both because of multiple definitions and difference in tone • How do we usually disambiguate words? Context #VSSML16 Latent Dirichlet Allocation September 2016 5 / 24
  30. 30. An Instructive Example • One way of looking at the usefulness of a machine learning feature is to think about how well it isolates unique and coherent subsets of the data • Suppose I have a collection of documents where some of them are about two different topics (via Ted Underwood’s Blog): I Leadership (CEOs, organization, management) I Chemistry (Elements, compounds, reactions) • If I do a keyword search for “lead” (or try to classify documents based on that word alone), I’ll get documents from either category and documents that are a mix of both • Can we build a feature that better isolates which set of documents we’re looking for? #VSSML16 Latent Dirichlet Allocation September 2016 6 / 24
  31. 31. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 7 / 24
  32. 32. Generative Modeling • Posit a parameterized structure that is responsible for generating the data • Use the data to fit the parameters • A notion of causality is important for these models #VSSML16 Latent Dirichlet Allocation September 2016 8 / 24
  33. 33. Example of a Generative model • Consider a patient with some disease • Class: Disease present / absent, Features: Test results • Arrows indicate cause in this diagram; the symptoms (features) are caused by the disease • This generative process implies a structure; in this case the so-called “Naive Bayes” model #VSSML16 Latent Dirichlet Allocation September 2016 9 / 24
  34. 34. Generative vs. Discriminative • This is an important distinction in machine learning generally • Generative models try to model / assume a structure for the process generating the data • More mathematically, generative classifiers explicitly model the joint distribution p(x, y) of the data • Discriminate models don’t care; they “solve the prediction problem directly”, and model only the conditional p(y|x) (Vapnik) #VSSML16 Latent Dirichlet Allocation September 2016 10 / 24
  35. 35. Which is Better? • No general answer to this question (not that we haven’t tried): Paper: On Discriminative vs. Generative Classifiers1 • Discriminative models tend to be faster to fit, quicker to predict, and in the case of non-parametrics are often guaranteed to converge to the correct answer given enough data • Generative models tend to be more probabilistically sound and able to do more than just classify 1 http: //ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf #VSSML16 Latent Dirichlet Allocation September 2016 11 / 24
  36. 36. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 12 / 24
  37. 37. A New Way of Thinking About Documents • Three entities: Documents, Terms, and Topics • A term is a single lexical token (usually one or more words, but can be any arbitrary string) • A document has many terms • A topic is a distribution over terms #VSSML16 Latent Dirichlet Allocation September 2016 13 / 24
  38. 38. A Generative Model for Documents • A document can be thought of as a distribution over topics, drawn from a distribution over possible distributions • To create a document, repeatedly draw a topic at random from the distribution, then draw a term from topic (which, remember, is a distribution over terms) • The main thing we want to infer is the topic distribution #VSSML16 Latent Dirichlet Allocation September 2016 14 / 24
  39. 39. Dirichlet Process Intuition: Rich Get Richer • We use a Dirichlet process to model the relationship between documents, topics, and terms • We’re more likely to think a word came from a topic if we’ve already seen a bunch of words from that topic • We’re more likely to think the topic was responsible for generating the document if we’ve already seen a bunch of words in the document from that topics. • Here lies the disambiguation: If a word could have come from two different topics, we use the rest of the words in the document to decide which meaning it has • Note that there’s a little bit of self-fulfilling prophecy going on here (by design) #VSSML16 Latent Dirichlet Allocation September 2016 15 / 24
  40. 40. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 16 / 24
  41. 41. Usenet Movie Reviews Library of over 26,000 movie reviews A solid noir melodrama from Vincent Sherman, who takes a standard story and dresses it up with moving characterizations and beautifully expressionistic B&W; photography from cinematographer James Wong Howe. The director took a songwriter Paul Webster's short magazine story called "The Man Who Died Twice" and improved the story by rounding out the characters to give them both strong and weak points, so that they would not be one-note characters as was the case in the original story. The film was made by Warner Brothers, who needed a film for their contract star Ann Sheridan and asked Sherman to change the story around so that her part as Nora Prentiss, a nightclub singer, is expanded #VSSML16 Latent Dirichlet Allocation September 2016 17 / 24
  42. 42. Supreme Court Cases Library of about 7500 Supreme Court Cases NO. 136. ARGUED DECEMBER 6, 1966. - DECIDED JANUARY 9, 1967. - 258 F. SUPP. 819, REVERSED. FOLLOWING THIS COURT'S DECISIONS IN SWANN V. ADAMS, INVALIDATING THE APPORTIONMENT OF THE FLORIDA LEGISLATURE (378 U.S. 553) AND THE SUBSEQUENT REAPPORTIONMENT WHICH THE DISTRICT COURT HAD FOUND UNCONSTITUTIONAL BUT APPROVED ON AN INTERIM BASIS (383 U.S. 210), THE FLORIDA LEGISLATURE ADOPTED STILL ANOTHER LEGISLATIVE REAPPORTIONMENT PLAN, WHICH APPELLANTS, RESIDENTS AND VOTERS OF DADE COUNTY, FLORIDA, ATTACKED AS FAILING TO MEET THE STANDARDS OF VOTER EQUALITY SET FORTH #VSSML16 Latent Dirichlet Allocation September 2016 18 / 24
  43. 43. Outline 1 Understanding the Limits of Simple Text Analysis 2 Aside: Generative Processes 3 Latent Dirichlet Allocation 4 A Couple of Instructive Examples 5 Applications #VSSML16 Latent Dirichlet Allocation September 2016 19 / 24
  44. 44. Visualizing Changes in Topic Over Time • Plot changes in topic distribution over time • Especially nice for dated historical collections (e.g., novels, newspapers) #VSSML16 Latent Dirichlet Allocation September 2016 20 / 24
  45. 45. Search Without Keywords • Keyword search is great, if you know the keywords • Good for finding search terms • Great for, e.g., legal discovery • Nice for finding “outliers” • Surprise topics (From the recycle bin) #VSSML16 Latent Dirichlet Allocation September 2016 21 / 24
  46. 46. Feature Spaces for Classification • Just classify the documents in “topic space” rather than “bag space” • The topics that come out of LDA have some nice benefits as features I Can reduce a feature space of thousands to a few dozen (faster to fit) I Nicely interpretable I Automatically tailored to the documents you’ve provided • Foreshadowing Alert: When using LDA in this way, we’re doing a form of feature engineering which we’ll hear more about tomorrow. #VSSML16 Latent Dirichlet Allocation September 2016 22 / 24
  47. 47. Some Caveats • You need to choose the number of topics beforehand • Takes forever, both to fit and to do inference • Takes a lot of text to make it meaningful • Tends to focus on “meaningless minutiae” • While it sometimes makes a nice classification space, it’s a rare case that provides dramatic improvement over bag-of-words • I find it nice just for exploration #VSSML16 Latent Dirichlet Allocation September 2016 23 / 24
  48. 48. Thus Ends The Lesson Questions? #VSSML16 Latent Dirichlet Allocation September 2016 24 / 24

×