Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BigML Education - Association Discovery

166 views

Published on

Learn how to find statistically significant rules in your data using BigML's Association Discovery.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

BigML Education - Association Discovery

  1. 1. BigML Education Association Discovery July 2017
  2. 2. BigML Education Program 2Association Discovery In This Video • Introduction • Typical workflow: 1-click creation / Purpose • Find correlations in your data • Unsupervised technique • Basic Features • Support / Coverage / Confidence / Lift / Leverage • Number of associations / antecedent terms • Search Strategy • Complementary and Missing items • Advanced Features • Minimum measures • Consequent search • Discretization
  3. 3. BigML Education Program 3Association Discovery Association Discovery Introduction
  4. 4. BigML Education Program 4Association Discovery What is Association Discovery? • An unsupervised learning technique • No labels necessary • Useful for data discovery • Finds "significant" correlations • Shopping cart: Coffee and sugar • Medical: High plasma glucose and diabetes • Expresses them as "if then rules" • If "antecedent" then "consequent" • Significance measures
  5. 5. BigML Education Program 5Association Discovery Association Rules date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140 amount < 100 Rules: Antecedent Consequent {customer = Bob, account = 3421} {class = gas}
  6. 6. BigML Education Program 6Association Discovery Association Discovery Basic Features
  7. 7. BigML Education Program 7Association Discovery Association Metrics Coverage Percentage of instances which match antecedent “A” Instances A C
  8. 8. BigML Education Program 8Association Discovery Association Metrics Support Percentage of instances which match antecedent “A” and Consequent “C” Instances A C
  9. 9. BigML Education Program 9Association Discovery Association Metrics Confidence Percentage of instances in the antecedent which also contain the consequent. Coverage Support Instances A C
  10. 10. BigML Education Program 10Association Discovery Association Metrics C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C
  11. 11. BigML Education Program 11Association Discovery Association Metrics Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A
  12. 12. BigML Education Program 12Association Discovery Association Metrics C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C
  13. 13. BigML Education Program 13Association Discovery Association Metrics Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A Problem: if p(C) is "small" then… lift may be large.
  14. 14. BigML Education Program 14Association Discovery Association Metrics Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ] Independent A C C Observed A
  15. 15. BigML Education Program 15Association Discovery Association Metrics C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C -1…
  16. 16. BigML Education Program 16Association Discovery AD Configuration 1. Max Number of Associations: 1 to 500 (default 100) 2. Max Items in Antecedent: 1 to 10 (default 4) 3. Search Strategy: Support / Coverage / Confidence / Lift / Leverage 4. Complement Items: True / False • False: Coffee and… • True: Not Coffee and… 5. Missing Items: True / False • False: Loan Description contains "Ferrari" and… • True: Loan Description is missing and…
  17. 17. BigML Education Program 17Association Discovery Data Types numeric 1 2 3 1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical A B C date-time2013-09-25 10:02 DATE-TIME YEAR MONTH DAY-OF-MONTH YYYY-MM-DD DAY-OF-WEEK HOUR MINUTE YYYY-MM-DD YYYY-MM-DD M-T-W-T-F-S-D HH:MM:SS HH:MM:SS 2013 September 25 Wednesday 10 02 text Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. text “great” “afraid” “born” “some” appears 2 times appears 1 time appears 1 time appears 2 times
  18. 18. BigML Education Program 18Association Discovery Items Type itemscoffee, sugar, milk, honey, dish soap, bread items • Canonical example: shopping cart contents • Single feature describing a list of items • Each item separated by a comma (default)
  19. 19. BigML Education Program 19Association Discovery Association Discovery Advanced Features
  20. 20. BigML Education Program 20Association Discovery AD Configuration 1. Measures: Set a minimum criteria for AD measures 2. Minimum Significance: lower values reduce spurious rules 3. Consequent: Restrict rules to a specific consequent criteria 4. Discretization: How numeric values are handled • Pretty: rounds off discretized values: 20 instead of 20.234 • Size: the number of ranges (default 5) • Type: equal population / width • Trim: Removes percentage of values from the tails
  21. 21. BigML Education Program 21Association Discovery Summary • Association Discover Purpose • Unsupervised technique for discovering interesting associations • Outputs antecedent/consequent rules • Metrics: Support / Coverage / Confidence / Lift / Leverage • Items type: • Configuration: • Search strategy / Minimum Measures • Complementary rules / Missing • Consequent Filtering • Discretization • Additional Uses • Understanding clusters and anomaly detectors

×