BigML Education
Association Discovery
July 2017
BigML Education Program 2Association Discovery
In This Video
• Introduction
• Typical workflow: 1-click creation / Purpose
• Find correlations in your data
• Unsupervised technique
• Basic Features
• Support / Coverage / Confidence / Lift / Leverage
• Number of associations / antecedent terms
• Search Strategy
• Complementary and Missing items
• Advanced Features
• Minimum measures
• Consequent search
• Discretization
BigML Education Program 3Association Discovery
Association Discovery
Introduction
BigML Education Program 4Association Discovery
What is Association Discovery?
• An unsupervised learning technique
• No labels necessary
• Useful for data discovery
• Finds "significant" correlations
• Shopping cart: Coffee and sugar
• Medical: High plasma glucose and diabetes
• Expresses them as "if then rules"
• If "antecedent" then "consequent"
• Significance measures
BigML Education Program 5Association Discovery
Association Rules
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
Rules:
Antecedent Consequent
{customer = Bob, account = 3421}
{class = gas}
BigML Education Program 6Association Discovery
Association Discovery
Basic Features
BigML Education Program 7Association Discovery
Association Metrics
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C
BigML Education Program 8Association Discovery
Association Metrics
Support
Percentage of instances
which match antecedent
“A” and Consequent “C”
Instances
A
C
BigML Education Program 9Association Discovery
Association Metrics
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
Coverage
Support
Instances
A
C
BigML Education Program 10Association Discovery
Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never 

implies C
A sometimes 

implies C
A always 

implies C
BigML Education Program 11Association Discovery
Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
BigML Education Program 12Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
BigML Education Program 13Association Discovery
Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
Problem:
if p(C) is "small" then…
lift may be large.
BigML Education Program 14Association Discovery
Association Metrics
Leverage
Difference of observed
support and support if A
and C were statistically
independent. 

Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
BigML Education Program 15Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
BigML Education Program 16Association Discovery
AD Configuration
1. Max Number of Associations: 1 to 500 (default 100)
2. Max Items in Antecedent: 1 to 10 (default 4)
3. Search Strategy: Support / Coverage / Confidence / Lift / Leverage
4. Complement Items: True / False
• False: Coffee and…
• True: Not Coffee and…
5. Missing Items: True / False
• False: Loan Description contains "Ferrari" and…
• True: Loan Description is missing and…
BigML Education Program 17Association Discovery
Data Types
numeric
1 2 3
1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical
A B C
date-time2013-09-25 10:02
DATE-TIME
YEAR
MONTH
DAY-OF-MONTH
YYYY-MM-DD
DAY-OF-WEEK
HOUR
MINUTE
YYYY-MM-DD
YYYY-MM-DD
M-T-W-T-F-S-D
HH:MM:SS
HH:MM:SS
2013
September
25
Wednesday
10
02
text
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
text
“great”
“afraid”
“born”
“some”
appears 2 times
appears 1 time
appears 1 time
appears 2 times
BigML Education Program 18Association Discovery
Items Type
itemscoffee, sugar, milk, honey,
dish soap, bread
items
• Canonical example: shopping cart contents
• Single feature describing a list of items
• Each item separated by a comma (default)
BigML Education Program 19Association Discovery
Association Discovery
Advanced Features
BigML Education Program 20Association Discovery
AD Configuration
1. Measures: Set a minimum criteria for AD measures
2. Minimum Significance: lower values reduce spurious rules
3. Consequent: Restrict rules to a specific consequent criteria
4. Discretization: How numeric values are handled
• Pretty: rounds off discretized values: 20 instead of 20.234
• Size: the number of ranges (default 5)
• Type: equal population / width
• Trim: Removes percentage of values from the tails
BigML Education Program 21Association Discovery
Summary
• Association Discover Purpose
• Unsupervised technique for discovering interesting
associations
• Outputs antecedent/consequent rules
• Metrics: Support / Coverage / Confidence / Lift / Leverage
• Items type:
• Configuration:
• Search strategy / Minimum Measures
• Complementary rules / Missing
• Consequent Filtering
• Discretization
• Additional Uses
• Understanding clusters and anomaly detectors

BigML Education - Association Discovery

  • 1.
  • 2.
    BigML Education Program2Association Discovery In This Video • Introduction • Typical workflow: 1-click creation / Purpose • Find correlations in your data • Unsupervised technique • Basic Features • Support / Coverage / Confidence / Lift / Leverage • Number of associations / antecedent terms • Search Strategy • Complementary and Missing items • Advanced Features • Minimum measures • Consequent search • Discretization
  • 3.
    BigML Education Program3Association Discovery Association Discovery Introduction
  • 4.
    BigML Education Program4Association Discovery What is Association Discovery? • An unsupervised learning technique • No labels necessary • Useful for data discovery • Finds "significant" correlations • Shopping cart: Coffee and sugar • Medical: High plasma glucose and diabetes • Expresses them as "if then rules" • If "antecedent" then "consequent" • Significance measures
  • 5.
    BigML Education Program5Association Discovery Association Rules date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140 amount < 100 Rules: Antecedent Consequent {customer = Bob, account = 3421} {class = gas}
  • 6.
    BigML Education Program6Association Discovery Association Discovery Basic Features
  • 7.
    BigML Education Program7Association Discovery Association Metrics Coverage Percentage of instances which match antecedent “A” Instances A C
  • 8.
    BigML Education Program8Association Discovery Association Metrics Support Percentage of instances which match antecedent “A” and Consequent “C” Instances A C
  • 9.
    BigML Education Program9Association Discovery Association Metrics Confidence Percentage of instances in the antecedent which also contain the consequent. Coverage Support Instances A C
  • 10.
    BigML Education Program10Association Discovery Association Metrics C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C
  • 11.
    BigML Education Program11Association Discovery Association Metrics Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A
  • 12.
    BigML Education Program12Association Discovery Association Metrics C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C
  • 13.
    BigML Education Program13Association Discovery Association Metrics Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A Problem: if p(C) is "small" then… lift may be large.
  • 14.
    BigML Education Program14Association Discovery Association Metrics Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ] Independent A C C Observed A
  • 15.
    BigML Education Program15Association Discovery Association Metrics C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C -1…
  • 16.
    BigML Education Program16Association Discovery AD Configuration 1. Max Number of Associations: 1 to 500 (default 100) 2. Max Items in Antecedent: 1 to 10 (default 4) 3. Search Strategy: Support / Coverage / Confidence / Lift / Leverage 4. Complement Items: True / False • False: Coffee and… • True: Not Coffee and… 5. Missing Items: True / False • False: Loan Description contains "Ferrari" and… • True: Loan Description is missing and…
  • 17.
    BigML Education Program17Association Discovery Data Types numeric 1 2 3 1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical A B C date-time2013-09-25 10:02 DATE-TIME YEAR MONTH DAY-OF-MONTH YYYY-MM-DD DAY-OF-WEEK HOUR MINUTE YYYY-MM-DD YYYY-MM-DD M-T-W-T-F-S-D HH:MM:SS HH:MM:SS 2013 September 25 Wednesday 10 02 text Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. text “great” “afraid” “born” “some” appears 2 times appears 1 time appears 1 time appears 2 times
  • 18.
    BigML Education Program18Association Discovery Items Type itemscoffee, sugar, milk, honey, dish soap, bread items • Canonical example: shopping cart contents • Single feature describing a list of items • Each item separated by a comma (default)
  • 19.
    BigML Education Program19Association Discovery Association Discovery Advanced Features
  • 20.
    BigML Education Program20Association Discovery AD Configuration 1. Measures: Set a minimum criteria for AD measures 2. Minimum Significance: lower values reduce spurious rules 3. Consequent: Restrict rules to a specific consequent criteria 4. Discretization: How numeric values are handled • Pretty: rounds off discretized values: 20 instead of 20.234 • Size: the number of ranges (default 5) • Type: equal population / width • Trim: Removes percentage of values from the tails
  • 21.
    BigML Education Program21Association Discovery Summary • Association Discover Purpose • Unsupervised technique for discovering interesting associations • Outputs antecedent/consequent rules • Metrics: Support / Coverage / Confidence / Lift / Leverage • Items type: • Configuration: • Search strategy / Minimum Measures • Complementary rules / Missing • Consequent Filtering • Discretization • Additional Uses • Understanding clusters and anomaly detectors