2. Explain the concept of association rules
Develop an association rules data mining model
Understand the Apriori algorithm
Interpret output generated by model
Effectively employ the CRISP-DM method to your
assignment
This Week’s Learning Objectives
3. Association rules are a data mining methodology that
seeks to find frequent connections between attributes
in a data set
o An example of this could be a shopping basket analysis
where marketers and vendors try to find which products
are most frequently purchased together
Association Rules
4. Study of “what goes with what”
o “Customers who bought X also bought Y”
o What symptoms go with what diagnosis
Transaction-based or event-based
Also called “market basket analysis” and “affinity analysis”
Originated with study of customer transactions databases
to determine associations among items purchased
What are Association Rules?
5. So what kind of items are we talking about? There are many
applications of association:
o Product recommendation – like Amazon’s “customers who bought that,
also bought this”
o Medical diagnosis – like with diabetes
o Content optimization – magazine websites or blogs or even menu
design for a restaurant
o DNA genome analysis – patterns in cellular data
o Fraud detection in finance
Association Rules
6. Used in many recommender systems
Screenshot, Amazon.com
7. Key Point:
Association rules use “if/then” statements to
uncover relationships between seemingly
unrelated data.
“If a customer buys a dozen eggs, then he is 80%
more likely to also purchase milk.”
Association Rules
8. “IF” part = antecedent
“THEN” part = consequent
“Item set” = the items (e.g., products) comprising the antecedent or
consequent
Antecedent and consequent are disjoint when there are no items in
common
Terms
9. Although association rule operators require
binominal data types, it’s helpful to evaluate
the average (avg) and standard deviation for
each attribute.
Association Rules
11. Standard deviations are measurements of how
dispersed or varied the values in an attribute are
o A good rule of thumb- any value that is smaller than two
standard deviations below the mean or two standard
deviations above the mean is a statistical outlier
For example, if Average = 36.731 and Standard Deviation =
10.647, acceptable range of values should be 15.437 to 58.025
Not a hard-and-fast rule
Association Rules
12. RapidMiner uses binominal instead of binomial
o Binomial means one of two numbers (usually 0 and 1),
meaning the basic underlying data type is still numeric
o Binominal means one of two values which may be numeric
or character based
Association Rules
13. An example of a data type transformation of all attributes in
RapidMiner
Association Rules
14. Frequency Pattern analysis is handy for many
kinds of data mining and is a necessary
component of association rule mining
o We use this to determine whether any of the
patterns in the data occur often enough to be
considered rules
Association Rules
15. Results of an FP-Growth operator in RapidMiner
Association Rules
16. Two main factors that dictate whether or not frequency
patterns get translated into association rules:
Confidence percent - how confident we are that when one
attribute is flagged as true, the associated attribute will also
be flagged as true
Support percent - the number of times that the rule did
occur, divided by the number of observations in the data set
Confidence & Support Percent
17. Out of 10 shopping baskets…
Cookies were purchased in 4 baskets
Milk was purchased in 7 baskets
Cookies 3 out of 4 instances where cookies were purchased milk
was also in those baskets
Confidence % = How often are we when 1 attribute = True that an associated
attribute = True
Therefore, we have a 75% confidence in the association rule cookies g milk
Confidence Percent - Example
18. Out of 10 shopping baskets…
Milk was purchased in 7 baskets
Cookies were purchased in 4 baskets
Milk 3 out of 7 instances where milk was purchased that cookies
were also in those baskets
Confidence % = How often are we when 1 attribute = True that an associated
attribute = True
Therefore, we have a 43% confidence in the association rule milk g cookies
Confidence Percent - Example
19. Premise (or antecedent) g Conclusion (or consequent)
Low Confidence % = only happens by chance and can
be used to eliminate uninteresting rules
When evaluating associations between three or more
attributes, the confidence percentages are calculated
based on the two attributes being found with the third
Confidence Percent (cont.)
20. An example of association rules found with
50% confidence threshold in RapidMiner
Confidence Percent (cont.)
21. If cookies and milk were found together 3 out of
the 10 shopping baskets, support percentage is
calculated as 30% (3/10 = .3)
o There is no reciprocal for support percentages since
this metric is simply the number of times the
association occurred over the number of times it could
have occurred in the data set
Support Percent
22. The higher the Support %, the higher the likelihood for Y to be present in
purchases of X
Estimate of conditional probability of Y given X.
o Remember: “If a customer buys a dozen eggs, then he is 80% more likely to
also purchase milk.”
There is no reciprocal for support percentages since this metric is simply
the number of times the association occurred over the number of times it
could have occurred in the data set
Support Percent
23. Minimum Confidence = 50%
Minimum Support = 20%
However, for large data sets, your minimum support
could be much lower and still be valid.
Hint: Your homework will contain a large data set.
Typical Ranges
24. 6 colors within 10 Transactions
Tiny Example: Phone Faceplates
Transaction Faceplate Colors Purchased
1 Red White Green
2 White Orange
3 White Blue
4 Red White Orange
5 Red Blue
6 White Blue
7 White Orange
8 Red White Blue Green
9 Red White Blue
10 yellow
25. For example: Transaction 1 supports several
rules, such as
o “If red, then white” (“If a red faceplate is purchased,
then so is a white one”)
o “If white, then red”
o “If red and white, then green”
o + several more
Many Rules are Possible
26. Ideally, we want to create all possible combinations of items
Problem: computation time grows exponentially as # items
increases
Solution: consider only “frequent item sets”
Criterion for frequent: Support %
Frequent Item Sets
27. Support = # (or percent) of transactions that
include both the antecedent and the consequent
Example: support for the item set {red, white} is 4
out of 10 transactions, or 40%
Support
29. Apriori Principle:
“If an item set is frequent, then all of its
subsets MUST also be frequent.”
Apriori Algorithm
30. The strategy of trimming the exponential search
space based on the Support measure is known
as “Support-based Pruning”
This is the foundational component of the Apriori
Algorithm.
Apriori Algorithm
31. For k products…
User sets a minimum support criterion
Next, generate list of one-item sets that meet the support criterion
Use the list of one-item sets to generate list of two-item sets that
meet the support criterion
Use list of two-item sets to generate list of three-item sets
Continue up through k-item sets
Generating Frequent Item Sets
32. Confidence: the % of antecedent transactions that also have the
consequent item set
Lift = confidence/(benchmark confidence)
Benchmark confidence = transactions with consequent as % of all
transactions
Lift > 1 indicates a rule that is useful in finding consequent items sets (i.e.,
more useful than just selecting transactions randomly)
Measures of Performance
33. Generate all rules that meet specified support
& confidence
o Find frequent item sets (those with sufficient
support – see above)
o From these item sets, generate rules with
sufficient confidence
Process of Rule Selection
34. {red, white} > {green} with confidence = 2/4 = 50%
o [(support {red, white, green})/(support {red, white})]
{red, green} > {white} with confidence = 2/2 = 100%
o [(support {red, white, green})/(support {red, green})]
Plus 4 more with confidence of 100%, 33%, 29% & 100%
If confidence criterion is 70%, report only rules 2, 3 and 6
Example: Rules from {red, white,
green}
35. Lift ratio shows how effective the rule is in finding consequents
(useful if finding particular consequents is important)
Confidence shows the rate at which consequents will be found
(useful in learning costs of promotion)
Support measures overall impact
Interpretation
36. Random data can generate apparently interesting
association rules
The more rules you produce, the greater this danger
Rules based on large numbers of records are less
subject to this danger
Caution: The Role of Chance
37. Association rules (or affinity analysis, or market basket analysis) produce
rules on associations between items from a database of transactions
Widely used in recommender systems
Most popular method is Apriori algorithm
To reduce computation, we consider only “frequent” item sets (=support)
Performance is measured by confidence and lift
Can produce a profusion of rules; review is required to identify useful rules
and to reduce redundancy
Summary
38. Explain the concept of association rules
Develop an association rules data mining model
Understand the Apriori algorithm
Interpret output generated by model
Effectively employ the CRISP-DM method to your
assignment
Summary - Learning Objectives
39. “This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s
Employment and Training Administration. The solution was created by the grantee and does not
necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor
makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such
information, including any information on linked sites and including, but not limited to, accuracy of the
information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in
Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative
Commons Attribution 4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/
Copyright Information