Association Rule (Data Mining) - Frequent Itemset Generation, Closed Frequent Itemsets, Maximal Frequent Itemsets

Department of Geomatics, National Cheng Kung University
[106-2] Data Mining, Homework 3, Instructor: Prof. Hsueh-Chan Lu
Muhammad Irsyadi Firdaus (P66067055)
Using R tool to analyze the association relationships of all multiple check questions from my all collected
data.
 Try to set a suitable min_support and min_confidence threshold to mine the (Maximal / Closed) association
rules and show your mining results.
 Write a short report to summarize what do you get / find after association relationship analysis.
 Hint: You can sort your patterns by support / confidence/ lift and try to explain why these events have
frequent positive / negative association relationships.
Answers
I use data on HW1 and HW2 about hobbies on vacation. The number of recorders in my data is 68. I use
data titled "What is the place you most like to visit?" With lots of options like Beach, Museum, Historic
Heritage, Natural Ecology, Mountains, Events & Festivals, Recreation Area, Department Stores , Night
Market, Jungle.
To try to set min_Support and min_confidence threshold to mine the association rules, first is to store the
data in .txt format or another with the title Place_to_Visit.txt. In this case I will use RStudio software to
do the calculation.
If I use a priori standard, I will get the result as below with the parameters that I use is
The results can be seen below:
 I will use min_support = 20% and min_confidence = 40%, it can be seen below:

The purpose of the above result is for support, s {Night Market} => {Department Stores} has count is 14
so the support value is 20.58% because 14/68 = 0.205. The rule is confidence is obtained by diving the
support count for {Night Market, Department Stores} by the support count for {Night Market}. Since
there are 22 transactions that contain Night Market, the confidence for this rule is 14/22 = 0.636.
An association rule is an implication expression of the form 𝑋𝑋 → 𝑌𝑌, where X and Y are disjoint itemsets,
i.e., 𝑋𝑋 ∩ 𝑌𝑌 = ∅. The strength of an association rule can be measured in terms of support and confidence.
Support determines how often a rule is applicable to given data set, while confidence determines how
frequently items in Y appear in transactions that contain X. The formal definitions of these metrics are
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆, 𝑠𝑠(𝑋𝑋 → 𝑌𝑌) =
𝜎𝜎(𝑋𝑋∪𝑌𝑌)
𝑁𝑁
....................................... 1
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶, 𝑐𝑐(𝑋𝑋 → 𝑌𝑌) =
𝜎𝜎(𝑋𝑋∪𝑌𝑌)
𝜎𝜎(𝑋𝑋)
..................................2
Support is an important measure because a rule that has very low support may occur simply by chance. A
low support rule is also likely to be uninteresting from a business perspective because it may not be
profitable to promote tourist attractions to tourists. For these reasons, supports is often used to eliminate
uninteresting rules. Support also has a desirable property that can be exploited for the efficient discovery
of association rules.
Confidence, on the other hand, measures the reliability of the inference made by a rule. For a given rule
𝑋𝑋 →, the higher the confidence, the more likely it is for Y to be present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of Y given X.
Association analysis results should be interpreted with caution. The inference made by an association rule
does not necessarily imply causality. Instead, it suggests a strong co-occurrence relationship between items
in the antecedent and consequent of the rule. Causality, on the other hand, requires knowledge about the
causal and effect attributes in the data and typically involves relationships occurring over time (e.g., ozone
depletion leads to global warming).
Frequent Itemset Generation
A lattice structure can be used to enumerate the list of all possible itemsets. In general, a data set that
contains k items can potentially generate up to 2k
− 1 frequent itemsets, excluding the null set. Because k
can be very large in many practical applications, the search space of itemsets that need to be explored is
exponentially large. A brute-force approach for finding frequent itemsets is to determine the support count
for every candidate itemset in the lattice structure
 I use min_support = 20%, min_confidence = 40%, minl = 2, and target = frequent itemsets. The results

can be seen below:
 I use min_support = 10%, min_confidence = 40%, minl = 3, and target = frequent itemsets. The results
can be seen below:
Maximal Frequent Itemsets
A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets
are frequent.
 I use min_support = 20%, min_confidence = 40%, minl = 1, and target = maximal. The results can be
seen below:
Maximal frequent itemsets effectively provide a compact representation of frequent itemsets. In other
words, they form the smallest set of itemsets from which all frequent itemsets can be derived. Maximal

frequent itemsets provide a valuable representation for data sets that can produce very long, frequent
itemsets, as there are exponentially many frequent itemsets in such data. Nevertheless, this approach is
practical only if an efficient algorithm exists to explicitly find the maximal frequent itemsets without
having to enumerate all their subsets. Despite providing a compact representation, maximal frequent
itemsets do not contain the support information of their subsets An additional pass over the data set is
therefore needed to determine the support counts of the non-maximal frequent itemsets. In some cases, it
might be desirable to have a minimal representation of frequent itemsets that preserves the support
information.
Closed Frequent Itemsets
Closed itemsets provide a minimal representation of itemsets without losing their support information. An
itemset X is closed if none of its immediate supersets has exactly the same support count as X. An itemset
is a closed frequent itemset if it is closed and its support is greater than or equal to min_support. We can
use the closed frequent itemsets to determine the support counts for the non-closed frequent itemsets.
 I use min_support = 20%, min_confidence = 40%, minl = 1, and target = closed. The results can be
seen below:
Closed frequent itemsets are useful for removing some of the redundant association rules. An association
rule X −→ Y is redundant if there exists another rule X′ −→ Y′, where X is a subset of X′ and Y
is a subset of Y′, such that the support and confidence for both rules are identical. Such redundant rules
are not generated if closed frequent itemsets are used for rule generation. Finally, note that all maximal
frequent itemsets are closed because none of the maximal frequent itemsets can have the same support
count as their immediate supersets.
Lift
One way to address this problem is by applying a metric known as lift:
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 =
𝑐𝑐(𝐴𝐴→𝐵𝐵)
𝑠𝑠(𝐵𝐵)
...................... 3
which computes the ratio between the rule’s confidence and the support of the itemset in the rule
consequent. For binary variables, lift is equivalent to another objective measure called interest factor.
which is defined as follows:
𝐼𝐼( 𝐴𝐴, 𝐵𝐵) =
𝑠𝑠(𝐴𝐴,𝐵𝐵)
𝑠𝑠(𝐴𝐴) × 𝑠𝑠(𝐵𝐵)
=
𝑁𝑁𝑓𝑓11
𝑓𝑓1+ 𝑓𝑓1+
.....................................4
Interest factor compares the frequency of a pattern against a baseline frequency computed under the

statistical independence assumption. The baseline frequency for a pair of mutually independent variables
is
𝑓𝑓11
𝑁𝑁
=
𝑓𝑓1+
𝑁𝑁
×
𝑓𝑓1+
𝑁𝑁
, or equivalently 𝑓𝑓11 =
𝑓𝑓1+ 𝑓𝑓1+
𝑁𝑁
....................5
Using Equations 4 and 5, we can interpret the measure as follows:
𝐼𝐼(𝐴𝐴, 𝐵𝐵) �
= 1, if A and B are independent
> 1, if A and B are positively correlated
< 1, if A and B are negatively correlated
So when using min_support = 20% and min_confidence = 40%, then the result can be concluded:
- (Lift) is equal to 1, there is no association rule/correlation (Independent)
- (Lift) is larger than 1, there is positive association rule
- (Lift) is smaller than 1, there is negative association rule.
No data found has a negative correlation
We can also sort the data patterns by support / confidence / lift as below:
- Support
- Confidence

- Lift
Appendix
Algorithms Association Rule Analysis in RStudio

Association Rule (Data Mining) - Frequent Itemset Generation, Closed Frequent Itemsets, Maximal Frequent Itemsets

Association Rule (Data Mining) - Frequent Itemset Generation, Closed Frequent Itemsets, Maximal Frequent Itemsets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Association Rule (Data Mining) - Frequent Itemset Generation, Closed Frequent Itemsets, Maximal Frequent Itemsets

Similar to Association Rule (Data Mining) - Frequent Itemset Generation, Closed Frequent Itemsets, Maximal Frequent Itemsets (20)

More from National Cheng Kung University

More from National Cheng Kung University (20)

Recently uploaded

Recently uploaded (20)

Association Rule (Data Mining) - Frequent Itemset Generation, Closed Frequent Itemsets, Maximal Frequent Itemsets