1. Department of Geomatics, National Cheng Kung University
[106-2] Data Mining, Homework 3, Instructor: Prof. Hsueh-Chan Lu
Muhammad Irsyadi Firdaus (P66067055)
Using R tool to analyze the association relationships of all multiple check questions from my all collected
data.
ο¬ Try to set a suitable min_support and min_confidence threshold to mine the (Maximal / Closed) association
rules and show your mining results.
ο¬ Write a short report to summarize what do you get / find after association relationship analysis.
ο¬ Hint: You can sort your patterns by support / confidence/ lift and try to explain why these events have
frequent positive / negative association relationships.
Answers
I use data on HW1 and HW2 about hobbies on vacation. The number of recorders in my data is 68. I use
data titled "What is the place you most like to visit?" With lots of options like Beach, Museum, Historic
Heritage, Natural Ecology, Mountains, Events & Festivals, Recreation Area, Department Stores , Night
Market, Jungle.
To try to set min_Support and min_confidence threshold to mine the association rules, first is to store the
data in .txt format or another with the title Place_to_Visit.txt. In this case I will use RStudio software to
do the calculation.
If I use a priori standard, I will get the result as below with the parameters that I use is
The results can be seen below:
ο I will use min_support = 20% and min_confidence = 40%, it can be seen below:
3. can be seen below:
ο I use min_support = 10%, min_confidence = 40%, minl = 3, and target = frequent itemsets. The results
can be seen below:
Maximal Frequent Itemsets
A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets
are frequent.
ο I use min_support = 20%, min_confidence = 40%, minl = 1, and target = maximal. The results can be
seen below:
Maximal frequent itemsets effectively provide a compact representation of frequent itemsets. In other
words, they form the smallest set of itemsets from which all frequent itemsets can be derived. Maximal
4. frequent itemsets provide a valuable representation for data sets that can produce very long, frequent
itemsets, as there are exponentially many frequent itemsets in such data. Nevertheless, this approach is
practical only if an efficient algorithm exists to explicitly find the maximal frequent itemsets without
having to enumerate all their subsets. Despite providing a compact representation, maximal frequent
itemsets do not contain the support information of their subsets An additional pass over the data set is
therefore needed to determine the support counts of the non-maximal frequent itemsets. In some cases, it
might be desirable to have a minimal representation of frequent itemsets that preserves the support
information.
Closed Frequent Itemsets
Closed itemsets provide a minimal representation of itemsets without losing their support information. An
itemset X is closed if none of its immediate supersets has exactly the same support count as X. An itemset
is a closed frequent itemset if it is closed and its support is greater than or equal to min_support. We can
use the closed frequent itemsets to determine the support counts for the non-closed frequent itemsets.
ο I use min_support = 20%, min_confidence = 40%, minl = 1, and target = closed. The results can be
seen below:
Closed frequent itemsets are useful for removing some of the redundant association rules. An association
rule X ββ Y is redundant if there exists another rule Xβ² ββ Yβ², where X is a subset of Xβ² and Y
is a subset of Yβ², such that the support and confidence for both rules are identical. Such redundant rules
are not generated if closed frequent itemsets are used for rule generation. Finally, note that all maximal
frequent itemsets are closed because none of the maximal frequent itemsets can have the same support
count as their immediate supersets.
Lift
One way to address this problem is by applying a metric known as lift:
πΏπΏπΏπΏπΏπΏπΏπΏ =
ππ(π΄π΄βπ΅π΅)
π π (π΅π΅)
...................... 3
which computes the ratio between the ruleβs confidence and the support of the itemset in the rule
consequent. For binary variables, lift is equivalent to another objective measure called interest factor.
which is defined as follows:
πΌπΌ( π΄π΄, π΅π΅) =
π π (π΄π΄,π΅π΅)
π π (π΄π΄) Γ π π (π΅π΅)
=
ππππ11
ππ1+ ππ1+
.....................................4
Interest factor compares the frequency of a pattern against a baseline frequency computed under the
5. statistical independence assumption. The baseline frequency for a pair of mutually independent variables
is
ππ11
ππ
=
ππ1+
ππ
Γ
ππ1+
ππ
, or equivalently ππ11 =
ππ1+ ππ1+
ππ
....................5
Using Equations 4 and 5, we can interpret the measure as follows:
πΌπΌ(π΄π΄, π΅π΅) οΏ½
= 1, if A and B are independent
> 1, if A and B are positively correlated
< 1, if A and B are negatively correlated
So when using min_support = 20% and min_confidence = 40%, then the result can be concluded:
- (Lift) is equal to 1, there is no association rule/correlation (Independent)
- (Lift) is larger than 1, there is positive association rule
- (Lift) is smaller than 1, there is negative association rule.
No data found has a negative correlation
We can also sort the data patterns by support / confidence / lift as below:
- Support
- Confidence