Association Rule Mining with Apriori Algorithm.pdf
1.
Mining Frequent Patterns,Associations, and Correlations: Basic Concepts
and Methods
Frequent Pattern is a pattern which appears frequently in a data set.
Itemset: A collection of one or more items.
For example, a set of items, such as milk and bread, that appear frequently together in a transaction
data set is a frequent itemset.
By identifying frequent patterns, we can observe strongly correlated items together and easily
identify similar characteristics, associations among them.
Frequent pattern mining searches for recurring relationships in a given data set.
Association rules are "if-then" statements, that help to show the probability of relationships
between data items, within large data sets in various types of databases.
Association rule mining: Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrence of other items in the transaction.
Support count: Frequency of occurrence of an itemset.
Example: σ({Milk, Bread, Diaper})=2
Support: The support of a rule x→y (where x and y are each items / events etc.) is defined as the
proportion of transactions in the data set which contain the item set x as well as y. So, Support
(x→y)= no. of transactions which contain the item set x & y / total no. of transactions.
Confidence: The confidence of a rule x→y is defined as: Support (x→y) / support (x). So, it is the
ratio of the number of transactions that include all items in the consequent (y in this case), as well
as the antecedent (x in this case) to the number of transactions that include all items in the
antecedent (x in this case).
2.
support(A⇒B) =P(A∪B)
confidence(A⇒B) =P(B|A)
confidence(A⇒B)= P(B|A) = support(A∪B) / support(A)=support_count(A∪B) /
support_count(A)
For example, the information that customers who purchase computers also tend to buy antivirus
software at the same time is represented in the following association rule:
computer ⇒ antivirus software [support = 2%, confidence = 60%].
A support of 2% means that 2% of all the transactions under analysis show that computer and
antivirus software are purchased together. A confidence of 60% means that 60% of the customers
who purchased a computer also bought the software.
Typically, association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. These thresholds can be a set by users or domain
experts.
Apriori Algorithm
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the
given database. This data mining technique follows the join and the prune steps iteratively until
the most frequent itemset is achieved. A minimum support threshold is given in the problem or it
is assumed by the user.
1. In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The
algorithm will count the occurrences of each item.
2. Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose
occurrence is satisfying the min sup are determined. Only those candidates which count
more than or equal to min_sup, are taken ahead for the next iteration and the others are
pruned.
3.
3. Next, 2-itemsetfrequent items with min_sup are discovered. For this in the join step, the
2-itemset is generated by forming a group of 2 by combining items with itself.
4. The 2-itemset candidates are pruned using min-sup threshold value. Now the table will
have 2 –itemsets with min-sup only.
5. The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of
each group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be
frequent otherwise it is pruned.
6. Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its
subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent
itemset is achieved.
Problem:
Find all frequent itemset using Approri algorithm. Given that, min support=2
Solution:
C1: L1:
4.
C2: L2:
C3: L3:
ItemsetSup_count
I1, I2, I3 2
I1, I2, I4 1
I1, I2, I5 2
I1, I3, I5 1
I2, I3, I4 0
I2, I3, I5 1
I2, I4, I5 0
Only {I1, I2, I3} and {I1, I2, I5} are frequent.
Generate Association Rules: From the frequent itemset discovered above the association could
be:
{I1, I2} => {I3}
Confidence = support {I1, I2, I3} / support {I1, I2} = (2 / 4)* 100 = 50%
{I1, I3} => {I2}
Confidence = support {I1, I2, I3} / support {I1, I3} = (2 / 4)* 100 = 50%
{I2, I3} => {I1}
Confidence = support {I1, I2, I3} / support {I2, I3} = (2 / 4)* 100 = 50%
Itemset Sup_count
I1, I2, I3 2
I1, I2, I5 2