2. Association Rule Mining
• Association rules are if/then statements that help uncover
relationships between seemingly unrelated data in a relational
database or other information repository.
• Association rules are created by analysing data for frequent if/then
patterns and using the criteria support and confidence to identify the
most important relationships. Support is an indication of how
frequently the items appear in the database. Confidence indicates the
number of times the if/then statements have been found to be true.
3. Association Rule Mining
• In data mining, association rules are useful for analysing and
predicting customer behaviour. They play an important part in
shopping basket data analysis, product clustering, catalogue design
and store layout.
• Programmers use association rules to build programs capable of
machine learning. Machine learning is a type of artificial intelligence
(AI) that seeks to build programs with the ability to become more
efficient without being explicitly programmed.
4. Class Comparisons Association Rule Mining
• Several algorithms have been proposed in the literature to address
the problem of mining association rules.
• It seems to be the most common and popular problem.
• The method for generating the frequent patterns are divided in 2
categories:
• Sequential Methods
• Parallel Methods
5. Sequential Methods
• Sequential pattern mining is a case of structured data mining.
• It forms the foundation of most known algorithms.
• A sequence database contains some sequences. For example,
consider the following database:
• This database contains four sequences named seq1, seq2, seq3 and
seq4.
6. Sequential Methods
• For our example, consider that the symbols “a”, “b”, “c”, d”, “e”, “f”,
“g” and “h” respectively represents some items sold in a
supermarket. For example, “a” could represent an “apple”, “b” could
be some “bread”, etc.
• Now, a sequence is an ordered list of sets of items. For our example,
we will assume that each sequence represents what a customer has
bought in our supermarket over time. For example, consider the
second sequence “seq2”. This sequence indicates that the second
customer bought items “a” and “d” together, than bought item “c”,
then bought “b”, and then bought “a”, “b”, “e” and “f” together.
7. Simple Apriori
• Simple Apriori algorithm is a two step process.
• It is developed by Rakesh Agrawal and Ramakrishnan Srikant in 1994.
• The steps are as follows:
1. The join step: To find LK, a set of candidate K-item set is generated by
joining LK-1 item itself. The rules for join is that the items should be ordered
so we can compare item by item. The join of LK-1 is possible only if its 1st
(K-2) items are common.
2. The prune step: The join step will produce all K-item sets, but not all of
them are frequent. So scan database to see when join step produce an
empty set.
8. Hash Based Apriori
• A hash based technique can be applied so that it reduce the size of
the candidate K-item set in CK for K>1.
• Our main aim is to reduce the number of scans on the database.
• For Example: When we scan each transaction in the dataset to generate the frequent 1-item
L1 from the candidate 1-itemset C, we can generate all of 2-itemset for each transaction and hash
them into corresponding bucket counts in the hash table.
H(A,B) = ((Order of A)*10 + (Order of B))mod8
9. Partition Based Apriori
• A partition of the database refer to any subset of transaction
contained in the database D.
• Partition reduces the number of database scans as it divides the
database into small partitions such that each partition can be handled
in the main memory.
• Partition scan database only twice.
• Scan 1: Partition database and find local frequent pattern.
• Scan 2: Consolidate global frequent pattern, initially the database D is
logically partitioned into n partitions.
10. Partition Based Apriori
• Phase I: Read the entire database once, takes n iterations.
• Input: Pi where i=1…n
• Output: Local large item set of all length
• Merge Phase: The set of global candidate item sets of length J is
computed.
• Input: Local large item set of same length from all n partitions
• Output: Combine and generate global candidate item sets
• Phase II: Read the entire database again, also take n iterations.
• Input: PiB where iB=1…n
• Output: Counter for each global candidate item set and count their support.
11. Parallel Methods
• Parallel algorithm can be implemented over distributed memory
system.
• Parallel work can be followed.
• Each processor gathers the locally frequent item set of all size in 1
pass over their local database. Then all potentially frequent item set
are then broadcast to other processor.
• Other then each processor gathers the count of these global
candidates item sets. There are 2 major approaches for using
processor.
• Distributed memory machines
• Shared memory processor system