Association Analysis-Definition Association Analysis is the task of uncovering relationships among data. Association rules: It is a model that identifies how the data items are associated with each other. Ex: It is used in retail sales to identify that are frequently purchased together.
If (condition) then (result) Example: IF a customer purchases coke, then the customer also purchases orange juice The first part is the rule body and the second part is the rule head
Strength of a rule How certain is the rule? Confidence measures the certainty of a rule It is the percentage of transactions containing all items stated in the condition that also contain the items in result Confidence (A ,B) = P(B | A) Example: The rule "If Coke then Oranje Juice" has a confidence of 100%
Strength of a rule How often is the rule occurred? Support measures the usefulness of a rule It is the percentage of transactions that contains all items in the rule Support (A , B) = P(A ,B) Example: For the rule If Coke then Oranj juice In all 5 transactions, 2 contains both coke and OJ The support of the rule is 40%
Association Rule Mining Two-step process Find all frequent k-item sets, k=1, 2, 3, … All items in a rule is referred as an itemset Rules that contains k item forms a k-itemset The occurrence frequency of an k-itemset is the number of transactions that contain all k items in the itemset An itemset satisfies a minimum support (or minimum occurrence frequency) is called a frequent itemset
Association Rule Mining 2.Generate strong association rules from the frequent k-itemsets Rules satisfy both a minimum support threshold and a minimum confidence threshold are called strong rules
Apriori Algorithm: Find all frequent k-item sets Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent
Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets
Contd… Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent
Generate strong association rules from the frequent k-itemsets For each frequent k-itemset, generate all non-empty subsets Fore every nonempty subset, generate the rule and the associated confidence Output the rule if the minimum confidence threshold is satisfied
Multilevel association rules Difficult to find strong associations at very low or primitive levels of data
Few people may buy "IBM desktop computer" and "Sony b/w printer" together Many people may purchase "computer" and "printer" together
Concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level EX: IBM Microsoft Hp ……… computer software printer accessory
Steps to be followed Top-down, progressive deepening approach First mine high-level frequent items Then mine their lower level frequent items and so on At each level, Apriori algorithm is used Use uniform minimum support for all levels, or Use reduced minimum support at lower levels
Sequential Association Rule Concerns sequences of events New homeowners purchase shower curtains before purchasing furniture When a customer goes into a bank branch and ask for an account reconciliation, there is a good chance that he or she will close all his or her accounts
Sequential Association Rule Transaction must have two additional features: a time stamp or sequencing information to determine when transactions occurred relative to each other identifying information, such as account number or id number
Some important parameters Duration duration may be the entire available sequence in the database, or a user selected subsequence, such as year 1999 Event folding window a set of events occurring within a specified period of time, such as within the same day, can be viewed as occurring together.
Some important parameters Interval between events in the discovered pattern 0 interval means to find strictly consecutive sequences min_int <= interval <= max_int means to find patterns that are separated by at least min_int at most max_int interval = c, to find patterns carrying an exact interval
Some Practical Issues Time window of transactions Level of aggregation Level of support and confidence
Time window of transactions Select a time window for the transaction covers at least 2 product cycles e.g. customer purchases a product with a frequency of six month or less, select a 12-month window of customer transaction data For frequently purchased products, a short time window is sufficient For low frequency items, a longer time window is necessary.
Level of aggregation If product codes in the data are too specific (such as based on product details such as size and flavour), few associations will be discovered Group products into categories according to the product hierarchy or create new level manually
Level of support and confidence Start with a high support and gradually reduce it Set confidence to around 50% to reduce the number of permutation
Conclusion Association analysis rules such as multidimensional and sequential association rules are studied. Apriori algorithm is described in detail Various practical issues in association rules are analyzed.
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net