Association Analysis
Association Analysis-DefinitionAssociation Analysis is the task of uncovering relationships among data.Association rules:It  is a model that identifies how the data items are associated with each other.Ex:       It is used in retail sales to identify that are frequently purchased together.
What is a rule? Structure of rule:If (condition) then (result) Example: IF a customer purchases coke, then the customer also purchases orange juice The first part is the rule body and the second part is the rule head
Strength of a rule How certain is the rule? Confidence measures the certainty of a rule It is the percentage of transactions containing all items stated in the condition that also contain the items in result Confidence (A ,B) = P(B | A) Example: The rule "If Coke then Oranje Juice" has a confidence of 100%
Strength of a rule How often is the rule occurred? Support measures the usefulness of a rule It is the percentage of transactions that contains all items in the rule Support (A , B) = P(A ,B) Example: For the rule If Coke then Oranj juice In all 5 transactions, 2 contains both coke and OJ The support of the rule is 40% 
Association Rule MiningTwo-step process Find all frequent k-item sets, k=1, 2, 3, … All items in a rule is referred as an itemsetRules that contains k item forms a k-itemsetThe occurrence frequency of an k-itemset is the number of transactions that contain all k items in the itemsetAn itemset satisfies a minimum support (or minimum occurrence frequency) is called a frequent itemset
Association Rule Mining2.Generate strong association rules from the frequent k-itemsetsRules satisfy both a minimum support threshold and a minimum confidence threshold are called strong rules
Apriori Algorithm: Find all frequent k-item setsApriori principle:If an itemset is frequent, then all of its subsets must also be frequent
Illustrating Apriori Principle
Apriori AlgorithmMethod: Let k=1Generate frequent itemsets of length 1Repeat until no new frequent itemsets are identifiedGenerate length (k+1) candidate itemsets from length k frequent itemsets
Contd…Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DBEliminate candidates that are infrequent, leaving only those that are frequent
Generate strong association rules from the frequent k-itemsetsFor each frequent k-itemset, generate all non-empty subsets Fore every nonempty subset, generate the rule and the associated confidence Output the rule if the minimum confidence threshold is satisfied
Multilevel association rulesDifficult to find strong associations at very low or primitive levels of data  Few people may buy "IBM desktop computer" and "Sony b/w printer" together Many people may purchase "computer" and "printer" together
Concept hierarchydefines a sequence of mappings from a set of low level concepts to higher levelEX:                                IBM                                          Microsoft                                           Hp                                             ………                                         computer                                      software                                       printer                                    accessory 
Steps to be followedTop-down, progressive deepening approach First mine high-level frequent items Then mine their lower level frequent items and so on At each level, Apriori algorithm is used Use uniform minimum support for all levels, or Use reduced minimum support at lower levels
Sequential Association Rule Concerns sequences of events New homeowners purchase shower curtains before purchasing furniture When a customer goes into a bank branch and ask for an account reconciliation, there is a good chance that he or she will close all his or her accounts
Sequential Association Rule Transaction must have two additional features: a time stamp or sequencing information to determine when transactions occurred relative to each other identifying information, such as account number or id number
Some important parameters Duration duration may be the entire available sequence in the database, or a user selected subsequence, such as year 1999 Event folding window a set of events occurring within a specified period of time, such as within the same day, can be viewed as occurring together.
Some important parameters Interval between events in the discovered pattern 0 interval means to find strictly consecutive sequences min_int <= interval <= max_int means to find patterns that are separated by at least min_int at most max_intinterval = c, to find patterns carrying an exact interval
Some Practical Issues Time window of transactions Level of aggregation Level of support and confidence
Time window of transactions Select a time window for the transaction covers at least 2 product cycles e.g. customer purchases a product with a frequency of six month or less, select a 12-month window of customer transaction data For frequently purchased products, a short time window is sufficient For low frequency items, a longer time window is necessary.
Level of aggregation If product codes in the data are too specific (such as based on product details such as size and flavour), few associations will be discovered Group products into categories according to the product hierarchy or create new level manually
Level of support and confidence Start with a high support and gradually reduce it Set confidence to around 50% to reduce the number of permutation
ConclusionAssociation analysis rules such as multidimensional and sequential association rules are studied.Apriori algorithm is described in detailVarious practical issues in association rules are analyzed.
Visit more self help tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net

Association Analysis

  • 1.
  • 2.
    Association Analysis-DefinitionAssociation Analysisis the task of uncovering relationships among data.Association rules:It is a model that identifies how the data items are associated with each other.Ex: It is used in retail sales to identify that are frequently purchased together.
  • 3.
    What is arule? Structure of rule:If (condition) then (result) Example: IF a customer purchases coke, then the customer also purchases orange juice The first part is the rule body and the second part is the rule head
  • 4.
    Strength of arule How certain is the rule? Confidence measures the certainty of a rule It is the percentage of transactions containing all items stated in the condition that also contain the items in result Confidence (A ,B) = P(B | A) Example: The rule "If Coke then Oranje Juice" has a confidence of 100%
  • 5.
    Strength of arule How often is the rule occurred? Support measures the usefulness of a rule It is the percentage of transactions that contains all items in the rule Support (A , B) = P(A ,B) Example: For the rule If Coke then Oranj juice In all 5 transactions, 2 contains both coke and OJ The support of the rule is 40% 
  • 6.
    Association Rule MiningTwo-stepprocess Find all frequent k-item sets, k=1, 2, 3, … All items in a rule is referred as an itemsetRules that contains k item forms a k-itemsetThe occurrence frequency of an k-itemset is the number of transactions that contain all k items in the itemsetAn itemset satisfies a minimum support (or minimum occurrence frequency) is called a frequent itemset
  • 7.
    Association Rule Mining2.Generatestrong association rules from the frequent k-itemsetsRules satisfy both a minimum support threshold and a minimum confidence threshold are called strong rules
  • 8.
    Apriori Algorithm: Findall frequent k-item setsApriori principle:If an itemset is frequent, then all of its subsets must also be frequent
  • 9.
  • 10.
    Apriori AlgorithmMethod: Letk=1Generate frequent itemsets of length 1Repeat until no new frequent itemsets are identifiedGenerate length (k+1) candidate itemsets from length k frequent itemsets
  • 11.
    Contd…Prune candidate itemsetscontaining subsets of length k that are infrequent Count the support of each candidate by scanning the DBEliminate candidates that are infrequent, leaving only those that are frequent
  • 12.
    Generate strong associationrules from the frequent k-itemsetsFor each frequent k-itemset, generate all non-empty subsets Fore every nonempty subset, generate the rule and the associated confidence Output the rule if the minimum confidence threshold is satisfied
  • 13.
    Multilevel association rulesDifficultto find strong associations at very low or primitive levels of data  Few people may buy "IBM desktop computer" and "Sony b/w printer" together Many people may purchase "computer" and "printer" together
  • 14.
    Concept hierarchydefines asequence of mappings from a set of low level concepts to higher levelEX: IBM  Microsoft  Hp ……… computer  software  printer  accessory 
  • 15.
    Steps to befollowedTop-down, progressive deepening approach First mine high-level frequent items Then mine their lower level frequent items and so on At each level, Apriori algorithm is used Use uniform minimum support for all levels, or Use reduced minimum support at lower levels
  • 16.
    Sequential Association Rule Concernssequences of events New homeowners purchase shower curtains before purchasing furniture When a customer goes into a bank branch and ask for an account reconciliation, there is a good chance that he or she will close all his or her accounts
  • 17.
    Sequential Association Rule Transactionmust have two additional features: a time stamp or sequencing information to determine when transactions occurred relative to each other identifying information, such as account number or id number
  • 18.
    Some important parametersDuration duration may be the entire available sequence in the database, or a user selected subsequence, such as year 1999 Event folding window a set of events occurring within a specified period of time, such as within the same day, can be viewed as occurring together.
  • 19.
    Some important parametersInterval between events in the discovered pattern 0 interval means to find strictly consecutive sequences min_int <= interval <= max_int means to find patterns that are separated by at least min_int at most max_intinterval = c, to find patterns carrying an exact interval
  • 20.
    Some Practical Issues Timewindow of transactions Level of aggregation Level of support and confidence
  • 21.
    Time window oftransactions Select a time window for the transaction covers at least 2 product cycles e.g. customer purchases a product with a frequency of six month or less, select a 12-month window of customer transaction data For frequently purchased products, a short time window is sufficient For low frequency items, a longer time window is necessary.
  • 22.
    Level of aggregationIf product codes in the data are too specific (such as based on product details such as size and flavour), few associations will be discovered Group products into categories according to the product hierarchy or create new level manually
  • 23.
    Level of supportand confidence Start with a high support and gradually reduce it Set confidence to around 50% to reduce the number of permutation
  • 24.
    ConclusionAssociation analysis rulessuch as multidimensional and sequential association rules are studied.Apriori algorithm is described in detailVarious practical issues in association rules are analyzed.
  • 25.
    Visit more selfhelp tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net