Association and Correlation analysis.....

 Data mining: semi-automatic discovery of interesting patterns from large
data sets
 Knowledge discovery is a process:
◦ Preprocessing
◦ Data mining
◦ Postprocessing
Summary: What is Data
Mining?

Data mining
Input data Results
Preprocessing Postprocessing
Operational
Database
Utilization
Cleaned
Verified
Focused
Eval. of
interes-
tingness
Raw
data
Time
based
selection
Selected
usable
patterns
1 3
2
Summary: Typical KDD
Process

Basics
FP Growth algorithm
Examples
Apriori Algorithm
Applications
Association Rules

• A set of items together is called an itemset. If any itemset has k- items it is
called a k-itemset.
• An itemset consists of two or more items. An itemset that occurs frequently
is called a frequent itemset. Thus frequent itemset mining is a data
mining technique to identify the items that often occur together.
• For Example, Bread and butter, Laptop and Antivirus software, etc.

• A set of items is called frequent if it satisfies a minimum threshold value for
support and confidence.
• Support shows transactions with items purchased together in a single transaction.
• Confidence shows transactions where the items are purchased one after the other.
• For frequent itemset mining method, we consider only those transactions which
meet minimum threshold support and confidence requirements.
• Insights from these mining algorithms offer a lot of benefits, cost-cutting and
improved competitive advantage.

 The frequent pattern mining algorithm is one of the most important techniques of
data mining to discover relationships between different items in a dataset.
 FPM has many applications in the field of data analysis, software bugs, cross-
marketing, sale campaign analysis, market basket analysis, etc.
 Frequent itemsets discovered through Apriori have many applications in data mining
tasks. Tasks such as finding interesting patterns in the database, finding out sequence
and Mining of association rules is the most important of them.
 Association rules apply to supermarket transaction data, that is, to examine the
customer behavior in terms of the purchased products. Association rules describe how
often the items are purchased together.

 This is the most typical example of association mining. Data is collected
using barcode scanners in most supermarkets.
 This database, known as the “market basket” database, consists of a
large number of records on past transactions.
 A single record lists all the items bought by a customer in one sale.
Knowing which groups are inclined towards which set of items gives
these shops the freedom to adjust the store layout and the store
catalogue to place the optimally concerning one another.

When people buy green tea, it is evident that they may also buy honey with it. This relationship is depicted as a
conditional algorithm, as given below.
IF {green tea} THEN {honey}
It represents that items stated on the right are more likely to be ordered with the items on the left side. Market basket
analysis in data mining helps us understand that relationship and how helpful it would be to alter our decisions based
on the analysis.

 Typical representation formats for association rules:
◦ Green Tea  Honey[0.5%, 60%]
◦ buys: Green Tea  buys: Honey [0.5%, 60%]
◦ Green Tea and Honey are bought together in 0.5% of the rows
in the database."
◦ "IF buys Green Tea , THEN buys Honey in 60% of the cases.

Green Tea  Honey [0.5%, 60%]
1 Antecedent, left-hand side (LHS), body
2 Consequent, right-hand side (RHS), head
3 Support, frequency ("in how big part of the data the things
in left- and right-hand sides occur together")
4 Confidence, strength ("if the left-hand side occurs, how
likely the right-hand side occurs")
"IF buys Green Tea ,
THEN buys Honey
in 60% of the cases
in 0.5% of the rows"
1 2 3 4

A=>B
Support is a measure of how frequently an itemset (A) appears in a dataset. It
indicates the proportion of transactions or records in the dataset that contain
the item set.
Confidence: denotes probability that a transaction containing A also contains B.

Support(X) = (Number of transactions containing X) / (Total number
of transactions)
Confidence(X => Y) = (Number of transactions containing X and Y) /
(Number of transactions containing X)

 Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent
item sets in a dataset for Boolean association rule.
 Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties.
 Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects.
 This algorithm uses two steps “join” and “prune” to reduce the search space. It is an
iterative approach to discover the most frequent item sets.
 It means how two or more objects are related to one another. In other words, we can
say that the Apriori algorithm is an association rule leaning that analyses that people
who bought product A also bought product.

 Method:
◦ Initially, scan DB once to get frequent 1-itemset.
◦ Generate length (k+1) candidate itemsets from length k frequent
itemsets.
◦ Test the MINIMUM SUPPORT.
◦ Terminate when no frequent or candidate set can be generated

• First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that
satisfy minimum support. The resulting set is denoted by L1.
• Next, L1 is used to find L2 , the set of frequent 2-itemsets, which is used
to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lk requires one full scan of the database.

Find the frequent itemsets and generate association rules on this. Assume that
minimum support threshold (s = 33.33%) and minimum confident threshold (c
= 60%)

There is only one itemset with minimum support 2. So only one itemset is

Frequent Itemset (I) = {Coke, Chips}

Advantages of Apriori Algorithm
• It is used to calculate large item sets.
• Simple to understand and apply.
Disadvantages of Apriori Algorithms
• Apriori algorithm is an expensive method to find support since the
calculation has to pass through the whole database.
• Sometimes, you need a huge number of candidate rules, so it
becomes computationally more expensive.

 Multiple database scans are costly
 Mining long patterns needs many passes of scanning and generates
lots of candidates
◦ To find frequent itemset i1i2…i100
 # of scans: 100
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?

 The FP-Growth Algorithm is an alternative way to find frequent
item sets without using candidate generations, thus improving
performance.
 For so much, it uses a divide-and-conquer strategy. The core of
this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the item set
association information.

The FP Growth algorithm in data mining is a popular method for
frequent pattern mining. The algorithm is efficient for mining frequent
item sets in large datasets.
 It works by constructing a frequent pattern tree (FP-tree) from the
input dataset.
FP Growth algorithm was developed by Han in 2000 and is a powerful
tool for frequent pattern mining in data mining. It is widely used in
various applications such as market basket analysis, bioinformatics, and
web usage mining.
What is FP Growth Algorithm?

FP-Growth: allows frequent itemset discovery without candidate
itemset generation. Two step approach:
◦ Step 1: Build a compact data structure called the FP-tree
 Built using 2 passes over the data-set.
◦ Step 2: Extracts frequent itemsets directly from the FP-tree

Working on FP Growth Algorithm
The working of the FP Growth algorithm in data mining can be summarized in the
following steps:
Scan the database:
In this step, the algorithm scans the input dataset to determine the frequency of each
item. This determines the order in which items are added to the FP tree, with the most
frequent items added first.
Sort items:
In this step, the items in the dataset are sorted in descending order of frequency. The
infrequent items that do not meet the minimum support threshold are removed from the
dataset. This helps to reduce the dataset's size and improve the algorithm's efficiency.

Construct the FP-tree:
In this step, the FP-tree is constructed. The FP-tree is a compact data structure that stores the
frequent itemsets and their support counts.
Generate frequent itemsets:
Once the FP-tree has been constructed, frequent itemsets can be generated by recursively
mining the tree. Starting at the bottom of the tree, the algorithm finds all combinations of
frequent item sets that satisfy the minimum support threshold.
Generate association rules:
Once all frequent item sets have been generated, the algorithm post-processes the generated
frequent item sets to generate association rules, which can be used to identify interesting
relationships between the items in the dataset.

Now, for each item, the Conditional Pattern Base is computed
which is path labels of all the paths which lead to any node of
the given item in the frequent-pattern tree. Note that the items
in the below table are arranged in the ascending order of their
frequencies.
Y
M
O

Now for each item, the Conditional Frequent Pattern Tree is built. It is done by
taking the set of elements that is common in all the paths in the Conditional
Pattern Base of that item and calculating its support count by summing the
support counts of all the paths in the Conditional Pattern Base.

From the Conditional Frequent Pattern tree, the Frequent Pattern
rules are generated by pairing the items of the Conditional Frequent
Pattern Tree set to the corresponding to the item as given in the below
table.

Advantages of FP Growth Algorithm
The FP Growth algorithm in data mining has several advantages over other frequent itemset mining
algorithms, as mentioned below:
Efficiency:
FP Growth algorithm is faster and more memory-efficient than other frequent itemset mining
algorithms such as Apriori, especially on large datasets with high dimensionality. This is because it
generates frequent itemsets by constructing the FP-Tree, which compresses the database and
requires only two scans.
Scalability:
FP Growth algorithm scales well with increasing database size and itemset dimensionality, making
it suitable for mining frequent itemsets in large datasets.
Resistant to noise:
FP Growth algorithm is more resistant to noise in the data than other frequent itemset mining
algorithms, as it generates only frequent itemsets and ignores infrequent itemsets that may be
caused by noise.
Parallelization:
FP Growth algorithm can be easily parallelized, making it suitable for distributed computing
environments and allowing it to take advantage of multi-core processors.

Disadvantages of FP Growth Algorithm
While the FP Growth algorithm in data mining has several advantages, it also has
some limitations and disadvantages, as mentioned below:
Memory consumption:
Although the FP Growth algorithm is more memory-efficient than other frequent
itemset mining algorithms, storing the FP-Tree and the conditional pattern bases can
still require a significant amount of memory, especially for large datasets.
Complex implementation:
The FP Growth algorithm is more complex than other frequent itemset mining
algorithms, making it more difficult to understand and implement.

 Association rule learning is a type of unsupervised learning methods that
tests for the dependence of one data element on another data element
and create appropriately so that it can be more effective.
 It tries to discover some interesting relations or relations among the
variables of the dataset. It depends on several rules to find interesting
relations between variables in the database.

There are various applications of Association Rule which are as follows −
• Items purchased on a credit card, such as rental cars and hotel rooms, support
insight into the following product that customer are likely to buy.
• Optional services purchased by tele-connection users (call waiting, call
forwarding, DSL, speed call, etc.) support decide how to bundle these functions to
maximize revenue.

•Banking services used by retail users (money industry accounts, CDs, investment
services, car loans, etc.) recognize users likely to needed other services.
•Unusual group of insurance claims can be an expression of fraud and can spark
higher investigation.
•Medical patient histories can supports expressions of likely complications based
on definite set of treatments.

Association and Correlation analysis.....

More Related Content

What's hot

Similar to Association and Correlation analysis.....

More from anjanasharma77573

Recently uploaded

Association and Correlation analysis.....