The document provides an overview of advanced data mining concepts covered in the semester, including frequent pattern mining methods like the Apriori algorithm and FP-Growth algorithm, association rule mining, and correlation analysis. It discusses techniques for mining frequent itemsets, generating association rules, and measuring correlation between variables. It also covers topics like mining multilevel association rules and multidimensional association rules from relational databases.
2. Chapter 01
Review of Data Mining principles and Preprocessing methods
• Mining Frequent Patters-basic concepts
• Efficient and scalable frequent item set mining methods.
• Apiriori Algoithm
• FP-Growth algorithm
• Associations-mining various kinds of association rules.
• Correlations-association mining to Correlation analysis.
• Constrained Based Mining
3. The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, ...
• Science: Remote sensing, bioinformatics, scientific simulation, ...
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!“
• Necessity is the mother of invention"-Data mining-Automated analysis of massive
data
Why Data Mining?
4. Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
Alternative names:
• Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
Watch out: Is everything "data mining"?
• Simple search and query processing
• (Deductive) expert systems
What Is Data Mining?
5.
6. • Tremendous amount of data
• Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data
• Micro-array may have tens of thousands of dimensions
• High complexity of data
• Data streams and sensor data
• Time-series data, temporal data, sequence data
• Structure data, graphs, social networks and multi-linked data
• Heterogeneous databases and legacy databases
• Spatial, spatiotemporal, multimedia, text and Web data
• Software programs, scientific simulations
• New and sophisticated applications
Why Not Traditional Data Analysis?
7. • Database-oriented data sets and applications
• Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
• Data streams and sensor data
• Time-series data, temporal data, sequence data (incl. bio-sequences)
• Structure data, graphs, social networks and multi-linked data
• Object-relational databases
• Heterogeneous databases and legacy databases
• Spatial data and spatiotemporal data
• Multimedia database
• Text databases
• The World-Wide Web
Data Mining: On What Kinds of Data?
11. • itemset: A set of one or more items
• k-itemset X = {x1, …, xk}
• (absolute) support or support count of X: Frequency or occurrence of an
itemset X.
• (Releative)supoort, s is the fraction of transaction that contains X(i.e., the
probability that a transaction contains X)
• An itemset X is frequent if X’s support is no less than a minsup threshold.
Frequent Patterns
12. (absolute) support, or, supportcount of X:Frequency or occurrence of an
itemset X
{Diaper}: 4
{beer, Diaper}: 3
(relative) support, s, is the fraction of transactions that contains X (i.e., the
probability that a transaction contains X)
{Diaper}: 80%
{beer, Diaper}: 60%
An itemset X is frequent if X's support is no less than a minsup threshold
Cont.,
13. Find all the rules X→Y with minimum support and confidence
Support, s, probability that a transaction contains X U Y
Confidence, c, conditional probability that a transaction having X also contains Y
Let minsup = 50%, minconf = 50%Freq.
Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3
Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
Association Rules
14. Given a minimum support ‘s’ and a minimum confidence ‘c’, find all the rules that
satisfy:
• The support of the rules are greater than s
• The confidence of the rules are greater than c
• Support : "How useful is the rule?“
• percentage of transactions in the dataset that contain the itemset
• Confidence: “How true is the rules ?“
• strength of the relationship between two itemsets A and B
Cont.,
15. • A long pattern contains a combinatorial number of sub-patterns, e.g., {a1,
…, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-
patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no superpattern Y כ
X, with the same support as X
• An itemset X is a max-pattern if X is frequent and maximum count value.
Closed Patterns and Max-Patterns
17. Cont.,
Closed frequent itemset is frequent itemset if it’s support is not
equal to it’s superset support value.
Here A=3/5 and AC=3/5 is equal so A is not closed
Here C=4/5 and AC=3/5, BC=3/5, CE=3/5 are not equal so C is
closed
C, AC, BE, BCE and ABCE are closed
Max-Pattern can be chosen by identifying the frequent itemset
with the highest support in the frequent tree.
maximum pattern itemsets in the given dataset are:
{B} (support = 4)
{C} (support = 4)
{E} (support = 4)
{BE} (support = 4)
These are the itemsets with the highest support counts among all
frequent itemsets.
18. Apriori algorithm:
1. Initialize frequent itemsets by finding the support of each individual item in the dataset.
2. Generate candidate itemsets of size k+1 from frequent itemsets of size k by taking the
union of each pair of frequent itemsets of size k, and then prune resulting itemsets that
contain any infrequent subset.
3. Count the support of candidate itemsets by scanning the transaction database.
4. Prune infrequent itemsets that do not meet the minimum support threshold.
5. Repeat steps 2-4 until no more frequent itemsets can be generated.
6. Return the frequent itemsets found during the iterations.
Efficient and Scalable Frequent Itemset Mining Methods
21. In 𝐶3= 𝐿2𝑥 𝐿2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
Here {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5} are removed because it’s subsets are not frequent.
Example {I1, I3, I5} is removed. Subsets are {{I1, I3}, {I1, I5} and {I3, I5} }. {I3, I5} is not available in
𝐿2.
In 𝐿3 = {{I1, I2, I3}, {I1, I2, I5}} = {I1, I2, I3, I5}
Subset of {I1, I2, I3, I5} are {{{I1, I2, I3} {I1, I2, I5} {I1, I3, I5}
{I1, I3, I5} is not frequent so {I1, I2, I3, I5} is pruned
𝐶4= 𝐿3𝑥 𝐿3 = ɸ and the algorithm terminates, having found all of the frequent itemsets.
Cont.,
22. Pseudo-Code
Ck : Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk != ɸ; k++) do begin
Ck+1 = candidates generated from Lk ;
for each transaction t in database do increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return Uk Lk
Cont.,
23. • Mining Frequent Patterns without Candidate Generation" is a research paper published in
2000 by Jiawei Han, Jian Pei, and Yiwen Yin.
• The paper proposed the FP-Growth Algorithm, which is a frequent pattern mining
algorithm that avoids the need for generating candidate itemsets, which is a
bottleneck in other frequent itemset mining algorithms such as the Apriori Algorithm
• FP-Growth Algorithm is to represent the database in FP-tree, which allows for
efficient computation of frequent itemsets
• The algorithm works in two main steps: constructing the FP-tree and generating
frequent itemsets from the FP-tree.
Frequent Pattern-Growth Algorithm
24. FP-Growth Algorithm Steps:
• Scan the database to find the support count of each item.
• Sort the items in descending order of support count.
• Construct the FP-tree by traversing the database again, inserting each transaction into
the tree in a way that preserves the order of the items.
• Generate frequent itemsets from the FP-tree.
Cont.,
25. Step by Step Process:
1. Scan the database
2. Find the support count for each items
3. Arrange it in descending order
4. Arrange the items based on step3 in each transaction in database
5. Draw the FP-tree with root null based on the database with each transaction and increase
count for each node visit in traverse.
Cont.,
26. 6. Compute the conditional pattern base for the item from the FP-tree. It’s finding path for
each items and write with label
7. Built conditional pattern tree. It is done by taking the set of elements which is common in
all the paths in the Conditional Pattern Base of that item and calculating it's support count
by summing the support counts of all the paths in the Conditional Pattern Base.
8. All the combination of frequent pattern is generated using item and conditional FP-tree
from the table. Take the minimum count when the case two count available.
Cont.,
30. Steps to Generate Association Rules from Frequent Pattern generated by FP-Growth
• Generate frequent itemsets using FP-Growth algorithm.
• Determine the minimum support and confidence levels for association rules.
• Generate all possible association rules for each frequent itemset.
• Calculate the support and confidence values for each association rule.
• Filter out the association rules that do not meet the minimum support and confidence
levels.
• Evaluate the remaining association rules based on their interestingness measures such
as lift, conviction, and leverage to determine their significance.
• Present the significant association rules to the user for further analysis or decision-
making.
Cont.,
31. Generate Association rules for Frequent Pattern {I2, I4 : 2}
• Minimum support is 20% and the minimum confidence is 80% is fixed.
• All possible association rules for frequent itemset.
• I2 → I4 (I2 implies I4)(I2 leads to I4)
• I4 → I2
• Calculate the support and confidence values for I2 → I4 association rule.
• Support({I2, I4}) = (Frequency of {I2, I4}) / (Total number of transactions)
Support = 2/9 = 0.22
• Confidence ({I2, I4}) = (Frequency of {I2, I4}) / (Frequency of {I2})
Confidence = 2/7 = 0.28
• Calculate the confidence values for I4 → I2 association rule.
Confidence({I4I2}) = (Frequency of {I2, I4}) / (Frequency of {I4}) = 2/2 = 1.0
Cont.,
32. Filter out the association rules that do not meet the minimum support and confidence
levels.
Since the minimum support is 20% and the minimum confidence is 80%, the
only association rule that meets both criteria is:
I4 → I2
with a support of 0.222 and a confidence of 1.0
Continue the same process for further frequent pattern to generate association rules
Cont.,
33. A. Mining Multilevel Association Rules
It is difficult to find strong associations among data items at low or primitive levels of
abstraction
1. Using uniform minimum support for all levels (referred to as uniform support)
• The same minimum support threshold is used at each level of abstraction.
• uniform minimum support threshold is used, the search procedure is simplified. The
method is also simple in that users are required to specify only one minimum support
threshold
• If the minimum support threshold is set too high, it could miss some meaningful
associations occurring at low abstraction levels. If the threshold is set too low, it may
generate many uninteresting associations occurring at high abstraction levels.
Mining Various Kinds of Association rules
34. 2. Using reduced minimum support at lower levels (referred to as reduced support)
• Each level of abstraction has its own minimum support threshold.
• The deeper the level of abstraction, the smaller the corresponding threshold.
• In Figure, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all
considered frequent.
Cont.,
35. 3. Using item or group-based minimum support (referred to as group-based support)
• Users or Experts often decide some groups are more important than others.
• User will fix minimal support thresholds based on the importance
• For example, a user could set up the minimum support thresholds based on product
price, or on items of interest, such as by setting particularly low support thresholds for
laptop computers and flash drives in order to pay particular attention to the association
patterns containing items in these categories
Cont.,
36. B. Mining Multidimensional Association Rules from Relational Databases and DataWarehouses
• Data are stored in the relational database or data warehouse then its called multidimensional
• Relational information regarding the customers who purchased the items, such as customer age,
occupation, credit rating, income, and address, may also be stored. Considering each database attribute
or warehouse dimension as a predicate, mine association rules containing multiple predicates, such as
age(X, “20:::29”)^occupation(X, “student”))buys(X, “laptop”).
• Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules.
• contains three predicates (age, occupation, and buys), each of which occurs only once in the rule. Hence, no
repeated predicates. Multidimensional association rules with no repeated predicates are called
interdimensional association rules.
• We can also mine multidimensional association rules with repeated predicates, which contain multiple
occurrences of some predicates. These rules are called hybrid-dimensional association rules.
age(X, “20:::29”)^buys(X, “laptop”))buys(X, “HP printer”)
Cont.,
37. C. Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes
• data discretization techniques, where numeric values are replaced by interval labels
• Alternatively, the transformed multidimensional data may be used to construct a data cube
• Data cubes are well suited for the mining of multidimensional association rules: They store
aggregates (such as counts), inmultidimensional space
• Figure shows the lattice of cuboids defining a data cube for the dimensions age, income, and
buys.
• The cells of an n-dimensional cuboid can be used to store the support counts of the
corresponding n-predicate sets.
38. Correlation
• Correlation is a statistical measure that shows the strength and direction of the relationship between
two variables. It is commonly used in data mining to identify patterns and relationships between
variables in a dataset.
• The correlation coefficient is a statistical measure used to quantify the strength and direction of the
relationship between two variables.
• The coefficient ranges from -1 to +1, with -1 indicating a perfectly negative correlation, +1 indicating a
perfectly positive correlation, and 0 indicating no correlation
• analyzing a dataset of housing prices in a city. the size of the house (in square feet) and the sale price of
the house.
• plot the data points for the two variables on a scatter plot. The resulting plot shows a positive linear
relationship between the two variables, with larger houses generally selling for higher prices
Association mining to Correlation analysis
39. From Association Analysis to Correlation Analysis
• support and confidence measures are insufficient uninteresting association rules.
• To tackle this weakness, a correlation measure can be used to incease the support-confidence
framework for association rules
AB [support, confidence. correlation].
Lift is a simple correlation measure that is given as follows
The occurrence of itemset A is independent of the occurrence of itemset B.
if P(A υ B) = P(A)P(B);
otherwise, itemsets A and B are dependent and correlated as events. The lift between the occurrence
of A and B can be measured by computing
lift(A, B) = P(AυB) / P(A)P(B)
Result is less than 1, then the occurrence of A is negatively correlated with the occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated, meaning that the
occurrence of one implies the occurrence of the other.
If the result is equal to 1, then A and B are independent and there is no correlation between them.
Cont.,
40. Let game refer to the transactions containing computer games, and video refer to those containing
videos. Of the 10,000 transactions analyzed, the data show that 6,000 of the customer ransactions
included computer games, while 7,500 included videos, and 4,000 included both computer
games and videos.
Correlation analysis using lift with the above example:
we need to study how the two itemsets, A and B, are correlated
probability of purchasing a computer game is P({game}) = 0.60
probability of purchasing a video is P({video}) = 0.75
probability of purchasing both is P({game; video}) = 0.40
P({game, video})/(P({game})P({video})) = 0.40/(0.60 * 0.75) = 0.89
value is less than 1, there is a negative correlation between the occurrence of {game} and {video}
negative correlation cannot be identified by a support and confidence framework
Cont.,
41. Correlation analysis using 𝑿𝟐
:
• To compute the correlation using c2 analysis, we need the observed value and
expected value (displayed in parenthesis) for each slot of the contingency table
• Because the c2 value is greater than one, and the observed value of the slot (game, video)
=4,000, which is less than the expected value 4,500, buying game and buying video are
negatively correlated. This is consistent with the conclusion derived from the analysis of
the lift
Cont.,
43. • A data mining process may uncover thousands of rules from a given set of data, most
of which end up being unrelated or uninteresting to the users.
• Users have a good sense of which “direction” of mining may lead to interesting
patterns and the “form” of the patterns or rules they would like to find.
• Thus, a good heuristic is to have the users specify such intuition or expectations as
constraints to confine the search space.
The constraints can include the following:
• Knowledge type constraints: These specify the type of knowledge to be mined, such as
association or correlation.
• Data constraints: These specify the set of task-relevant data.
Constraint-Based Association Mining
44. • Dimension/level constraints: These specify the desired dimensions (or attributes)
of the data, or levels of the concept hierarchies, to be used in mining.
• Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness, such as support, confidence, and correlation.
• Rule constraints: These specify the form of rules to be mined. Such constraints may
be expressed as metarules (rule templates), as the maximum or minimum number
of predicates that can occur in the rule consequent, or as relationships among
attributes, attribute values, and/or aggregates.
Cont.,