SlideShare a Scribd company logo
1 of 44
Advanced Data Mining Concepts
Semester II
2022-23
Chapter 01
Review of Data Mining principles and Preprocessing methods
• Mining Frequent Patters-basic concepts
• Efficient and scalable frequent item set mining methods.
• Apiriori Algoithm
• FP-Growth algorithm
• Associations-mining various kinds of association rules.
• Correlations-association mining to Correlation analysis.
• Constrained Based Mining
The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, ...
• Science: Remote sensing, bioinformatics, scientific simulation, ...
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!“
• Necessity is the mother of invention"-Data mining-Automated analysis of massive
data
Why Data Mining?
Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
Alternative names:
• Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
Watch out: Is everything "data mining"?
• Simple search and query processing
• (Deductive) expert systems
What Is Data Mining?
• Tremendous amount of data
• Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data
• Micro-array may have tens of thousands of dimensions
• High complexity of data
• Data streams and sensor data
• Time-series data, temporal data, sequence data
• Structure data, graphs, social networks and multi-linked data
• Heterogeneous databases and legacy databases
• Spatial, spatiotemporal, multimedia, text and Web data
• Software programs, scientific simulations
• New and sophisticated applications
Why Not Traditional Data Analysis?
• Database-oriented data sets and applications
• Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
• Data streams and sensor data
• Time-series data, temporal data, sequence data (incl. bio-sequences)
• Structure data, graphs, social networks and multi-linked data
• Object-relational databases
• Heterogeneous databases and legacy databases
• Spatial data and spatiotemporal data
• Multimedia database
• Text databases
• The World-Wide Web
Data Mining: On What Kinds of Data?
KDD Process
Architecture: Data Mining System
Market Basket Analysis
• itemset: A set of one or more items
• k-itemset X = {x1, …, xk}
• (absolute) support or support count of X: Frequency or occurrence of an
itemset X.
• (Releative)supoort, s is the fraction of transaction that contains X(i.e., the
probability that a transaction contains X)
• An itemset X is frequent if X’s support is no less than a minsup threshold.
Frequent Patterns
(absolute) support, or, supportcount of X:Frequency or occurrence of an
itemset X
{Diaper}: 4
{beer, Diaper}: 3
(relative) support, s, is the fraction of transactions that contains X (i.e., the
probability that a transaction contains X)
{Diaper}: 80%
{beer, Diaper}: 60%
An itemset X is frequent if X's support is no less than a minsup threshold
Cont.,
Find all the rules X→Y with minimum support and confidence
Support, s, probability that a transaction contains X U Y
Confidence, c, conditional probability that a transaction having X also contains Y
Let minsup = 50%, minconf = 50%Freq.
Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3
Association rules: (many more!)
Beer  Diaper (60%, 100%)
Diaper  Beer (60%, 75%)
Association Rules
Given a minimum support ‘s’ and a minimum confidence ‘c’, find all the rules that
satisfy:
• The support of the rules are greater than s
• The confidence of the rules are greater than c
• Support : "How useful is the rule?“
• percentage of transactions in the dataset that contain the itemset
• Confidence: “How true is the rules ?“
• strength of the relationship between two itemsets A and B
Cont.,
• A long pattern contains a combinatorial number of sub-patterns, e.g., {a1,
…, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-
patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no superpattern Y ‫כ‬
X, with the same support as X
• An itemset X is a max-pattern if X is frequent and maximum count value.
Closed Patterns and Max-Patterns
Cont.,
Cont.,
Closed frequent itemset is frequent itemset if it’s support is not
equal to it’s superset support value.
Here A=3/5 and AC=3/5 is equal so A is not closed
Here C=4/5 and AC=3/5, BC=3/5, CE=3/5 are not equal so C is
closed
C, AC, BE, BCE and ABCE are closed
Max-Pattern can be chosen by identifying the frequent itemset
with the highest support in the frequent tree.
maximum pattern itemsets in the given dataset are:
{B} (support = 4)
{C} (support = 4)
{E} (support = 4)
{BE} (support = 4)
These are the itemsets with the highest support counts among all
frequent itemsets.
Apriori algorithm:
1. Initialize frequent itemsets by finding the support of each individual item in the dataset.
2. Generate candidate itemsets of size k+1 from frequent itemsets of size k by taking the
union of each pair of frequent itemsets of size k, and then prune resulting itemsets that
contain any infrequent subset.
3. Count the support of candidate itemsets by scanning the transaction database.
4. Prune infrequent itemsets that do not meet the minimum support threshold.
5. Repeat steps 2-4 until no more frequent itemsets can be generated.
6. Return the frequent itemsets found during the iterations.
Efficient and Scalable Frequent Itemset Mining Methods
Cont.,
Cont.,
In 𝐶3= 𝐿2𝑥 𝐿2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
Here {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5} are removed because it’s subsets are not frequent.
Example {I1, I3, I5} is removed. Subsets are {{I1, I3}, {I1, I5} and {I3, I5} }. {I3, I5} is not available in
𝐿2.
In 𝐿3 = {{I1, I2, I3}, {I1, I2, I5}} = {I1, I2, I3, I5}
Subset of {I1, I2, I3, I5} are {{{I1, I2, I3} {I1, I2, I5} {I1, I3, I5}
{I1, I3, I5} is not frequent so {I1, I2, I3, I5} is pruned
𝐶4= 𝐿3𝑥 𝐿3 = ɸ and the algorithm terminates, having found all of the frequent itemsets.
Cont.,
Pseudo-Code
Ck : Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk != ɸ; k++) do begin
Ck+1 = candidates generated from Lk ;
for each transaction t in database do increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return Uk Lk
Cont.,
• Mining Frequent Patterns without Candidate Generation" is a research paper published in
2000 by Jiawei Han, Jian Pei, and Yiwen Yin.
• The paper proposed the FP-Growth Algorithm, which is a frequent pattern mining
algorithm that avoids the need for generating candidate itemsets, which is a
bottleneck in other frequent itemset mining algorithms such as the Apriori Algorithm
• FP-Growth Algorithm is to represent the database in FP-tree, which allows for
efficient computation of frequent itemsets
• The algorithm works in two main steps: constructing the FP-tree and generating
frequent itemsets from the FP-tree.
Frequent Pattern-Growth Algorithm
FP-Growth Algorithm Steps:
• Scan the database to find the support count of each item.
• Sort the items in descending order of support count.
• Construct the FP-tree by traversing the database again, inserting each transaction into
the tree in a way that preserves the order of the items.
• Generate frequent itemsets from the FP-tree.
Cont.,
Step by Step Process:
1. Scan the database
2. Find the support count for each items
3. Arrange it in descending order
4. Arrange the items based on step3 in each transaction in database
5. Draw the FP-tree with root null based on the database with each transaction and increase
count for each node visit in traverse.
Cont.,
6. Compute the conditional pattern base for the item from the FP-tree. It’s finding path for
each items and write with label
7. Built conditional pattern tree. It is done by taking the set of elements which is common in
all the paths in the Conditional Pattern Base of that item and calculating it's support count
by summing the support counts of all the paths in the Conditional Pattern Base.
8. All the combination of frequent pattern is generated using item and conditional FP-tree
from the table. Take the minimum count when the case two count available.
Cont.,
Cont.,
Cont.,
Cont.,
Steps to Generate Association Rules from Frequent Pattern generated by FP-Growth
• Generate frequent itemsets using FP-Growth algorithm.
• Determine the minimum support and confidence levels for association rules.
• Generate all possible association rules for each frequent itemset.
• Calculate the support and confidence values for each association rule.
• Filter out the association rules that do not meet the minimum support and confidence
levels.
• Evaluate the remaining association rules based on their interestingness measures such
as lift, conviction, and leverage to determine their significance.
• Present the significant association rules to the user for further analysis or decision-
making.
Cont.,
Generate Association rules for Frequent Pattern {I2, I4 : 2}
• Minimum support is 20% and the minimum confidence is 80% is fixed.
• All possible association rules for frequent itemset.
• I2 → I4 (I2 implies I4)(I2 leads to I4)
• I4 → I2
• Calculate the support and confidence values for I2 → I4 association rule.
• Support({I2, I4}) = (Frequency of {I2, I4}) / (Total number of transactions)
Support = 2/9 = 0.22
• Confidence ({I2, I4}) = (Frequency of {I2, I4}) / (Frequency of {I2})
Confidence = 2/7 = 0.28
• Calculate the confidence values for I4 → I2 association rule.
Confidence({I4I2}) = (Frequency of {I2, I4}) / (Frequency of {I4}) = 2/2 = 1.0
Cont.,
Filter out the association rules that do not meet the minimum support and confidence
levels.
Since the minimum support is 20% and the minimum confidence is 80%, the
only association rule that meets both criteria is:
I4 → I2
with a support of 0.222 and a confidence of 1.0
Continue the same process for further frequent pattern to generate association rules
Cont.,
A. Mining Multilevel Association Rules
It is difficult to find strong associations among data items at low or primitive levels of
abstraction
1. Using uniform minimum support for all levels (referred to as uniform support)
• The same minimum support threshold is used at each level of abstraction.
• uniform minimum support threshold is used, the search procedure is simplified. The
method is also simple in that users are required to specify only one minimum support
threshold
• If the minimum support threshold is set too high, it could miss some meaningful
associations occurring at low abstraction levels. If the threshold is set too low, it may
generate many uninteresting associations occurring at high abstraction levels.
Mining Various Kinds of Association rules
2. Using reduced minimum support at lower levels (referred to as reduced support)
• Each level of abstraction has its own minimum support threshold.
• The deeper the level of abstraction, the smaller the corresponding threshold.
• In Figure, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all
considered frequent.
Cont.,
3. Using item or group-based minimum support (referred to as group-based support)
• Users or Experts often decide some groups are more important than others.
• User will fix minimal support thresholds based on the importance
• For example, a user could set up the minimum support thresholds based on product
price, or on items of interest, such as by setting particularly low support thresholds for
laptop computers and flash drives in order to pay particular attention to the association
patterns containing items in these categories
Cont.,
B. Mining Multidimensional Association Rules from Relational Databases and DataWarehouses
• Data are stored in the relational database or data warehouse then its called multidimensional
• Relational information regarding the customers who purchased the items, such as customer age,
occupation, credit rating, income, and address, may also be stored. Considering each database attribute
or warehouse dimension as a predicate, mine association rules containing multiple predicates, such as
age(X, “20:::29”)^occupation(X, “student”))buys(X, “laptop”).
• Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules.
• contains three predicates (age, occupation, and buys), each of which occurs only once in the rule. Hence, no
repeated predicates. Multidimensional association rules with no repeated predicates are called
interdimensional association rules.
• We can also mine multidimensional association rules with repeated predicates, which contain multiple
occurrences of some predicates. These rules are called hybrid-dimensional association rules.
age(X, “20:::29”)^buys(X, “laptop”))buys(X, “HP printer”)
Cont.,
C. Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes
• data discretization techniques, where numeric values are replaced by interval labels
• Alternatively, the transformed multidimensional data may be used to construct a data cube
• Data cubes are well suited for the mining of multidimensional association rules: They store
aggregates (such as counts), inmultidimensional space
• Figure shows the lattice of cuboids defining a data cube for the dimensions age, income, and
buys.
• The cells of an n-dimensional cuboid can be used to store the support counts of the
corresponding n-predicate sets.
Correlation
• Correlation is a statistical measure that shows the strength and direction of the relationship between
two variables. It is commonly used in data mining to identify patterns and relationships between
variables in a dataset.
• The correlation coefficient is a statistical measure used to quantify the strength and direction of the
relationship between two variables.
• The coefficient ranges from -1 to +1, with -1 indicating a perfectly negative correlation, +1 indicating a
perfectly positive correlation, and 0 indicating no correlation
• analyzing a dataset of housing prices in a city. the size of the house (in square feet) and the sale price of
the house.
• plot the data points for the two variables on a scatter plot. The resulting plot shows a positive linear
relationship between the two variables, with larger houses generally selling for higher prices
Association mining to Correlation analysis
From Association Analysis to Correlation Analysis
• support and confidence measures are insufficient uninteresting association rules.
• To tackle this weakness, a correlation measure can be used to incease the support-confidence
framework for association rules
AB [support, confidence. correlation].
Lift is a simple correlation measure that is given as follows
The occurrence of itemset A is independent of the occurrence of itemset B.
if P(A υ B) = P(A)P(B);
otherwise, itemsets A and B are dependent and correlated as events. The lift between the occurrence
of A and B can be measured by computing
lift(A, B) = P(AυB) / P(A)P(B)
Result is less than 1, then the occurrence of A is negatively correlated with the occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated, meaning that the
occurrence of one implies the occurrence of the other.
If the result is equal to 1, then A and B are independent and there is no correlation between them.
Cont.,
Let game refer to the transactions containing computer games, and video refer to those containing
videos. Of the 10,000 transactions analyzed, the data show that 6,000 of the customer ransactions
included computer games, while 7,500 included videos, and 4,000 included both computer
games and videos.
Correlation analysis using lift with the above example:
we need to study how the two itemsets, A and B, are correlated
probability of purchasing a computer game is P({game}) = 0.60
probability of purchasing a video is P({video}) = 0.75
probability of purchasing both is P({game; video}) = 0.40
P({game, video})/(P({game})P({video})) = 0.40/(0.60 * 0.75) = 0.89
value is less than 1, there is a negative correlation between the occurrence of {game} and {video}
negative correlation cannot be identified by a support and confidence framework
Cont.,
Correlation analysis using 𝑿𝟐
:
• To compute the correlation using c2 analysis, we need the observed value and
expected value (displayed in parenthesis) for each slot of the contingency table
• Because the c2 value is greater than one, and the observed value of the slot (game, video)
=4,000, which is less than the expected value 4,500, buying game and buying video are
negatively correlated. This is consistent with the conclusion derived from the analysis of
the lift
Cont.,
Reading Assignment:
• Analyze the correlation measures using all confidence and cosine
Cont.,
• A data mining process may uncover thousands of rules from a given set of data, most
of which end up being unrelated or uninteresting to the users.
• Users have a good sense of which “direction” of mining may lead to interesting
patterns and the “form” of the patterns or rules they would like to find.
• Thus, a good heuristic is to have the users specify such intuition or expectations as
constraints to confine the search space.
The constraints can include the following:
• Knowledge type constraints: These specify the type of knowledge to be mined, such as
association or correlation.
• Data constraints: These specify the set of task-relevant data.
Constraint-Based Association Mining
• Dimension/level constraints: These specify the desired dimensions (or attributes)
of the data, or levels of the concept hierarchies, to be used in mining.
• Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness, such as support, confidence, and correlation.
• Rule constraints: These specify the form of rules to be mined. Such constraints may
be expressed as metarules (rule templates), as the maximum or minimum number
of predicates that can occur in the rule consequent, or as relationships among
attributes, attribute values, and/or aggregates.
Cont.,

More Related Content

Similar to Advanced Data Mining Concepts and Techniques

Associations1
Associations1Associations1
Associations1mancnilu
 
20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.ppt20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.pptPalaniKumarR2
 
20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.pptSamPrem3
 
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth AlgorithmFrequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth AlgorithmShivarkarSandip
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
CS 402 DATAMINING AND WAREHOUSING -MODULE 5CS 402 DATAMINING AND WAREHOUSING -MODULE 5
CS 402 DATAMINING AND WAREHOUSING -MODULE 5NIMMYRAJU
 
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset CreationTop Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset Creationcscpconf
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureIOSR Journals
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesRashmi Bhat
 
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptUNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptRaviKiranVarma4
 
IRJET-Comparative Analysis of Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of  Apriori and Apriori with Hashing AlgorithmIRJET-Comparative Analysis of  Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of Apriori and Apriori with Hashing AlgorithmIRJET Journal
 
Association 04.03.14
Association   04.03.14Association   04.03.14
Association 04.03.14rahulmath80
 
Mining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptMining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptNBACriteria2SICET
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...ijsrd.com
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithmGangadhar S
 

Similar to Advanced Data Mining Concepts and Techniques (20)

Associations1
Associations1Associations1
Associations1
 
20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.ppt20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.ppt
 
20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt
 
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth AlgorithmFrequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
CS 402 DATAMINING AND WAREHOUSING -MODULE 5CS 402 DATAMINING AND WAREHOUSING -MODULE 5
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
 
apriori.pptx
apriori.pptxapriori.pptx
apriori.pptx
 
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset CreationTop Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining Procedure
 
06 fp basic
06 fp basic06 fp basic
06 fp basic
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
 
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptUNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
 
IRJET-Comparative Analysis of Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of  Apriori and Apriori with Hashing AlgorithmIRJET-Comparative Analysis of  Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of Apriori and Apriori with Hashing Algorithm
 
Association 04.03.14
Association   04.03.14Association   04.03.14
Association 04.03.14
 
Mining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptMining Frequent Itemsets.ppt
Mining Frequent Itemsets.ppt
 
Ej36829834
Ej36829834Ej36829834
Ej36829834
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 

Recently uploaded

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 

Recently uploaded (20)

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 

Advanced Data Mining Concepts and Techniques

  • 1. Advanced Data Mining Concepts Semester II 2022-23
  • 2. Chapter 01 Review of Data Mining principles and Preprocessing methods • Mining Frequent Patters-basic concepts • Efficient and scalable frequent item set mining methods. • Apiriori Algoithm • FP-Growth algorithm • Associations-mining various kinds of association rules. • Correlations-association mining to Correlation analysis. • Constrained Based Mining
  • 3. The Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, ... • Science: Remote sensing, bioinformatics, scientific simulation, ... • Society and everyone: news, digital cameras, YouTube • We are drowning in data, but starving for knowledge!“ • Necessity is the mother of invention"-Data mining-Automated analysis of massive data Why Data Mining?
  • 4. Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names: • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything "data mining"? • Simple search and query processing • (Deductive) expert systems What Is Data Mining?
  • 5.
  • 6. • Tremendous amount of data • Algorithms must be highly scalable to handle such as tera-bytes of data • High-dimensionality of data • Micro-array may have tens of thousands of dimensions • High complexity of data • Data streams and sensor data • Time-series data, temporal data, sequence data • Structure data, graphs, social networks and multi-linked data • Heterogeneous databases and legacy databases • Spatial, spatiotemporal, multimedia, text and Web data • Software programs, scientific simulations • New and sophisticated applications Why Not Traditional Data Analysis?
  • 7. • Database-oriented data sets and applications • Relational database, data warehouse, transactional database • Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data (incl. bio-sequences) • Structure data, graphs, social networks and multi-linked data • Object-relational databases • Heterogeneous databases and legacy databases • Spatial data and spatiotemporal data • Multimedia database • Text databases • The World-Wide Web Data Mining: On What Kinds of Data?
  • 11. • itemset: A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support or support count of X: Frequency or occurrence of an itemset X. • (Releative)supoort, s is the fraction of transaction that contains X(i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold. Frequent Patterns
  • 12. (absolute) support, or, supportcount of X:Frequency or occurrence of an itemset X {Diaper}: 4 {beer, Diaper}: 3 (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) {Diaper}: 80% {beer, Diaper}: 60% An itemset X is frequent if X's support is no less than a minsup threshold Cont.,
  • 13. Find all the rules X→Y with minimum support and confidence Support, s, probability that a transaction contains X U Y Confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50%Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer  Diaper (60%, 100%) Diaper  Beer (60%, 75%) Association Rules
  • 14. Given a minimum support ‘s’ and a minimum confidence ‘c’, find all the rules that satisfy: • The support of the rules are greater than s • The confidence of the rules are greater than c • Support : "How useful is the rule?“ • percentage of transactions in the dataset that contain the itemset • Confidence: “How true is the rules ?“ • strength of the relationship between two itemsets A and B Cont.,
  • 15. • A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub- patterns! • Solution: Mine closed patterns and max-patterns instead • An itemset X is closed if X is frequent and there exists no superpattern Y ‫כ‬ X, with the same support as X • An itemset X is a max-pattern if X is frequent and maximum count value. Closed Patterns and Max-Patterns
  • 17. Cont., Closed frequent itemset is frequent itemset if it’s support is not equal to it’s superset support value. Here A=3/5 and AC=3/5 is equal so A is not closed Here C=4/5 and AC=3/5, BC=3/5, CE=3/5 are not equal so C is closed C, AC, BE, BCE and ABCE are closed Max-Pattern can be chosen by identifying the frequent itemset with the highest support in the frequent tree. maximum pattern itemsets in the given dataset are: {B} (support = 4) {C} (support = 4) {E} (support = 4) {BE} (support = 4) These are the itemsets with the highest support counts among all frequent itemsets.
  • 18. Apriori algorithm: 1. Initialize frequent itemsets by finding the support of each individual item in the dataset. 2. Generate candidate itemsets of size k+1 from frequent itemsets of size k by taking the union of each pair of frequent itemsets of size k, and then prune resulting itemsets that contain any infrequent subset. 3. Count the support of candidate itemsets by scanning the transaction database. 4. Prune infrequent itemsets that do not meet the minimum support threshold. 5. Repeat steps 2-4 until no more frequent itemsets can be generated. 6. Return the frequent itemsets found during the iterations. Efficient and Scalable Frequent Itemset Mining Methods
  • 21. In 𝐶3= 𝐿2𝑥 𝐿2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Here {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5} are removed because it’s subsets are not frequent. Example {I1, I3, I5} is removed. Subsets are {{I1, I3}, {I1, I5} and {I3, I5} }. {I3, I5} is not available in 𝐿2. In 𝐿3 = {{I1, I2, I3}, {I1, I2, I5}} = {I1, I2, I3, I5} Subset of {I1, I2, I3, I5} are {{{I1, I2, I3} {I1, I2, I5} {I1, I3, I5} {I1, I3, I5} is not frequent so {I1, I2, I3, I5} is pruned 𝐶4= 𝐿3𝑥 𝐿3 = ɸ and the algorithm terminates, having found all of the frequent itemsets. Cont.,
  • 22. Pseudo-Code Ck : Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk != ɸ; k++) do begin Ck+1 = candidates generated from Lk ; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return Uk Lk Cont.,
  • 23. • Mining Frequent Patterns without Candidate Generation" is a research paper published in 2000 by Jiawei Han, Jian Pei, and Yiwen Yin. • The paper proposed the FP-Growth Algorithm, which is a frequent pattern mining algorithm that avoids the need for generating candidate itemsets, which is a bottleneck in other frequent itemset mining algorithms such as the Apriori Algorithm • FP-Growth Algorithm is to represent the database in FP-tree, which allows for efficient computation of frequent itemsets • The algorithm works in two main steps: constructing the FP-tree and generating frequent itemsets from the FP-tree. Frequent Pattern-Growth Algorithm
  • 24. FP-Growth Algorithm Steps: • Scan the database to find the support count of each item. • Sort the items in descending order of support count. • Construct the FP-tree by traversing the database again, inserting each transaction into the tree in a way that preserves the order of the items. • Generate frequent itemsets from the FP-tree. Cont.,
  • 25. Step by Step Process: 1. Scan the database 2. Find the support count for each items 3. Arrange it in descending order 4. Arrange the items based on step3 in each transaction in database 5. Draw the FP-tree with root null based on the database with each transaction and increase count for each node visit in traverse. Cont.,
  • 26. 6. Compute the conditional pattern base for the item from the FP-tree. It’s finding path for each items and write with label 7. Built conditional pattern tree. It is done by taking the set of elements which is common in all the paths in the Conditional Pattern Base of that item and calculating it's support count by summing the support counts of all the paths in the Conditional Pattern Base. 8. All the combination of frequent pattern is generated using item and conditional FP-tree from the table. Take the minimum count when the case two count available. Cont.,
  • 30. Steps to Generate Association Rules from Frequent Pattern generated by FP-Growth • Generate frequent itemsets using FP-Growth algorithm. • Determine the minimum support and confidence levels for association rules. • Generate all possible association rules for each frequent itemset. • Calculate the support and confidence values for each association rule. • Filter out the association rules that do not meet the minimum support and confidence levels. • Evaluate the remaining association rules based on their interestingness measures such as lift, conviction, and leverage to determine their significance. • Present the significant association rules to the user for further analysis or decision- making. Cont.,
  • 31. Generate Association rules for Frequent Pattern {I2, I4 : 2} • Minimum support is 20% and the minimum confidence is 80% is fixed. • All possible association rules for frequent itemset. • I2 → I4 (I2 implies I4)(I2 leads to I4) • I4 → I2 • Calculate the support and confidence values for I2 → I4 association rule. • Support({I2, I4}) = (Frequency of {I2, I4}) / (Total number of transactions) Support = 2/9 = 0.22 • Confidence ({I2, I4}) = (Frequency of {I2, I4}) / (Frequency of {I2}) Confidence = 2/7 = 0.28 • Calculate the confidence values for I4 → I2 association rule. Confidence({I4I2}) = (Frequency of {I2, I4}) / (Frequency of {I4}) = 2/2 = 1.0 Cont.,
  • 32. Filter out the association rules that do not meet the minimum support and confidence levels. Since the minimum support is 20% and the minimum confidence is 80%, the only association rule that meets both criteria is: I4 → I2 with a support of 0.222 and a confidence of 1.0 Continue the same process for further frequent pattern to generate association rules Cont.,
  • 33. A. Mining Multilevel Association Rules It is difficult to find strong associations among data items at low or primitive levels of abstraction 1. Using uniform minimum support for all levels (referred to as uniform support) • The same minimum support threshold is used at each level of abstraction. • uniform minimum support threshold is used, the search procedure is simplified. The method is also simple in that users are required to specify only one minimum support threshold • If the minimum support threshold is set too high, it could miss some meaningful associations occurring at low abstraction levels. If the threshold is set too low, it may generate many uninteresting associations occurring at high abstraction levels. Mining Various Kinds of Association rules
  • 34. 2. Using reduced minimum support at lower levels (referred to as reduced support) • Each level of abstraction has its own minimum support threshold. • The deeper the level of abstraction, the smaller the corresponding threshold. • In Figure, the minimum support thresholds for levels 1 and 2 are 5% and 3%, respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all considered frequent. Cont.,
  • 35. 3. Using item or group-based minimum support (referred to as group-based support) • Users or Experts often decide some groups are more important than others. • User will fix minimal support thresholds based on the importance • For example, a user could set up the minimum support thresholds based on product price, or on items of interest, such as by setting particularly low support thresholds for laptop computers and flash drives in order to pay particular attention to the association patterns containing items in these categories Cont.,
  • 36. B. Mining Multidimensional Association Rules from Relational Databases and DataWarehouses • Data are stored in the relational database or data warehouse then its called multidimensional • Relational information regarding the customers who purchased the items, such as customer age, occupation, credit rating, income, and address, may also be stored. Considering each database attribute or warehouse dimension as a predicate, mine association rules containing multiple predicates, such as age(X, “20:::29”)^occupation(X, “student”))buys(X, “laptop”). • Association rules that involve two or more dimensions or predicates can be referred to as multidimensional association rules. • contains three predicates (age, occupation, and buys), each of which occurs only once in the rule. Hence, no repeated predicates. Multidimensional association rules with no repeated predicates are called interdimensional association rules. • We can also mine multidimensional association rules with repeated predicates, which contain multiple occurrences of some predicates. These rules are called hybrid-dimensional association rules. age(X, “20:::29”)^buys(X, “laptop”))buys(X, “HP printer”) Cont.,
  • 37. C. Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes • data discretization techniques, where numeric values are replaced by interval labels • Alternatively, the transformed multidimensional data may be used to construct a data cube • Data cubes are well suited for the mining of multidimensional association rules: They store aggregates (such as counts), inmultidimensional space • Figure shows the lattice of cuboids defining a data cube for the dimensions age, income, and buys. • The cells of an n-dimensional cuboid can be used to store the support counts of the corresponding n-predicate sets.
  • 38. Correlation • Correlation is a statistical measure that shows the strength and direction of the relationship between two variables. It is commonly used in data mining to identify patterns and relationships between variables in a dataset. • The correlation coefficient is a statistical measure used to quantify the strength and direction of the relationship between two variables. • The coefficient ranges from -1 to +1, with -1 indicating a perfectly negative correlation, +1 indicating a perfectly positive correlation, and 0 indicating no correlation • analyzing a dataset of housing prices in a city. the size of the house (in square feet) and the sale price of the house. • plot the data points for the two variables on a scatter plot. The resulting plot shows a positive linear relationship between the two variables, with larger houses generally selling for higher prices Association mining to Correlation analysis
  • 39. From Association Analysis to Correlation Analysis • support and confidence measures are insufficient uninteresting association rules. • To tackle this weakness, a correlation measure can be used to incease the support-confidence framework for association rules AB [support, confidence. correlation]. Lift is a simple correlation measure that is given as follows The occurrence of itemset A is independent of the occurrence of itemset B. if P(A υ B) = P(A)P(B); otherwise, itemsets A and B are dependent and correlated as events. The lift between the occurrence of A and B can be measured by computing lift(A, B) = P(AυB) / P(A)P(B) Result is less than 1, then the occurrence of A is negatively correlated with the occurrence of B. If the resulting value is greater than 1, then A and B are positively correlated, meaning that the occurrence of one implies the occurrence of the other. If the result is equal to 1, then A and B are independent and there is no correlation between them. Cont.,
  • 40. Let game refer to the transactions containing computer games, and video refer to those containing videos. Of the 10,000 transactions analyzed, the data show that 6,000 of the customer ransactions included computer games, while 7,500 included videos, and 4,000 included both computer games and videos. Correlation analysis using lift with the above example: we need to study how the two itemsets, A and B, are correlated probability of purchasing a computer game is P({game}) = 0.60 probability of purchasing a video is P({video}) = 0.75 probability of purchasing both is P({game; video}) = 0.40 P({game, video})/(P({game})P({video})) = 0.40/(0.60 * 0.75) = 0.89 value is less than 1, there is a negative correlation between the occurrence of {game} and {video} negative correlation cannot be identified by a support and confidence framework Cont.,
  • 41. Correlation analysis using 𝑿𝟐 : • To compute the correlation using c2 analysis, we need the observed value and expected value (displayed in parenthesis) for each slot of the contingency table • Because the c2 value is greater than one, and the observed value of the slot (game, video) =4,000, which is less than the expected value 4,500, buying game and buying video are negatively correlated. This is consistent with the conclusion derived from the analysis of the lift Cont.,
  • 42. Reading Assignment: • Analyze the correlation measures using all confidence and cosine Cont.,
  • 43. • A data mining process may uncover thousands of rules from a given set of data, most of which end up being unrelated or uninteresting to the users. • Users have a good sense of which “direction” of mining may lead to interesting patterns and the “form” of the patterns or rules they would like to find. • Thus, a good heuristic is to have the users specify such intuition or expectations as constraints to confine the search space. The constraints can include the following: • Knowledge type constraints: These specify the type of knowledge to be mined, such as association or correlation. • Data constraints: These specify the set of task-relevant data. Constraint-Based Association Mining
  • 44. • Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, or levels of the concept hierarchies, to be used in mining. • Interestingness constraints: These specify thresholds on statistical measures of rule interestingness, such as support, confidence, and correlation. • Rule constraints: These specify the form of rules to be mined. Such constraints may be expressed as metarules (rule templates), as the maximum or minimum number of predicates that can occur in the rule consequent, or as relationships among attributes, attribute values, and/or aggregates. Cont.,