SlideShare a Scribd company logo
1 of 25
Topic for the class: Pattern Evaluation Methods
Unit _3 : Title- Mining frequent patterns, associations and correlations
Date & Time : 14.3.2022 9.00 AM- 9.50 AM
Dr. Bhramaramba Ravi
Professor
Department of Computer Science and Engineering
GITAM Institute of Technology (GIT)
Visakhapatnam – 530045
Email: bravi@gitam.edu
1
Department of CSE, GIT EID 356 Data Mining and Data warehousing
17 August 2023
Course objectives
• To understand the importance of data mining and its
applications
• To introduce various types of data and preprocessing
techniques
• To learn various multi-dimensional data models and
OLAP processing
• To study concepts of Association Analysis
• To learn various Classification methods
• To learn basics of Cluster Analysis
2
Department of computer science and engineering, GIT Course Code EID 356
Course Title: Data Mining and Data Warehousing
17 August 2023
Learning Outcomes
• At the end of this lecture/ session, Students will be able
to
• Understand the pattern evaluation methods
3
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023
Module 3
• Mining frequent patterns, associations and
correlations: Basic concepts, Applications of
frequent pattern and associations, Frequent
pattern and association mining: A road map,
mining various kinds of association rules, Apriori
algorithm, FP growth algorithm, Pattern
evaluation methods
4
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023
• Most association rule mining algorithms employ a support-
confidence framework. Although minimum support and
confidence thresholds help weed out or exclude the exploration
of a good number of uninteresting rules, many of the rules
generated are still not interesting to the users. This is especially
true when mining at low support thresholds or mining for long
patterns. This has been a major bottleneck for successful
application of association rule mining.
5
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
Which Patterns are interesting?- Pattern Evaluation
Methods
Strong rules are not necessarily
interesting
• Whether or not a rule is interesting can be assessed either
subjectively or objectively. Ultimately only the user can judge
if a given rule is interesting, and this judgment, being
subjective, may differ from one user to another. However,
objective interestingness measures based on the statistics behind
the data, can be used as one step toward the goal of weeding out
uninteresting rules that would otherwise be presented to the user.
Example: A misleading strong association rule. Suppose we are
interested in analyzing transactions at AllElectronics with respect
to the purchase of computer games and videos.
6
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023
Strong rules are not necessarily
interesting contd.
7
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023
• Let game refer to the transactions containing computer games, and
video refer to those containing videos. Of the 10,000 transactions
analyzed, the data show that 6000 of the customer transactions
included computer games, while 7500 included videos and 4000
included both computer games and videos. Suppose that a data
mining program for discovering association rules is run on the data,
using a minimum support of say, 30% and a minimum confidence of
60%. The following association rule is discovered:
buys(X,”computer games”)=> buys(X,”Videos”)
[support =40%, Confidence = 66%]
Strong rules are not necessarily
interesting contd.
8
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023
• The earlier rule is a strong association rule since its support value of
4000/10000 =40% and confidence value of 4000/6000 =66% satisfy the
minimum support and minimum confidence thresholds. However the
rule is misleading because the probability of purchasing videos is 75%,
which is even larger than 66%. Computer games and videos are
negatively associated because the purchase of one of these items
actually decreases the likelihood of purchasing the other
From Association Analysis to Correlation
Analysis
9
Department of computer science and engineering, GIT Course Code EID
356 and Course Title: Data Mining and Data warehousing
17 August 2023
• As we have seen so far, the support and confidence measures are
insufficient at filtering out uninteresting association rules. To tackle
this weakness, a correlation measure can be used to augment the
support-confidence framework for association rules. This leads to
correlation rules of the form
A => B[Support, confidence, correlation]
That is, a correlation rule is measured not only by its support and
confidence but also by the correlation between itemsets A and B.
• Lift is a simple correlation measure that is given as follows:
• The occurrence of itemset A is independent of the occurrence of
itemset B if P(AUB) = P(A)P(B) otherwise itemsets A and B are
dependent and correlated as events. This definition can be
extended to more than two itemsets. The lift between the
occurrence of A and B can be measured by computing
lift(A,B)=P(AUB)/P(A)P(B)
• If the resulting value of the above equation is less than 1, then
the occurrence of A is negatively correlated with the occurrence
of B, meaning that the occurrence of one likely leads to the
absence of the other.
10
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
From Association Analysis to Correlation Analysis
contd.
11
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
From Association Analysis to Correlation Analysis
contd.
• If the resulting value is equal to 1, then A and B are independent and
there is no correlation between them. The earlier equation is
equivalent to P(B/A)/P(B) or conf(A=>B)/sup(B) which is also referred
to as the lift of the association(or correlation) rule A=>B
• Example – Correlation analysis using lift. To help filter out misleading “
strong” associations of the form A=>B from the data of earlier example
we need to study how the two itemsets A and B, are correlated. Let
game refer to the transactions of example that do not contain
computer games, and video refer to those that do not contain videos.
• The Transactions can be summarized in a contingency table
as shown in Table
2x2 Contingency Table Summarizing the transactions with
respect to game and video purchases
12
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
From Association Analysis to Correlation Analysis
contd.
• From the table we can see that the probabiiity of purchasing a computer
game is P({game}) =0.60, the probability of purchasing a video is
P({video})=0.75 and the probability of purchasing both is
P({game,video})=0.40.
• By the equation for lift lift is calculated as
• P({game,video})/(P({game})XP({video}))=0.40/(0.60x0.75) =0.89.
• Because this value is less than 1, there is a negative correlation between
the occurrence of {game} and {video}.
• The numerator is the likelihood of a customer purchasing both, while the
denominator is what the likelihood would have been if the two purchases
were completely independent. Such a negative correlation cannot be
identified by a support-confidence framework.
13
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
From Association Analysis to Correlation Analysis
contd.
• The second correlation measure that we study is the χ2 measure.
• To compute the χ2 value, we take the squared difference between the
observed and expected value for a slot( A and B pair) in the contingency
table, divided by the expected value. This amount is summed for all
slots of the contingency table.
14
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
From Association Analysis to Correlation Analysis
contd.
game game ∑row
video 4000(4500) 3500(3000) 7500
video 2000(1500) 500(1000) 2500
∑col 6000 4000 10000
15
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
Table Contingency Table, now with the expected
Values
•
To compute the correlation using χ2 analysis for nominal data, we need
the observed value and expected value for each slot of the contingency
table.
• From the table we can compute the χ2 value as follows:
• χ2 = ∑(observed-expected)2 = (4000-4500)2 + (3500-3000) 2
expected 4500 3000
+(2000-1500)2 + (500-1000)2 = 555.6
1500 1000
• Because the χ2 value is greater than 1, and the observed value of the
slot(game, video)=4000, which is less than the expected value of 4500,
buying game and video are negatively correlated.
16
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
From Association Analysis to Correlation Analysis
contd.
• Instead of using the simple support-confidence framework to evaluate
frequent patterns, other measures such as lift and χ2 often disclose more
intrinsic pattern relationships.
• Several other pattern evaluation measures exist.
• Four such measures are all_confidence, max_confidence, kulczynski and
cosine.
• We’ll then compare their effectiveness with respect to one another and
with respect to the lift and χ2 measures.
• Given two itemsets, A and B, the all_confidence measure of A and B is
defined as
all_conf(A,B)= sup(AUB)/max{ sup(a), sup(B)} = min{P(A|B), P(B|A)}
Where max{sup(A), sup(B)} is the maximum support of the itemsets A and B.
Thus all_conf(A,B) is also the minimum confidence of the two association
rules related to A and B namely “A=>B” and “B=>A”
17
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
A comparison of pattern evaluation measures
•
• Given two itemsets, A and B, the max_confidence measure of A and B is
defined as max_conf(A,B)= max{P(A|B), P(B|A)}
• The max_conf measure is the maximum confidence of the two association
rules, “A=>B” and “B=>A”.
• Given two itemsets, A and B, the kulczynski measure of A and
B(abbreviated as Kulc) is defined as
• Kulc(A,B)=1/2(P(A|B) +P(B|A)). It was proposed in 1927 by Polish
mathematician S.Kulczynski. It can be viewed as the average of two
confidence measures. That is, it is the average of two conditional
probabilities: the probability of itemset B given itemset A, and the
probability of itemset A given itemset B.
18
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
A comparison of pattern evaluation measures contd.
• Finally, given two itemsets, A and B, the cosine measure of A and B is
defined as
• Cosine(A,B) = P(AUB)/√P(A)xP(B)= sup(AUB)/√sup(A)xsup(B)
= √P(A|B)xP(B|A)
• The cosine measure can be viewed as a harmonized lift measure. The two
formulae are similar except that for cosine, the square root is taken on the
product of the probabilities of A and B. This is an important difference,
because by taking the square root, the cosine value is only influenced by
the supports of A,B, and AUB and not by the total number of transactions.
19
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
A comparison of pattern evaluation measures contd.
20
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
A comparison of pattern evaluation measures contd.
• Each of these four measures defined has the following property: Its
value is only influenced by the supports of A,B, and AUB or more
exactly by the conditional probabilities of P(A|B) and P(B|A) but not by
the total number of transactions.
• Another common property is that each measure ranges from 0 to 1,
and the higher the value, the closer the relationship between A and B.
•
• A null transaction is a transaction that does not contain any of the itemsets
being examined.
• A measure is null-invariant if its value is free from the influence of null-
transactions.
• Null invariance is an important property for measuring association patterns
in large transaction databases.
• Among the 6 measures, lift and χ2 are not null-invariant measures.
• Among the four null-invariant measures, namely all_confidence,
max_confidence, Kulc and cosine, we recommend using Kulc in conjunction
with the imbalance ratio.
21
Department of computer science and engineering, Data warehousingGIT
Course Code EID 356 and Course Title: Data Mining and
17 August 2023
A comparison of pattern evaluation measures contd.
Text Book(s)
1. Jiawei Han, Micheline Kamber, Jian Pei , Data Mining: Concepts
and Techniques, 2/e, Morgan Kaufmann publishers, 2006.
References
1. Correlation Analysis mgaub.ac.in
22
Department of computer science and engineering, GIT Course
Code EID 356 and Course Title: Data Mining and Data Warehousing
17 August 2023
References
Session Quiz (5-10 MCQ/True false questions)
(marks to be counted for continuous evaluation)
23
Department of Computer science and engineering, GIT Course Code EID 356
and Course Title: Data Mining and Data Warehousing
17 August 2023
Session Quiz
• Understood the pattern evaluation methods
24
Department of computer science and engineering, GIT Course
Code EID 356 and Course Title: Data Mining and Data Warehousing
17 August 2023
Recap- Summary
25
Department of computer sciience and engineering GIT Course Code
EID 356 and Course Title: Data Mining and Data Warehousing
17 August 2023
THANK YOU

More Related Content

Similar to pattern_evaluaiton_methods.ppt

GSU IT Program Fact Sheet
GSU IT Program Fact SheetGSU IT Program Fact Sheet
GSU IT Program Fact Sheet
William Roman
 
2018,sem7,ala,pm,160993119017,160993119018,160993119019,160993119020,16099311...
2018,sem7,ala,pm,160993119017,160993119018,160993119019,160993119020,16099311...2018,sem7,ala,pm,160993119017,160993119018,160993119019,160993119020,16099311...
2018,sem7,ala,pm,160993119017,160993119018,160993119019,160993119020,16099311...
Siddharth Modi
 

Similar to pattern_evaluaiton_methods.ppt (20)

GSU IT Program Fact Sheet
GSU IT Program Fact SheetGSU IT Program Fact Sheet
GSU IT Program Fact Sheet
 
DESIGN OF 8-BIT COMPARATORS
DESIGN OF 8-BIT COMPARATORSDESIGN OF 8-BIT COMPARATORS
DESIGN OF 8-BIT COMPARATORS
 
Practical Challenges ML Workflows
Practical Challenges ML WorkflowsPractical Challenges ML Workflows
Practical Challenges ML Workflows
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
 
Homomorphic encryption scheme
Homomorphic encryption schemeHomomorphic encryption scheme
Homomorphic encryption scheme
 
Stock Market Analysis and Prediction
Stock Market Analysis and PredictionStock Market Analysis and Prediction
Stock Market Analysis and Prediction
 
Se notes
Se notesSe notes
Se notes
 
2018,sem7,ala,pm,160993119017,160993119018,160993119019,160993119020,16099311...
2018,sem7,ala,pm,160993119017,160993119018,160993119019,160993119020,16099311...2018,sem7,ala,pm,160993119017,160993119018,160993119019,160993119020,16099311...
2018,sem7,ala,pm,160993119017,160993119018,160993119019,160993119020,16099311...
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j
 
Medical Image Compression
Medical Image CompressionMedical Image Compression
Medical Image Compression
 
ML master class
ML master classML master class
ML master class
 
HANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNING
HANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNINGHANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNING
HANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNING
 
HANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNING
HANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNINGHANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNING
HANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNING
 
Data Structures and Algorithm - Week 8 - Minimum Spanning Trees
Data Structures and Algorithm - Week 8 - Minimum Spanning TreesData Structures and Algorithm - Week 8 - Minimum Spanning Trees
Data Structures and Algorithm - Week 8 - Minimum Spanning Trees
 
The Computation Complexity Reduction of 2-D Gaussian Filter
The Computation Complexity Reduction of 2-D Gaussian FilterThe Computation Complexity Reduction of 2-D Gaussian Filter
The Computation Complexity Reduction of 2-D Gaussian Filter
 
Pro-active Thermal MonitoringFINALPRESENTATIONCOLLEGE.pptx
Pro-active Thermal MonitoringFINALPRESENTATIONCOLLEGE.pptxPro-active Thermal MonitoringFINALPRESENTATIONCOLLEGE.pptx
Pro-active Thermal MonitoringFINALPRESENTATIONCOLLEGE.pptx
 
IRJET- Security, Issues and Algorithm and their Performance Analysis
IRJET- Security, Issues and Algorithm and their Performance AnalysisIRJET- Security, Issues and Algorithm and their Performance Analysis
IRJET- Security, Issues and Algorithm and their Performance Analysis
 
IRJET - Encoded Polymorphic Aspect of Clustering
IRJET - Encoded Polymorphic Aspect of ClusteringIRJET - Encoded Polymorphic Aspect of Clustering
IRJET - Encoded Polymorphic Aspect of Clustering
 

Recently uploaded

Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
EADTU
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
EADTU
 
Orientation Canvas Course Presentation.pdf
Orientation Canvas Course Presentation.pdfOrientation Canvas Course Presentation.pdf
Orientation Canvas Course Presentation.pdf
Elizabeth Walsh
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
 

Recently uploaded (20)

Introduction to TechSoup’s Digital Marketing Services and Use Cases
Introduction to TechSoup’s Digital Marketing  Services and Use CasesIntroduction to TechSoup’s Digital Marketing  Services and Use Cases
Introduction to TechSoup’s Digital Marketing Services and Use Cases
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
 
Diuretic, Hypoglycemic and Limit test of Heavy metals and Arsenic.-1.pdf
Diuretic, Hypoglycemic and Limit test of Heavy metals and Arsenic.-1.pdfDiuretic, Hypoglycemic and Limit test of Heavy metals and Arsenic.-1.pdf
Diuretic, Hypoglycemic and Limit test of Heavy metals and Arsenic.-1.pdf
 
Orientation Canvas Course Presentation.pdf
Orientation Canvas Course Presentation.pdfOrientation Canvas Course Presentation.pdf
Orientation Canvas Course Presentation.pdf
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
Ernest Hemingway's For Whom the Bell Tolls
Ernest Hemingway's For Whom the Bell TollsErnest Hemingway's For Whom the Bell Tolls
Ernest Hemingway's For Whom the Bell Tolls
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptx
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learning
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of Play
 

pattern_evaluaiton_methods.ppt

  • 1. Topic for the class: Pattern Evaluation Methods Unit _3 : Title- Mining frequent patterns, associations and correlations Date & Time : 14.3.2022 9.00 AM- 9.50 AM Dr. Bhramaramba Ravi Professor Department of Computer Science and Engineering GITAM Institute of Technology (GIT) Visakhapatnam – 530045 Email: bravi@gitam.edu 1 Department of CSE, GIT EID 356 Data Mining and Data warehousing 17 August 2023
  • 2. Course objectives • To understand the importance of data mining and its applications • To introduce various types of data and preprocessing techniques • To learn various multi-dimensional data models and OLAP processing • To study concepts of Association Analysis • To learn various Classification methods • To learn basics of Cluster Analysis 2 Department of computer science and engineering, GIT Course Code EID 356 Course Title: Data Mining and Data Warehousing 17 August 2023
  • 3. Learning Outcomes • At the end of this lecture/ session, Students will be able to • Understand the pattern evaluation methods 3 Department of computer science and engineering, GIT Course Code EID 356 and Course Title: Data Mining and Data warehousing 17 August 2023
  • 4. Module 3 • Mining frequent patterns, associations and correlations: Basic concepts, Applications of frequent pattern and associations, Frequent pattern and association mining: A road map, mining various kinds of association rules, Apriori algorithm, FP growth algorithm, Pattern evaluation methods 4 Department of computer science and engineering, GIT Course Code EID 356 and Course Title: Data Mining and Data warehousing 17 August 2023
  • 5. • Most association rule mining algorithms employ a support- confidence framework. Although minimum support and confidence thresholds help weed out or exclude the exploration of a good number of uninteresting rules, many of the rules generated are still not interesting to the users. This is especially true when mining at low support thresholds or mining for long patterns. This has been a major bottleneck for successful application of association rule mining. 5 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 Which Patterns are interesting?- Pattern Evaluation Methods
  • 6. Strong rules are not necessarily interesting • Whether or not a rule is interesting can be assessed either subjectively or objectively. Ultimately only the user can judge if a given rule is interesting, and this judgment, being subjective, may differ from one user to another. However, objective interestingness measures based on the statistics behind the data, can be used as one step toward the goal of weeding out uninteresting rules that would otherwise be presented to the user. Example: A misleading strong association rule. Suppose we are interested in analyzing transactions at AllElectronics with respect to the purchase of computer games and videos. 6 Department of computer science and engineering, GIT Course Code EID 356 and Course Title: Data Mining and Data warehousing 17 August 2023
  • 7. Strong rules are not necessarily interesting contd. 7 Department of computer science and engineering, GIT Course Code EID 356 and Course Title: Data Mining and Data warehousing 17 August 2023 • Let game refer to the transactions containing computer games, and video refer to those containing videos. Of the 10,000 transactions analyzed, the data show that 6000 of the customer transactions included computer games, while 7500 included videos and 4000 included both computer games and videos. Suppose that a data mining program for discovering association rules is run on the data, using a minimum support of say, 30% and a minimum confidence of 60%. The following association rule is discovered: buys(X,”computer games”)=> buys(X,”Videos”) [support =40%, Confidence = 66%]
  • 8. Strong rules are not necessarily interesting contd. 8 Department of computer science and engineering, GIT Course Code EID 356 and Course Title: Data Mining and Data warehousing 17 August 2023 • The earlier rule is a strong association rule since its support value of 4000/10000 =40% and confidence value of 4000/6000 =66% satisfy the minimum support and minimum confidence thresholds. However the rule is misleading because the probability of purchasing videos is 75%, which is even larger than 66%. Computer games and videos are negatively associated because the purchase of one of these items actually decreases the likelihood of purchasing the other
  • 9. From Association Analysis to Correlation Analysis 9 Department of computer science and engineering, GIT Course Code EID 356 and Course Title: Data Mining and Data warehousing 17 August 2023 • As we have seen so far, the support and confidence measures are insufficient at filtering out uninteresting association rules. To tackle this weakness, a correlation measure can be used to augment the support-confidence framework for association rules. This leads to correlation rules of the form A => B[Support, confidence, correlation] That is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets A and B. • Lift is a simple correlation measure that is given as follows:
  • 10. • The occurrence of itemset A is independent of the occurrence of itemset B if P(AUB) = P(A)P(B) otherwise itemsets A and B are dependent and correlated as events. This definition can be extended to more than two itemsets. The lift between the occurrence of A and B can be measured by computing lift(A,B)=P(AUB)/P(A)P(B) • If the resulting value of the above equation is less than 1, then the occurrence of A is negatively correlated with the occurrence of B, meaning that the occurrence of one likely leads to the absence of the other. 10 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 From Association Analysis to Correlation Analysis contd.
  • 11. 11 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 From Association Analysis to Correlation Analysis contd. • If the resulting value is equal to 1, then A and B are independent and there is no correlation between them. The earlier equation is equivalent to P(B/A)/P(B) or conf(A=>B)/sup(B) which is also referred to as the lift of the association(or correlation) rule A=>B • Example – Correlation analysis using lift. To help filter out misleading “ strong” associations of the form A=>B from the data of earlier example we need to study how the two itemsets A and B, are correlated. Let game refer to the transactions of example that do not contain computer games, and video refer to those that do not contain videos.
  • 12. • The Transactions can be summarized in a contingency table as shown in Table 2x2 Contingency Table Summarizing the transactions with respect to game and video purchases 12 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 From Association Analysis to Correlation Analysis contd.
  • 13. • From the table we can see that the probabiiity of purchasing a computer game is P({game}) =0.60, the probability of purchasing a video is P({video})=0.75 and the probability of purchasing both is P({game,video})=0.40. • By the equation for lift lift is calculated as • P({game,video})/(P({game})XP({video}))=0.40/(0.60x0.75) =0.89. • Because this value is less than 1, there is a negative correlation between the occurrence of {game} and {video}. • The numerator is the likelihood of a customer purchasing both, while the denominator is what the likelihood would have been if the two purchases were completely independent. Such a negative correlation cannot be identified by a support-confidence framework. 13 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 From Association Analysis to Correlation Analysis contd.
  • 14. • The second correlation measure that we study is the χ2 measure. • To compute the χ2 value, we take the squared difference between the observed and expected value for a slot( A and B pair) in the contingency table, divided by the expected value. This amount is summed for all slots of the contingency table. 14 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 From Association Analysis to Correlation Analysis contd.
  • 15. game game ∑row video 4000(4500) 3500(3000) 7500 video 2000(1500) 500(1000) 2500 ∑col 6000 4000 10000 15 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 Table Contingency Table, now with the expected Values
  • 16. • To compute the correlation using χ2 analysis for nominal data, we need the observed value and expected value for each slot of the contingency table. • From the table we can compute the χ2 value as follows: • χ2 = ∑(observed-expected)2 = (4000-4500)2 + (3500-3000) 2 expected 4500 3000 +(2000-1500)2 + (500-1000)2 = 555.6 1500 1000 • Because the χ2 value is greater than 1, and the observed value of the slot(game, video)=4000, which is less than the expected value of 4500, buying game and video are negatively correlated. 16 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 From Association Analysis to Correlation Analysis contd.
  • 17. • Instead of using the simple support-confidence framework to evaluate frequent patterns, other measures such as lift and χ2 often disclose more intrinsic pattern relationships. • Several other pattern evaluation measures exist. • Four such measures are all_confidence, max_confidence, kulczynski and cosine. • We’ll then compare their effectiveness with respect to one another and with respect to the lift and χ2 measures. • Given two itemsets, A and B, the all_confidence measure of A and B is defined as all_conf(A,B)= sup(AUB)/max{ sup(a), sup(B)} = min{P(A|B), P(B|A)} Where max{sup(A), sup(B)} is the maximum support of the itemsets A and B. Thus all_conf(A,B) is also the minimum confidence of the two association rules related to A and B namely “A=>B” and “B=>A” 17 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 A comparison of pattern evaluation measures
  • 18. • • Given two itemsets, A and B, the max_confidence measure of A and B is defined as max_conf(A,B)= max{P(A|B), P(B|A)} • The max_conf measure is the maximum confidence of the two association rules, “A=>B” and “B=>A”. • Given two itemsets, A and B, the kulczynski measure of A and B(abbreviated as Kulc) is defined as • Kulc(A,B)=1/2(P(A|B) +P(B|A)). It was proposed in 1927 by Polish mathematician S.Kulczynski. It can be viewed as the average of two confidence measures. That is, it is the average of two conditional probabilities: the probability of itemset B given itemset A, and the probability of itemset A given itemset B. 18 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 A comparison of pattern evaluation measures contd.
  • 19. • Finally, given two itemsets, A and B, the cosine measure of A and B is defined as • Cosine(A,B) = P(AUB)/√P(A)xP(B)= sup(AUB)/√sup(A)xsup(B) = √P(A|B)xP(B|A) • The cosine measure can be viewed as a harmonized lift measure. The two formulae are similar except that for cosine, the square root is taken on the product of the probabilities of A and B. This is an important difference, because by taking the square root, the cosine value is only influenced by the supports of A,B, and AUB and not by the total number of transactions. 19 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 A comparison of pattern evaluation measures contd.
  • 20. 20 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 A comparison of pattern evaluation measures contd. • Each of these four measures defined has the following property: Its value is only influenced by the supports of A,B, and AUB or more exactly by the conditional probabilities of P(A|B) and P(B|A) but not by the total number of transactions. • Another common property is that each measure ranges from 0 to 1, and the higher the value, the closer the relationship between A and B.
  • 21. • • A null transaction is a transaction that does not contain any of the itemsets being examined. • A measure is null-invariant if its value is free from the influence of null- transactions. • Null invariance is an important property for measuring association patterns in large transaction databases. • Among the 6 measures, lift and χ2 are not null-invariant measures. • Among the four null-invariant measures, namely all_confidence, max_confidence, Kulc and cosine, we recommend using Kulc in conjunction with the imbalance ratio. 21 Department of computer science and engineering, Data warehousingGIT Course Code EID 356 and Course Title: Data Mining and 17 August 2023 A comparison of pattern evaluation measures contd.
  • 22. Text Book(s) 1. Jiawei Han, Micheline Kamber, Jian Pei , Data Mining: Concepts and Techniques, 2/e, Morgan Kaufmann publishers, 2006. References 1. Correlation Analysis mgaub.ac.in 22 Department of computer science and engineering, GIT Course Code EID 356 and Course Title: Data Mining and Data Warehousing 17 August 2023 References
  • 23. Session Quiz (5-10 MCQ/True false questions) (marks to be counted for continuous evaluation) 23 Department of Computer science and engineering, GIT Course Code EID 356 and Course Title: Data Mining and Data Warehousing 17 August 2023 Session Quiz
  • 24. • Understood the pattern evaluation methods 24 Department of computer science and engineering, GIT Course Code EID 356 and Course Title: Data Mining and Data Warehousing 17 August 2023 Recap- Summary
  • 25. 25 Department of computer sciience and engineering GIT Course Code EID 356 and Course Title: Data Mining and Data Warehousing 17 August 2023 THANK YOU

Editor's Notes

  1. i
  2. i
  3. i
  4. i
  5. i
  6. i
  7. i
  8. i
  9. i
  10. i
  11. i
  12. i
  13. i
  14. i
  15. i
  16. i
  17. i
  18. i
  19. i