SlideShare a Scribd company logo
1 of 52
Download to read offline
Hypothesis Testing and
Statistically-sound Pattern Mining
Tran Quang Thien
This slide mainly based on the tutorial “Hypothesis testing and statistically-sound Pattern Mining”
by Leonardo Pellegrina, Matteo Riondato and Fabio Vandin at KDD2019.
Outline
● Introduction to Significant Pattern mining
● Statistical Hypothesis Testing
● Testing for independence
● Multiple Hypothesis testing
● Selecting Hypothesis
● Hypothesis Testability
● Summary
Introduction
Data mining is the process of extracting patterns and
knowledge from data
https://www.guru99.com/data-mining-tutorial.html
The ideal and the real worlds
● Knowledge extracted from our data might not match the real patterns
● Our data is just a sample from the population of interest
○ Typically only a very small proportion of the total population
○ May contain noise
Img: A Tutorial on Statistically Sound Pattern Discovery, Wilhelmiina Hämäläinen and Geoffrey I. Webb
Limitation of most data mining methods
● Most data mining methods don’t give a guarantee on results
○ It is not clear how reliable the result is
○ This limitation constrains the use in many applied fields such as
bioinformatics, medical research ...
Img: A Tutorial on Statistically Sound Pattern Discovery, Wilhelmiina Hämäläinen and Geoffrey I. Webb
A traditional method to offer reliability
How do we efficiently identify patterns in data with guarantees on the
underlying generative process?
We use the statistical hypothesis testing framework
In other words
● Typically, data mining and (inferential) statistics have traditionally
two different point of views
○ data mining: the data is the complete representation of the world and of
the phenomena we are studying
○ statistics: the data is obtained from an underlying generative process,
that is what we really care about
● Data: information from two online communities C1 and C2, regarding
whether each post is in a given topic T.
○ Data mining: “what fraction of posts in C1 are related to T? What fraction
of posts in C2 are related to T?”
○ Statistics: “What is the probability that a post from C1 is related to T?
What is the probability that a post from C2 is related to T?”
>> Similar questions but different flavours!
Outline
● Introduction to Significant Pattern mining
● Statistical Hypothesis Testing
● Testing for independence
● Multiple Hypothesis testing
● Selecting Hypothesis
● Hypothesis Testability
● Summary
Statistical Hypothesis Testing
● The process of testing some hypothesis of interest
○ Sometimes called confirmatory data analysis
● The test tells the analyst whether or not his hypothesis is true
○ Hypothesis that passed the test will have some statistically guarantee
● Widely used as a gate-keeper for science publications
Hypothesis A
Hypothesis B
…….
Hypothesis D
Hypothesis E
Publish
Publish
Statistical
Hypothesis
Testing
Data
Significant?
A real world example
Problem: Checking if a coin is biased (prob. of head is not 0.5)
● Make the hypothesis
○ Null hypothesis H0: P(head) = 0.5
○ Alternative hypothesis H1: P(head) != 0.5
● Choose the proper test (z-test or t-test in this case)
○ In fact, this is not simple since there are many types of test
○ Where each test have different purpose, different assumption …
● Collect data for testing
○ For example, we flip the coin 100 times and count the head (n=100)
● Conduct the test to decide whether to reject the null hypothesis
(and accept the alternative hypothesis)
Statistical Hypothesis Testing (in pattern mining)
● We are given:
○ a dataset D
○ a question we want to answer => a pattern S to consider
● EXAMPLE
○ Dataset D
1000 patients using drug S (cases), did they get cured (YES/NO)
1000 patients not using drug S (controls), did they get cured (YES/NO)
○ Question: does drug S have the effect ?
Example: market basket analysis
● Dataset D: transactions = set of items, label (student / professor)
● Pattern S: a subset of items (orange, tomato, broccoli)
● Question: is S associated with one of the two labels?
○ Is being a teacher related with buying orange, tomato and broccoli?
Example: genomic analysis
● Dataset D: transactions = set of genes, label (cancer / not cancer)
● Pattern S: subset of genes (Gene 2, Gene 100)
● Question: is S associated with one of the two labels?
○ Does having gene 2 and gene 100 related with having cancer?
Gene 1 2 ... 10000
Person 1 1 0 ... 0
Person 2 0 1 ... 0
….
….
Person 500 0 1 …. 1
Cancer
Not cancer
Cancer
。
。
。
Example: Subgraph mining
http://cbrc3.cbrc.jp/twsdm2015/pdf/sugiyama.pdf
Question: Which subgraph associated with the active/inactive of the disease?
Example: Subgraph mining
http://cbrc3.cbrc.jp/twsdm2015/pdf/sugiyama.pdf
Question :Which subgraph associated with the active/inactive of the disease?
Example: Subgraph mining
Question :Which subgraph associated with the active/inactive of the disease?
http://cbrc3.cbrc.jp/twsdm2015/pdf/sugiyama.pdf
Example: Subgraph mining
Question :Which subgraph associated with the active/inactive of the disease?
Statistical Hypothesis Testing
● Frame the question in terms of a null hypothesis H0
○ The hypothesis corresponds to “nothing interesting” for pattern S
○ i.e. H0: Having gene 2, 100 and having cancer are independent
● The goal is to use the data to either
○ reject H0 (“S is interesting!”) → New discovery
○ or not reject H0 (“S is not interesting”).
● This is decided based on a test statistic xS = fS(D) which is the
summary of S in D
○ ex: the empirical mean of obtaining a head from a flip
○ ex: the number of patient having gene S and got cancer
Statistical Hypothesis Testing: p-value
● Let xS = f(D) the value of the test statistic for our dataset D.
● Let be the random variable describing the value of the test statistic
under the null hypothesis H0 (i.e., when H0 is true)
● p-value = Pr[XS more extreme than xS | H0 is true]
○ “XS more extreme than xS ”: depends on the test
○ may be XS >= xS or XS <= xS or something else. . .
● Rejection rule:
○ Given a statistical level α in (0, 1), 0.05 or 0.01 are widely used
○ Reject H0 if p-value <= α (S is significant! new discovery!)
critical
level
https://scientistseessquirrel.wordpress.com/2015/02/09/n-defence-of-the-p-value/
19
Statistical Hypothesis Testing: Errors
● There are two types of errors we can make:
○ type I error: reject H0 when H0 is true
=> flag S as significant when it is not (false discovery)
○ type II error: do not reject H0 when H0 is false
=> do not flag S as significant when it is
● Using the rejection rule, the probability of a type I error is
○ Hypothesis testing offers us a guarantee on false discovery
Statistical Hypothesis Testing: Errors Guarantees
● Type I errors and type II errors have trade-off effect
○ No type I error when we never flag a pattern as significant
○ No type II error if you always flag a pattern as significant
● Power of a test
○ Power or the test = Pr[H0 is rejected: H0 is false] =
○ Not easy to evaluate (we don’t know the truth)
○ Pr[type II error] = 1 -
● Low critical level cause more strict test
○ Fewer false discoveries, but more misses
Outline
● Introduction to Significant Pattern mining
● Statistical Hypothesis Testing
● Testing for independence
● Multiple Hypothesis testing
● Selecting Hypothesis
● Hypothesis Testability
● Summary
Example: Testing for independence
● An important test for pattern mining
● Given:
○ Dataset D = {t1, . . . , tn contains n transactions
○ Each transaction ti has a label l(ti) in {c0, c1}
○ A pattern S
● Goal:
○ Understand if the appearance of S and the transactions labels l(ti)
are independent
● Statistical hypothesis testing
○ Null hypothesis H0: the events “S in ti” and “l(ti) = c1” are indepedent
○ Alternative hypothesis H1: there is a dependency between
“S in ti” and “l(ti) = c1”
Example: market basket analysis
● S = {orange, tomato, broccoli}
● H0: S is independent of (not associated with) label “professor”
Example: Testing for Independence
Useful representation of the data: contingency table
● = support of S with label c1 in D
● = support of S with label c0 in D
● = support of S in D
● = number of transactions with label c1
● = number of transactions with label c0
Example: Testing for Independence
● Value of test statistic = = 3
● p-value: how do we compute it?
○ Fisher’s exact test
○ test
Fisher’s exact test
● Assumption: column marginals, and the row marginals are fixed
● Null distribution: Under the null hypothesis (independence)
○ the support of S with label c1 follow hypergeometric(n, n1, )
● And also the p-value
Example
● XS follow hypergeomtric of parameter 8, 4, 4
● Probability of the table = = 0.228
● p-value = = 0.243
● For = 0.05, p-value
=> S is not associated with label
‘professor’
test
● In the old days: “Fisher’s exact test is computationally expensive...”
○ We have to calculate the combination many times
● Use test instead, which is a asymptotic version of Fisher’s test
○ The test statistic is easier to compute
○ The test statistic follow the distribution when
○ People also have a table to compute the p-value
○ But still just an asymptotic version
http://uregina.ca/~gingrich/appchi.pdf
Outline
● Introduction to Significant Pattern mining
● Statistical Hypothesis Testing
● Testing for independence
● Multiple Hypothesis testing
● Selecting Hypothesis
● Hypothesis Testability
● Summary
Pattern mining and statistical hypothesis testing
● Previous part: We had one pattern we are interested in
● In Knowledge Discovery and Data mining setting:
we have to consider multiple hypotheses given by our dataset D
● We want to find the interesting patterns
(We don’t know which pattern to test a priori)
I will
have all
Fourth paradigm: Data-intensitive science
データ駆動型 workflow
Hypothesis
determining
Hypothesis
testing
Data
collection
Hypothesis
determining
Data
collection
Hypothesis
testing
“古典”的な workflow
More data available today
It was costly to collect data
data-intensitive workflow
Hypothesis
determining
Hypothesis
testing
Data
collection
Hypothesis
determining
Data
collection
Hypothesis
testing
traditional workflow
Big data, many variables,
huge number of patterns, huge number of test to conduct
Fourth paradigm: Data-intensitive science
professor
student
contains
pattern S
not contains
pattern S
を含まない
A B n
C D N-n
n(T) N-n(T) N
pattern S and ‘professor’ are independent
pattern S and ‘professor’ are associated
Patterns appear in the data
Multiple hypothesis testing
So what if we use the rejection rule before?
● Let m be the number of hypotheses to test
● Proposition: E[num. false discoveries] =
● m is extremely high in the problem before
○ 1000 variables, 2^1000-1 possible patterns to consider
○ Large m makes the expected number of false discoveries to be very high
Multiple hypothesis testing
We want guarantees on the (expected) number of false discoveries
● Let the number of false discoveries as V
● Family-Wise Error Rate (FWER): Pr[V >= 1] (We focus on this)
○ Bonferroni correction
○ Bonferroni-Holm procedure
○ minP method (permutation test)
○ Tarone and LAMP
● False Discovery Rate (FDR): Let R the number of discoveries
False Discovery Rate: E[V / R]
○ Benjamini-Hochberg procedure
○ Benjamini-Yekutieli procedure
Bonferroni correction
● Let H the set of hypotheses (patterns) to be tested, and m = |H|.
● Require: control FWER under a significant level alpha
● Bonferroni rejection rule:
Reject a hypothesis H_S if <= alpha (corrected p-value)
● Intuition: for each S, Pr[S is a false discovery]
so, FWER
Most widely used method, simple but might too conservative
(The test have weak power when m is large)
Bonferroni-Holm correction
● Let H the set of hypotheses (patterns) to be tested, and m = |H|.
More powerful than Bonferroni correction: compared with vs .
However: still require very small p-values when m is large.
0.05/5
0.05/4
0.05/3
0.05/2
0.05/1
http://www.compbio.dundee.ac.uk/user/mgierlinski/talks/p-values1/p-values8.pdf
minP method
● FWER only relied on the smallest p-value among true hypothese
○ FWER = P[minP of true hypotheses ]
○ If we know the distribution of ‘minP of true hypotheses’, we can find a
good significant value to control FWER
● However, this distribution is unknown … but we can estimate it.
● Westfall-young permutation
○ Shuffle the label to create the dataset where all hypotheses are true
(this procedure remove all the association between item and label)
○ Calculate the minP value for each shuffled dataset
● Find the alpha quantile of the obtained distribution of minP
minP
probability
density
quantile
Outline
● Introduction
○ Introduction to Significant Pattern mining
○ Statistical Hypothesis Testing
○ Testing for independence
○ Multiple Hypothesis testing
○ Selecting Hypothesis
○ Hypothesis Testability
● Recent developments and advanced topics
Reduce the number of hypotheses ?
● The corrected significance threshold of introduced methods
depends on the size of hypotheses set H
● A smaller H may lead to a higher corrected significance threshold,
thus to higher power.
● Q: can we shrink H a posteriori?
○ I.e., Can we use D to select such that H’ only contains
non-significant hypotheses? (then test )
○ Yes, but you must do it ‘right’.
Example: The ‘wrong’ way
● Dataset D:
○ 10 transactions with label c1, 10 transactions with label c0
○ There are 15 items. We are interested only in patterns of size 6.
● Number of hypotheses m = = 5005
● “let select some hypotheses first, and then do the testing...”
○ find pattern S with highest value σ1(S) - σ0(S): σ1(S) - σ0(S) = 10
○ “I am going to test only pattern S!”
○ Fisher’s exact test p-value = 0.0001
Example: The ‘wrong’ way
● “S is very significant!!!” BUT IT IS NOT!
● Assume that D is generate as follows: for each pattern S
○ consider one of its 10 occurrences
○ place it in a transaction with label c0 with probability 1/2, and in a
transaction with label c1 with probability 1/2 otherwise
● S is not associated with class labels!
● For a given S, the probability σ1(S) = 10 and σ0(S) = 0 is ½^10 =
1/1024
○ In expectation, there will be 6 patterns with σ1(S)=10 and σ0(S)=0
and they are all false discoveries!
How not to select hypotheses
Do not do this:
1) Perform each individual test for each hypothesis using D.
2) Use the test results to select which hypotheses to include in H1.
3) Use your favorite MHC to bound the FWER/FDR on H1 .
Selecting H1 must be done without performing the tests on D
More concisely, the selection must be independence with the p-value
Hold out approach
1. Partition D into D1 and D2: D1 cup D2=D and
2. Apply some selection procedure to D1 to select H1
(you can also perform the tests on D1).
3. Perform the tests for hypotheses in H1 on D2, using any MHC method.
Splitting D is similar to splitting a labeled set into training and test sets
G. Webb, Discovering Significant Patterns, Mach. Learn. 2007
● There are still 2 drawbacks:
○ We can not utilize all of our data
○ We might falsely remove significant hypotheses (hard to evaluate)
Outline
● Introduction to Significant Pattern mining
● Statistical Hypothesis Testing
● Testing for independence
● Multiple Hypothesis testing
● Selecting Hypothesis
● Hypothesis Testability
● Summary
Tarone
● Some test statistic is discrete
○ Fisher’s exact test statistic is discrete
● There is a minimum attainable p-value for a pattern S
● Example: Fisher’s exact test
○ Smallest p-value for S?
○ minimum attainable p-value = 3 x 10-4
Fisher’s exact test minimum p-value
Let be p-value from Fisher’s exact test for pattern S
with support and
● Note that
● and p-value is obtained by
● So, the minimum attainable p-value for pattern S will be
NOTE: This function is monotone decrease with σ(S)
Tarone’s improved Bonferroni correction
Tarone’s result:
If your are testing hypotheses with a significance level , then hypothesis
that cannot be significant will have no effect on FWER!
● S cannot be significant with significance level if its p-value always
larger than , or in other word:
● These are called ‘untestable hypotheses’
● Set of testable hypotheses (for significance level ):
● Rejection rule: Given a statistical level for a value
reject H0 if (S is significant!)
● Theorem: FWER
TASK: find
The largest that still can control the FWER.
LAMP
● However, where the data is big, it’s hard to evaluate
○ => Need to calculate ( patterns to look at)
○ => Need to calculate (Have to look at all the transactions)
● There are some properties that can fasten this calculation
○ The support for a itemset decrease when number of item increase
○ The minimum attainable p-value increase when support decrease
○ If an itemset A is untestable
then any itemset will be also untestable
● LAMP is a method that use these properties to quickly find the
optimized
http://cbrc3.cbrc.jp/twsdm2015/pdf/tsuda.pdf
LAMP
http://cbrc3.cbrc.jp/twsdm2015/pdf/tsuda.pdf
Outline
● Introduction to Significant Pattern mining
● Statistical Hypothesis Testing
● Testing for independence
● Multiple Hypothesis testing
● Selecting Hypothesis
● Hypothesis Testability
● Summary
Summary
● Significant pattern mining is a framework for finding pattern with
statistical guarantee on results
● In KDD setting, the number of hypotheses to test are
enormously large due to the number of item in data
● This multiple testing problem are difficult in both statistical and
computation aspect
● Traditional methods like bonferroni or holm are not effective
enough to deal with size scale of problem
● Recent development mostly based on the result of Tarone that
remove untestable hypotheses from the candidate pool

More Related Content

Similar to Hypothesis testing and statistically sound-pattern mining

Introduction to Quantitative Research Methods
Introduction to Quantitative Research MethodsIntroduction to Quantitative Research Methods
Introduction to Quantitative Research Methods
Iman Ardekani
 
SMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignment
SMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignmentSMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignment
SMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignment
rahul kumar verma
 

Similar to Hypothesis testing and statistically sound-pattern mining (20)

Introduction to Quantitative Research Methods
Introduction to Quantitative Research MethodsIntroduction to Quantitative Research Methods
Introduction to Quantitative Research Methods
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
2주차
2주차2주차
2주차
 
Chapter 9 Fundamental of Hypothesis Testing.ppt
Chapter 9 Fundamental of Hypothesis Testing.pptChapter 9 Fundamental of Hypothesis Testing.ppt
Chapter 9 Fundamental of Hypothesis Testing.ppt
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Two Proportions
Two Proportions  Two Proportions
Two Proportions
 
Hypothesis testing interview
Hypothesis testing interviewHypothesis testing interview
Hypothesis testing interview
 
hypothesis test
 hypothesis test hypothesis test
hypothesis test
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Regression shrinkage: better answers to causal questions
Regression shrinkage: better answers to causal questionsRegression shrinkage: better answers to causal questions
Regression shrinkage: better answers to causal questions
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing framework
 
Testing a claim about a mean
Testing a claim about a mean  Testing a claim about a mean
Testing a claim about a mean
 
Chi square analysis
Chi square analysisChi square analysis
Chi square analysis
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
06StatisticalInference.pptx
06StatisticalInference.pptx06StatisticalInference.pptx
06StatisticalInference.pptx
 
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesStep Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
 
SMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignment
SMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignmentSMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignment
SMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignment
 
Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)
 

More from Thien Q. Tran

More from Thien Q. Tran (6)

LLM Threats: Prompt Injections and Jailbreak Attacks
LLM Threats: Prompt Injections and Jailbreak AttacksLLM Threats: Prompt Injections and Jailbreak Attacks
LLM Threats: Prompt Injections and Jailbreak Attacks
 
Introduction to FAST-LAMP
Introduction to FAST-LAMPIntroduction to FAST-LAMP
Introduction to FAST-LAMP
 
Finding statistically significant interactions between continuous features (I...
Finding statistically significant interactions between continuous features (I...Finding statistically significant interactions between continuous features (I...
Finding statistically significant interactions between continuous features (I...
 
Introduction to TCAV (ICML2018)
Introduction to TCAV (ICML2018)Introduction to TCAV (ICML2018)
Introduction to TCAV (ICML2018)
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Hypothesis testing and statistically sound-pattern mining

  • 1. Hypothesis Testing and Statistically-sound Pattern Mining Tran Quang Thien This slide mainly based on the tutorial “Hypothesis testing and statistically-sound Pattern Mining” by Leonardo Pellegrina, Matteo Riondato and Fabio Vandin at KDD2019.
  • 2. Outline ● Introduction to Significant Pattern mining ● Statistical Hypothesis Testing ● Testing for independence ● Multiple Hypothesis testing ● Selecting Hypothesis ● Hypothesis Testability ● Summary
  • 3. Introduction Data mining is the process of extracting patterns and knowledge from data https://www.guru99.com/data-mining-tutorial.html
  • 4. The ideal and the real worlds ● Knowledge extracted from our data might not match the real patterns ● Our data is just a sample from the population of interest ○ Typically only a very small proportion of the total population ○ May contain noise Img: A Tutorial on Statistically Sound Pattern Discovery, Wilhelmiina Hämäläinen and Geoffrey I. Webb
  • 5. Limitation of most data mining methods ● Most data mining methods don’t give a guarantee on results ○ It is not clear how reliable the result is ○ This limitation constrains the use in many applied fields such as bioinformatics, medical research ... Img: A Tutorial on Statistically Sound Pattern Discovery, Wilhelmiina Hämäläinen and Geoffrey I. Webb
  • 6. A traditional method to offer reliability How do we efficiently identify patterns in data with guarantees on the underlying generative process? We use the statistical hypothesis testing framework
  • 7. In other words ● Typically, data mining and (inferential) statistics have traditionally two different point of views ○ data mining: the data is the complete representation of the world and of the phenomena we are studying ○ statistics: the data is obtained from an underlying generative process, that is what we really care about ● Data: information from two online communities C1 and C2, regarding whether each post is in a given topic T. ○ Data mining: “what fraction of posts in C1 are related to T? What fraction of posts in C2 are related to T?” ○ Statistics: “What is the probability that a post from C1 is related to T? What is the probability that a post from C2 is related to T?” >> Similar questions but different flavours!
  • 8. Outline ● Introduction to Significant Pattern mining ● Statistical Hypothesis Testing ● Testing for independence ● Multiple Hypothesis testing ● Selecting Hypothesis ● Hypothesis Testability ● Summary
  • 9. Statistical Hypothesis Testing ● The process of testing some hypothesis of interest ○ Sometimes called confirmatory data analysis ● The test tells the analyst whether or not his hypothesis is true ○ Hypothesis that passed the test will have some statistically guarantee ● Widely used as a gate-keeper for science publications Hypothesis A Hypothesis B ……. Hypothesis D Hypothesis E Publish Publish Statistical Hypothesis Testing Data Significant?
  • 10. A real world example Problem: Checking if a coin is biased (prob. of head is not 0.5) ● Make the hypothesis ○ Null hypothesis H0: P(head) = 0.5 ○ Alternative hypothesis H1: P(head) != 0.5 ● Choose the proper test (z-test or t-test in this case) ○ In fact, this is not simple since there are many types of test ○ Where each test have different purpose, different assumption … ● Collect data for testing ○ For example, we flip the coin 100 times and count the head (n=100) ● Conduct the test to decide whether to reject the null hypothesis (and accept the alternative hypothesis)
  • 11. Statistical Hypothesis Testing (in pattern mining) ● We are given: ○ a dataset D ○ a question we want to answer => a pattern S to consider ● EXAMPLE ○ Dataset D 1000 patients using drug S (cases), did they get cured (YES/NO) 1000 patients not using drug S (controls), did they get cured (YES/NO) ○ Question: does drug S have the effect ?
  • 12. Example: market basket analysis ● Dataset D: transactions = set of items, label (student / professor) ● Pattern S: a subset of items (orange, tomato, broccoli) ● Question: is S associated with one of the two labels? ○ Is being a teacher related with buying orange, tomato and broccoli?
  • 13. Example: genomic analysis ● Dataset D: transactions = set of genes, label (cancer / not cancer) ● Pattern S: subset of genes (Gene 2, Gene 100) ● Question: is S associated with one of the two labels? ○ Does having gene 2 and gene 100 related with having cancer? Gene 1 2 ... 10000 Person 1 1 0 ... 0 Person 2 0 1 ... 0 …. …. Person 500 0 1 …. 1 Cancer Not cancer Cancer 。 。 。
  • 14. Example: Subgraph mining http://cbrc3.cbrc.jp/twsdm2015/pdf/sugiyama.pdf Question: Which subgraph associated with the active/inactive of the disease?
  • 15. Example: Subgraph mining http://cbrc3.cbrc.jp/twsdm2015/pdf/sugiyama.pdf Question :Which subgraph associated with the active/inactive of the disease?
  • 16. Example: Subgraph mining Question :Which subgraph associated with the active/inactive of the disease? http://cbrc3.cbrc.jp/twsdm2015/pdf/sugiyama.pdf
  • 17. Example: Subgraph mining Question :Which subgraph associated with the active/inactive of the disease?
  • 18. Statistical Hypothesis Testing ● Frame the question in terms of a null hypothesis H0 ○ The hypothesis corresponds to “nothing interesting” for pattern S ○ i.e. H0: Having gene 2, 100 and having cancer are independent ● The goal is to use the data to either ○ reject H0 (“S is interesting!”) → New discovery ○ or not reject H0 (“S is not interesting”). ● This is decided based on a test statistic xS = fS(D) which is the summary of S in D ○ ex: the empirical mean of obtaining a head from a flip ○ ex: the number of patient having gene S and got cancer
  • 19. Statistical Hypothesis Testing: p-value ● Let xS = f(D) the value of the test statistic for our dataset D. ● Let be the random variable describing the value of the test statistic under the null hypothesis H0 (i.e., when H0 is true) ● p-value = Pr[XS more extreme than xS | H0 is true] ○ “XS more extreme than xS ”: depends on the test ○ may be XS >= xS or XS <= xS or something else. . . ● Rejection rule: ○ Given a statistical level α in (0, 1), 0.05 or 0.01 are widely used ○ Reject H0 if p-value <= α (S is significant! new discovery!) critical level https://scientistseessquirrel.wordpress.com/2015/02/09/n-defence-of-the-p-value/ 19
  • 20. Statistical Hypothesis Testing: Errors ● There are two types of errors we can make: ○ type I error: reject H0 when H0 is true => flag S as significant when it is not (false discovery) ○ type II error: do not reject H0 when H0 is false => do not flag S as significant when it is ● Using the rejection rule, the probability of a type I error is ○ Hypothesis testing offers us a guarantee on false discovery
  • 21. Statistical Hypothesis Testing: Errors Guarantees ● Type I errors and type II errors have trade-off effect ○ No type I error when we never flag a pattern as significant ○ No type II error if you always flag a pattern as significant ● Power of a test ○ Power or the test = Pr[H0 is rejected: H0 is false] = ○ Not easy to evaluate (we don’t know the truth) ○ Pr[type II error] = 1 - ● Low critical level cause more strict test ○ Fewer false discoveries, but more misses
  • 22. Outline ● Introduction to Significant Pattern mining ● Statistical Hypothesis Testing ● Testing for independence ● Multiple Hypothesis testing ● Selecting Hypothesis ● Hypothesis Testability ● Summary
  • 23. Example: Testing for independence ● An important test for pattern mining ● Given: ○ Dataset D = {t1, . . . , tn contains n transactions ○ Each transaction ti has a label l(ti) in {c0, c1} ○ A pattern S ● Goal: ○ Understand if the appearance of S and the transactions labels l(ti) are independent ● Statistical hypothesis testing ○ Null hypothesis H0: the events “S in ti” and “l(ti) = c1” are indepedent ○ Alternative hypothesis H1: there is a dependency between “S in ti” and “l(ti) = c1”
  • 24. Example: market basket analysis ● S = {orange, tomato, broccoli} ● H0: S is independent of (not associated with) label “professor”
  • 25. Example: Testing for Independence Useful representation of the data: contingency table ● = support of S with label c1 in D ● = support of S with label c0 in D ● = support of S in D ● = number of transactions with label c1 ● = number of transactions with label c0
  • 26. Example: Testing for Independence ● Value of test statistic = = 3 ● p-value: how do we compute it? ○ Fisher’s exact test ○ test
  • 27. Fisher’s exact test ● Assumption: column marginals, and the row marginals are fixed ● Null distribution: Under the null hypothesis (independence) ○ the support of S with label c1 follow hypergeometric(n, n1, ) ● And also the p-value
  • 28. Example ● XS follow hypergeomtric of parameter 8, 4, 4 ● Probability of the table = = 0.228 ● p-value = = 0.243 ● For = 0.05, p-value => S is not associated with label ‘professor’
  • 29. test ● In the old days: “Fisher’s exact test is computationally expensive...” ○ We have to calculate the combination many times ● Use test instead, which is a asymptotic version of Fisher’s test ○ The test statistic is easier to compute ○ The test statistic follow the distribution when ○ People also have a table to compute the p-value ○ But still just an asymptotic version http://uregina.ca/~gingrich/appchi.pdf
  • 30. Outline ● Introduction to Significant Pattern mining ● Statistical Hypothesis Testing ● Testing for independence ● Multiple Hypothesis testing ● Selecting Hypothesis ● Hypothesis Testability ● Summary
  • 31. Pattern mining and statistical hypothesis testing ● Previous part: We had one pattern we are interested in ● In Knowledge Discovery and Data mining setting: we have to consider multiple hypotheses given by our dataset D ● We want to find the interesting patterns (We don’t know which pattern to test a priori) I will have all
  • 32. Fourth paradigm: Data-intensitive science データ駆動型 workflow Hypothesis determining Hypothesis testing Data collection Hypothesis determining Data collection Hypothesis testing “古典”的な workflow More data available today It was costly to collect data data-intensitive workflow Hypothesis determining Hypothesis testing Data collection Hypothesis determining Data collection Hypothesis testing traditional workflow Big data, many variables, huge number of patterns, huge number of test to conduct
  • 33. Fourth paradigm: Data-intensitive science professor student contains pattern S not contains pattern S を含まない A B n C D N-n n(T) N-n(T) N pattern S and ‘professor’ are independent pattern S and ‘professor’ are associated Patterns appear in the data
  • 34. Multiple hypothesis testing So what if we use the rejection rule before? ● Let m be the number of hypotheses to test ● Proposition: E[num. false discoveries] = ● m is extremely high in the problem before ○ 1000 variables, 2^1000-1 possible patterns to consider ○ Large m makes the expected number of false discoveries to be very high
  • 35. Multiple hypothesis testing We want guarantees on the (expected) number of false discoveries ● Let the number of false discoveries as V ● Family-Wise Error Rate (FWER): Pr[V >= 1] (We focus on this) ○ Bonferroni correction ○ Bonferroni-Holm procedure ○ minP method (permutation test) ○ Tarone and LAMP ● False Discovery Rate (FDR): Let R the number of discoveries False Discovery Rate: E[V / R] ○ Benjamini-Hochberg procedure ○ Benjamini-Yekutieli procedure
  • 36. Bonferroni correction ● Let H the set of hypotheses (patterns) to be tested, and m = |H|. ● Require: control FWER under a significant level alpha ● Bonferroni rejection rule: Reject a hypothesis H_S if <= alpha (corrected p-value) ● Intuition: for each S, Pr[S is a false discovery] so, FWER Most widely used method, simple but might too conservative (The test have weak power when m is large)
  • 37. Bonferroni-Holm correction ● Let H the set of hypotheses (patterns) to be tested, and m = |H|. More powerful than Bonferroni correction: compared with vs . However: still require very small p-values when m is large. 0.05/5 0.05/4 0.05/3 0.05/2 0.05/1 http://www.compbio.dundee.ac.uk/user/mgierlinski/talks/p-values1/p-values8.pdf
  • 38. minP method ● FWER only relied on the smallest p-value among true hypothese ○ FWER = P[minP of true hypotheses ] ○ If we know the distribution of ‘minP of true hypotheses’, we can find a good significant value to control FWER ● However, this distribution is unknown … but we can estimate it. ● Westfall-young permutation ○ Shuffle the label to create the dataset where all hypotheses are true (this procedure remove all the association between item and label) ○ Calculate the minP value for each shuffled dataset ● Find the alpha quantile of the obtained distribution of minP minP probability density quantile
  • 39. Outline ● Introduction ○ Introduction to Significant Pattern mining ○ Statistical Hypothesis Testing ○ Testing for independence ○ Multiple Hypothesis testing ○ Selecting Hypothesis ○ Hypothesis Testability ● Recent developments and advanced topics
  • 40. Reduce the number of hypotheses ? ● The corrected significance threshold of introduced methods depends on the size of hypotheses set H ● A smaller H may lead to a higher corrected significance threshold, thus to higher power. ● Q: can we shrink H a posteriori? ○ I.e., Can we use D to select such that H’ only contains non-significant hypotheses? (then test ) ○ Yes, but you must do it ‘right’.
  • 41. Example: The ‘wrong’ way ● Dataset D: ○ 10 transactions with label c1, 10 transactions with label c0 ○ There are 15 items. We are interested only in patterns of size 6. ● Number of hypotheses m = = 5005 ● “let select some hypotheses first, and then do the testing...” ○ find pattern S with highest value σ1(S) - σ0(S): σ1(S) - σ0(S) = 10 ○ “I am going to test only pattern S!” ○ Fisher’s exact test p-value = 0.0001
  • 42. Example: The ‘wrong’ way ● “S is very significant!!!” BUT IT IS NOT! ● Assume that D is generate as follows: for each pattern S ○ consider one of its 10 occurrences ○ place it in a transaction with label c0 with probability 1/2, and in a transaction with label c1 with probability 1/2 otherwise ● S is not associated with class labels! ● For a given S, the probability σ1(S) = 10 and σ0(S) = 0 is ½^10 = 1/1024 ○ In expectation, there will be 6 patterns with σ1(S)=10 and σ0(S)=0 and they are all false discoveries!
  • 43. How not to select hypotheses Do not do this: 1) Perform each individual test for each hypothesis using D. 2) Use the test results to select which hypotheses to include in H1. 3) Use your favorite MHC to bound the FWER/FDR on H1 . Selecting H1 must be done without performing the tests on D More concisely, the selection must be independence with the p-value
  • 44. Hold out approach 1. Partition D into D1 and D2: D1 cup D2=D and 2. Apply some selection procedure to D1 to select H1 (you can also perform the tests on D1). 3. Perform the tests for hypotheses in H1 on D2, using any MHC method. Splitting D is similar to splitting a labeled set into training and test sets G. Webb, Discovering Significant Patterns, Mach. Learn. 2007 ● There are still 2 drawbacks: ○ We can not utilize all of our data ○ We might falsely remove significant hypotheses (hard to evaluate)
  • 45. Outline ● Introduction to Significant Pattern mining ● Statistical Hypothesis Testing ● Testing for independence ● Multiple Hypothesis testing ● Selecting Hypothesis ● Hypothesis Testability ● Summary
  • 46. Tarone ● Some test statistic is discrete ○ Fisher’s exact test statistic is discrete ● There is a minimum attainable p-value for a pattern S ● Example: Fisher’s exact test ○ Smallest p-value for S? ○ minimum attainable p-value = 3 x 10-4
  • 47. Fisher’s exact test minimum p-value Let be p-value from Fisher’s exact test for pattern S with support and ● Note that ● and p-value is obtained by ● So, the minimum attainable p-value for pattern S will be NOTE: This function is monotone decrease with σ(S)
  • 48. Tarone’s improved Bonferroni correction Tarone’s result: If your are testing hypotheses with a significance level , then hypothesis that cannot be significant will have no effect on FWER! ● S cannot be significant with significance level if its p-value always larger than , or in other word: ● These are called ‘untestable hypotheses’ ● Set of testable hypotheses (for significance level ): ● Rejection rule: Given a statistical level for a value reject H0 if (S is significant!) ● Theorem: FWER TASK: find The largest that still can control the FWER.
  • 49. LAMP ● However, where the data is big, it’s hard to evaluate ○ => Need to calculate ( patterns to look at) ○ => Need to calculate (Have to look at all the transactions) ● There are some properties that can fasten this calculation ○ The support for a itemset decrease when number of item increase ○ The minimum attainable p-value increase when support decrease ○ If an itemset A is untestable then any itemset will be also untestable ● LAMP is a method that use these properties to quickly find the optimized http://cbrc3.cbrc.jp/twsdm2015/pdf/tsuda.pdf
  • 51. Outline ● Introduction to Significant Pattern mining ● Statistical Hypothesis Testing ● Testing for independence ● Multiple Hypothesis testing ● Selecting Hypothesis ● Hypothesis Testability ● Summary
  • 52. Summary ● Significant pattern mining is a framework for finding pattern with statistical guarantee on results ● In KDD setting, the number of hypotheses to test are enormously large due to the number of item in data ● This multiple testing problem are difficult in both statistical and computation aspect ● Traditional methods like bonferroni or holm are not effective enough to deal with size scale of problem ● Recent development mostly based on the result of Tarone that remove untestable hypotheses from the candidate pool