The availability of large datasets has lead to the need for methods that can extracts patterns in the data while providing statistical guarantees on the quality of the results, in particular with respect to false discoveries. In this talk, we firstly introduce the fundamental concepts behind statistical hypothesis testing. We then explain the computational and statistical challenges in statistically-sound pattern mining and how they have been tackled. Moreover, we will also show some application of these methods in areas such as subgraph mining, social networks analysis, basket analysis, and cancer genomics.
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Hypothesis testing and statistically sound-pattern mining
1. Hypothesis Testing and
Statistically-sound Pattern Mining
Tran Quang Thien
This slide mainly based on the tutorial “Hypothesis testing and statistically-sound Pattern Mining”
by Leonardo Pellegrina, Matteo Riondato and Fabio Vandin at KDD2019.
3. Introduction
Data mining is the process of extracting patterns and
knowledge from data
https://www.guru99.com/data-mining-tutorial.html
4. The ideal and the real worlds
● Knowledge extracted from our data might not match the real patterns
● Our data is just a sample from the population of interest
○ Typically only a very small proportion of the total population
○ May contain noise
Img: A Tutorial on Statistically Sound Pattern Discovery, Wilhelmiina Hämäläinen and Geoffrey I. Webb
5. Limitation of most data mining methods
● Most data mining methods don’t give a guarantee on results
○ It is not clear how reliable the result is
○ This limitation constrains the use in many applied fields such as
bioinformatics, medical research ...
Img: A Tutorial on Statistically Sound Pattern Discovery, Wilhelmiina Hämäläinen and Geoffrey I. Webb
6. A traditional method to offer reliability
How do we efficiently identify patterns in data with guarantees on the
underlying generative process?
We use the statistical hypothesis testing framework
7. In other words
● Typically, data mining and (inferential) statistics have traditionally
two different point of views
○ data mining: the data is the complete representation of the world and of
the phenomena we are studying
○ statistics: the data is obtained from an underlying generative process,
that is what we really care about
● Data: information from two online communities C1 and C2, regarding
whether each post is in a given topic T.
○ Data mining: “what fraction of posts in C1 are related to T? What fraction
of posts in C2 are related to T?”
○ Statistics: “What is the probability that a post from C1 is related to T?
What is the probability that a post from C2 is related to T?”
>> Similar questions but different flavours!
9. Statistical Hypothesis Testing
● The process of testing some hypothesis of interest
○ Sometimes called confirmatory data analysis
● The test tells the analyst whether or not his hypothesis is true
○ Hypothesis that passed the test will have some statistically guarantee
● Widely used as a gate-keeper for science publications
Hypothesis A
Hypothesis B
…….
Hypothesis D
Hypothesis E
Publish
Publish
Statistical
Hypothesis
Testing
Data
Significant?
10. A real world example
Problem: Checking if a coin is biased (prob. of head is not 0.5)
● Make the hypothesis
○ Null hypothesis H0: P(head) = 0.5
○ Alternative hypothesis H1: P(head) != 0.5
● Choose the proper test (z-test or t-test in this case)
○ In fact, this is not simple since there are many types of test
○ Where each test have different purpose, different assumption …
● Collect data for testing
○ For example, we flip the coin 100 times and count the head (n=100)
● Conduct the test to decide whether to reject the null hypothesis
(and accept the alternative hypothesis)
11. Statistical Hypothesis Testing (in pattern mining)
● We are given:
○ a dataset D
○ a question we want to answer => a pattern S to consider
● EXAMPLE
○ Dataset D
1000 patients using drug S (cases), did they get cured (YES/NO)
1000 patients not using drug S (controls), did they get cured (YES/NO)
○ Question: does drug S have the effect ?
12. Example: market basket analysis
● Dataset D: transactions = set of items, label (student / professor)
● Pattern S: a subset of items (orange, tomato, broccoli)
● Question: is S associated with one of the two labels?
○ Is being a teacher related with buying orange, tomato and broccoli?
13. Example: genomic analysis
● Dataset D: transactions = set of genes, label (cancer / not cancer)
● Pattern S: subset of genes (Gene 2, Gene 100)
● Question: is S associated with one of the two labels?
○ Does having gene 2 and gene 100 related with having cancer?
Gene 1 2 ... 10000
Person 1 1 0 ... 0
Person 2 0 1 ... 0
….
….
Person 500 0 1 …. 1
Cancer
Not cancer
Cancer
。
。
。
16. Example: Subgraph mining
Question :Which subgraph associated with the active/inactive of the disease?
http://cbrc3.cbrc.jp/twsdm2015/pdf/sugiyama.pdf
18. Statistical Hypothesis Testing
● Frame the question in terms of a null hypothesis H0
○ The hypothesis corresponds to “nothing interesting” for pattern S
○ i.e. H0: Having gene 2, 100 and having cancer are independent
● The goal is to use the data to either
○ reject H0 (“S is interesting!”) → New discovery
○ or not reject H0 (“S is not interesting”).
● This is decided based on a test statistic xS = fS(D) which is the
summary of S in D
○ ex: the empirical mean of obtaining a head from a flip
○ ex: the number of patient having gene S and got cancer
19. Statistical Hypothesis Testing: p-value
● Let xS = f(D) the value of the test statistic for our dataset D.
● Let be the random variable describing the value of the test statistic
under the null hypothesis H0 (i.e., when H0 is true)
● p-value = Pr[XS more extreme than xS | H0 is true]
○ “XS more extreme than xS ”: depends on the test
○ may be XS >= xS or XS <= xS or something else. . .
● Rejection rule:
○ Given a statistical level α in (0, 1), 0.05 or 0.01 are widely used
○ Reject H0 if p-value <= α (S is significant! new discovery!)
critical
level
https://scientistseessquirrel.wordpress.com/2015/02/09/n-defence-of-the-p-value/
19
20. Statistical Hypothesis Testing: Errors
● There are two types of errors we can make:
○ type I error: reject H0 when H0 is true
=> flag S as significant when it is not (false discovery)
○ type II error: do not reject H0 when H0 is false
=> do not flag S as significant when it is
● Using the rejection rule, the probability of a type I error is
○ Hypothesis testing offers us a guarantee on false discovery
21. Statistical Hypothesis Testing: Errors Guarantees
● Type I errors and type II errors have trade-off effect
○ No type I error when we never flag a pattern as significant
○ No type II error if you always flag a pattern as significant
● Power of a test
○ Power or the test = Pr[H0 is rejected: H0 is false] =
○ Not easy to evaluate (we don’t know the truth)
○ Pr[type II error] = 1 -
● Low critical level cause more strict test
○ Fewer false discoveries, but more misses
23. Example: Testing for independence
● An important test for pattern mining
● Given:
○ Dataset D = {t1, . . . , tn contains n transactions
○ Each transaction ti has a label l(ti) in {c0, c1}
○ A pattern S
● Goal:
○ Understand if the appearance of S and the transactions labels l(ti)
are independent
● Statistical hypothesis testing
○ Null hypothesis H0: the events “S in ti” and “l(ti) = c1” are indepedent
○ Alternative hypothesis H1: there is a dependency between
“S in ti” and “l(ti) = c1”
24. Example: market basket analysis
● S = {orange, tomato, broccoli}
● H0: S is independent of (not associated with) label “professor”
25. Example: Testing for Independence
Useful representation of the data: contingency table
● = support of S with label c1 in D
● = support of S with label c0 in D
● = support of S in D
● = number of transactions with label c1
● = number of transactions with label c0
26. Example: Testing for Independence
● Value of test statistic = = 3
● p-value: how do we compute it?
○ Fisher’s exact test
○ test
27. Fisher’s exact test
● Assumption: column marginals, and the row marginals are fixed
● Null distribution: Under the null hypothesis (independence)
○ the support of S with label c1 follow hypergeometric(n, n1, )
● And also the p-value
28. Example
● XS follow hypergeomtric of parameter 8, 4, 4
● Probability of the table = = 0.228
● p-value = = 0.243
● For = 0.05, p-value
=> S is not associated with label
‘professor’
29. test
● In the old days: “Fisher’s exact test is computationally expensive...”
○ We have to calculate the combination many times
● Use test instead, which is a asymptotic version of Fisher’s test
○ The test statistic is easier to compute
○ The test statistic follow the distribution when
○ People also have a table to compute the p-value
○ But still just an asymptotic version
http://uregina.ca/~gingrich/appchi.pdf
31. Pattern mining and statistical hypothesis testing
● Previous part: We had one pattern we are interested in
● In Knowledge Discovery and Data mining setting:
we have to consider multiple hypotheses given by our dataset D
● We want to find the interesting patterns
(We don’t know which pattern to test a priori)
I will
have all
32. Fourth paradigm: Data-intensitive science
データ駆動型 workflow
Hypothesis
determining
Hypothesis
testing
Data
collection
Hypothesis
determining
Data
collection
Hypothesis
testing
“古典”的な workflow
More data available today
It was costly to collect data
data-intensitive workflow
Hypothesis
determining
Hypothesis
testing
Data
collection
Hypothesis
determining
Data
collection
Hypothesis
testing
traditional workflow
Big data, many variables,
huge number of patterns, huge number of test to conduct
33. Fourth paradigm: Data-intensitive science
professor
student
contains
pattern S
not contains
pattern S
を含まない
A B n
C D N-n
n(T) N-n(T) N
pattern S and ‘professor’ are independent
pattern S and ‘professor’ are associated
Patterns appear in the data
34. Multiple hypothesis testing
So what if we use the rejection rule before?
● Let m be the number of hypotheses to test
● Proposition: E[num. false discoveries] =
● m is extremely high in the problem before
○ 1000 variables, 2^1000-1 possible patterns to consider
○ Large m makes the expected number of false discoveries to be very high
35. Multiple hypothesis testing
We want guarantees on the (expected) number of false discoveries
● Let the number of false discoveries as V
● Family-Wise Error Rate (FWER): Pr[V >= 1] (We focus on this)
○ Bonferroni correction
○ Bonferroni-Holm procedure
○ minP method (permutation test)
○ Tarone and LAMP
● False Discovery Rate (FDR): Let R the number of discoveries
False Discovery Rate: E[V / R]
○ Benjamini-Hochberg procedure
○ Benjamini-Yekutieli procedure
36. Bonferroni correction
● Let H the set of hypotheses (patterns) to be tested, and m = |H|.
● Require: control FWER under a significant level alpha
● Bonferroni rejection rule:
Reject a hypothesis H_S if <= alpha (corrected p-value)
● Intuition: for each S, Pr[S is a false discovery]
so, FWER
Most widely used method, simple but might too conservative
(The test have weak power when m is large)
37. Bonferroni-Holm correction
● Let H the set of hypotheses (patterns) to be tested, and m = |H|.
More powerful than Bonferroni correction: compared with vs .
However: still require very small p-values when m is large.
0.05/5
0.05/4
0.05/3
0.05/2
0.05/1
http://www.compbio.dundee.ac.uk/user/mgierlinski/talks/p-values1/p-values8.pdf
38. minP method
● FWER only relied on the smallest p-value among true hypothese
○ FWER = P[minP of true hypotheses ]
○ If we know the distribution of ‘minP of true hypotheses’, we can find a
good significant value to control FWER
● However, this distribution is unknown … but we can estimate it.
● Westfall-young permutation
○ Shuffle the label to create the dataset where all hypotheses are true
(this procedure remove all the association between item and label)
○ Calculate the minP value for each shuffled dataset
● Find the alpha quantile of the obtained distribution of minP
minP
probability
density
quantile
40. Reduce the number of hypotheses ?
● The corrected significance threshold of introduced methods
depends on the size of hypotheses set H
● A smaller H may lead to a higher corrected significance threshold,
thus to higher power.
● Q: can we shrink H a posteriori?
○ I.e., Can we use D to select such that H’ only contains
non-significant hypotheses? (then test )
○ Yes, but you must do it ‘right’.
41. Example: The ‘wrong’ way
● Dataset D:
○ 10 transactions with label c1, 10 transactions with label c0
○ There are 15 items. We are interested only in patterns of size 6.
● Number of hypotheses m = = 5005
● “let select some hypotheses first, and then do the testing...”
○ find pattern S with highest value σ1(S) - σ0(S): σ1(S) - σ0(S) = 10
○ “I am going to test only pattern S!”
○ Fisher’s exact test p-value = 0.0001
42. Example: The ‘wrong’ way
● “S is very significant!!!” BUT IT IS NOT!
● Assume that D is generate as follows: for each pattern S
○ consider one of its 10 occurrences
○ place it in a transaction with label c0 with probability 1/2, and in a
transaction with label c1 with probability 1/2 otherwise
● S is not associated with class labels!
● For a given S, the probability σ1(S) = 10 and σ0(S) = 0 is ½^10 =
1/1024
○ In expectation, there will be 6 patterns with σ1(S)=10 and σ0(S)=0
and they are all false discoveries!
43. How not to select hypotheses
Do not do this:
1) Perform each individual test for each hypothesis using D.
2) Use the test results to select which hypotheses to include in H1.
3) Use your favorite MHC to bound the FWER/FDR on H1 .
Selecting H1 must be done without performing the tests on D
More concisely, the selection must be independence with the p-value
44. Hold out approach
1. Partition D into D1 and D2: D1 cup D2=D and
2. Apply some selection procedure to D1 to select H1
(you can also perform the tests on D1).
3. Perform the tests for hypotheses in H1 on D2, using any MHC method.
Splitting D is similar to splitting a labeled set into training and test sets
G. Webb, Discovering Significant Patterns, Mach. Learn. 2007
● There are still 2 drawbacks:
○ We can not utilize all of our data
○ We might falsely remove significant hypotheses (hard to evaluate)
46. Tarone
● Some test statistic is discrete
○ Fisher’s exact test statistic is discrete
● There is a minimum attainable p-value for a pattern S
● Example: Fisher’s exact test
○ Smallest p-value for S?
○ minimum attainable p-value = 3 x 10-4
47. Fisher’s exact test minimum p-value
Let be p-value from Fisher’s exact test for pattern S
with support and
● Note that
● and p-value is obtained by
● So, the minimum attainable p-value for pattern S will be
NOTE: This function is monotone decrease with σ(S)
48. Tarone’s improved Bonferroni correction
Tarone’s result:
If your are testing hypotheses with a significance level , then hypothesis
that cannot be significant will have no effect on FWER!
● S cannot be significant with significance level if its p-value always
larger than , or in other word:
● These are called ‘untestable hypotheses’
● Set of testable hypotheses (for significance level ):
● Rejection rule: Given a statistical level for a value
reject H0 if (S is significant!)
● Theorem: FWER
TASK: find
The largest that still can control the FWER.
49. LAMP
● However, where the data is big, it’s hard to evaluate
○ => Need to calculate ( patterns to look at)
○ => Need to calculate (Have to look at all the transactions)
● There are some properties that can fasten this calculation
○ The support for a itemset decrease when number of item increase
○ The minimum attainable p-value increase when support decrease
○ If an itemset A is untestable
then any itemset will be also untestable
● LAMP is a method that use these properties to quickly find the
optimized
http://cbrc3.cbrc.jp/twsdm2015/pdf/tsuda.pdf
52. Summary
● Significant pattern mining is a framework for finding pattern with
statistical guarantee on results
● In KDD setting, the number of hypotheses to test are
enormously large due to the number of item in data
● This multiple testing problem are difficult in both statistical and
computation aspect
● Traditional methods like bonferroni or holm are not effective
enough to deal with size scale of problem
● Recent development mostly based on the result of Tarone that
remove untestable hypotheses from the candidate pool