Introduction to FAST-LAMP

A Fast Method of Statistical Assessment
for Combinatorial Hypotheses
Based on Frequent Itemset Enumeration
Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada and Jun Sese
Selection reason: Looking for some hints for speeding up the proposing method
(PKDD2014)

Abstract (Introduction)
● Combinatorial hypothesis assessment is a hard problem
○ Large p-value correction factor due to multiple testing
● LAMP method was proposed to exclude meaningless hypotheses
○ Based on frequent itemset enumeration
○ Can find more accurate p-value correction
● However, original implementation is time-consuming
○ Itemset mining algorithm executed many times
● This work proposed a new, faster LAMP algorithm
○ Execute itemset mining algorithm only once
○ 10 to 100 times faster than original LAMP

Preliminary
● 　　　　　 be a set of items. An itemset is a subset of E.
● A transaction database D is a dataset that composed of transactions.
● An occurence of itemset X is a transaction including X
● Occurrence set Occ(X) is the set of all occurrences of X in D
● Frequency of X frq(X) is the number of occurrences of X in D
● An itemset X is called frequent for a constant if
● 　　 is the number of itemset X that is frequent for sigma
Frequent itemsets for
(apple), (beer), (rice), (milk)
(apple, beer), (milk, beer), (beer, rice)
Frequent itemsets for
(apple), (beer), (rice), (milk)
(beer, rice)

Frequent Itemset Mining Algorithm
● TASK: Find all itemsets that are frequent for a constant
● Start from empty set, recursively add items with depth-first search
● is condition to prevent duplicated solution
○ is the maximum item in X
● The heaviest computation is the function

Frequent Itemset Mining Algorithm (Update)
● An item is addible for itemset X if
● Let the set of addible items for X that satisfy
● We can add these items without calling

Statistical Assessment for Combinatorial Hypotheses
● Assume a classifier classify each transaction
● itemsets, transactions, positive transactions
● For a itemset X
○ the number of transactions contains X ( )
○ is the number of positive transactions in
Please neglect the numbers.
p-value of Fisher’s exact test is calculated as
where

Statistical Assessment for Combinatorial Hypotheses

Multiple testing and LAMP’s idea
● Have to keep FWER
● LAMP ideas
○ Exclude meaninglessly infrequent itemsets which never be significant
○ Itemsets having completely the same occurence set can be counted as one
● For an itemset X, the p-value cannot be smaller than
- is monotonically decreasing
- If , all infrequent itemsets (to ) can never be significant
- Let be the number of all closed itemsets that
- LAMP find the maximum that satisfies

Multiple testing and LAMP’s idea

Current implement of LAMP
- Intuitive approach.
- Start from most frequent itemset (null itemset)
- Conduct breadth-first search for each lower frequent parameter sigma
- Large size of memory usage
Approach 1:
Approach 2: (actually implemented)
- Depth-first search approach that requires less memory
- Have to call LCM to compute repeatly
- Time consuming

Reforming the problem
● Reform the problem using a threshold function
○ :
○ is monotonically decreasing for x and increasing for y
○ we reform our problem to
■
■
○ And our problem is to find largest that satisfy

Support increasing algorithm
● This algorithm generate itemsets starting from small sigma
● First observation
○ For a frequency sigma, if we found some k that and
then
● Second observation
○ Assume that we are considering and found k itemsets that
we can skip and go to
○ Here, we can reuse the current k itemsets
we just need to remove the itemsets with frequency

Support increasing algorithm
● if is relative small compared to on average, algorithm terminates fast
● Maintain can be done by using a heap to extracts the minimum frequency itemset from S
that takes
● However for large or is very large, the algorithm take very long time

Faster implemention
● However, we don’t need to maintain the hold using heap
● We only need the size of , we can store only the size
○ This make the step of removing infrequent itemsets
○ Moreover, adding the addible items also only takes

Conclusion
● Proposed a fast itemset enumeration algorithm to find the frequency threshold
satisfying the LAMP condition
● The proposed method is much faster than the original
● Future work:
○ It will be useful if we can efficiently compute the p-values for many combinatorial
hypotheses and can discover the best or top-k significant one (Our work)
○ Other tests such as X-squared test and Mann-Whitney test
○ Extension to non binary-valued database (Our work)
Comment:
- Solid work and gave such great insights about the current problem we are dealing with
- A bit surprise when reading the future work part

Introduction to FAST-LAMP

Recommended

Recommended

More Related Content

Similar to Introduction to FAST-LAMP

Similar to Introduction to FAST-LAMP (20)

More from Thien Q. Tran

More from Thien Q. Tran (6)

Recently uploaded

Recently uploaded (20)

Introduction to FAST-LAMP