Lash

LASH:
Large-Scale Sequence
Mining with Hierarchies
Kaustubh Beedkar (University of Mannheim)
Rainer Gemulla (University of Mannheim)
SIGMOD (2015)
발표자: 구한준

 Introduction
 Problem Definition
 Proposed Algorithm
 Experiment
 Conclusion
Contents

 Sequential Pattern Mining is used in many area
such as market-basket analysis, web usage
mining, language model etc.
 Some of items have hierarchies and frequency
can be different
Ex)
Introduction
Photography
Analog CameraDigital Camera
Canon Nikon
Frequent!
Not Frequent!

 MG-FSM , a state-of-the-art frequent sequence
miner, was suggested (SIGMOD, 2013) but,
doesn’t support hierarchies
 Other sequential pattern mining
BFS: APRIORI, GSP, SPADE..
DFS: FP-Growth, PrefixSpan, SPAM, BIDE, GAP-
BIDE..
Related Work

 Sequence database 𝒟 = {𝑇1, 𝑇2,…,𝑇|𝒟|}
 Each sequence 𝑇 = 𝑡1 𝑡2 𝑡3 … 𝑡 𝑛 is composed with
 Vocabulary W = {𝑤1, 𝑤2,…,𝑤|𝑊|}
Problem Variables
𝑇1 𝑎 𝑏1 𝑎 𝑏1
𝑇2 𝑎 𝑏3 𝑐 𝑐 𝑏2
𝑇3 𝑎 𝑐
𝑇4 𝑏11 𝑎 𝑒 𝑎
𝑇5 𝑎 𝑏12 𝑑1 𝑐
𝑇6 𝑏13 𝑓 𝑑2

 In GSM, vocabulary is arranged in a hierarchy
𝑓𝑜𝑟 𝑢, 𝑣 ∈ 𝑊
 if 𝑢 directly generalizes to v
𝑢 → 𝑣
 if u generalizes to v (include itself)
𝑢 →∗ 𝑣
Hierarchies
𝑏11 𝑏11𝑏11
𝑏1 𝑏3𝑏2
𝐵
*
* *

 Extend relation ’→’ to sequences
 for sequence 𝑇 = 𝑡1 𝑡2 … 𝑡 𝑛, 𝑆 = 𝑠1 𝑠2 … 𝑠 𝑛′
 𝑇 directly generalizes to sequence S,
denoted 𝑇 → 𝑆
 if 𝑛 = 𝑛′
 ∃𝑗, 1 ≤ 𝑗 ≤ 𝑛 𝑠. 𝑡. 𝑡𝑗 → 𝑠𝑗
 𝑡𝑖 = 𝑠𝑖 𝑓𝑜𝑟 𝑗 ≠ 𝑖
Ex)
𝑇1 ∶ 𝑎𝑏1 𝑎𝑏1 satisfies
𝑇1 → 𝑎𝐵𝑎𝑏1
𝑇1 → 𝑎𝑏1 𝑎𝐵
Generalized Sequence

 Extend relation ’→’ to sequences
 for sequence 𝑇 = 𝑡1 𝑡2 … 𝑡 𝑛, 𝑆 = 𝑠1 𝑠2 … 𝑠 𝑛′
 𝑇 directly generalizes to sequence S,
denoted 𝑇 → 𝑆
 if 𝑛 = 𝑛′
 e𝑥𝑖𝑠𝑡𝑠 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑗, 1 ≤ 𝑗 ≤ 𝑛 𝑠. 𝑡. 𝑡𝑗 →∗ 𝑠𝑗
 𝑡𝑖 = 𝑠𝑖 𝑓𝑜𝑟 𝑗 ≠ 𝑖
Ex)
𝑇1 ∶ 𝑎𝑏1 𝑎𝑏1 satisfies
𝑇1 →∗ 𝑎𝐵𝑎𝐵
Generalized Sequence

 𝑆 is subsequence of T , denoted 𝑆 ⊆ 𝛾 𝑇
 Gap Constraint 𝛾 ≥ 0 ( 𝛾 items in between item )
Ex) 𝑇5 ∶ 𝑎𝑏12 𝑑1 𝑐
𝑎𝑏12 ⊆0 𝑇5, 𝑎𝑑1 𝑐 ⊆1 𝑇5
Subsequence
𝑇5 𝑎 𝑏12 𝑑1 𝑐
⊆0 𝑎 𝑏12
⊆0 𝑏12 𝑑1
⊆1 𝑎 𝑏12 𝑐
⊆1 𝑎 𝑑1 𝑐
⊆2 𝑎 𝑐

 S is generalized subsequence of T
denoted 𝑆 ⊑ 𝛾 𝑇
Ex) 𝑇5 ∶ 𝑎𝑏12 𝑑1 𝑐
𝑎𝑏12 ⊑0 𝑇5, 𝑎𝑏1 ⊑0 𝑇5, 𝑎𝐵 ⊑0 𝑇5, 𝑎𝐷 ⊑1 𝑇5
Generalized Subsequences
𝑇5 𝑎 𝑏12 𝑑1 𝑐
⊑0 𝑎 𝑏12
⊑0 𝑎 𝑏1
⊑0 𝑎 𝐵
⊑1 𝑎 𝐷
⊑2 𝑎 𝐶

 𝑆𝑢𝑝 𝛾 𝑆, 𝐷 = {𝑇 ∈ 𝐷: 𝑆 ⊑ 𝛾 𝑇}
Support set of sequence S in the database D
(S : generalized subsequence of T)
 𝑓𝛾 𝑆, 𝐷 = |𝑆𝑢𝑝 𝛾 𝑆, 𝐷 |
S is frequent in D if 𝑓𝛾 𝑆, 𝐷 ≥ 𝜎
𝜎 > 0 is support threshold
Ex)
𝑆𝑢𝑝1 𝑎𝐵𝑐, 𝐷 = {𝑇2, 𝑇5}
𝑆𝑢𝑝0 𝑎𝐵𝑐, 𝐷 = {𝑇2}
Support
𝑇3 𝑎 𝑐
𝑇5 𝑎 𝑏12 𝑑1 𝑐
𝑇6 𝑏13 𝑓 𝑑2

 Given
 𝜎 > 0 a minimum support threshold
 γ ≥ 0 a maximum-gap constraint
 λ ≥ 2 a maximum-length constraint
 Find all frequent generalized sequences S that
satisfies
 2 ≤ 𝑆 ≤ 𝜆,
 𝑓𝛾(𝑆, 𝐷) ≥ 𝜎
Problem Definition

 Generate all all possible subsequence (Map Phase)
and count all of them. (Reduce Phase)
 𝐺𝜆,𝛾 𝑇 = 𝑆 𝑆 ⊑ 𝛾 𝑇, 2 ≤ 𝑆 ≤ 𝜆}
Ex)
𝑇4 ∶ 𝑏11 𝑎𝑒𝑎
𝐺𝜆=3,𝛾=1 𝑇4
= { 𝑏11 𝑎, 𝑏11 𝑒, 𝑎𝑒, 𝑎𝑎, 𝑒𝑎, 𝑏11 𝑎𝑒, 𝑏11 𝑎𝑎, 𝑏11 𝑒𝑎,
𝑎𝑒𝑎, 𝑏1 𝑎, 𝑏1 𝑒, 𝑏1 𝑎𝑒, 𝑏1 𝑎𝑎, 𝑏1 𝑒𝑎, 𝐵𝑎, 𝐵𝑒, 𝐵𝑎𝑒, 𝐵𝑎𝑎, 𝐵𝑒𝑎}
Naïve Algorithm
⊑1 𝑏11 𝑎
⊑1 𝑏1 𝑎
⊑1 𝐵 𝑎
⊑1 𝐵 𝑒
⊑1 𝑎 𝑎
… …

 In Preprocessing Phase, make f-list and total
order
 𝑤1 < 𝑤2 𝑤ℎ𝑒𝑛 𝑓0 𝑤1, 𝐷 > 𝑓0 𝑤2, 𝐷
 Ancestor is smaller than descendant
Preprocess
𝑇3 𝑎 𝑐
𝑇5 𝑎 𝑏12 𝑑1 𝑐
𝑇6 𝑏13 𝑓 𝑑2
f-list (𝜎 ≥ 2)
a : 5
B : 5
𝑏1: 4
c : 3
D : 2
total order : a<B<𝑏1<c<D

 Generate Subsequence only if its element is
frequent
Ex) 𝑇4 ∶ 𝑏11 𝑎𝑒𝑎
𝐺𝜆=3,𝛾=1 𝑇4 = {𝑎𝑎, 𝑏1 𝑎, 𝑏1 𝑎𝑎, 𝐵𝑎, 𝐵𝑎𝑎}
Semi-Naïve Algorithm
f-list (𝜎 ≥ 2)
a : 5
B : 5
𝑏1: 4
c : 3
D : 2
⊑1 𝑏11 𝑎
⊑1 𝑏1 𝑎 𝑎
⊑1 𝐵 𝑎
⊑1 𝐵 𝑒
⊑1 𝑎 𝑎
… …

 total order : a<B<𝑏1<c<D (a is the most frequent)
 p 𝑆 = 𝑚𝑎𝑥 𝑤∈𝑆 𝑆 , the pivot item of S (item which has
maximum order)
Ex) 𝑇1 = 𝑎𝑏1 𝑎𝑏1, 𝑝 𝑇1 = 𝑏1
 A partition 𝑃𝑤 is a set of sequences which have w as pivot
Ex)T1 ∈ 𝑃𝑏1
, a ∈ 𝑃𝑎, 𝑎𝑎 ∈ 𝑃𝑎 …
 from 𝑃𝑤, mine all generalized sequences that contain w
but no larger(in total order) item
Ex)𝑃𝑎 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑠 𝑜𝑓 ′𝑎′ 𝑠 𝑜𝑛𝑙𝑦, 𝑃𝐵 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑠 𝑜𝑓 ′𝑎′𝑠 & ′𝐵′𝑠
Partition

 total order : a<B<𝑏1<c<D (a is the most frequent)
 𝐺𝜆=3,𝛾=1 𝑇4 = {𝑎𝑎, 𝑏1 𝑎, 𝑏1 𝑎𝑎, 𝐵𝑎, 𝐵𝑎𝑎}
Partition
𝑃𝑎 𝑎𝑎 ← 𝒂
𝑃𝐵 𝐵𝑎 𝐵𝑎𝑎 ← 𝑎 , 𝑩
𝑃𝑏1
𝑏1 𝑎 𝑏1 𝑎𝑎 ← 𝑎, 𝐵, 𝒃 𝟏
𝑃𝑐 ← 𝑎, 𝐵, 𝑏1, 𝒄
𝑃 𝐷 ← 𝑎, 𝐵, 𝑏1, 𝑐, 𝑫

 two sequences T and T’ are w-equivalent
if 𝐺 𝑤,𝜆,𝛾(𝑇) = 𝐺 𝑤,𝜆,𝛾(𝑇′)
where
𝐺 𝑤,𝜆 ,𝛾 𝑇 = 𝑆 𝑆 ⊑ 𝛾 𝑇, 2 ≤ 𝑆 ≤ 𝜆, 𝑝 𝑆 = 𝑤}
total order : a<B<𝑏1<c<D
Ex) 𝑇4 ∶ 𝑏11 𝑎𝑒𝑎
𝐺 𝑤=𝐵,𝜆=3,𝛾=1(𝑇4) = {𝐵𝑎𝑎, 𝐵𝑎} = 𝐺 𝑤=𝐵,𝜆=3,𝛾=1(𝐵𝑎𝑎)
w-equivalency
𝑃𝑎 𝑎𝑎
𝑃𝐵 𝐵𝑎 𝐵𝑎𝑎
𝑃𝑏1
𝑏1 𝑎 𝑏1 𝑎𝑎
𝑃𝑐
𝑃 𝐷 Not necessary!

 An item 𝑤′ is w-relevant if 𝑤′ ≤ 𝑤 (more frequent)
 1) replace irrelevant items that doesn’t have an ancestor
𝑤′ < 𝑤 by the blank symbol ⊔
 2) replace the items which are irrelevant and have an
ancestor that are smaller than the pivot
Ex) a<B<𝑏1<c<D (pivot B)
𝑇2 ∶ 𝑎𝑏3 𝑐𝑐𝑏2 →∗ 𝑇2
′
∶ 𝑎 𝐵⊔⊔ 𝐵 regarding pivot B
w-generalization
1) 𝑎 𝐵 𝑐 𝑐 𝑏2
1) 𝑎 𝐵 𝑐 𝑐 𝐵
2) 𝑎 𝐵 ⨆ 𝑐 𝐵
𝑇2
′
𝑎 𝐵 ⨆ ⨆ 𝐵

 purpose : make sequence as short as possible
 3) remove items that locate far away from pivot
Ex) 𝛾 = 1, 𝑝𝑖𝑣𝑜𝑡: 𝐷 , a<B<𝑏1<c<D
-> 𝜆 = 2 , 𝑎𝑐𝐷𝑎𝐷𝑐⊔ 𝜆 = 3, 𝑎𝑏1 𝑎𝑐𝐷𝑎𝐷𝑐⊔ 𝐵
w-generalization
𝑇 𝑎 𝑏1 𝑎 𝑐 𝑑1 𝑎 𝑑2 𝑐 𝑓 𝑏2 𝑐
𝑇′ 𝑎 𝑏1 𝑎 𝑐 𝑫 𝑎 𝑫 𝑐 ⊔ 𝐵 𝑐
𝜆 = 2 𝑎 𝑏1 𝑎 𝑐 𝑫 𝑎 𝑫 𝑐 ⊔ 𝐵 𝑐
𝜆 = 3 ? 𝑎 𝑏1 𝑎 𝑐 𝑫 𝑎 𝑫 𝑐 ⊔ 𝐵 𝑐
𝜆 = 3 ? 𝑎 𝑏1 𝑎 𝑐 𝑫 𝑎 𝑫 𝑐 ⊔ 𝐵 𝑐
𝜆 = 3 ? 𝑎 𝑏1 𝑎 𝑐 𝑫 𝑎 𝑫 𝑐 ⊔ 𝐵 𝑐
𝜆 = 3 𝑎 𝑏1 𝑎 𝑐 𝑫 𝑎 𝑫 𝑐 ⊔ 𝐵 𝑐

Proposed Algorithm
For each Transaction 𝑇𝑖
generate 𝑇𝑖′ regarding each frequent item 𝑓𝑗
Divide 𝑇𝑖′ to each partition
Do local Mining

 Local Mining can be done efficiently with PSM
instead of ‘Apriori’s (BFS,DFS)
 Instead of Searching every frequent sequence,
LASH can enumerate efficiently a sequence has
the pivot
Ex) pivot : c, {abc, cab , abc,…}
don’t need to find {ab} because it doesn’t have {c}
Pivot Sequence Miner

 Data Set: NYT, AMZN
 NYT (50M sentences from 1.8m articles)
 n gram mining from textual data
 AMZN (35m reviews from 6m users)
 customer behavior mining from product sequences
 Cluster
 11 Dell PowerEdge R720
 64GB memory, 8*2TB hard disks, 2 * Intel Xeon E5-
2640 6core CPUs
 Hadoop 0.20.2 (JDK 1.7)
Test Environment

 LASH is the first parallel algorithm for mining
frequent sequence with hierarchies
 LASH divides each sequence by pivot item and
performs local mining (PSM)
 LASH can search better than MG-FSM ( state-of-
the-art Algorithm for frequent sequence miner
without hierarchies)
because of PSM
Conclusion

Lash

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (10)

Similar to Lash

Similar to Lash (20)

Recently uploaded

Recently uploaded (20)

Lash

Editor's Notes