Top-k Interesting Phrase Mining in Ad-hoc Collections
using Sequence Pattern Indexing
Chuancong Gao, Sebastian Michel
Max-Planck Institute for Informatics & Saarland University
Saarbr¨ucken, Germany
EDBT, March 28, 2012
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 1 / 31
Motivation
Lost overview on current presidential campaign topics?
Get an update on key phrases of Mitt Romney!
• Want concise summary.
• Solution: Mining top-k phrases w.r.t. subset of all documents.
• Sub-set of documents is determined in ad-hoc fashion.
• E.g., using keyword queries, restriction to categorical attributes.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 2 / 31
Motivation (Example)
1988 George Bush Kinder, Gentler Nation
1992 Bill Clinton Dont stop thinking about tomorrow
1992 Bill Clinton Putting People First
1992 Ross Perot Ross for Boss
1996 Bill Clinton Building a bridge to the 21st century
1996 Bob Dole The Better Man for a Better America
2000 Al Gore Prosperity and progress
2000 George W. Bush Compassionate conservatism
2000 George W. Bush Leave no child behind
2000 Ralph Nader Government of, by, and for the people...
2004 John Kerry Let America be America Again
2004 George W. Bush Yes, America Can!
2008 John McCain Country First
2008 Barack Obama Change We Can Believe In
2008 Barack Obama Yes We Can!
http://www.presidentsusa.net/campaignslogans.html
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 3 / 31
Problem Statement
Given: a document corpus
Task: Create an Index
• that organizes documents and phrases
• in a compact way
• to query for top-k phrases of arbitrary ad-hoc collections.
• Devise appropriate algorithms to use the index efficiently.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 4 / 31
Problem Definition
Given a document corpus D and a subset D .
Find the top-k frequent phrases with the highest interestingness in D .
Ad-hoc Document Subset
D is not known upfront
Phrase
A phrase p is contained in document d iff p is a continuous sub-sequence
of d (notation p d).
• p’s length is restricted by min-len and max-len.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 5 / 31
Problem Definition (Cont.)
Frequency
The frequency of a phrase p in D is called the support value of p w.r.t.
D , denoted by supD
p .
• supD
p = |{d|d ∈ D ∧ p d}|
• A phrase is called frequent in a document collection D iff p’s support
is larger than the user-specified threshold min-supD .
Interestingness
The interestingness measure we adopted is the confidence value (or
relevance value), defined as
conf D
p =
supD
p
supD
p
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 6 / 31
Example
Find the top-k (k = 3) interesting phrases for 4 documents.
D
D
ID Document
d1 eabefcad
d2 dcbeafe
d3 fceacd
d4 acfdbeacd
(a) Documents
k
ID Phrase Conf.
p1 ac 2/2
p2 bea 2/2
p3 ea 3/4
p4 be 2/3
· · · · · ·
(b) Top-k Phrases
(min-supD = 2, k = 3, min-len = 2, max-len = 4)
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 7 / 31
Outline
Existing Approaches
Sequence Patter Indexing
Experiments
Conclusion
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 8 / 31
Existing Approaches - Phrase Inverted Indexing
(A. Simitsis et al. - Multidimensional content eXploration. VLDB’08)
Indexing Steps
For each phrase p, store an inverted list that contains the IDs of all
documents containing p.
Example index:
ID Document IDs
p1 {d1, d2}
p2 {d1, d3}
p3 {d1, d3, d4}
p4 {d1, d2, d3, d4}
Given an ad-hoc collection D : Alg. requires a full scan of all the inverted
lists.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 9 / 31
Existing Approaches - Forward Indexing
(S. Bedathur et al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics.
VLDB’10)
Indexing Steps
Builds a forward list for each document d ∈ D, containing all the IDs of
phrases contained in d.
Example index structure:
ID Forward List
d1 {p1:2, p2:2, p3:3, p4:4}
d2 {p1:2, p4:4}
d3 {p2:2, p3:3, p4:4}
d4 {p3:3, p4:4}
Numbers after “:” denote global support values.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 10 / 31
Existing Approaches - Forward Indexing (Cont.)
Query Processing
Compute top-k phrases with |D |-way merge join.
• Global support value of p stored explicitly in list.
• Local support value + confidence is computed during the join process.
• Nice: only the forward lists of documents in D need be to scanned.
• Even better: can apply early termination:
• phrase IDs in each forward list stored in ascending order of global
support values
• then, conf D
p has an upper bound min 1,
|D |
supD
p
.
• Can stop if the top-k results we found can not be topped anymore.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 11 / 31
Prefix-Maximal Indexing
(S. Bedathur et al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics.
VLDB’10)
Key Idea
Store only prefix-maximal phrases (small subset of all phrases) w.r.t. each
document in the corpus.
• A phrase p is called maximal w.r.t. a document d ∈ D iff ∃p that p
is frequent and p p .
• A phrase p is called prefix-maximal w.r.t. a document d ∈ D iff p is
maximal and p that p is maximal and p is a prefix of p .
Significantly reduces the index size comparing to Forward Indexing.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 12 / 31
Phrase-Maximal Index. Comparison
Comparison: all phrases, maximal phrases, and prefix-maximal phrases:
ID Phrases
d1 {ad, be, bef, befc, ca, cad, ea, ef, efc, efca, fc, fca, fcad}
d2 {be, bea, cb, cbe, cbea, dc, dcb, dcbe, ea, fe}
d3 {ac, acd, cd, ce, cea, ceac, ea, eac, eacd, fc, fce, fcea}
d4 {ac, acd, acf, acfd, be, bea, cd, cf, cfd, ea, eac, fd}
· · · · · ·
ID Prefix-Maximal Phrases
d1 {ad, befc, cad, ea, efca, fcad}
d2 {bea, cbea, dcbe, ea, fe}
d3 {acd, cd, ceac, eacd, fcea}
d4 {acd, acfd, bea, cfd, eac, fd}
· · · · · ·
ID Maximal Phrases
d1 {befc, ea, efca, fcad}
d2 {cbea, dcbe, fe}
d3 {ceac, eacd, fcea}
d4 {acd, acfd, bea, eac}
· · · · · ·
Max. phrases keep index small, but top-k phrases computation is difficult.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 13 / 31
Existing Approaches - Phrase-Maximal Indexing (Cont.)
The top-k phrases are computed by |D |-way merge join.
• Support values computed when merging all the forward lists in D .
• The confidence values are calculated by using a separate global
support value dictionary.
• Issues:
• A full scan of the forward lists of D is required, since no early
termination can be applied.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 14 / 31
Objectives
Devise an approach that
1. is as fast as Forward Indexing (i.e., we must devise
early stopping)
2. and has an index as small (or smaller than)
Phrase-Maximal Indexing
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 15 / 31
Outline
Existing Approaches
Sequence Patter Indexing
Experiments
Conclusion
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 16 / 31
Our Approach - Key Ideas
We also store prefix-maximal phrases for each document d ∈ D, but ...
Lexicographical order vs. matching position order:
ID Prefix-Maximal Phrases
d1 {ad, befc, cad, ea, efca, fcad}
d2 {bea, cbea, dcbe, ea, fe}
d3 {acd, cd, ceac, eacd, fcea}
d4 {acd, acfd, bea, cfd, eac, fd}
· · · · · ·
ID Prefix-Maximal Phrases (Ordered by Position)
d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7}
d2 {dcbe@1, cbea@2, bea@3, ea@4, fe@6}
d3 {fcea@1, ceac@2, eacd@3, acd@4, cd@5}
d4 {acfd@1, cfd@2, fd@3, bea@5, eacd@6, acd@7}
· · · · · ·
– Duplicate Prefix Sub-Phrase Comparing to Previous Phrase – Duplicate Prefixes among Phrases
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 17 / 31
Our Approach - Indexing
ID Prefix-Maximal Phrases (Ordered by Position)
d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7}
a
1 2
e
1
b
1
e
2
f
1
3
c
2
4
f2
c3
a4
a3
d d4
1 2
3
2
n1 n2 n3 n4 n5 n6
n7 n8 n9
n10 n11n12
Possible Lengths (1-4)
• Sub-phrases are organized in a forest structure; each node contains
one item + length information.
• Links from outside point to start items of indexed prefix-maximal
phrases.
• Possible length information is stored in bit array.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 18 / 31
Memory Layout
a
1 2
e
1
b
1
e
2
f
1
3 c
2
4
f2
c3
a4
a3
d d4
1 2
3
2
n1 n2 n3 n4 n5 n6
n7 n8 n9
n10 n11n12
Possible Lengths (1-4)
ae d f c a e f cb
00
0010
0000
0011
0000
0001
000014
0101
0000
0010
1000
0011
0100
0001
0000
0000
1000
0100
0100
0010
0000
00 04 06 10 14 18 22 26 30 34 38
00
06 10 26 34
8 bits24 bits1+15 bits
16 bits
a d0001
1000
0100
1100
42 46
38
50
0100
1100
Word / Link
End of Branch
Possible Lengths (1-6)
(Number below cells indicate a relative memory address)
Each node is assigned with 4 bytes (32 bits); the first byte storing flags, the last 3 bytes store
the item.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 19 / 31
Our Approach - Indexing (Cont.) - Improving
a1 2
e1
b1
e2
f1
3
c2
4
f
2
c3 a4
a3
d d4
1 2
3
2
n1 n2 n3 n4 n5 n6
n7 n8 n9
n10 n11n12
Possible Lengths (1-4)
• Still some duplications like e → f → c → a and a → d.
• Further improvement: utilize specific matching positions (instead of
just matching orders).
• Idea: Use graph structure to eliminate more duplicates.
a
1 2
e1
b1
e2
f1
3
c
2
4 a3 d4
1 2
3
n1 n2 n3 n4 n5 n6 n7 n8
2
3 4
2
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 20 / 31
Our Approach - Top-k Phrase Computing
Computation is done in two steps (interleaved): |D |-way merge join and
growing patterns:
Merge join approach is used on phrase prefixes with length 1
• Index contains an address list for different items in document d.
• Apply |D |-way merge join on these address lists to retrieve all
length-1 phrases.
Extend each length-1 phrase
In each step: the prefix is extended with one of the local frequent items.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 21 / 31
Optimizations
Early Stopping
• Can tell when it is “OK” to stop retrieving more length 1 patterns
Search Space Pruning
• During expansion of a length-1 pattern to larger patterns, stop when
unpromising.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 22 / 31
Our Approach - Summary
Our Method’s Advantages:
• Index with little redundancy
• Algorithms fully utilize this new index structure.
• Visiting only frequent phrases:
• With early termination in |D |-merge join process.
• And search space pruning in pattern-growth process.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 23 / 31
Outline
Existing Approaches
Sequence Patter Indexing
Experiments
Conclusion
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 24 / 31
Evaluation
Different versions of own approach, plus two competitors.
Using an Intel Xeon W3520 (2.66 GHz) CPU and 24 GB memory
workstation.
Implementation in C#. Index in memory.
Algorithms under Comparison
Algorithm Description
Baselines
ForwardIndex Forward Indexing
PreMaxIndex Prefix-Maximal Indexing
Our Approaches
SeqPattIndex Our Indexing
SeqPattIndex IM Our Improved Indexing
SeqPattIndex NP SeqPattIndex With Early Termination and
Search Space Pruning Turned off
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 25 / 31
Datasets
Global Per Document
Dataset #Doc. #Word Max. Len. Avg. Len. Avg. #Word
PubMed Titles 17,826,927 2,028,673 167 11.59 10.99
PubMed Abstracts 2,500,000 2,075,526 1599 149.72 92.1
Extracted from PubMed – the largest free publication database on life
sciences and biomedical topics.
• Titles are rather short, little redundancy per title.
• Abstracts contain much more information, more redundancy
document.
Great to study pros and cons of the discussed approaches.
Queries: Generated
• to simulate different kind of queries (by size of D )
• grouped phrases by global support according to several ranges
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 26 / 31
Evaluation - In-Memory Index Size (in MB)
SeqPattIndex
SeqPattIndex_IM
PreMaxIndex
ForwardIndex
1000
2000
3000
4000
5000
5 10 15 20 25
IndexSize(inMB)
Minimum Support
(a) PubMed Titles
1000
2000
3000
4000
5000
6000
7000
5 10 15 20 25
IndexSize(inMB)
Minimum Support
(b) PubMed Abstracts
• PubMed Titles: improved approach close to original one, as phrases in titles have little
redundancy
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 27 / 31
Evaluation - Average Querying Time (in sec)
PubMed Titles
SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex
0.001
0.01
0.1
1
10
100 1000 10000 100000
Avg.QueryingTime(insec)
|D’|
(a) Varying |D | (k = 10, min-sup = 5)
0.1
1
10
100
10 100 1000 10000
Avg.QueryingTime(insec)
k
(b) Varying k (min-sup = 5)
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 28 / 31
Evaluation - Average Querying Time (in sec)
PubMed Abstract
SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex
0.01
0.1
1
10
100
100 1000 10000 100000
Avg.QueryingTime(insec)
|D’|
(a) Varying |D | (k = 10, min-sup = 5)
0.1
1
10
100
10000
10 100 1000 10000
Avg.QueryingTime(insec) k
(b) Varying k (min-sup = 5)
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 29 / 31
Outline
Existing Approaches
Sequence Patter Indexing
Experiments
Conclusion
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 30 / 31
Conclusion
• Presented approach to ad-hoc frequent phrases mining
• Making use of redundancy of sequentially aligned
phrases
• Index is very compact.
• High performance in query processing based on early
stopping and search space pruning.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 31 / 31

EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing

  • 1.
    Top-k Interesting PhraseMining in Ad-hoc Collections using Sequence Pattern Indexing Chuancong Gao, Sebastian Michel Max-Planck Institute for Informatics & Saarland University Saarbr¨ucken, Germany EDBT, March 28, 2012 C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 1 / 31
  • 2.
    Motivation Lost overview oncurrent presidential campaign topics? Get an update on key phrases of Mitt Romney! • Want concise summary. • Solution: Mining top-k phrases w.r.t. subset of all documents. • Sub-set of documents is determined in ad-hoc fashion. • E.g., using keyword queries, restriction to categorical attributes. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 2 / 31
  • 3.
    Motivation (Example) 1988 GeorgeBush Kinder, Gentler Nation 1992 Bill Clinton Dont stop thinking about tomorrow 1992 Bill Clinton Putting People First 1992 Ross Perot Ross for Boss 1996 Bill Clinton Building a bridge to the 21st century 1996 Bob Dole The Better Man for a Better America 2000 Al Gore Prosperity and progress 2000 George W. Bush Compassionate conservatism 2000 George W. Bush Leave no child behind 2000 Ralph Nader Government of, by, and for the people... 2004 John Kerry Let America be America Again 2004 George W. Bush Yes, America Can! 2008 John McCain Country First 2008 Barack Obama Change We Can Believe In 2008 Barack Obama Yes We Can! http://www.presidentsusa.net/campaignslogans.html C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 3 / 31
  • 4.
    Problem Statement Given: adocument corpus Task: Create an Index • that organizes documents and phrases • in a compact way • to query for top-k phrases of arbitrary ad-hoc collections. • Devise appropriate algorithms to use the index efficiently. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 4 / 31
  • 5.
    Problem Definition Given adocument corpus D and a subset D . Find the top-k frequent phrases with the highest interestingness in D . Ad-hoc Document Subset D is not known upfront Phrase A phrase p is contained in document d iff p is a continuous sub-sequence of d (notation p d). • p’s length is restricted by min-len and max-len. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 5 / 31
  • 6.
    Problem Definition (Cont.) Frequency Thefrequency of a phrase p in D is called the support value of p w.r.t. D , denoted by supD p . • supD p = |{d|d ∈ D ∧ p d}| • A phrase is called frequent in a document collection D iff p’s support is larger than the user-specified threshold min-supD . Interestingness The interestingness measure we adopted is the confidence value (or relevance value), defined as conf D p = supD p supD p C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 6 / 31
  • 7.
    Example Find the top-k(k = 3) interesting phrases for 4 documents. D D ID Document d1 eabefcad d2 dcbeafe d3 fceacd d4 acfdbeacd (a) Documents k ID Phrase Conf. p1 ac 2/2 p2 bea 2/2 p3 ea 3/4 p4 be 2/3 · · · · · · (b) Top-k Phrases (min-supD = 2, k = 3, min-len = 2, max-len = 4) C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 7 / 31
  • 8.
    Outline Existing Approaches Sequence PatterIndexing Experiments Conclusion C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 8 / 31
  • 9.
    Existing Approaches -Phrase Inverted Indexing (A. Simitsis et al. - Multidimensional content eXploration. VLDB’08) Indexing Steps For each phrase p, store an inverted list that contains the IDs of all documents containing p. Example index: ID Document IDs p1 {d1, d2} p2 {d1, d3} p3 {d1, d3, d4} p4 {d1, d2, d3, d4} Given an ad-hoc collection D : Alg. requires a full scan of all the inverted lists. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 9 / 31
  • 10.
    Existing Approaches -Forward Indexing (S. Bedathur et al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics. VLDB’10) Indexing Steps Builds a forward list for each document d ∈ D, containing all the IDs of phrases contained in d. Example index structure: ID Forward List d1 {p1:2, p2:2, p3:3, p4:4} d2 {p1:2, p4:4} d3 {p2:2, p3:3, p4:4} d4 {p3:3, p4:4} Numbers after “:” denote global support values. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 10 / 31
  • 11.
    Existing Approaches -Forward Indexing (Cont.) Query Processing Compute top-k phrases with |D |-way merge join. • Global support value of p stored explicitly in list. • Local support value + confidence is computed during the join process. • Nice: only the forward lists of documents in D need be to scanned. • Even better: can apply early termination: • phrase IDs in each forward list stored in ascending order of global support values • then, conf D p has an upper bound min 1, |D | supD p . • Can stop if the top-k results we found can not be topped anymore. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 11 / 31
  • 12.
    Prefix-Maximal Indexing (S. Bedathuret al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics. VLDB’10) Key Idea Store only prefix-maximal phrases (small subset of all phrases) w.r.t. each document in the corpus. • A phrase p is called maximal w.r.t. a document d ∈ D iff ∃p that p is frequent and p p . • A phrase p is called prefix-maximal w.r.t. a document d ∈ D iff p is maximal and p that p is maximal and p is a prefix of p . Significantly reduces the index size comparing to Forward Indexing. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 12 / 31
  • 13.
    Phrase-Maximal Index. Comparison Comparison:all phrases, maximal phrases, and prefix-maximal phrases: ID Phrases d1 {ad, be, bef, befc, ca, cad, ea, ef, efc, efca, fc, fca, fcad} d2 {be, bea, cb, cbe, cbea, dc, dcb, dcbe, ea, fe} d3 {ac, acd, cd, ce, cea, ceac, ea, eac, eacd, fc, fce, fcea} d4 {ac, acd, acf, acfd, be, bea, cd, cf, cfd, ea, eac, fd} · · · · · · ID Prefix-Maximal Phrases d1 {ad, befc, cad, ea, efca, fcad} d2 {bea, cbea, dcbe, ea, fe} d3 {acd, cd, ceac, eacd, fcea} d4 {acd, acfd, bea, cfd, eac, fd} · · · · · · ID Maximal Phrases d1 {befc, ea, efca, fcad} d2 {cbea, dcbe, fe} d3 {ceac, eacd, fcea} d4 {acd, acfd, bea, eac} · · · · · · Max. phrases keep index small, but top-k phrases computation is difficult. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 13 / 31
  • 14.
    Existing Approaches -Phrase-Maximal Indexing (Cont.) The top-k phrases are computed by |D |-way merge join. • Support values computed when merging all the forward lists in D . • The confidence values are calculated by using a separate global support value dictionary. • Issues: • A full scan of the forward lists of D is required, since no early termination can be applied. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 14 / 31
  • 15.
    Objectives Devise an approachthat 1. is as fast as Forward Indexing (i.e., we must devise early stopping) 2. and has an index as small (or smaller than) Phrase-Maximal Indexing C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 15 / 31
  • 16.
    Outline Existing Approaches Sequence PatterIndexing Experiments Conclusion C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 16 / 31
  • 17.
    Our Approach -Key Ideas We also store prefix-maximal phrases for each document d ∈ D, but ... Lexicographical order vs. matching position order: ID Prefix-Maximal Phrases d1 {ad, befc, cad, ea, efca, fcad} d2 {bea, cbea, dcbe, ea, fe} d3 {acd, cd, ceac, eacd, fcea} d4 {acd, acfd, bea, cfd, eac, fd} · · · · · · ID Prefix-Maximal Phrases (Ordered by Position) d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7} d2 {dcbe@1, cbea@2, bea@3, ea@4, fe@6} d3 {fcea@1, ceac@2, eacd@3, acd@4, cd@5} d4 {acfd@1, cfd@2, fd@3, bea@5, eacd@6, acd@7} · · · · · · – Duplicate Prefix Sub-Phrase Comparing to Previous Phrase – Duplicate Prefixes among Phrases C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 17 / 31
  • 18.
    Our Approach -Indexing ID Prefix-Maximal Phrases (Ordered by Position) d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7} a 1 2 e 1 b 1 e 2 f 1 3 c 2 4 f2 c3 a4 a3 d d4 1 2 3 2 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11n12 Possible Lengths (1-4) • Sub-phrases are organized in a forest structure; each node contains one item + length information. • Links from outside point to start items of indexed prefix-maximal phrases. • Possible length information is stored in bit array. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 18 / 31
  • 19.
    Memory Layout a 1 2 e 1 b 1 e 2 f 1 3c 2 4 f2 c3 a4 a3 d d4 1 2 3 2 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11n12 Possible Lengths (1-4) ae d f c a e f cb 00 0010 0000 0011 0000 0001 000014 0101 0000 0010 1000 0011 0100 0001 0000 0000 1000 0100 0100 0010 0000 00 04 06 10 14 18 22 26 30 34 38 00 06 10 26 34 8 bits24 bits1+15 bits 16 bits a d0001 1000 0100 1100 42 46 38 50 0100 1100 Word / Link End of Branch Possible Lengths (1-6) (Number below cells indicate a relative memory address) Each node is assigned with 4 bytes (32 bits); the first byte storing flags, the last 3 bytes store the item. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 19 / 31
  • 20.
    Our Approach -Indexing (Cont.) - Improving a1 2 e1 b1 e2 f1 3 c2 4 f 2 c3 a4 a3 d d4 1 2 3 2 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11n12 Possible Lengths (1-4) • Still some duplications like e → f → c → a and a → d. • Further improvement: utilize specific matching positions (instead of just matching orders). • Idea: Use graph structure to eliminate more duplicates. a 1 2 e1 b1 e2 f1 3 c 2 4 a3 d4 1 2 3 n1 n2 n3 n4 n5 n6 n7 n8 2 3 4 2 C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 20 / 31
  • 21.
    Our Approach -Top-k Phrase Computing Computation is done in two steps (interleaved): |D |-way merge join and growing patterns: Merge join approach is used on phrase prefixes with length 1 • Index contains an address list for different items in document d. • Apply |D |-way merge join on these address lists to retrieve all length-1 phrases. Extend each length-1 phrase In each step: the prefix is extended with one of the local frequent items. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 21 / 31
  • 22.
    Optimizations Early Stopping • Cantell when it is “OK” to stop retrieving more length 1 patterns Search Space Pruning • During expansion of a length-1 pattern to larger patterns, stop when unpromising. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 22 / 31
  • 23.
    Our Approach -Summary Our Method’s Advantages: • Index with little redundancy • Algorithms fully utilize this new index structure. • Visiting only frequent phrases: • With early termination in |D |-merge join process. • And search space pruning in pattern-growth process. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 23 / 31
  • 24.
    Outline Existing Approaches Sequence PatterIndexing Experiments Conclusion C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 24 / 31
  • 25.
    Evaluation Different versions ofown approach, plus two competitors. Using an Intel Xeon W3520 (2.66 GHz) CPU and 24 GB memory workstation. Implementation in C#. Index in memory. Algorithms under Comparison Algorithm Description Baselines ForwardIndex Forward Indexing PreMaxIndex Prefix-Maximal Indexing Our Approaches SeqPattIndex Our Indexing SeqPattIndex IM Our Improved Indexing SeqPattIndex NP SeqPattIndex With Early Termination and Search Space Pruning Turned off C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 25 / 31
  • 26.
    Datasets Global Per Document Dataset#Doc. #Word Max. Len. Avg. Len. Avg. #Word PubMed Titles 17,826,927 2,028,673 167 11.59 10.99 PubMed Abstracts 2,500,000 2,075,526 1599 149.72 92.1 Extracted from PubMed – the largest free publication database on life sciences and biomedical topics. • Titles are rather short, little redundancy per title. • Abstracts contain much more information, more redundancy document. Great to study pros and cons of the discussed approaches. Queries: Generated • to simulate different kind of queries (by size of D ) • grouped phrases by global support according to several ranges C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 26 / 31
  • 27.
    Evaluation - In-MemoryIndex Size (in MB) SeqPattIndex SeqPattIndex_IM PreMaxIndex ForwardIndex 1000 2000 3000 4000 5000 5 10 15 20 25 IndexSize(inMB) Minimum Support (a) PubMed Titles 1000 2000 3000 4000 5000 6000 7000 5 10 15 20 25 IndexSize(inMB) Minimum Support (b) PubMed Abstracts • PubMed Titles: improved approach close to original one, as phrases in titles have little redundancy C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 27 / 31
  • 28.
    Evaluation - AverageQuerying Time (in sec) PubMed Titles SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex 0.001 0.01 0.1 1 10 100 1000 10000 100000 Avg.QueryingTime(insec) |D’| (a) Varying |D | (k = 10, min-sup = 5) 0.1 1 10 100 10 100 1000 10000 Avg.QueryingTime(insec) k (b) Varying k (min-sup = 5) C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 28 / 31
  • 29.
    Evaluation - AverageQuerying Time (in sec) PubMed Abstract SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex 0.01 0.1 1 10 100 100 1000 10000 100000 Avg.QueryingTime(insec) |D’| (a) Varying |D | (k = 10, min-sup = 5) 0.1 1 10 100 10000 10 100 1000 10000 Avg.QueryingTime(insec) k (b) Varying k (min-sup = 5) C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 29 / 31
  • 30.
    Outline Existing Approaches Sequence PatterIndexing Experiments Conclusion C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 30 / 31
  • 31.
    Conclusion • Presented approachto ad-hoc frequent phrases mining • Making use of redundancy of sequentially aligned phrases • Index is very compact. • High performance in query processing based on early stopping and search space pruning. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 31 / 31