SlideShare a Scribd company logo
Top-k Interesting Phrase Mining in Ad-hoc Collections
using Sequence Pattern Indexing
Chuancong Gao, Sebastian Michel
Max-Planck Institute for Informatics & Saarland University
Saarbr¨ucken, Germany
EDBT, March 28, 2012
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 1 / 31
Motivation
Lost overview on current presidential campaign topics?
Get an update on key phrases of Mitt Romney!
• Want concise summary.
• Solution: Mining top-k phrases w.r.t. subset of all documents.
• Sub-set of documents is determined in ad-hoc fashion.
• E.g., using keyword queries, restriction to categorical attributes.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 2 / 31
Motivation (Example)
1988 George Bush Kinder, Gentler Nation
1992 Bill Clinton Dont stop thinking about tomorrow
1992 Bill Clinton Putting People First
1992 Ross Perot Ross for Boss
1996 Bill Clinton Building a bridge to the 21st century
1996 Bob Dole The Better Man for a Better America
2000 Al Gore Prosperity and progress
2000 George W. Bush Compassionate conservatism
2000 George W. Bush Leave no child behind
2000 Ralph Nader Government of, by, and for the people...
2004 John Kerry Let America be America Again
2004 George W. Bush Yes, America Can!
2008 John McCain Country First
2008 Barack Obama Change We Can Believe In
2008 Barack Obama Yes We Can!
http://www.presidentsusa.net/campaignslogans.html
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 3 / 31
Problem Statement
Given: a document corpus
Task: Create an Index
• that organizes documents and phrases
• in a compact way
• to query for top-k phrases of arbitrary ad-hoc collections.
• Devise appropriate algorithms to use the index efficiently.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 4 / 31
Problem Definition
Given a document corpus D and a subset D .
Find the top-k frequent phrases with the highest interestingness in D .
Ad-hoc Document Subset
D is not known upfront
Phrase
A phrase p is contained in document d iff p is a continuous sub-sequence
of d (notation p d).
• p’s length is restricted by min-len and max-len.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 5 / 31
Problem Definition (Cont.)
Frequency
The frequency of a phrase p in D is called the support value of p w.r.t.
D , denoted by supD
p .
• supD
p = |{d|d ∈ D ∧ p d}|
• A phrase is called frequent in a document collection D iff p’s support
is larger than the user-specified threshold min-supD .
Interestingness
The interestingness measure we adopted is the confidence value (or
relevance value), defined as
conf D
p =
supD
p
supD
p
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 6 / 31
Example
Find the top-k (k = 3) interesting phrases for 4 documents.
D
D
ID Document
d1 eabefcad
d2 dcbeafe
d3 fceacd
d4 acfdbeacd
(a) Documents
k
ID Phrase Conf.
p1 ac 2/2
p2 bea 2/2
p3 ea 3/4
p4 be 2/3
· · · · · ·
(b) Top-k Phrases
(min-supD = 2, k = 3, min-len = 2, max-len = 4)
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 7 / 31
Outline
Existing Approaches
Sequence Patter Indexing
Experiments
Conclusion
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 8 / 31
Existing Approaches - Phrase Inverted Indexing
(A. Simitsis et al. - Multidimensional content eXploration. VLDB’08)
Indexing Steps
For each phrase p, store an inverted list that contains the IDs of all
documents containing p.
Example index:
ID Document IDs
p1 {d1, d2}
p2 {d1, d3}
p3 {d1, d3, d4}
p4 {d1, d2, d3, d4}
Given an ad-hoc collection D : Alg. requires a full scan of all the inverted
lists.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 9 / 31
Existing Approaches - Forward Indexing
(S. Bedathur et al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics.
VLDB’10)
Indexing Steps
Builds a forward list for each document d ∈ D, containing all the IDs of
phrases contained in d.
Example index structure:
ID Forward List
d1 {p1:2, p2:2, p3:3, p4:4}
d2 {p1:2, p4:4}
d3 {p2:2, p3:3, p4:4}
d4 {p3:3, p4:4}
Numbers after “:” denote global support values.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 10 / 31
Existing Approaches - Forward Indexing (Cont.)
Query Processing
Compute top-k phrases with |D |-way merge join.
• Global support value of p stored explicitly in list.
• Local support value + confidence is computed during the join process.
• Nice: only the forward lists of documents in D need be to scanned.
• Even better: can apply early termination:
• phrase IDs in each forward list stored in ascending order of global
support values
• then, conf D
p has an upper bound min 1,
|D |
supD
p
.
• Can stop if the top-k results we found can not be topped anymore.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 11 / 31
Prefix-Maximal Indexing
(S. Bedathur et al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics.
VLDB’10)
Key Idea
Store only prefix-maximal phrases (small subset of all phrases) w.r.t. each
document in the corpus.
• A phrase p is called maximal w.r.t. a document d ∈ D iff ∃p that p
is frequent and p p .
• A phrase p is called prefix-maximal w.r.t. a document d ∈ D iff p is
maximal and p that p is maximal and p is a prefix of p .
Significantly reduces the index size comparing to Forward Indexing.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 12 / 31
Phrase-Maximal Index. Comparison
Comparison: all phrases, maximal phrases, and prefix-maximal phrases:
ID Phrases
d1 {ad, be, bef, befc, ca, cad, ea, ef, efc, efca, fc, fca, fcad}
d2 {be, bea, cb, cbe, cbea, dc, dcb, dcbe, ea, fe}
d3 {ac, acd, cd, ce, cea, ceac, ea, eac, eacd, fc, fce, fcea}
d4 {ac, acd, acf, acfd, be, bea, cd, cf, cfd, ea, eac, fd}
· · · · · ·
ID Prefix-Maximal Phrases
d1 {ad, befc, cad, ea, efca, fcad}
d2 {bea, cbea, dcbe, ea, fe}
d3 {acd, cd, ceac, eacd, fcea}
d4 {acd, acfd, bea, cfd, eac, fd}
· · · · · ·
ID Maximal Phrases
d1 {befc, ea, efca, fcad}
d2 {cbea, dcbe, fe}
d3 {ceac, eacd, fcea}
d4 {acd, acfd, bea, eac}
· · · · · ·
Max. phrases keep index small, but top-k phrases computation is difficult.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 13 / 31
Existing Approaches - Phrase-Maximal Indexing (Cont.)
The top-k phrases are computed by |D |-way merge join.
• Support values computed when merging all the forward lists in D .
• The confidence values are calculated by using a separate global
support value dictionary.
• Issues:
• A full scan of the forward lists of D is required, since no early
termination can be applied.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 14 / 31
Objectives
Devise an approach that
1. is as fast as Forward Indexing (i.e., we must devise
early stopping)
2. and has an index as small (or smaller than)
Phrase-Maximal Indexing
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 15 / 31
Outline
Existing Approaches
Sequence Patter Indexing
Experiments
Conclusion
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 16 / 31
Our Approach - Key Ideas
We also store prefix-maximal phrases for each document d ∈ D, but ...
Lexicographical order vs. matching position order:
ID Prefix-Maximal Phrases
d1 {ad, befc, cad, ea, efca, fcad}
d2 {bea, cbea, dcbe, ea, fe}
d3 {acd, cd, ceac, eacd, fcea}
d4 {acd, acfd, bea, cfd, eac, fd}
· · · · · ·
ID Prefix-Maximal Phrases (Ordered by Position)
d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7}
d2 {dcbe@1, cbea@2, bea@3, ea@4, fe@6}
d3 {fcea@1, ceac@2, eacd@3, acd@4, cd@5}
d4 {acfd@1, cfd@2, fd@3, bea@5, eacd@6, acd@7}
· · · · · ·
– Duplicate Prefix Sub-Phrase Comparing to Previous Phrase – Duplicate Prefixes among Phrases
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 17 / 31
Our Approach - Indexing
ID Prefix-Maximal Phrases (Ordered by Position)
d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7}
a
1 2
e
1
b
1
e
2
f
1
3
c
2
4
f2
c3
a4
a3
d d4
1 2
3
2
n1 n2 n3 n4 n5 n6
n7 n8 n9
n10 n11n12
Possible Lengths (1-4)
• Sub-phrases are organized in a forest structure; each node contains
one item + length information.
• Links from outside point to start items of indexed prefix-maximal
phrases.
• Possible length information is stored in bit array.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 18 / 31
Memory Layout
a
1 2
e
1
b
1
e
2
f
1
3 c
2
4
f2
c3
a4
a3
d d4
1 2
3
2
n1 n2 n3 n4 n5 n6
n7 n8 n9
n10 n11n12
Possible Lengths (1-4)
ae d f c a e f cb
00
0010
0000
0011
0000
0001
000014
0101
0000
0010
1000
0011
0100
0001
0000
0000
1000
0100
0100
0010
0000
00 04 06 10 14 18 22 26 30 34 38
00
06 10 26 34
8 bits24 bits1+15 bits
16 bits
a d0001
1000
0100
1100
42 46
38
50
0100
1100
Word / Link
End of Branch
Possible Lengths (1-6)
(Number below cells indicate a relative memory address)
Each node is assigned with 4 bytes (32 bits); the first byte storing flags, the last 3 bytes store
the item.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 19 / 31
Our Approach - Indexing (Cont.) - Improving
a1 2
e1
b1
e2
f1
3
c2
4
f
2
c3 a4
a3
d d4
1 2
3
2
n1 n2 n3 n4 n5 n6
n7 n8 n9
n10 n11n12
Possible Lengths (1-4)
• Still some duplications like e → f → c → a and a → d.
• Further improvement: utilize specific matching positions (instead of
just matching orders).
• Idea: Use graph structure to eliminate more duplicates.
a
1 2
e1
b1
e2
f1
3
c
2
4 a3 d4
1 2
3
n1 n2 n3 n4 n5 n6 n7 n8
2
3 4
2
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 20 / 31
Our Approach - Top-k Phrase Computing
Computation is done in two steps (interleaved): |D |-way merge join and
growing patterns:
Merge join approach is used on phrase prefixes with length 1
• Index contains an address list for different items in document d.
• Apply |D |-way merge join on these address lists to retrieve all
length-1 phrases.
Extend each length-1 phrase
In each step: the prefix is extended with one of the local frequent items.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 21 / 31
Optimizations
Early Stopping
• Can tell when it is “OK” to stop retrieving more length 1 patterns
Search Space Pruning
• During expansion of a length-1 pattern to larger patterns, stop when
unpromising.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 22 / 31
Our Approach - Summary
Our Method’s Advantages:
• Index with little redundancy
• Algorithms fully utilize this new index structure.
• Visiting only frequent phrases:
• With early termination in |D |-merge join process.
• And search space pruning in pattern-growth process.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 23 / 31
Outline
Existing Approaches
Sequence Patter Indexing
Experiments
Conclusion
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 24 / 31
Evaluation
Different versions of own approach, plus two competitors.
Using an Intel Xeon W3520 (2.66 GHz) CPU and 24 GB memory
workstation.
Implementation in C#. Index in memory.
Algorithms under Comparison
Algorithm Description
Baselines
ForwardIndex Forward Indexing
PreMaxIndex Prefix-Maximal Indexing
Our Approaches
SeqPattIndex Our Indexing
SeqPattIndex IM Our Improved Indexing
SeqPattIndex NP SeqPattIndex With Early Termination and
Search Space Pruning Turned off
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 25 / 31
Datasets
Global Per Document
Dataset #Doc. #Word Max. Len. Avg. Len. Avg. #Word
PubMed Titles 17,826,927 2,028,673 167 11.59 10.99
PubMed Abstracts 2,500,000 2,075,526 1599 149.72 92.1
Extracted from PubMed – the largest free publication database on life
sciences and biomedical topics.
• Titles are rather short, little redundancy per title.
• Abstracts contain much more information, more redundancy
document.
Great to study pros and cons of the discussed approaches.
Queries: Generated
• to simulate different kind of queries (by size of D )
• grouped phrases by global support according to several ranges
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 26 / 31
Evaluation - In-Memory Index Size (in MB)
SeqPattIndex
SeqPattIndex_IM
PreMaxIndex
ForwardIndex
1000
2000
3000
4000
5000
5 10 15 20 25
IndexSize(inMB)
Minimum Support
(a) PubMed Titles
1000
2000
3000
4000
5000
6000
7000
5 10 15 20 25
IndexSize(inMB)
Minimum Support
(b) PubMed Abstracts
• PubMed Titles: improved approach close to original one, as phrases in titles have little
redundancy
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 27 / 31
Evaluation - Average Querying Time (in sec)
PubMed Titles
SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex
0.001
0.01
0.1
1
10
100 1000 10000 100000
Avg.QueryingTime(insec)
|D’|
(a) Varying |D | (k = 10, min-sup = 5)
0.1
1
10
100
10 100 1000 10000
Avg.QueryingTime(insec)
k
(b) Varying k (min-sup = 5)
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 28 / 31
Evaluation - Average Querying Time (in sec)
PubMed Abstract
SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex
0.01
0.1
1
10
100
100 1000 10000 100000
Avg.QueryingTime(insec)
|D’|
(a) Varying |D | (k = 10, min-sup = 5)
0.1
1
10
100
10000
10 100 1000 10000
Avg.QueryingTime(insec) k
(b) Varying k (min-sup = 5)
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 29 / 31
Outline
Existing Approaches
Sequence Patter Indexing
Experiments
Conclusion
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 30 / 31
Conclusion
• Presented approach to ad-hoc frequent phrases mining
• Making use of redundancy of sequentially aligned
phrases
• Index is very compact.
• High performance in query processing based on early
stopping and search space pruning.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 31 / 31

More Related Content

What's hot

Slides
SlidesSlides
Slidesbutest
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
Mehwish Alam
 
Introducing a Tool for Concurrent Argumentation
Introducing a Tool for Concurrent ArgumentationIntroducing a Tool for Concurrent Argumentation
Introducing a Tool for Concurrent Argumentation
Carlo Taticchi
 
Looking for Invariant Operators in Argumentation
Looking for Invariant Operators in ArgumentationLooking for Invariant Operators in Argumentation
Looking for Invariant Operators in Argumentation
Carlo Taticchi
 
A Matrix Based Approach for Weighted Argumentation Frameworks
A Matrix Based Approach for Weighted Argumentation FrameworksA Matrix Based Approach for Weighted Argumentation Frameworks
A Matrix Based Approach for Weighted Argumentation Frameworks
Carlo Taticchi
 
多媒體資料庫(New)3rd
多媒體資料庫(New)3rd多媒體資料庫(New)3rd
多媒體資料庫(New)3rdKevingo Tsai
 
Looking for Invariant Operators in Argumentation
Looking for Invariant Operators in ArgumentationLooking for Invariant Operators in Argumentation
Looking for Invariant Operators in Argumentation
Carlo Taticchi
 
A Concurrent Language for Argumentation
A Concurrent Language for ArgumentationA Concurrent Language for Argumentation
A Concurrent Language for Argumentation
Carlo Taticchi
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
Tonny Madsen
 
Proposal for adding Named Dimensions to HDF5 Arrays
Proposal for adding Named Dimensions to HDF5 ArraysProposal for adding Named Dimensions to HDF5 Arrays
Proposal for adding Named Dimensions to HDF5 Arrays
The HDF-EOS Tools and Information Center
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
Sakthivel C R
 
FScaFi: A Core Calculus for Collective Adaptive Systems Programming
FScaFi: A Core Calculus for Collective Adaptive Systems ProgrammingFScaFi: A Core Calculus for Collective Adaptive Systems Programming
FScaFi: A Core Calculus for Collective Adaptive Systems Programming
Roberto Casadei
 
Algorithm
AlgorithmAlgorithm
Algorithm
Shakil Ahmed
 
A Concurrent Argumentation Language for Negotiation and Debating
A Concurrent Argumentation Language for Negotiation and DebatingA Concurrent Argumentation Language for Negotiation and Debating
A Concurrent Argumentation Language for Negotiation and Debating
Carlo Taticchi
 
Bit Vectors Siddhesh
Bit Vectors SiddheshBit Vectors Siddhesh
Bit Vectors Siddhesh
Siddhesh Bhobe
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
Shadi Saleh
 
Semidefinite programming, binary codes and a graph coloring problem
Semidefinite programming, binary codes and a graph coloring problemSemidefinite programming, binary codes and a graph coloring problem
Semidefinite programming, binary codes and a graph coloring problem
Chao Li
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slideMohd Iqbal Al-farabi
 

What's hot (20)

Slides
SlidesSlides
Slides
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
 
Introducing a Tool for Concurrent Argumentation
Introducing a Tool for Concurrent ArgumentationIntroducing a Tool for Concurrent Argumentation
Introducing a Tool for Concurrent Argumentation
 
Looking for Invariant Operators in Argumentation
Looking for Invariant Operators in ArgumentationLooking for Invariant Operators in Argumentation
Looking for Invariant Operators in Argumentation
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
A Matrix Based Approach for Weighted Argumentation Frameworks
A Matrix Based Approach for Weighted Argumentation FrameworksA Matrix Based Approach for Weighted Argumentation Frameworks
A Matrix Based Approach for Weighted Argumentation Frameworks
 
多媒體資料庫(New)3rd
多媒體資料庫(New)3rd多媒體資料庫(New)3rd
多媒體資料庫(New)3rd
 
Looking for Invariant Operators in Argumentation
Looking for Invariant Operators in ArgumentationLooking for Invariant Operators in Argumentation
Looking for Invariant Operators in Argumentation
 
A Concurrent Language for Argumentation
A Concurrent Language for ArgumentationA Concurrent Language for Argumentation
A Concurrent Language for Argumentation
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 
Proposal for adding Named Dimensions to HDF5 Arrays
Proposal for adding Named Dimensions to HDF5 ArraysProposal for adding Named Dimensions to HDF5 Arrays
Proposal for adding Named Dimensions to HDF5 Arrays
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
FScaFi: A Core Calculus for Collective Adaptive Systems Programming
FScaFi: A Core Calculus for Collective Adaptive Systems ProgrammingFScaFi: A Core Calculus for Collective Adaptive Systems Programming
FScaFi: A Core Calculus for Collective Adaptive Systems Programming
 
Algorithm
AlgorithmAlgorithm
Algorithm
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
A Concurrent Argumentation Language for Negotiation and Debating
A Concurrent Argumentation Language for Negotiation and DebatingA Concurrent Argumentation Language for Negotiation and Debating
A Concurrent Argumentation Language for Negotiation and Debating
 
Bit Vectors Siddhesh
Bit Vectors SiddheshBit Vectors Siddhesh
Bit Vectors Siddhesh
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
Semidefinite programming, binary codes and a graph coloring problem
Semidefinite programming, binary codes and a graph coloring problemSemidefinite programming, binary codes and a graph coloring problem
Semidefinite programming, binary codes and a graph coloring problem
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slide
 

Similar to EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing

Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Vrije Universiteit Amsterdam
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
WrushabhShirsat3
 
Semantic Recognition of Ontology Refactoring
Semantic Recognition of Ontology RefactoringSemantic Recognition of Ontology Refactoring
Semantic Recognition of Ontology RefactoringGerd Groener
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
Shubhmay Potdar
 
All good things
All good thingsAll good things
All good things
Pradipto Das
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
WU (Vienna University of Economics and Business)
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
rusbase
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
Roman Stanchak
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
Liwei Ren任力偉
 
Make money fast! department of computer science-copypasteads.com
Make money fast!   department of computer science-copypasteads.comMake money fast!   department of computer science-copypasteads.com
Make money fast! department of computer science-copypasteads.com
jackpot201
 
话题模型2
话题模型2话题模型2
话题模型2
Bryan Gummibearehausen
 
Session 1.5 supporting virtual integration of linked data with just-in-time...
Session 1.5   supporting virtual integration of linked data with just-in-time...Session 1.5   supporting virtual integration of linked data with just-in-time...
Session 1.5 supporting virtual integration of linked data with just-in-time...
semanticsconference
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
Aaron Li
 
Chapter 10 ds
Chapter 10 dsChapter 10 ds
Chapter 10 ds
Hanif Durad
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
Mark Levy
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Paolo Missier
 
Revealing Entities From Texts With a Hybrid Approach
Revealing Entities From Texts With a Hybrid ApproachRevealing Entities From Texts With a Hybrid Approach
Revealing Entities From Texts With a Hybrid Approach
Julien PLU
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
cscpconf
 

Similar to EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing (20)

Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
 
Lec1
Lec1Lec1
Lec1
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Semantic Recognition of Ontology Refactoring
Semantic Recognition of Ontology RefactoringSemantic Recognition of Ontology Refactoring
Semantic Recognition of Ontology Refactoring
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
All good things
All good thingsAll good things
All good things
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
 
Make money fast! department of computer science-copypasteads.com
Make money fast!   department of computer science-copypasteads.comMake money fast!   department of computer science-copypasteads.com
Make money fast! department of computer science-copypasteads.com
 
话题模型2
话题模型2话题模型2
话题模型2
 
Session 1.5 supporting virtual integration of linked data with just-in-time...
Session 1.5   supporting virtual integration of linked data with just-in-time...Session 1.5   supporting virtual integration of linked data with just-in-time...
Session 1.5 supporting virtual integration of linked data with just-in-time...
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
 
Chapter 10 ds
Chapter 10 dsChapter 10 ds
Chapter 10 ds
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Revealing Entities From Texts With a Hybrid Approach
Revealing Entities From Texts With a Hybrid ApproachRevealing Entities From Texts With a Hybrid Approach
Revealing Entities From Texts With a Hybrid Approach
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 

More from Chuancong Gao

WI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity JoinWI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity Join
Chuancong Gao
 
IRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set PreferencesIRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set Preferences
Chuancong Gao
 
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for ClassificationMaster Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
Chuancong Gao
 
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
Chuancong Gao
 
WWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generatorsWWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generators
Chuancong Gao
 
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding windowCIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
Chuancong Gao
 
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
Chuancong Gao
 
KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...
KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...
KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...
Chuancong Gao
 

More from Chuancong Gao (8)

WI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity JoinWI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity Join
 
IRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set PreferencesIRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set Preferences
 
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for ClassificationMaster Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
 
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
 
WWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generatorsWWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generators
 
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding windowCIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
 
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
 
KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...
KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...
KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...
 

Recently uploaded

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing

  • 1. Top-k Interesting Phrase Mining in Ad-hoc Collections using Sequence Pattern Indexing Chuancong Gao, Sebastian Michel Max-Planck Institute for Informatics & Saarland University Saarbr¨ucken, Germany EDBT, March 28, 2012 C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 1 / 31
  • 2. Motivation Lost overview on current presidential campaign topics? Get an update on key phrases of Mitt Romney! • Want concise summary. • Solution: Mining top-k phrases w.r.t. subset of all documents. • Sub-set of documents is determined in ad-hoc fashion. • E.g., using keyword queries, restriction to categorical attributes. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 2 / 31
  • 3. Motivation (Example) 1988 George Bush Kinder, Gentler Nation 1992 Bill Clinton Dont stop thinking about tomorrow 1992 Bill Clinton Putting People First 1992 Ross Perot Ross for Boss 1996 Bill Clinton Building a bridge to the 21st century 1996 Bob Dole The Better Man for a Better America 2000 Al Gore Prosperity and progress 2000 George W. Bush Compassionate conservatism 2000 George W. Bush Leave no child behind 2000 Ralph Nader Government of, by, and for the people... 2004 John Kerry Let America be America Again 2004 George W. Bush Yes, America Can! 2008 John McCain Country First 2008 Barack Obama Change We Can Believe In 2008 Barack Obama Yes We Can! http://www.presidentsusa.net/campaignslogans.html C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 3 / 31
  • 4. Problem Statement Given: a document corpus Task: Create an Index • that organizes documents and phrases • in a compact way • to query for top-k phrases of arbitrary ad-hoc collections. • Devise appropriate algorithms to use the index efficiently. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 4 / 31
  • 5. Problem Definition Given a document corpus D and a subset D . Find the top-k frequent phrases with the highest interestingness in D . Ad-hoc Document Subset D is not known upfront Phrase A phrase p is contained in document d iff p is a continuous sub-sequence of d (notation p d). • p’s length is restricted by min-len and max-len. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 5 / 31
  • 6. Problem Definition (Cont.) Frequency The frequency of a phrase p in D is called the support value of p w.r.t. D , denoted by supD p . • supD p = |{d|d ∈ D ∧ p d}| • A phrase is called frequent in a document collection D iff p’s support is larger than the user-specified threshold min-supD . Interestingness The interestingness measure we adopted is the confidence value (or relevance value), defined as conf D p = supD p supD p C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 6 / 31
  • 7. Example Find the top-k (k = 3) interesting phrases for 4 documents. D D ID Document d1 eabefcad d2 dcbeafe d3 fceacd d4 acfdbeacd (a) Documents k ID Phrase Conf. p1 ac 2/2 p2 bea 2/2 p3 ea 3/4 p4 be 2/3 · · · · · · (b) Top-k Phrases (min-supD = 2, k = 3, min-len = 2, max-len = 4) C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 7 / 31
  • 8. Outline Existing Approaches Sequence Patter Indexing Experiments Conclusion C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 8 / 31
  • 9. Existing Approaches - Phrase Inverted Indexing (A. Simitsis et al. - Multidimensional content eXploration. VLDB’08) Indexing Steps For each phrase p, store an inverted list that contains the IDs of all documents containing p. Example index: ID Document IDs p1 {d1, d2} p2 {d1, d3} p3 {d1, d3, d4} p4 {d1, d2, d3, d4} Given an ad-hoc collection D : Alg. requires a full scan of all the inverted lists. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 9 / 31
  • 10. Existing Approaches - Forward Indexing (S. Bedathur et al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics. VLDB’10) Indexing Steps Builds a forward list for each document d ∈ D, containing all the IDs of phrases contained in d. Example index structure: ID Forward List d1 {p1:2, p2:2, p3:3, p4:4} d2 {p1:2, p4:4} d3 {p2:2, p3:3, p4:4} d4 {p3:3, p4:4} Numbers after “:” denote global support values. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 10 / 31
  • 11. Existing Approaches - Forward Indexing (Cont.) Query Processing Compute top-k phrases with |D |-way merge join. • Global support value of p stored explicitly in list. • Local support value + confidence is computed during the join process. • Nice: only the forward lists of documents in D need be to scanned. • Even better: can apply early termination: • phrase IDs in each forward list stored in ascending order of global support values • then, conf D p has an upper bound min 1, |D | supD p . • Can stop if the top-k results we found can not be topped anymore. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 11 / 31
  • 12. Prefix-Maximal Indexing (S. Bedathur et al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics. VLDB’10) Key Idea Store only prefix-maximal phrases (small subset of all phrases) w.r.t. each document in the corpus. • A phrase p is called maximal w.r.t. a document d ∈ D iff ∃p that p is frequent and p p . • A phrase p is called prefix-maximal w.r.t. a document d ∈ D iff p is maximal and p that p is maximal and p is a prefix of p . Significantly reduces the index size comparing to Forward Indexing. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 12 / 31
  • 13. Phrase-Maximal Index. Comparison Comparison: all phrases, maximal phrases, and prefix-maximal phrases: ID Phrases d1 {ad, be, bef, befc, ca, cad, ea, ef, efc, efca, fc, fca, fcad} d2 {be, bea, cb, cbe, cbea, dc, dcb, dcbe, ea, fe} d3 {ac, acd, cd, ce, cea, ceac, ea, eac, eacd, fc, fce, fcea} d4 {ac, acd, acf, acfd, be, bea, cd, cf, cfd, ea, eac, fd} · · · · · · ID Prefix-Maximal Phrases d1 {ad, befc, cad, ea, efca, fcad} d2 {bea, cbea, dcbe, ea, fe} d3 {acd, cd, ceac, eacd, fcea} d4 {acd, acfd, bea, cfd, eac, fd} · · · · · · ID Maximal Phrases d1 {befc, ea, efca, fcad} d2 {cbea, dcbe, fe} d3 {ceac, eacd, fcea} d4 {acd, acfd, bea, eac} · · · · · · Max. phrases keep index small, but top-k phrases computation is difficult. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 13 / 31
  • 14. Existing Approaches - Phrase-Maximal Indexing (Cont.) The top-k phrases are computed by |D |-way merge join. • Support values computed when merging all the forward lists in D . • The confidence values are calculated by using a separate global support value dictionary. • Issues: • A full scan of the forward lists of D is required, since no early termination can be applied. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 14 / 31
  • 15. Objectives Devise an approach that 1. is as fast as Forward Indexing (i.e., we must devise early stopping) 2. and has an index as small (or smaller than) Phrase-Maximal Indexing C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 15 / 31
  • 16. Outline Existing Approaches Sequence Patter Indexing Experiments Conclusion C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 16 / 31
  • 17. Our Approach - Key Ideas We also store prefix-maximal phrases for each document d ∈ D, but ... Lexicographical order vs. matching position order: ID Prefix-Maximal Phrases d1 {ad, befc, cad, ea, efca, fcad} d2 {bea, cbea, dcbe, ea, fe} d3 {acd, cd, ceac, eacd, fcea} d4 {acd, acfd, bea, cfd, eac, fd} · · · · · · ID Prefix-Maximal Phrases (Ordered by Position) d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7} d2 {dcbe@1, cbea@2, bea@3, ea@4, fe@6} d3 {fcea@1, ceac@2, eacd@3, acd@4, cd@5} d4 {acfd@1, cfd@2, fd@3, bea@5, eacd@6, acd@7} · · · · · · – Duplicate Prefix Sub-Phrase Comparing to Previous Phrase – Duplicate Prefixes among Phrases C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 17 / 31
  • 18. Our Approach - Indexing ID Prefix-Maximal Phrases (Ordered by Position) d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7} a 1 2 e 1 b 1 e 2 f 1 3 c 2 4 f2 c3 a4 a3 d d4 1 2 3 2 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11n12 Possible Lengths (1-4) • Sub-phrases are organized in a forest structure; each node contains one item + length information. • Links from outside point to start items of indexed prefix-maximal phrases. • Possible length information is stored in bit array. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 18 / 31
  • 19. Memory Layout a 1 2 e 1 b 1 e 2 f 1 3 c 2 4 f2 c3 a4 a3 d d4 1 2 3 2 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11n12 Possible Lengths (1-4) ae d f c a e f cb 00 0010 0000 0011 0000 0001 000014 0101 0000 0010 1000 0011 0100 0001 0000 0000 1000 0100 0100 0010 0000 00 04 06 10 14 18 22 26 30 34 38 00 06 10 26 34 8 bits24 bits1+15 bits 16 bits a d0001 1000 0100 1100 42 46 38 50 0100 1100 Word / Link End of Branch Possible Lengths (1-6) (Number below cells indicate a relative memory address) Each node is assigned with 4 bytes (32 bits); the first byte storing flags, the last 3 bytes store the item. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 19 / 31
  • 20. Our Approach - Indexing (Cont.) - Improving a1 2 e1 b1 e2 f1 3 c2 4 f 2 c3 a4 a3 d d4 1 2 3 2 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11n12 Possible Lengths (1-4) • Still some duplications like e → f → c → a and a → d. • Further improvement: utilize specific matching positions (instead of just matching orders). • Idea: Use graph structure to eliminate more duplicates. a 1 2 e1 b1 e2 f1 3 c 2 4 a3 d4 1 2 3 n1 n2 n3 n4 n5 n6 n7 n8 2 3 4 2 C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 20 / 31
  • 21. Our Approach - Top-k Phrase Computing Computation is done in two steps (interleaved): |D |-way merge join and growing patterns: Merge join approach is used on phrase prefixes with length 1 • Index contains an address list for different items in document d. • Apply |D |-way merge join on these address lists to retrieve all length-1 phrases. Extend each length-1 phrase In each step: the prefix is extended with one of the local frequent items. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 21 / 31
  • 22. Optimizations Early Stopping • Can tell when it is “OK” to stop retrieving more length 1 patterns Search Space Pruning • During expansion of a length-1 pattern to larger patterns, stop when unpromising. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 22 / 31
  • 23. Our Approach - Summary Our Method’s Advantages: • Index with little redundancy • Algorithms fully utilize this new index structure. • Visiting only frequent phrases: • With early termination in |D |-merge join process. • And search space pruning in pattern-growth process. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 23 / 31
  • 24. Outline Existing Approaches Sequence Patter Indexing Experiments Conclusion C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 24 / 31
  • 25. Evaluation Different versions of own approach, plus two competitors. Using an Intel Xeon W3520 (2.66 GHz) CPU and 24 GB memory workstation. Implementation in C#. Index in memory. Algorithms under Comparison Algorithm Description Baselines ForwardIndex Forward Indexing PreMaxIndex Prefix-Maximal Indexing Our Approaches SeqPattIndex Our Indexing SeqPattIndex IM Our Improved Indexing SeqPattIndex NP SeqPattIndex With Early Termination and Search Space Pruning Turned off C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 25 / 31
  • 26. Datasets Global Per Document Dataset #Doc. #Word Max. Len. Avg. Len. Avg. #Word PubMed Titles 17,826,927 2,028,673 167 11.59 10.99 PubMed Abstracts 2,500,000 2,075,526 1599 149.72 92.1 Extracted from PubMed – the largest free publication database on life sciences and biomedical topics. • Titles are rather short, little redundancy per title. • Abstracts contain much more information, more redundancy document. Great to study pros and cons of the discussed approaches. Queries: Generated • to simulate different kind of queries (by size of D ) • grouped phrases by global support according to several ranges C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 26 / 31
  • 27. Evaluation - In-Memory Index Size (in MB) SeqPattIndex SeqPattIndex_IM PreMaxIndex ForwardIndex 1000 2000 3000 4000 5000 5 10 15 20 25 IndexSize(inMB) Minimum Support (a) PubMed Titles 1000 2000 3000 4000 5000 6000 7000 5 10 15 20 25 IndexSize(inMB) Minimum Support (b) PubMed Abstracts • PubMed Titles: improved approach close to original one, as phrases in titles have little redundancy C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 27 / 31
  • 28. Evaluation - Average Querying Time (in sec) PubMed Titles SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex 0.001 0.01 0.1 1 10 100 1000 10000 100000 Avg.QueryingTime(insec) |D’| (a) Varying |D | (k = 10, min-sup = 5) 0.1 1 10 100 10 100 1000 10000 Avg.QueryingTime(insec) k (b) Varying k (min-sup = 5) C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 28 / 31
  • 29. Evaluation - Average Querying Time (in sec) PubMed Abstract SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex 0.01 0.1 1 10 100 100 1000 10000 100000 Avg.QueryingTime(insec) |D’| (a) Varying |D | (k = 10, min-sup = 5) 0.1 1 10 100 10000 10 100 1000 10000 Avg.QueryingTime(insec) k (b) Varying k (min-sup = 5) C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 29 / 31
  • 30. Outline Existing Approaches Sequence Patter Indexing Experiments Conclusion C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 30 / 31
  • 31. Conclusion • Presented approach to ad-hoc frequent phrases mining • Making use of redundancy of sequentially aligned phrases • Index is very compact. • High performance in query processing based on early stopping and search space pruning. C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 31 / 31