SlideShare a Scribd company logo
1 of 1
Download to read offline
SEQUEL: Query Completion via Pattern Mining
on Multi-Column Structural Data
Chuancong Gao, Qingyan Yang, Jianyong Wang Tsinghua University, Beijing, China
Structural Data Description
Mined Pattern Structure
Suggestion Progress
STEP 1: Search the index of each column, find at least one combination
(matching order) of columns matching on the input query.
E.g., Query “www da” will be matched as (with the indexes in right-side):
Advantages Comparing to Other Systems
Pattern Index Structure – Trie Tree
Example on Column Title Phrase and Venue
Structural
Data
Formalize Mine & Index
Mined Patterns
Indexes for Each Column
Query
...
...Preprocess
...
...
Try to Match Greedily on
Each Column Index
Patterns for m
Match
Combinations
Top-k Selection on
Last-Matched Column
for m Combinations Top-k
Selection from
m×k
Candidates
Output
Offline Part
Online Part
≥ ≥
≥ ≥
≥ ≥
≥ ≥
... .........
≥ : Ranking Score Comparison
: supnn -
The DBLP Computer Science Bibliography (DBLP)
• > 1,400,000 Publication Entries
• Four Attributes for each Publication Entry:
• Authors (e.g. Jiawei Han, Guozhu Dong, Yiwen Yin)
• Title (e.g. Efficient Mining of Partial Periodic Patterns in Time
Series Database)
• Venue (e.g. ICDE)
• Year (e.g. 1999)
1. Title Phrase “frequent patterns” appears 17 times in Venue “icdm”
2. Title Phrase “pattern” appears 14 times for Authors “jian pei” and
“jiawei han”
• Suggests Patterns mined from underlying Data instead of Query Logs
• More Accurate and Meaningful
• Low Amount and Quality of Query Logs on Structural Data
• No need to Specify Explicitly Different Columns in Query
• Suggests Phrases instead of Single Terms
• Fast for both Offline Pattern Mining and Online Suggestion
d
a
t
a
b
e
s
a w
e
b
tl
a
m
r
o
f
me
d
c
i
w
w
w
m
l1 2 3 ...
...
... ...
2 5 6 ... ... ...
3 4 8 10 ...
5 ... 4 ...
data
data icde
data www
data web www
database icde
icde
www
1
2
3
4
5
6
7
8
w
w
w
7 8 ...
www www
www
9
10
50263
514
14
14
312
2666
880
4
1262
Title Phrase Index Venue Index
Title Phrase Venueid supid
Some Selected Patterns
d
a
t
a
9 ...
Blank Node Normal Node Phrase-end Node
www data 17
http://dbgroup.cs.tsinghua.edu.cn/chuancong/sequel
STEP 2: Suggest on the last matched column of each matching order.
Based on Frequent Sequential Pattern Mining algorithm PrefixSpan:
• Treat Authors as Itemset
• Treat Title as Sequence
• Treat Venue & Year as Single-Item
• Concatenate all the columns together as a new Sequence
• Mine and Index
Used Minimum Support (Frequency) Threshold: 10
Pattern Mining Algorithm
• Used for fast column text matching
• Every column has one corresponding Trie tree
• All the indexes share a global table storing all the patterns
• Close to 2GB in total in memory

More Related Content

What's hot

Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
caise2013vlc
 

What's hot (18)

Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin Komen
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
 
Bca ii dfs u-2 linklist,stack,queue
Bca ii  dfs u-2 linklist,stack,queueBca ii  dfs u-2 linklist,stack,queue
Bca ii dfs u-2 linklist,stack,queue
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Data wrangling with dplyr
Data wrangling with dplyrData wrangling with dplyr
Data wrangling with dplyr
 
Data and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefineData and Donuts: Data cleaning with OpenRefine
Data and Donuts: Data cleaning with OpenRefine
 
Deletion from single way linked list and search
Deletion from single way linked list and searchDeletion from single way linked list and search
Deletion from single way linked list and search
 
Data Structures 01
Data Structures 01Data Structures 01
Data Structures 01
 
Link List
Link ListLink List
Link List
 
Linked list
Linked listLinked list
Linked list
 
linked list
linked list linked list
linked list
 
ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018
 
02 Stack
02 Stack02 Stack
02 Stack
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Starting work with R
Starting work with RStarting work with R
Starting work with R
 
AITC: White Paper on Distributed Level Of Permission Hierarchy
AITC: White Paper on Distributed Level Of Permission HierarchyAITC: White Paper on Distributed Level Of Permission Hierarchy
AITC: White Paper on Distributed Level Of Permission Hierarchy
 
Circular link list.ppt
Circular link list.pptCircular link list.ppt
Circular link list.ppt
 
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentAre Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
 

Similar to CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column structural data

Similar to CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column structural data (20)

Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Harvester_presentaion
Harvester_presentaionHarvester_presentaion
Harvester_presentaion
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
Cassandra
CassandraCassandra
Cassandra
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 

More from Chuancong Gao

More from Chuancong Gao (8)

WI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity JoinWI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity Join
 
IRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set PreferencesIRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set Preferences
 
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for ClassificationMaster Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
 
WWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generatorsWWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generators
 
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
 
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding windowCIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
 
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
 
KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...
KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...
KDD 2010 - Direct mining of discriminative patterns for classifying uncertain...
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column structural data

  • 1. SEQUEL: Query Completion via Pattern Mining on Multi-Column Structural Data Chuancong Gao, Qingyan Yang, Jianyong Wang Tsinghua University, Beijing, China Structural Data Description Mined Pattern Structure Suggestion Progress STEP 1: Search the index of each column, find at least one combination (matching order) of columns matching on the input query. E.g., Query “www da” will be matched as (with the indexes in right-side): Advantages Comparing to Other Systems Pattern Index Structure – Trie Tree Example on Column Title Phrase and Venue Structural Data Formalize Mine & Index Mined Patterns Indexes for Each Column Query ... ...Preprocess ... ... Try to Match Greedily on Each Column Index Patterns for m Match Combinations Top-k Selection on Last-Matched Column for m Combinations Top-k Selection from m×k Candidates Output Offline Part Online Part ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ... ......... ≥ : Ranking Score Comparison : supnn - The DBLP Computer Science Bibliography (DBLP) • > 1,400,000 Publication Entries • Four Attributes for each Publication Entry: • Authors (e.g. Jiawei Han, Guozhu Dong, Yiwen Yin) • Title (e.g. Efficient Mining of Partial Periodic Patterns in Time Series Database) • Venue (e.g. ICDE) • Year (e.g. 1999) 1. Title Phrase “frequent patterns” appears 17 times in Venue “icdm” 2. Title Phrase “pattern” appears 14 times for Authors “jian pei” and “jiawei han” • Suggests Patterns mined from underlying Data instead of Query Logs • More Accurate and Meaningful • Low Amount and Quality of Query Logs on Structural Data • No need to Specify Explicitly Different Columns in Query • Suggests Phrases instead of Single Terms • Fast for both Offline Pattern Mining and Online Suggestion d a t a b e s a w e b tl a m r o f me d c i w w w m l1 2 3 ... ... ... ... 2 5 6 ... ... ... 3 4 8 10 ... 5 ... 4 ... data data icde data www data web www database icde icde www 1 2 3 4 5 6 7 8 w w w 7 8 ... www www www 9 10 50263 514 14 14 312 2666 880 4 1262 Title Phrase Index Venue Index Title Phrase Venueid supid Some Selected Patterns d a t a 9 ... Blank Node Normal Node Phrase-end Node www data 17 http://dbgroup.cs.tsinghua.edu.cn/chuancong/sequel STEP 2: Suggest on the last matched column of each matching order. Based on Frequent Sequential Pattern Mining algorithm PrefixSpan: • Treat Authors as Itemset • Treat Title as Sequence • Treat Venue & Year as Single-Item • Concatenate all the columns together as a new Sequence • Mine and Index Used Minimum Support (Frequency) Threshold: 10 Pattern Mining Algorithm • Used for fast column text matching • Every column has one corresponding Trie tree • All the indexes share a global table storing all the patterns • Close to 2GB in total in memory