KEYWORDS SEARCH
ON STRUCTURED
DATABASE
Xiaoyu Chen, Min Li, Yihan Gao, Tianning
Xu
Introduction
 Structured data
 Schema as a summary of the data
 Retrieve through structured language
 What would big data bring to structured data
retrieval?
Introduction
 In terms of high volume of data
 Hadoop + Pig Latin came to rescue
 However, is this enough?
 Recall how you write selection. What do you need
to know
 Can you remember this ?
Introduction
 Big data-> big and complicated schema
 Hard to remember and operate!
 May not even fit in main memory!
 What should we do about it ?
 How does information retrieval deals with this ?
Introduction
 Search based on keywords
 No need for schema
 Efficiency guaranteed using index
 All seem to to be straightforward and easy
 What are the challenges ?
Introduction
 Search for “Apple + company”
 Match to “apple(fruit)”, “Apple Inc.”, “Adams’
apple”
 Which one is correct ? How to filter?
Challenge1:
Filtering and disambiguat
Introduction
 Search for “Steve Jobs + Apple”
 Normalization. What to return ?
ID Nam
e
Gend
er
Employ
er
Location
ID Company Locatio
n
Type Product
ID Street City State Countr
y
Challenge2:
Automatic join
back
Introduction
 Search for “Jordan”
 Match “Jordan (brand)” , ”Michael Jordan (player)”,
“Michael Jordan (professor)” etc.
 All of them should match. Which one is better ?
 Ranking
Challenge3:
Ranking of the
result
Literature Overview
 Two kinds of approaches
 1. Interpretative approach
 Reuse database query language and index
 Translate the keywords into queries
 Will introduce 3 papers
 2. Un-interpretative approach (focus)
 Typically build own index and data structure
 Model as graph and use graph-based analysis
 Will introduce 3 papers
Literature Overview –
Interpretative approach
 DBXplorer Sanjay Agrawal et al.
 General: two steps
 Publish step: pre-computation, indexing etc.
 Search step: lookup, enumerate over join tree,
generate SQL etc.
 Efficiency:
 Symbol table (index) design
 Symbol table compaction
Literature Overview –
Interpretative approach
 Publish step:
 1: A database is identified, along with the set of
tables and columns within the database to be
published.
 2: Auxiliary tables are created for supporting
keyword searches. E.g. index table
 But, how to build efficient index ?
Literature Overview –
Interpretative approach
 Index goal: find out the keyword belonging
row_id and column_id.
 If the column (attribute) already has index, we
need only column_id index (reuse database
index)
ID Name Gender Addr Org
1
2
3
Column index
Row index
Literature Overview –
Interpretative approach
 Compress index table
 Foreign key constraint etc.
 General Algorithm -- CP-Comp
Name Product …
Name Gender …
Sells table
Person table
Table 1.
Compressed
table Table 2.
Uncompressed table
Literature Overview –
Interpretative approach
 Search step
 Step 1: look up index find columns/rows of the
database that contain the query keywords.
 Step 2: All potential subsets of tables in the
database that, if joined, might contain rows having
all keywords, are identified and enumerated. Join
Tree
 Step 3: For each enumerated join tree, a SQL
statement is constructed (and executed) that joins
the tables in the tree and selects those rows that
contain all keywords. The final rows are ranked
and presented to the user.
Literature Overview –
Interpretative approach
 Join Tree example:
Literature Overview –
Interpretative approach
 Keyword Search in Databases: The Power of
RDBMS
 Lu Qin et al.
 SIGMOD 09
Integrating IR and DB
 DB techniques provide users with efficient
ways to access structured data in RDBMSs
 IR techniques allow users to use keywords to
access unstructured data
 Eg. Structural keyword search, finds how
tuples that contain keywords in a RDB are
interconnected (the structure), three types:
Schema-based approach
Connected Tree Semantics: query
results in minimal total joining network
of tuples; adjacent tuples joined by
foreign key reference, #tuples <=
Tmax
Connected Tree Semantics
 1. Candidate Network (CN) generation:
relational algebra expressions that creates
trees with all keywords up to a certain size
 2. CN evaluation: evaluates generated CNs
using SQL
Schema-based approach
Distinct Root Semantics: query
results in collection of tuples all
reachable from root; root uniquely
defines tuples, distance(any tuple,
root) <= Dmax
Schema-based approach
Distinct Core Semantics: query results in
multi-center subgraphs (communities);
keyword tuples uniquely defines a
community, distance(any keyword tuple, any
center tuple) <= Dmax
Distinct Core/Root Semantics
 1. Creates pairs between tuple containing
keyword and every other tuple, that is the
shortest distance between them
 2. generate graphs using SQL with distinct
core/roots
Literature Overview –
Interpretative approach
 Keyword search over relational databases: a
metadata approach.
 Bergamaschiet al.
 SIGMOD 11
Problem Definition

A database D is a collection of relational tables. Each relational table
contains its name, attributes and value domains. All these elements
together form the vocabulary.

A keyword query q is an ordered list of keywords. Each keyword
specifies the element of the interest.

A configuration of a keyword query on Database is an injective
mapping from the keyword to vocabulary of the database

Task: First derive the top configurations based on some metrics and
then interpret it as SQL query (select-project-join interpretations)
From Keywords to Queries

Need to consider inter-dependency of the query keywords:
Introduce two different kinds of weights: the intrinsic weights, and the
contextual weights

Need to give a ranked list of all the configurations
Develop an algorithm based on and extends the Hungarian (a.k.a.,
Munkres) algorithm

Need to separate the process of evaluating the schema terms and
value terms
Evaluate the value weights based on the schema mapping
Contributions and Insights

Formally define the problem of keyword querying over relational
databases that lack a-priori access to the database instance

Introduce the notion of a weight as a measure of the likelihood that the
semantics of a keyword are represented by a database structure.
Need to consider both intrinsic weights and contextual weights

Extend and exploit the Hungarian (a.k.a., Munkres) algorithm to
generate a ranking of different interpretations.
Literature Overview
 Two kinds of approaches
 1. Interpretative approach
 Reuse database query language and index
 Translate the keywords into queries
 2. Un-interpretative approach
 Typically build own index and data structure
 Model as graph and use graph-based analysis
Literature Overview –
Un-interpretative approach
 Effective Keyword Search in Relational
Databases
 Fang Liu et al.
 SIGMOD 06
Difficulties of Keyword Search
 Keyword search in text databases only need to
compute score for each document
 Keyword search on RDBMS more complicated
(relations, attributes, tuples):
 1. Generate tuple trees (answers) by joining
tuples from different tables
 2. Rank the answers by computing score
Generate Answer Tuple Trees
 Tuple tree answer rules:
1. Each leaf node in a tuple tree must contain at
least one keyword
2. Each tuple only appears at most once in tree
 Separate tuples into tuple sets that contain
keywords and tuple sets that contain all tuples
for each relation, join adjacent sets from
schema graph within constraints of answer
trees
Ranking Tuple Trees
 Treat the text of each tuple within an answer
set as a “document”
 Assign similarity rating between each
document and query, normalizing for:
 Term Frequency
 Document Frequency
 Document Length
 Compute score for tuple tree as average over
all documents
Focused work
 Keyword Searching and Browsing in
Databases using BANKS
 Gaurav Bhalotia et al.
 ICDE 02
BANKS (Browsing And Keyword
Searching)
 a system which enables keyword-
based search on relational
databases, together with data and
schema browsing
User HTTP
BANKS
System JDBC Database
Database and Query Model
 Relational Database -> Directed
Graph
 Each Tuple in Database -> Node in
Graph
 Foreign Key -> Directed Edge
Database and Query Model
Database and Query Model
 An answer to a query should be a
subgraph connecting nodes matching
the keywords.
 The importance of a link depends upon
the type of the link i.e. what relations it
connects and on its semantics
 Ignoring directionality would cause
problems because of “hubs” which are
connected to a large numbers of nodes.
Database and Query Model
 We may restrict the information node to
be from a selected set of nodes of the
graph
 We incorporate another interesting
feature, namely node weights, inspired
by prestige rankings
 Node weights and tree weights need to
be combined to get an overall relevance
score
Formal Model
 Node Weight : N(u)
Depends on the prestige
Set the node prestige = the in-degree of
the node
Nodes that have multiple pointers to
them get a higher prestige
Formal Model
 Edge Weights
Some pupluar tuples can be connected
many other tuples  Edge with forward
and backward edge weights
Weight of a forward link = the strength of
the proximity relationship between two
tuples (set to 1 by default)
Weight of a backward link = in-degree of
edges pointing to the node
Formal Model

Result
Result of query “sudarshan soumen”
Searching for the best answer
 Backward Expanding Search
Algorithm
Intuition: find vertices from which a
forward path exists to at least one node
from each Si.
Run concurrent single source shortest
path algorithm from each node matching
a keyword
Searching for the best answer
S.
Sudarsha
n
Prasan
Roy
writes
author
paper
Charuta
BANKS: Keyword
search…
As an extension of BANKS
 BLINKS: ranked keyword searches on
graphs.
 He H et al.
 SIGMOD 07
Introduction
 Efficient ranked keyword searches on schemaless node-labeled
graphs.
 Challenges:
 Lack of schema for optimization
 Hard to guarantee strong performance
 Proposed technique
 Backward search algorithm
 SLINKS: single-level index search *
 Extension for scalability: BLINKS ( bi-level index search )
 Contributions
 Cost-balanced expansion based backward search
 Combining indexing with search
 Partition-based indexing (bi-level indexing)
Problem Formulation

Backward search algorithm

A single level index

A single level index

SLINKS Algorithm

BLINKS ( brief idea)
 The index is too large to store and too expensive to construct in large
graphs?
Use a divide and conquer approach to create a bi-level index
 Partition the data graph into multiple subgraphs, or blocks.
 Intra-Block Index
 indexes information inside a block
 4 kinds of index, 2 for separator nodes (important, so specially considered )
 Block Index
 2 simple index
Conclusion
 Keywords search challenges:
 Filtering and disambiguation
 Automatic join back
 Ranking of the result
 Additional consideration:
 Efficiency
 Space
Thank you and have fun

Presentation

  • 1.
    KEYWORDS SEARCH ON STRUCTURED DATABASE XiaoyuChen, Min Li, Yihan Gao, Tianning Xu
  • 2.
    Introduction  Structured data Schema as a summary of the data  Retrieve through structured language  What would big data bring to structured data retrieval?
  • 3.
    Introduction  In termsof high volume of data  Hadoop + Pig Latin came to rescue  However, is this enough?  Recall how you write selection. What do you need to know  Can you remember this ?
  • 4.
    Introduction  Big data->big and complicated schema  Hard to remember and operate!  May not even fit in main memory!  What should we do about it ?  How does information retrieval deals with this ?
  • 5.
    Introduction  Search basedon keywords  No need for schema  Efficiency guaranteed using index  All seem to to be straightforward and easy  What are the challenges ?
  • 6.
    Introduction  Search for“Apple + company”  Match to “apple(fruit)”, “Apple Inc.”, “Adams’ apple”  Which one is correct ? How to filter? Challenge1: Filtering and disambiguat
  • 7.
    Introduction  Search for“Steve Jobs + Apple”  Normalization. What to return ? ID Nam e Gend er Employ er Location ID Company Locatio n Type Product ID Street City State Countr y Challenge2: Automatic join back
  • 8.
    Introduction  Search for“Jordan”  Match “Jordan (brand)” , ”Michael Jordan (player)”, “Michael Jordan (professor)” etc.  All of them should match. Which one is better ?  Ranking Challenge3: Ranking of the result
  • 9.
    Literature Overview  Twokinds of approaches  1. Interpretative approach  Reuse database query language and index  Translate the keywords into queries  Will introduce 3 papers  2. Un-interpretative approach (focus)  Typically build own index and data structure  Model as graph and use graph-based analysis  Will introduce 3 papers
  • 10.
    Literature Overview – Interpretativeapproach  DBXplorer Sanjay Agrawal et al.  General: two steps  Publish step: pre-computation, indexing etc.  Search step: lookup, enumerate over join tree, generate SQL etc.  Efficiency:  Symbol table (index) design  Symbol table compaction
  • 11.
    Literature Overview – Interpretativeapproach  Publish step:  1: A database is identified, along with the set of tables and columns within the database to be published.  2: Auxiliary tables are created for supporting keyword searches. E.g. index table  But, how to build efficient index ?
  • 12.
    Literature Overview – Interpretativeapproach  Index goal: find out the keyword belonging row_id and column_id.  If the column (attribute) already has index, we need only column_id index (reuse database index) ID Name Gender Addr Org 1 2 3 Column index Row index
  • 13.
    Literature Overview – Interpretativeapproach  Compress index table  Foreign key constraint etc.  General Algorithm -- CP-Comp Name Product … Name Gender … Sells table Person table Table 1. Compressed table Table 2. Uncompressed table
  • 14.
    Literature Overview – Interpretativeapproach  Search step  Step 1: look up index find columns/rows of the database that contain the query keywords.  Step 2: All potential subsets of tables in the database that, if joined, might contain rows having all keywords, are identified and enumerated. Join Tree  Step 3: For each enumerated join tree, a SQL statement is constructed (and executed) that joins the tables in the tree and selects those rows that contain all keywords. The final rows are ranked and presented to the user.
  • 15.
    Literature Overview – Interpretativeapproach  Join Tree example:
  • 16.
    Literature Overview – Interpretativeapproach  Keyword Search in Databases: The Power of RDBMS  Lu Qin et al.  SIGMOD 09
  • 17.
    Integrating IR andDB  DB techniques provide users with efficient ways to access structured data in RDBMSs  IR techniques allow users to use keywords to access unstructured data  Eg. Structural keyword search, finds how tuples that contain keywords in a RDB are interconnected (the structure), three types:
  • 18.
    Schema-based approach Connected TreeSemantics: query results in minimal total joining network of tuples; adjacent tuples joined by foreign key reference, #tuples <= Tmax
  • 19.
    Connected Tree Semantics 1. Candidate Network (CN) generation: relational algebra expressions that creates trees with all keywords up to a certain size  2. CN evaluation: evaluates generated CNs using SQL
  • 20.
    Schema-based approach Distinct RootSemantics: query results in collection of tuples all reachable from root; root uniquely defines tuples, distance(any tuple, root) <= Dmax
  • 21.
    Schema-based approach Distinct CoreSemantics: query results in multi-center subgraphs (communities); keyword tuples uniquely defines a community, distance(any keyword tuple, any center tuple) <= Dmax
  • 22.
    Distinct Core/Root Semantics 1. Creates pairs between tuple containing keyword and every other tuple, that is the shortest distance between them  2. generate graphs using SQL with distinct core/roots
  • 23.
    Literature Overview – Interpretativeapproach  Keyword search over relational databases: a metadata approach.  Bergamaschiet al.  SIGMOD 11
  • 24.
    Problem Definition  A databaseD is a collection of relational tables. Each relational table contains its name, attributes and value domains. All these elements together form the vocabulary.  A keyword query q is an ordered list of keywords. Each keyword specifies the element of the interest.  A configuration of a keyword query on Database is an injective mapping from the keyword to vocabulary of the database  Task: First derive the top configurations based on some metrics and then interpret it as SQL query (select-project-join interpretations)
  • 25.
    From Keywords toQueries  Need to consider inter-dependency of the query keywords: Introduce two different kinds of weights: the intrinsic weights, and the contextual weights  Need to give a ranked list of all the configurations Develop an algorithm based on and extends the Hungarian (a.k.a., Munkres) algorithm  Need to separate the process of evaluating the schema terms and value terms Evaluate the value weights based on the schema mapping
  • 27.
    Contributions and Insights  Formallydefine the problem of keyword querying over relational databases that lack a-priori access to the database instance  Introduce the notion of a weight as a measure of the likelihood that the semantics of a keyword are represented by a database structure. Need to consider both intrinsic weights and contextual weights  Extend and exploit the Hungarian (a.k.a., Munkres) algorithm to generate a ranking of different interpretations.
  • 28.
    Literature Overview  Twokinds of approaches  1. Interpretative approach  Reuse database query language and index  Translate the keywords into queries  2. Un-interpretative approach  Typically build own index and data structure  Model as graph and use graph-based analysis
  • 29.
    Literature Overview – Un-interpretativeapproach  Effective Keyword Search in Relational Databases  Fang Liu et al.  SIGMOD 06
  • 30.
    Difficulties of KeywordSearch  Keyword search in text databases only need to compute score for each document  Keyword search on RDBMS more complicated (relations, attributes, tuples):  1. Generate tuple trees (answers) by joining tuples from different tables  2. Rank the answers by computing score
  • 31.
    Generate Answer TupleTrees  Tuple tree answer rules: 1. Each leaf node in a tuple tree must contain at least one keyword 2. Each tuple only appears at most once in tree  Separate tuples into tuple sets that contain keywords and tuple sets that contain all tuples for each relation, join adjacent sets from schema graph within constraints of answer trees
  • 32.
    Ranking Tuple Trees Treat the text of each tuple within an answer set as a “document”  Assign similarity rating between each document and query, normalizing for:  Term Frequency  Document Frequency  Document Length  Compute score for tuple tree as average over all documents
  • 33.
    Focused work  KeywordSearching and Browsing in Databases using BANKS  Gaurav Bhalotia et al.  ICDE 02
  • 34.
    BANKS (Browsing AndKeyword Searching)  a system which enables keyword- based search on relational databases, together with data and schema browsing User HTTP BANKS System JDBC Database
  • 35.
    Database and QueryModel  Relational Database -> Directed Graph  Each Tuple in Database -> Node in Graph  Foreign Key -> Directed Edge
  • 36.
  • 37.
    Database and QueryModel  An answer to a query should be a subgraph connecting nodes matching the keywords.  The importance of a link depends upon the type of the link i.e. what relations it connects and on its semantics  Ignoring directionality would cause problems because of “hubs” which are connected to a large numbers of nodes.
  • 38.
    Database and QueryModel  We may restrict the information node to be from a selected set of nodes of the graph  We incorporate another interesting feature, namely node weights, inspired by prestige rankings  Node weights and tree weights need to be combined to get an overall relevance score
  • 39.
    Formal Model  NodeWeight : N(u) Depends on the prestige Set the node prestige = the in-degree of the node Nodes that have multiple pointers to them get a higher prestige
  • 40.
    Formal Model  EdgeWeights Some pupluar tuples can be connected many other tuples  Edge with forward and backward edge weights Weight of a forward link = the strength of the proximity relationship between two tuples (set to 1 by default) Weight of a backward link = in-degree of edges pointing to the node
  • 41.
  • 42.
    Result Result of query“sudarshan soumen”
  • 43.
    Searching for thebest answer  Backward Expanding Search Algorithm Intuition: find vertices from which a forward path exists to at least one node from each Si. Run concurrent single source shortest path algorithm from each node matching a keyword
  • 44.
    Searching for thebest answer S. Sudarsha n Prasan Roy writes author paper Charuta BANKS: Keyword search…
  • 45.
    As an extensionof BANKS  BLINKS: ranked keyword searches on graphs.  He H et al.  SIGMOD 07
  • 46.
    Introduction  Efficient rankedkeyword searches on schemaless node-labeled graphs.  Challenges:  Lack of schema for optimization  Hard to guarantee strong performance  Proposed technique  Backward search algorithm  SLINKS: single-level index search *  Extension for scalability: BLINKS ( bi-level index search )  Contributions  Cost-balanced expansion based backward search  Combining indexing with search  Partition-based indexing (bi-level indexing)
  • 47.
  • 48.
  • 49.
    A single levelindex 
  • 50.
    A single levelindex 
  • 51.
  • 52.
    BLINKS ( briefidea)  The index is too large to store and too expensive to construct in large graphs? Use a divide and conquer approach to create a bi-level index  Partition the data graph into multiple subgraphs, or blocks.  Intra-Block Index  indexes information inside a block  4 kinds of index, 2 for separator nodes (important, so specially considered )  Block Index  2 simple index
  • 53.
    Conclusion  Keywords searchchallenges:  Filtering and disambiguation  Automatic join back  Ranking of the result  Additional consideration:  Efficiency  Space
  • 54.
    Thank you andhave fun