Presentation

KEYWORDS SEARCH
ON STRUCTURED
DATABASE
Xiaoyu Chen, Min Li, Yihan Gao, Tianning
Xu

Introduction
 Structured data
 Schema as a summary of the data
 Retrieve through structured language
 What would big data bring to structured data
retrieval?

Introduction
 In terms of high volume of data
 Hadoop + Pig Latin came to rescue
 However, is this enough?
 Recall how you write selection. What do you need
to know
 Can you remember this ?

Introduction
 Big data-> big and complicated schema
 Hard to remember and operate!
 May not even fit in main memory!
 What should we do about it ?
 How does information retrieval deals with this ?

Introduction
 Search based on keywords
 No need for schema
 Efficiency guaranteed using index
 All seem to to be straightforward and easy
 What are the challenges ?

Introduction
 Search for “Apple + company”
 Match to “apple(fruit)”, “Apple Inc.”, “Adams’
apple”
 Which one is correct ? How to filter?
Challenge1:
Filtering and disambiguat

Introduction
 Search for “Steve Jobs + Apple”
 Normalization. What to return ？
ID Nam
e
Gend
er
Employ
er
Location
ID Company Locatio
n
Type Product
ID Street City State Countr
y
Challenge2:
Automatic join
back

Introduction
 Search for “Jordan”
 Match “Jordan (brand)” , ”Michael Jordan (player)”,
“Michael Jordan (professor)” etc.
 All of them should match. Which one is better ?
 Ranking
Challenge3:
Ranking of the
result

Literature Overview
 Two kinds of approaches
 1. Interpretative approach
 Reuse database query language and index
 Translate the keywords into queries
 Will introduce 3 papers
 2. Un-interpretative approach (focus)
 Typically build own index and data structure
 Model as graph and use graph-based analysis
 Will introduce 3 papers

Literature Overview –
Interpretative approach
 DBXplorer Sanjay Agrawal et al.
 General: two steps
 Publish step: pre-computation, indexing etc.
 Search step: lookup, enumerate over join tree,
generate SQL etc.
 Efficiency:
 Symbol table (index) design
 Symbol table compaction

 Publish step:
 1: A database is identified, along with the set of
tables and columns within the database to be
published.
 2: Auxiliary tables are created for supporting
keyword searches. E.g. index table
 But, how to build efficient index ?

 Index goal: find out the keyword belonging
row_id and column_id.
 If the column (attribute) already has index, we
need only column_id index (reuse database
index)
ID Name Gender Addr Org
1
2
3
Column index
Row index

 Compress index table
 Foreign key constraint etc.
 General Algorithm -- CP-Comp
Name Product …
Name Gender …
Sells table
Person table
Table 1.
Compressed
table Table 2.
Uncompressed table

 Search step
 Step 1: look up index find columns/rows of the
database that contain the query keywords.
 Step 2: All potential subsets of tables in the
database that, if joined, might contain rows having
all keywords, are identified and enumerated. Join
Tree
 Step 3: For each enumerated join tree, a SQL
statement is constructed (and executed) that joins
the tables in the tree and selects those rows that
contain all keywords. The final rows are ranked
and presented to the user.

 Join Tree example:

 Keyword Search in Databases: The Power of
RDBMS
 Lu Qin et al.
 SIGMOD 09

Integrating IR and DB
 DB techniques provide users with efficient
ways to access structured data in RDBMSs
 IR techniques allow users to use keywords to
access unstructured data
 Eg. Structural keyword search, finds how
tuples that contain keywords in a RDB are
interconnected (the structure), three types:

Schema-based approach
Connected Tree Semantics: query
results in minimal total joining network
of tuples; adjacent tuples joined by
foreign key reference, #tuples <=
Tmax

Connected Tree Semantics
 1. Candidate Network (CN) generation:
relational algebra expressions that creates
trees with all keywords up to a certain size
 2. CN evaluation: evaluates generated CNs
using SQL

Distinct Root Semantics: query
results in collection of tuples all
reachable from root; root uniquely
defines tuples, distance(any tuple,
root) <= Dmax

Distinct Core Semantics: query results in
multi-center subgraphs (communities);
keyword tuples uniquely defines a
community, distance(any keyword tuple, any
center tuple) <= Dmax

Distinct Core/Root Semantics
 1. Creates pairs between tuple containing
keyword and every other tuple, that is the
shortest distance between them
 2. generate graphs using SQL with distinct
core/roots

 Keyword search over relational databases: a
metadata approach.
 Bergamaschiet al.
 SIGMOD 11

Problem Definition

A database D is a collection of relational tables. Each relational table
contains its name, attributes and value domains. All these elements
together form the vocabulary.

A keyword query q is an ordered list of keywords. Each keyword
specifies the element of the interest.

A configuration of a keyword query on Database is an injective
mapping from the keyword to vocabulary of the database

Task: First derive the top configurations based on some metrics and
then interpret it as SQL query (select-project-join interpretations)

From Keywords to Queries

Need to consider inter-dependency of the query keywords:
Introduce two different kinds of weights: the intrinsic weights, and the
contextual weights

Need to give a ranked list of all the configurations
Develop an algorithm based on and extends the Hungarian (a.k.a.,
Munkres) algorithm

Need to separate the process of evaluating the schema terms and
value terms
Evaluate the value weights based on the schema mapping

Contributions and Insights

Formally define the problem of keyword querying over relational
databases that lack a-priori access to the database instance

Introduce the notion of a weight as a measure of the likelihood that the
semantics of a keyword are represented by a database structure.
Need to consider both intrinsic weights and contextual weights

Extend and exploit the Hungarian (a.k.a., Munkres) algorithm to
generate a ranking of different interpretations.

Literature Overview
 Two kinds of approaches
 1. Interpretative approach
 Reuse database query language and index
 Translate the keywords into queries
 2. Un-interpretative approach
 Typically build own index and data structure
 Model as graph and use graph-based analysis

Un-interpretative approach
 Effective Keyword Search in Relational
Databases
 Fang Liu et al.
 SIGMOD 06

Difficulties of Keyword Search
 Keyword search in text databases only need to
compute score for each document
 Keyword search on RDBMS more complicated
(relations, attributes, tuples):
 1. Generate tuple trees (answers) by joining
tuples from different tables
 2. Rank the answers by computing score

Generate Answer Tuple Trees
 Tuple tree answer rules:
1. Each leaf node in a tuple tree must contain at
least one keyword
2. Each tuple only appears at most once in tree
 Separate tuples into tuple sets that contain
keywords and tuple sets that contain all tuples
for each relation, join adjacent sets from
schema graph within constraints of answer
trees

Ranking Tuple Trees
 Treat the text of each tuple within an answer
set as a “document”
 Assign similarity rating between each
document and query, normalizing for:
 Term Frequency
 Document Frequency
 Document Length
 Compute score for tuple tree as average over
all documents

Focused work
 Keyword Searching and Browsing in
Databases using BANKS
 Gaurav Bhalotia et al.
 ICDE 02

BANKS (Browsing And Keyword
Searching)
 a system which enables keyword-
based search on relational
databases, together with data and
schema browsing
User HTTP
BANKS
System JDBC Database

Database and Query Model
 Relational Database -> Directed
Graph
 Each Tuple in Database -> Node in
Graph
 Foreign Key -> Directed Edge

 An answer to a query should be a
subgraph connecting nodes matching
the keywords.
 The importance of a link depends upon
the type of the link i.e. what relations it
connects and on its semantics
 Ignoring directionality would cause
problems because of “hubs” which are
connected to a large numbers of nodes.

 We may restrict the information node to
be from a selected set of nodes of the
graph
 We incorporate another interesting
feature, namely node weights, inspired
by prestige rankings
 Node weights and tree weights need to
be combined to get an overall relevance
score

Formal Model
 Node Weight : N(u)
Depends on the prestige
Set the node prestige = the in-degree of
the node
Nodes that have multiple pointers to
them get a higher prestige

Formal Model
 Edge Weights
Some pupluar tuples can be connected
many other tuples  Edge with forward
and backward edge weights
Weight of a forward link = the strength of
the proximity relationship between two
tuples (set to 1 by default)
Weight of a backward link = in-degree of
edges pointing to the node

Result
Result of query “sudarshan soumen”

Searching for the best answer
 Backward Expanding Search
Algorithm
Intuition: find vertices from which a
forward path exists to at least one node
from each Si.
Run concurrent single source shortest
path algorithm from each node matching
a keyword

Searching for the best answer
S.
Sudarsha
n
Prasan
Roy
writes
author
paper
Charuta
BANKS: Keyword
search…

As an extension of BANKS
 BLINKS: ranked keyword searches on
graphs.
 He H et al.
 SIGMOD 07

Introduction
 Efficient ranked keyword searches on schemaless node-labeled
graphs.
 Challenges:
 Lack of schema for optimization
 Hard to guarantee strong performance
 Proposed technique
 Backward search algorithm
 SLINKS: single-level index search *
 Extension for scalability: BLINKS ( bi-level index search )
 Contributions
 Cost-balanced expansion based backward search
 Combining indexing with search
 Partition-based indexing (bi-level indexing)

BLINKS ( brief idea)
 The index is too large to store and too expensive to construct in large
graphs?
Use a divide and conquer approach to create a bi-level index
 Partition the data graph into multiple subgraphs, or blocks.
 Intra-Block Index
 indexes information inside a block
 4 kinds of index, 2 for separator nodes (important, so specially considered )
 Block Index
 2 simple index

Conclusion
 Keywords search challenges:
 Filtering and disambiguation
 Automatic join back
 Ranking of the result
 Additional consideration:
 Efficiency
 Space

Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Presentation

Similar to Presentation (20)

Recently uploaded

Recently uploaded (20)

Presentation