11a.ppt
Upcoming SlideShare
Loading in...5
×
 

11a.ppt

on

  • 269 views

 

Statistics

Views

Total Views
269
Views on SlideShare
269
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

11a.ppt 11a.ppt Presentation Transcript

  • Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung
  • Overview
    • What is unstructured retrieval?
    • This is retrieving data from documents like journals, articles etc.
    • What is structured retrieval?
    • Retrieving data from databases, XML files etc. (that is, structural relationship between data exists)
  • Traditional IR approach
    • Use keyword frequency and document frequency statistics for query words to determine relevance of a document
      • Keyword frequency – No. of times a keyword appears in a document
      • Document frequency – No. of documents in which a keyword appears.
    • Use the combination of the two as a weighting factor
  • Traditional IR technique is inadequate for relational databases
    • Traditional IR techniques do not capture the relationship between data sources in a normalized database
    • Need to take into account the relationship between keywords in a database
    • Example:
      • A keyword is in a tuple referenced by many other tuples
      • No. of joins that need to be performed to get all keywords in a query
  • Example
    • DB1
    • Inproceedings Conferences
    id inprocID title procID year mon annote t 1 Adiba1986 Historical Multimedia Databases 23 1988 Aug temporal t 2 Abarbanel1987 ConnectionPerspective Reform 18 1987 May Intellicorp id procID Conference t3 23 The conference on Very Large Databases (VLDB) t4 18 ACM Sigmod Conf on management of data
  • Example
    • DB2
  • Example
    • Query = (Multimedia, Database, VLDB)
    • DB1 will give us good results,
    • But traditional IR model will return DB2 as the better one as term frequencies are higher in DB2
    • Hence we need to effectively summarize relationships between keywords in databases
  • Contributions
    • Address the problem of selection of structured data sources for keyword based queries
    • Propose a method for summarizing relationships between keywords in a database
    • Define metrics to rank source databases given a keyword query based on keyword relationships
    • Evaluation of proposed summarization using real datasets
  • Measuring Strength of Relationships Between Keywords
    • Strength of relationships between two keywords measured as a combination of two factors:
      • Proximity factor – Inverse of distance
      • Frequency factor, given a distance d – Number of combinations of exactly d+1 distinct tuples that can be joined in a sequence to get the two keywords in the end tuples
  • Modeling of an RDBMS
    • Let m = No. of distinct keywords in database DB
    • Let n = Total no. of tuples in DB.
    • Then matrix D = t1 t2 …. tn
    • k1
    • k2
    • :
    • :
    • km
    • D represents presence or absence of a keyword in a tuple (Similar to term-document incidence matrix in VSM)
  • Modeling of an RDBMS Cont’d
    • Matrix T represents relationship between tuples (for example, foreign key)
    • T= t1 t2 ……………… tn
    • t1 0 1
    • t2 1 0
    • :
    • :
    • tn
  • Mathematical representation of keyword relationships
  • Mathematical representation of keyword relationships Cont’d
    • A Keyword Relationship Matrix (KRM) R represents the relationship between any two pair of keywords with respect to δ and K
  • Mathematical representation of keyword relationships Cont’d
  • Example
    • For two given keywords k 1 and k 2 , and K=40
    • Database A has 5 joining sequences connecting them at distance = 1 Then score = 5 * (1/2) = 2.5
    • Database B has 40 joining sequences connecting them at distance = 4 Then score = 40*(1/5) = 8
    • Here B wins.
  • Example (cont’d)
    • If we bring down K to 10, then A wins.
    • Thus one may prefer A to B due to better quality.
    • K defines the number of top results users expect from the database.
  • Computation of KRM
    • How to compute
    • Few definitions –
  • Three proven propositions aiding the computation of the KRM
  • Three proven propositions aiding the Computation of KRM Cont’d
  • Comparison of frequencies of keyword pairs in DB 1 and DB 2
    • Frequencies of keyword pairs in DB 1
    • Frequencies of keyword pairs in DB 2
    • Our query was Q = (Multimedia, Database, VLDB )
    • Observation tells us that query words are more closely related in DB 1
    Keyword pair d=0 d=1 d=2 d=3 d=4 database:multimedia 1 1 - - - multimedia:VLDB 0 1 - - - Database:VLDB 1 1 - - - Keyword pair d=0 d=1 d=2 d=3 d=4 database:multimedia 0 0 0 0 2 multimedia:VLDB 0 0 0 0 0 Database:VLDB 0 0 1 0 0
  • Comparison of relationship scores of DB 1 and DB 2
    • Sample computation for DB 1 (K=10)
    • Rel [ Database, multimedia ] = 1 * 1 + 0.5 * 1 = 1.5
    Keyword pair DB1 DB2 Database:multimedia 1.5 0.4 Multimedia:VLDB 0.5 0 Database:VLDB 1.5 0.33
  • Implementation with SQL
    • Relation R D (kId, tId) represents the non-zero entries of the keyword incidence matrix D
    • kId is the keyword ID and tId is the tuple ID
    • R K (kId, keyword) stores the keyword IDs and keywords (similar to a word dictionary in IR)
    • Matrices T 1 , T 2, T 3. .. (Tuple relationship matrices) are represented with relations R T1 ,R T2 ,R T3 ..
    • R T1 :- Produced by joining pairs of tables
    • R T2 :- Produced by self-joining R T1
  • Implementation with SQL Cont’d
    • R T3 produced using the following SQLs
    • INSERT INTO R T3 (tId1, tId2)
    • SELECT s1.tId1, s2.tId2
    • FROM R T2 s1, R T1 s2
    • WHERE s1.tId2 = s2.tId1
    • INSERT INTO R T3 (tId1, tId2)
    • SELECT s1.tId1, s2.tId1
    • FROM R T2 s1, R T1 s2
    • WHERE s1.tId2 = s2.tId2 AND s1.tId1 < s2.tId1
    • INSERT INTO R T3 (tId1, tId2)
    • SELECT s2.tId1, s1.tId2
    • FROM R T2 s1, R T1 s2
    • WHERE s1.tId1 = s2.tId2
  • Implementation with SQL Cont’d
    • INSERT INTO R T3 (tId1, tId2)
    • SELECT s1.tId2, s2.tId2
    • FROM R T2 s1, R T1 s2
    • WHERE s1.tId1 = s2.tId1 AND s1.tId2 < s2.tId2
    • DELETE a FROM R T3 a, R T2 b, R T1 c
    • WHERE (a.tId1 = b.tId1 AND a.tId2 = b.tId2) OR
    • (a.tId1 = c.tId1 AND a.tId2 = c.tId2)
    • In general, R Td is generated by joining R Td-1 with R T1
    • and excluding the tuples already in R Td-1 , R Td-2 , … R T1
  • Creation of W 0 ,W 1 , W 2 ….(Matrices representing frequencies)
    • W 0 is represented with a relation R W0 (kId1, kId2, freq)
    • tuple (kId1, kId2, freq) records the pair of keywords (kId1,kId2) (kId1 < kId2), and its frequency (freq) at 0 distance, where freq is greater than 0.
    • R W0 is the result of self-joining R D (kId, tId).
    • SQL for creating R W0
    • INSERT INTO R W0 (kId1, kId2, freq)
    • SELECT s1.kId AS kId1, s2.kId AS kId2, count(*)
    • FROM R D s1, R D s2
    • WHERE s1.tId = s2.tId AND s1.kId < s2.kId
    • GROUP BY kId1, kId2
  • Creation of W0,W1, W2….(Matrices representing frequencies)
    • SQL for creating R Wd , d > 0
    • INSERT INTO R Wd (kId1, kId2, freq)
    • SELECT s1.kId AS kId1, s2.kId AS kId2, count(*)
    • FROM R D s1, R D s2, R Td r
    • WHERE ((s1.tId = r.tId1 AND s2.tId = r.tId2) OR
    • (s1.tId = r.tId2 AND s2.tId = r.tId1)) AND s1.kId < s2.kId
    • GROUP BY kId1, kId2
  • Final resulting KRM
    • The final resulting KRM, R is stored in a relation R R (kId 1 ,kId 2 ),consisting of pairs of keywords and their relationship score.
    • It is computed using the formula –
    • Update issues :-
    • The tables for storing these matrices can be updated dynamically.
  • Estimating multi-keyword relationships
    • Mutiple keywords are connected with Steiner trees.
    • It is an NP complete problem to find a minimum Steiner
    • tree.
    • Most current keyword search algorithms rely on heuristics to find top-K results.
    • Hence estimation between multiple keywords estimated
    • using derived keyword relationships described above.
  • Estimating multi-keyword relationships Cont’d
  • Estimating multi-keyword relationships Cont’d
  • Estimating multi-keyword relationships Cont’d
  • Database ranking and indexing
    • With KR summary, we can effectively rank a set of databases D = {DB 1 ,DB 2 ,…,DB N } for a given keyword query.
    • We can use either a global index or a local index
    • Global Index –
      • Analogous to an inverted index in IR Use keyword pairs as key, and <database Id, relationship score> as a postings entry
      • To evaluate a query, fetch the corresponding inverted lists, and compute the score for each database .
  • Database ranking and indexing Cont’d
    • Decentralized index
      • Each machine can store a subset of the index (that is, keyword pairs and inverted lists)
      • When a query is received at a node, search messages are sent across nodes and the corresponding postings lists are retrieved.
  • Experiments done to evaluate efficiency of this system
    • K-R score compared with score from brute force method (real_rank) over 82 databases spread across 16 nodes.
    • Effectiveness of this technique has been successfully established over distributed databases
    • Definitions used for comparison :-
  • Experiments done to evaluate efficiency of this system
  • Experiments done to evaluate efficiency of this system Cont’d
    • Effects of (length of joining sequence)
    • Selection performance of keyword queries generally gets better when grows larger.
    • Precision and recall values for different values tend to cluster into groups
    • There are big gaps in both precision and recall values
    • when and when is greater
  • Experiments done to evaluate efficiency of this system Cont’d
    • Recall and precision of 2-keyword queries using KR summaries and KF-summaries
  • Experiments done to evaluate efficiency of this system Cont’d
    • Effects of number of query keywords –
    • 1) Performance of 2-keyword queries generally better than
    • 3-keyword and 4-keyword queries
    • 5-keyword queries give better recall than 3 and 4 keyword queries
    • as they are more selective
    • 2) Generally, the difference in the recall of queries with
    • different no. of keywords is less than that of the precision
    • This shows that the system is effective in assigning high ranks to
    • useful databases, although less relevant or irrelevant databases
    • may also be selected.
    • Comparison of four kinds of estimations
    • (MIN,MAX,SUM,PROD)
    • SUM and PROD have similar behavior
    • and outperform the other two methods
    • Hence it is more effective to take into account relationship information of every keyword pair in the query when estimating overall scores
    Experiments done to evaluate efficiency of this system Cont’d
    • Recall and precision of K-R summaries using different estimations ( )
    Experiments done to evaluate efficiency of this system Cont’d