Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply



Published on

Published in: Technology, Business

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik 30th VLDB Conference Toronto ,Canada,2004 Presented By Abhishek Jamloki [email_address]
  • 2.
    • Realtor DB:
    • Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)‏
    • SQL query:
    • Select * From D
    • Where City=Seattle AND View=Waterfront
  • 3. Consider a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am} a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xs where each Xi is an attribute from A and xi is a value in its domain. specified attributes: X ={X1, …, Xs} unspecified attributes: Y = A – X Let S be the answer set of Q How to rank tuples in S and return top-k tuples to the user?
  • 4.
    • IR Treatment
      • Query Reformulation
      • Automatic Ranking
      • Correlations are ignored in high dimensional spaces of IR
      • Automated Ranking function proposed based on
      • A global score of unspecified attributes
      • A conditional score (strength of correlation between specified and unspecified attributes)
      • Automatic estimation using workload and data analysis
  • 5.  
  • 6.
    • Bayes’ Rule
    • Product Rule
    Document t , Query Q R : Relevant document set R = D - R : Irrelevant document set
  • 7.
    • Each tuple t is treated as a document
    • Partition t into two parts
    • t(X): contains specified attributes
    • t(Y): contains unspecified attributes
    • Replace t with X and Y
    • Replace R with D
  • 8.  
  • 9.
    • Comprehensive dependency models have unacceptable preprocessing and query processing costs
    • Choose a middle ground.
    • Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
  • 10.  
  • 11.
    • Workload W : a collection of ranking queries that have been executed on our system in the past.
    • Represented as a set of “tuples”, where each tuple represents a query and is a vector containing the corresponding values of the specified attributes.
    • We approximate R as all query “tuples” in W that also request for X (approximation is novel to this paper)‏
    • Properties of the set of relevant tuples R can be obtained by only examining the subset of the workload that contains queries that also request for X
    • Substitute p(y | R) as p(y | X,W)‏
  • 12.  
  • 13.
    • p(y | W) the relative frequencies of each distinct value y in the workload
    • p( y | D) relative frequencies of each distinct value y in the
    • database (similar to IDF concept in IR)‏
    • p(x | y,W) confidences of pair-wise association rules in the workload, that is: (#of tuples in W that contains x, y)/total # of tuples in W
    • p(x | y,D): (#of tuples in D that contains x, y)/total # of tuples in D
    • Stored as auxiliary tables in the intermediate knowledge representation layer
  • 14.
    • p(y | w) {AttName, AttVal, Prob}
      • B + Tree index on (AttName, AttVal)‏
    • p(y | D) {AttName, AttVal, Prob}
      • B + Tree index on (AttName, AttVal)‏
    • p(x | y,W) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}
      • B + Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)‏
    • p(x | y,D) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}
      • B + Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)‏
  • 15.
    • Preprocessing - Atomic Probabilities Module
    • Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W) , and P(x | y, D) for All Distinct Values x and y
    • Execution
    • Select Tuples that Satisfy the Query
    • Scan and Compute Score for Each Result-Tuple
    • Return Top- K Tuples
  • 16.
    • Trade off between pre-processing and query processing
    • Pre-compute ranked lists of the tuples for all possible “atomic” queries. Then at query time, given an actual query that specifies a set of values X, we “merge” the ranked lists corresponding to each x in X to compute the final Top-K tuples.
    • We should be able to perform merging without scanning the entire ranked lists
    • Threshold algorithm can be used for this purpose
    • A feasible adaptation of TA should keep the number of sorted streams small
    • Number of sorted streams will depend on number of attributes in database
  • 17.
    • At query time we do a TA-like merging of several ranked lists (i.e. sorted streams)‏
    • The required number of sorted streams depends only on the number of specified attribute values in the query and not on the total number of attributes in the database
    • Such a merge operation is only made possible due to the specific functional form of our ranking function resulting from our limited independence assumptions
  • 18.
    • Index Module: takes as inputs the association rules and the database, and for every distinct value x, creates two lists Cx and Gx, each containing the tuple-ids of all data tuples that contain x, ordered in specific ways.
    • Conditional List Cx: consists of pairs of the form <TID, CondScore>, ordered by descending CondScore
    • TID: tuple-id of a tuple t that contains x
    • Global List Gx: consists of pairs of the form <TID, GlobScore>, ordered by descending GlobScore, where TID is the tuple-id of a tuple t that contains x and
  • 19.
    • At query time we retrieve and multiply the scores of t in the lists Cx1,…,Cxs and in one of Gx1,…,Gxs. This requires only s +1 multiplications and results in a score2 that is proportional to the actual score. Two kinds of efficient access operations are needed:
    • First, given a value x, it should be possible to perform a GetNextTID operation on lists Cx and Gx in constant time, tuple-ids in the lists should be efficiently retrievable one-by-one in order of decreasing score. This corresponds to the sorted stream access of TA.
    • Second, it should be possible to perform random access on the lists, that is, given a TID, the corresponding score (CondScore or GlobScore) should be retrievable in constant time.
  • 20.
    • These lists are stored as database tables –
    • CondList C x
    • {AttName, AttVal, TID, CondScore}
    • B + Tree index on (AttName, AttVal, CondScore)‏
    • GlobList G x
    • {AttName, AttVal, TID, GlobScore}
    • B + Tree index on (AttName, AttVal, GlobScore)‏
  • 21.  
  • 22.  
  • 23.
    • Space consumed by the lists is O(mn) bytes (m is the number of attributes and n the number of tuples of the database table)‏
    • We can store only a subset of the lists at preprocessing time, at the expense of an increase in the query processing time.
    • Determining which lists to retain/omit at preprocessing time done by analyzing the workload.
    • Store the conditional lists Cx and the corresponding global lists Gx only for those attribute values x that occur most frequently in the workload
    • Probe the intermediate knowledge representation layer at query time to compute the missing information
  • 24.
    • The following Datasets were used:
      • MSR HomeAdvisor Seattle (‏
      • Internet Movie Database (‏
    • Software and Hardware:
      • Microsoft SQL Server2000 RDBMS
      • P4 2.8-GHz PC, 1 GB RAM
      • C#, Connected to RDBMS through DAO
  • 25.
    • Evaluated using two ranking methods
    • 1) Conditional
    • 2) Global
    • Several hundred workload queries were collected for both the datasets and ranking algorithm trained on this workload
  • 26.
    • For each query Q i , generate a set H i of 30 tuples likely to contain a good mix of relevant and irrelevant tuples
    • Let each user mark 10 tuples in H i as most relevant to Q i
    • Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm
  • 27.
    • Users were given the Top-5 results of the two ranking methods for 5 queries (different from the previous survey), and were asked to choose which rankings they preferred
  • 28.
    • Compared performance of the various implementations of the Conditional algorithm: List Merge, its space-saving variant and Scan
    • Datasets used:
  • 29.  
  • 30.  
  • 31.  
  • 32.  
  • 33.
    • Completely automated approach for the Many-Answers Problem which leverages data and workload statistics and correlation
    • Probabilistic IR models were adapted for structured data.
    • Experiments demonstrate efficiency as well as quality of the ranking system
  • 34.
    • Many relational databases contain text columns in addition to numeric and categorical columns. Whether correlations between text and non-text data can be leveraged in a meaningful way for ranking ?
    • Comprehensive quality benchmarks for database ranking need to be established