Presentation (PowerPoint File)

276 views
148 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
276
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Presentation (PowerPoint File)

  1. 1. Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. [email_address]
  2. 2. Roadmap <ul><li>The problem </li></ul><ul><li>Database content modeling </li></ul><ul><li>Database selection </li></ul><ul><li>Summary </li></ul>
  3. 3. Metasearch – the problem ??? applied mathematics Search results Metasearch Engine ??? applied mathematics
  4. 4. Subproblems <ul><li>Database content modeling </li></ul><ul><ul><li>How does a Metasearch engine “perceive” the content of each database? </li></ul></ul><ul><li>Database selection </li></ul><ul><ul><li>Selectively issue the query to the “best” databases </li></ul></ul><ul><li>Query translation </li></ul><ul><ul><li>Different database has different query formats </li></ul></ul><ul><ul><ul><li>“ a+b” / “a AND b” / “title:a AND body:b” / etc. </li></ul></ul></ul><ul><li>Result merging </li></ul><ul><ul><li>Query “applied mathematics” </li></ul></ul><ul><ul><li>top-10 results from both science.com and nature.com, how to present? </li></ul></ul>
  5. 5. Database content modeling and selection: a simplified example <ul><li>A “content summary” of each database </li></ul><ul><li>Selection based on # of mathing docs </li></ul><ul><li>Assuming independence between words </li></ul>Total #: 10,000 10,000  0.4  0.25 = 1000 documents matches “applied mathematics” > Total #: 60,000 60,000  0.00333  0.005 = 1 documents matches “applied mathematics” 0.4 4000 applied 0.25 2500 mathematics Pr( w ) # of documents that use w Word w 0.00333 200 applied 0.005 300 mathematics Pr( w ) # of documents that use w Word w
  6. 6. Roadmap <ul><li>The problem </li></ul><ul><li>Database content modeling </li></ul><ul><li>Database selection </li></ul><ul><li>Summary </li></ul>
  7. 7. Database content modeling able to replicate the entire text database - most storage demanding - fully cooperative database download part of a text database - more storage demanding - non-cooperative database able to obtain a full content summary - less storage demanding - fully cooperative database approximate the content summary via sampling - least storage demanding - non-cooperative database
  8. 8. Replicate the entire database <ul><li>E.g. </li></ul><ul><ul><li>www.google.com/patents , replica of the entire USPTO patent document database </li></ul></ul>
  9. 9. Download a non-cooperative database <ul><li>Objective: download as much as possible </li></ul><ul><li>Basic idea: “probing” (querying with short queries) and downloading all results </li></ul><ul><li>Practically, can only issue a fixed # of probes (e.g., 1000 queries per day) </li></ul>Metasearch Engine Search Interface “applied” “mathematics” A text database
  10. 10. Harder than the “set-coverage” problem <ul><li>All docs in a database db as the universe </li></ul><ul><ul><li>assuming all docs are equal </li></ul></ul><ul><li>Each probe corresponds to a subset </li></ul><ul><li>Find the least # of subsets (probes) that covers db </li></ul><ul><ul><li>or, the max coverage with a fixed # of subsets (probes) </li></ul></ul><ul><li>NP-complete </li></ul><ul><ul><li>Greedy algo. proved to be the best-possible P-time approximation algo. </li></ul></ul><ul><li>Cardinality of each subset (# of matching docs for each probe) unknown! </li></ul>“ mathematics” “ applied”
  11. 11. Pseudo-greedy algorithms [NPC05] <ul><li>Greedy-set-coverage: choose subsets with the max “cardinality gain” </li></ul><ul><li>When cardinality of subsets is unknown </li></ul><ul><ul><li>Assume cardinality of subsets is the same across databases - proportionally </li></ul></ul><ul><ul><ul><li>e.g. build a database with Web pages crawled from the Internet, rank single words according to their frequency </li></ul></ul></ul><ul><ul><li>Start with certain “seed” queries, adaptively choose query words within the docs returned </li></ul></ul><ul><ul><ul><li>Choice of probing words varies from database to database </li></ul></ul></ul>
  12. 12. An adaptive method <ul><li>D ( w i ) – subsets returned by probe with word w i </li></ul><ul><li>w 1 , w 2 , …, w n already issued </li></ul><ul><li>Rewritten as | db |  Pr ( w i+1 ) - | db |  Pr ( w i+1 Λ ( w 1 V … V w n )) </li></ul><ul><ul><li>Pr ( w ): prob. of w appearing in a doc of db </li></ul></ul>
  13. 13. An adaptive method (cont’d) <ul><li>How to estimate Pr ̃ (w i+1 ) </li></ul><ul><li>Zipf’s law: </li></ul><ul><ul><li>Pr ( w ) = α  ( R ( w )+ β ) - γ , R ( w ): rank of w in a descending order of Pr ( w ) </li></ul></ul><ul><li>Assuming the ranking of w 1 , w 2 , …, w n and other words remains the same in the downloaded subset and in db </li></ul><ul><li>Interpolate: </li></ul>interpolated “ P ̃ r ( w )” fitted Zipf’s law curve single words ranked by Pr ( w ) in the downloaded documents Pr ( w ) values for w 1 , w 2 , …, w n
  14. 14. Obtain an exact content summary <ul><li>C ( db ) for a database db </li></ul><ul><ul><li>Statistics about words in db , e.g., df – document frequency, </li></ul></ul><ul><li>Standards and proposals for co-operative databases to follow to export C ( db ) </li></ul><ul><ul><li>STARTS [GCM97] </li></ul></ul><ul><ul><ul><li>Initiated by Stanford, attracted main search engine players by 1997: Fulcrum, Infoseek, PLS, Verity, WAIS, Excite, etc. </li></ul></ul></ul><ul><ul><li>SDARTS [GIG01] </li></ul></ul><ul><ul><ul><li>Initiated by Columbia U. </li></ul></ul></ul>1000 research 4000 applied 2500 mathematics df w
  15. 15. Approximate the content-summary <ul><li>Objective: C ̃ ( db ) of a database db , with high vocabulary coverage & high accuracy </li></ul><ul><li>Basic idea: probing and download sample docs [CC01] </li></ul><ul><ul><li>Example: df as the content summary statistics </li></ul></ul><ul><ul><ul><li>Pick a single word as the query, probe the database </li></ul></ul></ul><ul><ul><ul><li>Download a fraction of results, e.g., top- k </li></ul></ul></ul><ul><ul><ul><li>If terminating condition unsatisfied, go to 1. </li></ul></ul></ul><ul><ul><ul><li>Output < w , df ̃ > based on the sample docs downloaded </li></ul></ul></ul>
  16. 16. Vocabulary coverage <ul><li>Can a small sample of docs cover the vocabulary of a big database? </li></ul><ul><li>Yes, based on Heap’s law [Hea78]: </li></ul><ul><ul><li>| W |= K  n β </li></ul></ul><ul><ul><ul><li>n - # of words scanned </li></ul></ul></ul><ul><ul><ul><li>W - set of distinct words encountered </li></ul></ul></ul><ul><ul><ul><li>K - constant, typically in [10, 100] </li></ul></ul></ul><ul><ul><ul><li>β - constant, typically in [0.4, 0.6] </li></ul></ul></ul><ul><li>Empirically verified [CC01] </li></ul>
  17. 17. Estimate document frequency <ul><li>How to identify the df ̃ of w in the entire database? </li></ul><ul><ul><li>w used as a query during sampling: df typically revealed in search results </li></ul></ul><ul><ul><li>w’ appearing in the sampled docs: need to estimate df ̃ based on the docs sample </li></ul></ul><ul><li>Apply Zipf’s law & interpolate [IG02] </li></ul><ul><ul><li>Rank w and w’ based on their frequency in the sample </li></ul></ul><ul><ul><li>Curve-fit based on the true df of those w </li></ul></ul><ul><ul><li>Interpolate the estimated df ̃ of w’ onto the fitted curve </li></ul></ul>
  18. 18. What if db changes over time? <ul><li>So does its content summary C ( db ), and C ̃ ( db ) [INC05] </li></ul><ul><li>Empirical study </li></ul><ul><ul><li>152 Web databases, a snapshot downloaded weekly, for 1 year </li></ul></ul><ul><ul><li>df as the statistics measure </li></ul></ul><ul><ul><li>Kullback-Leibler divergence as the “change” measure </li></ul></ul><ul><ul><ul><li>between the “latest” snapshot and the snapshot time t ago </li></ul></ul></ul><ul><ul><li>db does change! </li></ul></ul><ul><ul><li>How do we model the change? </li></ul></ul><ul><ul><li>When to resample, and get a new C ̃ ( db ) ? </li></ul></ul>t Kullback-Leibler divergence
  19. 19. Model the change <ul><li>KL db ( t ) – the KL divergence between the current C ̃ ( db ) and C ̃ ( db , t ) time t ago </li></ul><ul><li>T: time when KL db ( t ) exceeds a pre-specified τ </li></ul><ul><li>Applying principles of Survival Analysis </li></ul><ul><ul><li>Survival function S db ( t ) = 1 – Pr(T ≤ t ) </li></ul></ul><ul><ul><li>Hazard funciton h db ( t ) = - (d S db ( t ) /d t ) / S db ( t ) </li></ul></ul><ul><li>How to compute h db ( t ) and then S db ( t )? </li></ul>
  20. 20. Learn the h db ( t ) of database change <ul><li>Cox proportional hazards regression model </li></ul><ul><ul><li>ln( h db ( t ) ) = ln( h base ( t ) ) + β 1  x 1 + … , where x i is some predictor variable </li></ul></ul><ul><li>Predictors </li></ul><ul><ul><li>Pre-specified threshold τ </li></ul></ul><ul><ul><li>Web domain of db , “.com” “.edu” “.gov” “.org” “others” </li></ul></ul><ul><ul><ul><li>5 binary “domain variables” </li></ul></ul></ul><ul><ul><li>ln( | db | ) </li></ul></ul><ul><ul><li>avg KL db (1 week) measured in the training period </li></ul></ul><ul><ul><li>… </li></ul></ul>
  21. 21. Train the Cox model <ul><li>Stratified Cox model being applied </li></ul><ul><ul><li>Domain variables didn’t satisfy the Cox proportional assumption </li></ul></ul><ul><ul><li>Stratifying on each domain, or, a h base ( t ) / S base ( t ) for each domain </li></ul></ul><ul><li>Training S base ( t ) for each domain </li></ul><ul><ul><li>Assuming Weibull distribution, S base ( t ) = e - λ t γ </li></ul></ul>
  22. 22. Training result <ul><li>γ ranges in (0.57, 1.08)  S base ( t ) not exponential distribution </li></ul>S base ( t ) t
  23. 23. Training result (cont’d) <ul><li>A larger db takes less time to have KL db ( t ) exceed τ </li></ul><ul><li>Databases changes faster during a short period are more likely to change later on </li></ul>-1.305 6.762 0.094 β value τ avg KL db (1 week) ln( | db | ) predictor
  24. 24. How to use the trained model? <ul><li>Model gives S db ( t )  likelihood that db “has not changed much” </li></ul><ul><li>An update policy to periodically resample each db </li></ul><ul><ul><li>Intuitively, maximize ∑ db S db ( t ) </li></ul></ul><ul><ul><li>More precisely S = lim t  ∞ (1/t)  ∫ 0 t [ ∑ db S db ( t ) ] d t </li></ul></ul><ul><li>A policy: { f db }, where f db is the update frequency of db , e.g., 2/week </li></ul><ul><li>Subject to practical constraints, e.g., total update cap per week </li></ul>–
  25. 25. Derive an optimal update policy <ul><li>Find { f db } that maximizes S under the constraint ∑ db f db = F , where F is a global frequency limit </li></ul><ul><li>Solvable by the Lagrange-multiplier method </li></ul><ul><li>Sample results: </li></ul>– 1/5 1/46 0.088 tomshardware.com 1/12 1/34 0.023 Usps.com F=15/week F=4/week λ db
  26. 26. Roadmap <ul><li>The problem </li></ul><ul><li>Database content modeling </li></ul><ul><li>Database selection </li></ul><ul><li>Summary </li></ul>
  27. 27. Database selection <ul><li>Select the databases to issue a given query </li></ul><ul><ul><li>Necessary when the Metasearch engine do not have entire replica of each database – most likely with content summary only </li></ul></ul><ul><ul><li>Reduces query load in the entire system </li></ul></ul><ul><li>Formalization </li></ul><ul><ul><li>Query q = < w 1 , …, w m >, databases db 1 , …, db n </li></ul></ul><ul><ul><li>Rank databases according to their “relevancy score” r ( db i , q ) to query q </li></ul></ul>
  28. 28. Relevancy score <ul><li># of matching docs in db </li></ul><ul><li>Similarity between q and top docs returned by db </li></ul><ul><ul><li>Typically vector-space similarity (dot-product) between q and a doc </li></ul></ul><ul><ul><li>Sum / Avg of similarities of top- k docs of each db , e.g., top-10 </li></ul></ul><ul><ul><li>Sum / Avg of similarities of top docs of each db exceeding a similarity threshold </li></ul></ul><ul><li>Relevancy of db as judged by users </li></ul><ul><ul><li>Explicit relevance feedback </li></ul></ul><ul><ul><li>User click behavior data </li></ul></ul>
  29. 29. Estimating r ( db , q ) <ul><li>Typically, r ( db , q ) unavailable </li></ul><ul><li>Estimate r ̃ ( db , q ) based on C ( db ), or C ̃ ( db ) </li></ul>
  30. 30. Estimating r ( db , q ), example 1 [GGT99] <ul><li>r ( db , q ): # of matching docs in db </li></ul><ul><li>Independence assumption: </li></ul><ul><ul><li>Query words w 1 , …, w m appear independently in db </li></ul></ul><ul><li>r ̃ ( db , q ): </li></ul><ul><ul><li>df ( db , w j ): document frequency of w j in db – could be df ̃ ( db , w j ) from C ̃ ( db ) </li></ul></ul>
  31. 31. Estimating r ( db , q ), example 2 [GGT99] <ul><li>r ( db , q ): ∑ { d  db | sim( d , q )> l } sim( d , q ) </li></ul><ul><ul><li>d : a doc in db </li></ul></ul><ul><ul><li>sim( d , q ): vector dot-product between d & q </li></ul></ul><ul><ul><ul><li>each word in d & q weighted with common tf  idf weighting </li></ul></ul></ul><ul><ul><li>l : a pre-specified threshold </li></ul></ul>
  32. 32. Estimating r ( db , q ), example 2 (cont’d) <ul><li>Content summary, C( db ), required: </li></ul><ul><ul><li>df ( db , w ): doc frequency </li></ul></ul><ul><ul><li>v( db , w ): ∑ { d  db } weight of w in d ’s vector </li></ul></ul><ul><ul><ul><li><v( db , w 1 ), v( db , w 2 ), …> - “centroid” of the entire db as a “cluster of doc vectors” </li></ul></ul></ul>– – –
  33. 33. Estimating r ( db , q ), example 2 (cont’d) <ul><li>l = 0, sum of all q-doc similarity values of db </li></ul><ul><ul><li>r ( db , q ) = ∑ { d  db } sim( d , q ) </li></ul></ul><ul><ul><li>r ̃ ( db , q ) = r ( db , q ) = <v( q , w 1 ), …>  <v( db , w 1 ), v( db , w 2 ), …> </li></ul></ul><ul><ul><ul><li>v( q , w ): weight of w in the query vector </li></ul></ul></ul><ul><li>l > 0? </li></ul>– –
  34. 34. Estimating r ( db , q ), example 2 (cont’d) <ul><li>Assuming uniform weight of w among all docs using w </li></ul><ul><ul><li>i.e. weight of w in any doc = v ( db , w ) / df ( db , w ) </li></ul></ul><ul><li>Highly-correlated query words scenario </li></ul><ul><ul><li>If df ( db , w i ) < df ( db , w j ), every doc using w i also uses w j </li></ul></ul><ul><ul><li>Words in q sorted s.t. df ( db , w 1 ) ≤ df ( db , w 2 ) ≤ … ≤ df ( db , w m ) </li></ul></ul><ul><ul><li>r ̃ ( db , q ) = ∑ i =1… p v( q , w i )  v( db , w i ) + df ( db , w p )  [ ∑ j =p+1… m v( q , w j )  v( db , w j )/ df ( db , w j )] where p is determined by some criteria [GGT99] </li></ul></ul><ul><li>Disjoint query words scenario </li></ul><ul><ul><li>No doc using w i uses w j </li></ul></ul><ul><ul><li>r ̃ ( db , q ) = ∑ i =1… m | df ( db, w i ) > 0 Λ v ( q , w i )  v ( db , w i ) / df ( db, w i ) > l v( q , w i )  v( db , w i ) </li></ul></ul>– – – – –
  35. 35. Estimating r ( db , q ), example 2 (cont’d) <ul><li>Ranking of databases based on r ̃ ( db , q ) empirically evaluated [GGT99] </li></ul>
  36. 36. A probabilistic model for errors in estimation [LLC04] <ul><li>Any estimation makes errors </li></ul><ul><li>An error (observed) distribution for each db </li></ul><ul><ul><li>distribution of db 1 ≠ distribution of db 2 </li></ul></ul><ul><li>Definition of error: relative </li></ul>
  37. 37. Modeling the errors: a motivating experiment <ul><li>db PMC : PubMedCentral www.pubmedcentral.nih.gov </li></ul><ul><li>Two query sets, Q 1 and Q 2 (healthcare related) </li></ul><ul><ul><li>|Q 1 | = | Q 2 | = 1000, Q 1  Q 2 =  </li></ul></ul><ul><li>Compute err ( db PMC , q ) for each sample query q  Q 1 or Q 2 </li></ul><ul><li>Further verified through statistical tests (Pearson- χ 2 ) </li></ul>err ( db PMC , q ),  q  Q 1 err ( db PMC , q ),  q  Q 2 error probability distribution error probability distribution Q 1 Q 2
  38. 38. Implications of the experiment <ul><li>On a text database </li></ul><ul><ul><li>Similar error behavior among sample queries </li></ul></ul><ul><ul><li>Can sample a database and summarize the error behavior into an Error Distribution ( ED ) </li></ul></ul><ul><ul><li>Use ED to predict the error for a future unseen query </li></ul></ul><ul><li>Sampling size study [LLC04] </li></ul><ul><ul><li>A few hundred sample queries good enough </li></ul></ul>
  39. 39. From an Error Distribution (ED) to a Relevancy Distribution (RD) <ul><li>Database: db 1 . Query: q new </li></ul>0.1 0.5 0.4 -50% 0% +50% ③ r̃ ( db 1 ,q new ) =1000 500 1000 1500 A Relevancy Distribution ( RD ) for r ( db 1 , q new ) err ( db 1 , q new ) r ( db 1 , q new ) The ED for db 1 ① ② ④ 0.1 0.5 0.4 by definition from sampling existing estimation method
  40. 40. RD-based selection <ul><li>Estimation-based: db 1 > db 2 </li></ul>RD -based: db 1 < db 2 ( Pr ( db 1 < db 2 ) = 0.85 ) r̃ ( db 1 ,q new ) r̃ ( db 2 ,q new ) 650 1000 0.1 0.5 0.4 -50% 0% +50% err ( db 1 , q new ) r̃ ( db 1 ,q new ) =1000 r ( db 1 , q new ) db 1 0.1 0.9 0% +100% err ( db 2 , q new ) r̃ ( db 1 ,q new ) =650 r ( db 2 , q new ) db 2 db 1 : db 2 : 500 1000 1500 0.1 0.5 0.4 0.1 0.9 650 1300
  41. 41. Correctness metric <ul><li>Terminology: </li></ul><ul><ul><li>DB k : k databases returned by some method </li></ul></ul><ul><ul><li>DB topk : the actual answer </li></ul></ul><ul><li>How correct DB k is compared to DB topk ? </li></ul><ul><li>Absolute correctness : Cor a ( DB k ) = 1, if DB k = DB topk 0, otherwise </li></ul><ul><li>Partial correctness : Cor p ( DB k ) = </li></ul><ul><li>Cor a ( DB k ) = Cor p ( DB k ) for k = 1 </li></ul>
  42. 42. Effectiveness of RD-based selection <ul><li>20 healthcare-related text databases on the Web </li></ul><ul><li>Q 1 (training, 1000 queries) to learn the ED of each database </li></ul><ul><li>Q 2 (testing, 1000 queries) to test the correctness of database selection </li></ul>k = 3 0.815 ( +30.9% ) 0.478 ( +58.8% ) 0.651 ( +38.2% ) RD -based selection 0.699 0.301 0.471 Estimation-based selection (term-independence estimator) Avg ( Cor p ) Avg ( Cor a ) Avg ( Cor a ) , Avg ( Cor p ) k = 1
  43. 43. Probing to improve correctness <ul><li>RD -based selection </li></ul><ul><ul><li>0.85 = Pr ( db 2 > db 1 ) </li></ul></ul><ul><ul><li>= Pr ({ db 2 } = DB top1 ) </li></ul></ul><ul><ul><li>= 1  Pr ({ db 2 } = DB top1 ) + 0  Pr ({ db 2 }  DB top1 ) </li></ul></ul><ul><ul><li>= E [ Cor a ({ db 2 })] </li></ul></ul><ul><li>Probe db i : contact a db i to obtain its exact relevancy </li></ul><ul><li>After probing db 1 : </li></ul><ul><ul><li>E [ Cor a ({ db 2 })] = Pr ( db 2 > db 1 ) = 1 </li></ul></ul>db 1 : db 2 : 500 1000 1500 0.1 0.5 0.4 0.1 0.9 650 1300 500 r ( db 1 , q )=500
  44. 44. Computing the expected correctness <ul><li>Expected absolute correctness </li></ul><ul><ul><li>E [ Cor a ( DB k )] =1  Pr ( Cor a ( DB k ) = 1) + 0  Pr ( Cor a ( DB k ) = 0) = Pr ( Cor a ( DB k ) = 1) = Pr ( DB k = DB topk ) </li></ul></ul><ul><li>Expected partial correctness </li></ul><ul><ul><li>E [ Cor p ( DB k )] </li></ul></ul>
  45. 45. Adaptive probing algorithm: APro <ul><li>User-specified correctness threshold : t </li></ul>db i+1 db i db 1 unprobed probed Any DB k with E [ Cor ( DB k )]  t ? NO YES return this DB k RD ’s of the probed and unprobed databases db n db i-1 db i+1 db n db i
  46. 46. Which database to probe? <ul><li>A greedy strategy: </li></ul><ul><ul><li>The stopping condition: E [ Cor ( DB k )]  t </li></ul></ul><ul><ul><li>Once probed, which database leads to the highest E [ Cor ( DB k )]? </li></ul></ul><ul><ul><li>Suppose we will probe db 3 </li></ul></ul><ul><ul><li>if r ( db 3 ,q ) = r a , max E [ Cor ( DB k )] = 0.85 </li></ul></ul><ul><ul><li>if r ( db 3 ,q ) = r b , max E [ Cor ( DB k )] = 0.8 </li></ul></ul><ul><ul><li>if r ( db 3 ,q ) = r c , max E [ Cor ( DB k )] = 0.9 </li></ul></ul><ul><ul><li>Probe the database that leads to the largest “expected” max E [ Cor ( DB k )] </li></ul></ul>db 1 db 2 db 3 db 4 r a r b r c r ( db 3 , q ) = r a r ( db 3 , q ) = r b r ( db 3 , q ) = r c
  47. 47. Effectiveness of adaptive probing <ul><li>20 healthcare-related text databases on the Web </li></ul><ul><li>Q 1 (training, 1000 queries) to learn the RD of each database </li></ul><ul><li>Q 2 (testing, 1000 queries) to test the correctness of database selection </li></ul>k = 1 avg Cor a avg Cor a avg Cor p k = 3 k = 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 # of databases probed adaptive probing APro the term-independence estimator 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 # of databases probed adaptive probing APro the term-independence estimator 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 # of databases probed adaptive probing APro the term-independence estimator
  48. 48. The “lazy TA problem” <ul><li>Same problem, generalized & “humanized” </li></ul><ul><li>After the final exam, the TA wants to find out the top scoring students </li></ul><ul><li>TA is “lazy,” don’t want to score all exam sheets </li></ul><ul><li>Input: every student’s score: a known distribution </li></ul><ul><ul><li>Observed from pervious quiz, mid-term exams </li></ul></ul><ul><li>Output: a scoring strategy </li></ul><ul><ul><li>Maximizes the correctness of the “guessed” top- k students </li></ul></ul>
  49. 49. Further study of this problem [LSC05] <ul><li>Proves greedy probing is optimal under special cases </li></ul><ul><li>More interesting factors to-be-explored: </li></ul><ul><ul><li>“Optimal” probing strategy in general cases </li></ul></ul><ul><ul><li>Non-uniform probing cost </li></ul></ul><ul><ul><li>Time-variant distributions </li></ul></ul>
  50. 50. Roadmap <ul><li>The problem </li></ul><ul><li>Database content modeling </li></ul><ul><li>Database selection </li></ul><ul><li>Summary </li></ul>
  51. 51. Summary <ul><li>Metasearch – a challenging problem </li></ul><ul><li>Database content modeling </li></ul><ul><ul><li>Sampling enhanced by proper application of the Zipf’s law, the Heap’s law </li></ul></ul><ul><ul><li>Content change modeled using Survival Analysis </li></ul></ul><ul><li>Database selection </li></ul><ul><ul><li>Estimation of database relevancy based on assumptions </li></ul></ul><ul><ul><li>A probabilistic framework that models the error as a distribution </li></ul></ul><ul><ul><li>“ Optimal” probing strategy for a collection of distributions as input </li></ul></ul>
  52. 52. References <ul><li>[CC01] J.P. Callan and M. Connell, “Query-Based Sampling of Text Databases,” ACM Tran. on Information System , 19(2), 2001 </li></ul><ul><li>[GCM97] L. Gravano, C-C. K. Chang, H. Garcia-Molina, A. Paepcke, “ STARTS : Stanford Proposal for Internet Meta-searching,” in Proc. of the ACM SIGMOD Int’l Conf. on Management of Data , 1997 </li></ul><ul><li>[GGT99] L. Gravano, H. Garcia-Molina, A. Tomasic, “ GlOSS : Text Source Discovery over the Internet,” ACM Tran. on Database Systems , 24(2), 1999 </li></ul><ul><li>[GIG01] N. Green, P. Ipeirotis, L. Gravano, “ SDLIP + STARTS = SDARTS : A Protocol and Toolkit for Metasearching,” in Proc. of the Joint Conf. on Digital Libraries ( JCDL ), 2001 </li></ul><ul><li>[Hea78] H.S. Heaps, Information Retrieval: Computational and Teoretical Aspects , Academic Press, 1978 </li></ul><ul><li>[IG02] P. Ipeirotis, L. Gravano, “Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection,” in Proc. of the 28 th VLDB Conf. , 2002 </li></ul>
  53. 53. References (cont’d) <ul><li>[INC05] P. Ipeirotis, A. Ntoulas, J. Cho, L. Gravano, “Modeling and Managing Content Changes in Text Databases,” in Proc. of the 21 st IEEE Int’l Conf. on Data Eng. ( ICDE ), 2005 </li></ul><ul><li>[LLC04] Z. Liu, C. Luo, J. Cho, W.W. Chu, “A Probabilistic Approach to Metasearching with Adaptive Probing,” in Proc. of the 20 th IEEE Int’l Conf. on Data Eng. ( ICDE ), 2004 </li></ul><ul><li>[LSC05] Z. Liu, K.C. Sia, J. Cho, “Cost-Efficient Processing of Min/Max Queries over Distributed Sensors with Uncertainty,” in Proc. of ACM Annual Symposium on Applied Computing , 2005 </li></ul><ul><li>[NPC05] A. Ntoulas, P. Zerfos, J. Cho, “Downloading Hidden Web Content,” in Proc. of the Joint Conf. on Digital Libraries ( JCDL ), June 2005 </li></ul>

×