From federated to aggregated search Fernando Diaz, Mounia Lalmas and Milad Shokouhi [email_address] [email_address] [email...
Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li...
Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li...
Introduction <ul><li>What is federated search? </li></ul><ul><li>What is aggregated search? </li></ul><ul><ul><li>Motivati...
A classical example of federated search www.theeuropeanlibrary.org Collections to be searched One query
A classical example of federated search www.theeuropeanlibrary.org Merged list of results
Motivation for federated search <ul><li>Search a number of independent collections, with a focus on hidden web collections...
Challenges for federated search <ul><li>How to  represent  collections, so that to know what documents each contain? </li>...
From federated search to aggregated search <ul><li>“ Federated search on the web” </li></ul><ul><ul><li>Peer-to-peer netwo...
A classical example of aggregated search News Homepage Wikipedia Real-time results Video Twitter Structured Data
Motivation for aggregated search <ul><li>Increasingly different types of information being available, sough and relevant <...
Motivation for aggregated search (Arguello et al ,  09) 25K editorially classified queries
Motivation for aggregated search
Motivation for aggregated search
Challenges in aggregated search <ul><ul><li>Extremely heterogeneous collections </li></ul></ul><ul><li>What is/are the ver...
Ambiguous non-stationary intent Query - Travel -  Molusk -  Paul Vertical -  Wikipedia -  News - Image
Recap – Introduction federated search aggregated search heterogeneity low high scale (documents, users) small large user f...
Terminology <ul><li>federated search, distributed information retrieval, data fusion, aggregated search, universal search,...
Problem definition Present the “querier” with a summary of search results from one or more resources.
General architecture User Search Interface/ Portal/ Broker Source/ Server/ Vertical Source/ Server/ Vertical Source/ Serve...
Peer-to-peer network Peer Directory Server
Peer to Peer (P2P) networks <ul><li>Broker-based </li></ul><ul><ul><li>Single centralized broker with documents lists shar...
Federated search Query Broker Collection A Query Query Query Query Query Collection B Collection C Collection D Collection...
Federated search <ul><li>Also known as distributed information retrieval (DIR) system  </li></ul><ul><li>Provides one port...
http://funnelback.com/pdfs/brochures/enterprise.pdf
Metasearch User Metasearch engine Raw Query WWW Query Query Query Query
Metasearch <ul><li>Search engine querying several different search engines and combines results from them (blended), or di...
Aggregated search User Angelina Jolie Results WWW Index (text) Query Query Query Query
Aggregated search <ul><li>Specific to a web search engine </li></ul><ul><li>“ Increasingly” more than one type of informat...
Data fusion Query GOV2 BM25 KL Inquery Anchor only Title only One document collection Different document representations D...
Data fusion <ul><li>Search one collection </li></ul><ul><li>Document can be indexed in different ways </li></ul><ul><ul><l...
Terminology - Resource <ul><li>Source </li></ul><ul><li>Server </li></ul><ul><li>Database </li></ul><ul><li>Collection  (f...
Terminology - Aggregation <ul><li>Merging </li></ul><ul><li>Blending </li></ul><ul><li>Fusion </li></ul><ul><li>Slotted </...
Aggregated search (tiled) http://au.alpha.yahoo.com/
Aggregated search (tiled) Naver.com
Aggregated search (slotted)
Others <ul><li>Clustering </li></ul><ul><li>Faceted search </li></ul><ul><li>Multi-document summarization </li></ul><ul><l...
Yippy – Clustering search engine from Vivisimo clusty.com
Faceted search
Multi-document summarization http://newsblaster.cs.columbia.edu/
“ Fictitious” document generation (Paris et al, 10)
Entity search http://sandbox.yahoo.com/Correlator
Recap <ul><li>Shown the relations between federated, aggregated search, and others </li></ul><ul><li>Exposed the various t...
Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li...
Architecture:  what are the general components of federated and aggregated search systems.
Federated search architecture
Aggregated search architecture <ul><li>Pre-retrieval aggregation: decide verticals before seeing results </li></ul><ul><li...
Post-retrieval, pre-web
Pre and post-retrieval,  pre-web
Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li...
Resource representation:  how to represent resources, so that we know what documents each contain.
Resource representation in federated search (Also known as resource summary/description)
Resource representation <ul><li>Cooperative environments </li></ul><ul><ul><li>Comprehensive term statistics </li></ul></u...
Resource representation (cooperative environments) <ul><li>STARTS Protocol  (Gravano et al, 97) </li></ul><ul><ul><li>Sour...
<ul><li>Different types of term statistics  </li></ul><ul><ul><li>(Callan et al, 95; Gravano et al, 94a,b,99;  Meng et al,...
Resource representation (uncooperative environments) <ul><li>Query-based sampling  ( Callan and Connell, 01 ) </li></ul><u...
<ul><li>Query selector </li></ul><ul><ul><li>(Callan and Connell, 01) </li></ul></ul><ul><ul><ul><li>Other resource descri...
<ul><li>Adaptive sampling </li></ul><ul><ul><li>(Shokouhi et al, 06a) </li></ul></ul><ul><ul><ul><li>Rate of visiting new ...
<ul><li>Improving incomplete samples </li></ul><ul><ul><li>Shrinkage  (Ipeirotis, 04; Ipeirotis and Gravano, 04) : topical...
<ul><li>Capture-recapture  ( Liu et al, 01) </li></ul>Resource representation (Collection size estimation) Sample A (Captu...
Resource representation (Collection size estimation)
<ul><li>Multiple queries sampler  </li></ul><ul><li>( Thomas and Hawking, 07 ) </li></ul><ul><li>Random-walk sampler, and ...
Resource representation (Updating summaries) <ul><li>(Ipeirotis et al, 05) </li></ul><ul><li>(Shokouhi et al, 07a) </li></ul>
Resource representation in aggregated search <ul><li>Vertical content </li></ul><ul><ul><li>samples or access to vertical ...
Vertical content includes text NEWS
Vertical content includes structure SPORTS
Vertical content includes images IMAGES
Issues with vertical content <ul><li>Dynamics </li></ul><ul><ul><li>some vertical becomes stale  fast </li></ul></ul><ul><...
Addressing content dynamics <ul><li>sample most recently indexed documents </li></ul><ul><ul><li>(Diaz 09) </li></ul></ul>...
Addressing heterogeneous content <ul><li>use text available with documents (e.g. captions) </li></ul><ul><li>manually map ...
Vertical query logs <ul><li>Queries issued directly to a vertical represent  explicit  vertical intent </li></ul><ul><li>I...
Issues with vertical query logs <ul><li>Dynamics </li></ul><ul><ul><li>some verticals require temporally-sensitive samplin...
Hybrid approaches <ul><li>Should only sample documents likely to be useful for vertical selection/merging </li></ul><ul><u...
Recap – Resource representation federated search aggregated search Representation completeness low low-high Representation...
Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li...
Resource selection:  how to select the resource(s) to be searched for relevant documents.
Resource selection for federated search Query Broker Collection A Query Query Query Collection B Collection C Collection D...
<ul><li>“ Big-document” bag of word summaries </li></ul><ul><ul><ul><li>CORI  ( Callan et al, 95) </li></ul></ul></ul><ul>...
Resource selection (Lexicon-based methods) <ul><li>CORI </li></ul><ul><li>GlOSS </li></ul>
<ul><li>Sample documents with retained boundaries </li></ul><ul><ul><ul><li>ReDDE  ( Si and Callan, 03a) </li></ul></ul></...
Resource selection (Document-surrogate methods) <ul><li>ReDDE </li></ul><ul><li>ReDDE assumes that the top-ranked sampled ...
<ul><li>SUSHI </li></ul>Resource selection (Document-surrogate methods) http://www.monthly.se/nucleus/index.php?itemid=1464
<ul><li>SUSHI </li></ul>Resource selection (Document-surrogate methods) http://www.monthly.se/nucleus/index.php?itemid=1464
<ul><li>SUSHI </li></ul>Resource selection (Document-surrogate methods) <ul><li>Different regression functions for each co...
<ul><li>Utility maximization techniques </li></ul><ul><ul><li>Model the search effectiveness </li></ul></ul><ul><ul><li>DT...
Resource selection in aggregated Search <ul><li>Content-based predictors </li></ul><ul><ul><li>derived from (sampled) vert...
Content-based predictors <ul><li>Distributed information retrieval (DIR) predictors </li></ul><ul><li>Simple result set pr...
Issues with content-based predictors <ul><li>DIR (usually) assumes homogeneous content types </li></ul><ul><li>performance...
String-based predictors <ul><li>Dictionary lookups </li></ul><ul><ul><li>terms correlated with a vertical (e.g., movie tit...
String-based predictors <ul><li>Issues </li></ul><ul><ul><li>curating lists and expressions (manual or automatic) </li></u...
Log-based predictors <ul><li>Classification approaches </li></ul><ul><ul><li>(Beitzel etal 07; Li etal ,  08) </li></ul></...
Comparing predictor performance (Arguello et al, 09)
Predictor cost <ul><li>Pre-retrieval predictors </li></ul><ul><ul><li>computed without sending the query to the vertical <...
Combining predictors <ul><li>Use predictors as features for a machine-learned model </li></ul><ul><li>Training data </li><...
Editorial data <ul><li>Data: <query,vertical,{+,-}> </li></ul><ul><li>Features: predictors based on  f(query,vertical) </l...
Combining predictors  (Arguello etal, 09)
Click data <ul><li>Data: <query,vertical,{click,skip}>, <query,vertical,click through rate> </li></ul><ul><li>Features: pr...
Gathering click data <ul><li>Exploration bucket:  </li></ul><ul><ul><li>show suboptimal presentations in order to gather p...
Gathering click data <ul><li>Solutions </li></ul><ul><ul><li>reduce impact to small fraction of traffic/users </li></ul></...
Click precision and recall (Konig etal, 09) ability to predict queries  using thresholded  click-through-rate to infer rel...
Non-target data have training data no data
Non-target data <ul><li>Data: <query,source vertical,{+,-}> </li></ul><ul><li>Features: predictors based on f(query,target...
Non-target data <ul><ul><li>(Arguello etal, 10) </li></ul></ul>
Generic model <ul><li>Objective </li></ul><ul><ul><li>train a single model that performs well for all source verticals </l...
Non-target data <ul><ul><li>(Arguello etal, 10) </li></ul></ul>adapted model
Adapted model <ul><li>Objective </li></ul><ul><ul><li>learn non-generic relationship between features and the target verti...
Non-target query classification <ul><ul><li>(Arguello etal, 10) </li></ul></ul>average precision on target query classific...
Training set characteristics <ul><li>What is the  cost  of generating training data </li></ul><ul><ul><li>how much money? ...
Training set cost summary
Online adaptation <ul><li>Production vertical selection systems receive a variety of feedback signals </li></ul><ul><ul><l...
Online adaptation <ul><li>Passive feedback: adjust prediction/parameters in response to feedback </li></ul><ul><ul><li>all...
Online adaptation <ul><li>Issues </li></ul><ul><ul><li>setting learning rate for dynamic intent verticals </li></ul></ul><...
Recap – Resource selection
Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li...
Resource presentation:  how to return results retrieved from several resources to users.
<ul><li>Same source (web) different overlapped indexes </li></ul><ul><li>Document scores may not be available </li></ul><u...
<ul><li>Same corpus </li></ul><ul><li>Different retrieval models </li></ul><ul><li>Document scores/positions available </l...
Result merging in federated search User Broker Collection A Query Query Collection B Collection C Collection D Collection ...
<ul><li>CORI  ( Callan et al, 95) </li></ul><ul><ul><li>Normalized collection score + Normalized document score. </li></ul...
Result merging <ul><li>SSL  (Si and Callan, 2003b) </li></ul>A G B C D E F H Query Ranking Selected resources L R D F Q Br...
Result merging http://upload.wikimedia.org/wikipedia/en/1/13/Linear_regression.png Source-specific score Broker score
<ul><li>Multi-lingual result merging </li></ul><ul><ul><li>SSL with logistic regression  (Si and Callan, 05a; Si et al, 08...
Images on top Images in the middle Images at the bottom Images at top-right Images on the left Images at the bottom-right ...
<ul><li>Designers of aggregated search interfaces should account for the aggregation styles </li></ul><ul><li>for both, ve...
Recap – Result presentation federated search aggregated search Content type homogenous (text documents) heterogeneous Docu...
Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li...
Evaluation Evaluation:  how to measure the effectiveness of federated and aggregated search systems.
<ul><li>CTF ratio  ( Callan and Connell, 01) </li></ul><ul><li>Spearman rank correlation coefficient (SRCC),  ( Callan and...
Resource selection evaluation – Federated search
Result merging evaluation – Federated search <ul><li>Oracle </li></ul><ul><ul><li>Correct merging (centralized index ranki...
Vertical Selection Evaluation – Aggregated search <ul><li>Majority of publications focus on single vertical selection </li...
Editorial data <ul><li>Guidelines </li></ul><ul><ul><li>judge relevance based on vertical results (implicit judging of ret...
Behavioral data <ul><li>Inference relevance from behavioral data (e.g. click data) </li></ul><ul><li>Evaluation metric </l...
Test collections (a la TREC) * There are on an average more than 100 events/shots contained in each video clip (document) ...
ImageCLEF photo retrieval track …… TREC  web track INEX ad-hoc track TREC blog track topic t 1 doc d 1 d 2 d 3 … d n judgm...
Recap – Evaluation federated search aggregated search Editorial data document relevance judgments query labels Behavioral ...
Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li...
Open problems in federated search <ul><li>Beyond big document </li></ul><ul><ul><li>Classification-based server selection ...
Open problems in aggregated search <ul><li>Evaluation metrics </li></ul><ul><ul><li>slotted presentation </li></ul></ul><u...
Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li...
Bibliography <ul><ul><li>J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo, Sources of evidence for vertical selection. In...
Bibliography <ul><ul><li>Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. Proceedings of WWW, ...
<ul><ul><li>B.T. Bartell, G.W. Cottrell, and R.K. Belew. Automatic Combination of Multiple Ranked Retrieval Systems, ACM S...
<ul><ul><li>E. Glover, S. Lawrence, W. Birmingham, and C. Giles. Architecture of a metasearch engine that supports user in...
<ul><ul><li>D. Hawking and P. Thomas. Server selection methods in hybrid portal search, ACM SIGIR, pp 75-82, Salvador, Bra...
<ul><ul><li>X. Li, Y.-Y. Wang, and A. Acero, Learning query intent from regularized click graphs, ACM SIGIR, pp. 339–346. ...
<ul><li>S.  Park. Analysis of characteristics and trends of Web queries submitted to NAVER, a major Korean search engine, ...
<ul><li>M. Shokouhi, J. Zobel, F. Scholer, and S. Tahaghoghi. Capturing collection size for distributed non-cooperative re...
<ul><li>A. Sugiura and O. Etzioni. Query routing for web search engines: architectures and experiments, WWW, Pages 417-429...
<ul><li>T. Tsikrika and M. Lalmas. Merging Techniques for Performing Data Fusion on the Web, ACM CIKM, pp 181-189, Atlanta...
Upcoming SlideShare
Loading in...5
×

From federated to aggregated search

7,218

Published on

SIGIR 2010 Tutorial, with Fernando Diaz & Milad Shokouhi

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,218
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
114
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Add URL
  • MILAD: Anchor text should not be THERE (you said it – please updated) MILAD: there was a comment from Andrew Trotman (we can ignore) about cooperative search engines. Anything you want to add about this (as I said we can safely ignore)
  • There was a comment about Amdox (Yellow Page): Mliad???
  • Say why some are underlined.
  • Formula does not print
  • Slide did not print well (stuff missing)
  • Milad you said “Collection overlap estimation” was misplaced here.
  • I have a comment here that says add the MJ slide 
  • Server vs collection here – does it matter at the end? Would be nice to have collection here 
  • Server vs collection
  • Server vs collection
  • Milad, you did speak quite a bit here, so maybe add something more?
  • I have a comment here: KDD cup?
  • All should be in % (or at least same format) Text needed here.
  • Say in some text what is combined here.
  • For other issues here, I have as comment add refs.
  • I have as comment here “predict newsworthiness of queries”
  • Say what C and D are.
  • Check E and F – something was not correct.
  • This slide does not print
  • This slide does not print.
  • CTR is full
  • From federated to aggregated search

    1. 1. From federated to aggregated search Fernando Diaz, Mounia Lalmas and Milad Shokouhi [email_address] [email_address] [email_address]
    2. 2. Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li></ul><ul><li>Resource Selection </li></ul><ul><li>Result Presentation </li></ul><ul><li>Evaluation </li></ul><ul><li>Open Problems </li></ul><ul><li>Bibliography </li></ul>
    3. 3. Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li></ul><ul><li>Resource Selection </li></ul><ul><li>Result Presentation </li></ul><ul><li>Evaluation </li></ul><ul><li>Open Problems </li></ul><ul><li>Bibliography </li></ul>
    4. 4. Introduction <ul><li>What is federated search? </li></ul><ul><li>What is aggregated search? </li></ul><ul><ul><li>Motivations </li></ul></ul><ul><ul><li>Challenges </li></ul></ul><ul><ul><li>Relationships </li></ul></ul>
    5. 5. A classical example of federated search www.theeuropeanlibrary.org Collections to be searched One query
    6. 6. A classical example of federated search www.theeuropeanlibrary.org Merged list of results
    7. 7. Motivation for federated search <ul><li>Search a number of independent collections, with a focus on hidden web collections </li></ul><ul><ul><li>Collections not easily crawlable (and often should not) </li></ul></ul><ul><li>Access to up-to-date information and data </li></ul><ul><li>Parallel search over several collections </li></ul><ul><li>Effective tool for enterprise and digital library environments </li></ul>
    8. 8. Challenges for federated search <ul><li>How to represent collections, so that to know what documents each contain? </li></ul><ul><li>How to select the collection(s) to be searched for relevant documents? </li></ul><ul><li>How to merge results retrieved from several collections, to return one list of results to the users? </li></ul><ul><ul><li>Cooperative environment </li></ul></ul><ul><ul><li>Uncooperative environment </li></ul></ul>
    9. 9. From federated search to aggregated search <ul><li>“ Federated search on the web” </li></ul><ul><ul><li>Peer-to-peer network connects distributed peers (usually for file sharing), where each peer can be both server and client </li></ul></ul><ul><ul><li>Metasearch engine combines the results of different search engines into a single result list </li></ul></ul><ul><ul><li>Vertical search – also known as aggregated search – add the top-ranked results from relevant verticals (e.g. images, videos, maps) to typical web search results </li></ul></ul>
    10. 10. A classical example of aggregated search News Homepage Wikipedia Real-time results Video Twitter Structured Data
    11. 11. Motivation for aggregated search <ul><li>Increasingly different types of information being available, sough and relevant </li></ul><ul><ul><li>e.g. news, image, wiki, video, audio, blog, map, tweet </li></ul></ul><ul><li>Search engine allows accessing these through so-called verticals </li></ul><ul><li>Two “ways” to search </li></ul><ul><ul><li>Users can directly search the verticals </li></ul></ul><ul><ul><li>Or rely on so called aggregated search </li></ul></ul>Google universal search 2007 : [ … ] search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results [ … ] will incorporate information from a variety of previously separate sources – including videos, images, news, maps, books, and websites – into a single set of results. http://www.google.com/intl/en/press/pressrel/universalsearch_20070516.html
    12. 12. Motivation for aggregated search (Arguello et al , 09) 25K editorially classified queries
    13. 13. Motivation for aggregated search
    14. 14. Motivation for aggregated search
    15. 15. Challenges in aggregated search <ul><ul><li>Extremely heterogeneous collections </li></ul></ul><ul><li>What is/are the vertical intent(s)? </li></ul><ul><li>And </li></ul><ul><ul><ul><li>Handling ambiguous (query | vertical) intent </li></ul></ul></ul><ul><ul><ul><li>Handling non-stationary intent (e.g. news, local) </li></ul></ul></ul><ul><li>How many results from each to return and where to position them in the result page? </li></ul><ul><ul><ul><li>Slotting results </li></ul></ul></ul><ul><ul><ul><li>Users looking at 1 st result page </li></ul></ul></ul><ul><li>Page optimization and its evaluation </li></ul>
    16. 16. Ambiguous non-stationary intent Query - Travel - Molusk - Paul Vertical - Wikipedia - News - Image
    17. 17. Recap – Introduction federated search aggregated search heterogeneity low high scale (documents, users) small large user feedback little a lot
    18. 18. Terminology <ul><li>federated search, distributed information retrieval, data fusion, aggregated search, universal search, peer-to-peer network </li></ul><ul><li>resource, vertical, database, collection, source, server, domain, genre </li></ul><ul><li>merging, blending, fusion, aggregation, slotted, tiled </li></ul>
    19. 19. Problem definition Present the “querier” with a summary of search results from one or more resources.
    20. 20. General architecture User Search Interface/ Portal/ Broker Source/ Server/ Vertical Source/ Server/ Vertical Source/ Server/ Vertical Source/ Server/ Vertical Raw Query Source/ Server/ Vertical Query Query Query Query Query
    21. 21. Peer-to-peer network Peer Directory Server
    22. 22. Peer to Peer (P2P) networks <ul><li>Broker-based </li></ul><ul><ul><li>Single centralized broker with documents lists shared from peer (e.g. Napster, original version ) </li></ul></ul><ul><li>Decentralized </li></ul><ul><ul><li>Each peer acts as both client and server (e.g. Gnutella v0.4) </li></ul></ul><ul><li>Structure-based </li></ul><ul><ul><li>Use distributed hash tables (DHT) (e.g. Chord (Stocia et al, 03) ) </li></ul></ul><ul><li>Hierarchical </li></ul><ul><ul><li>Use local directory services for routing and merging (e.g. Swapper.NET) </li></ul></ul>
    23. 23. Federated search Query Broker Collection A Query Query Query Query Query Collection B Collection C Collection D Collection E Sum A Sum B Sum C Sum D Sum E Merged results
    24. 24. Federated search <ul><li>Also known as distributed information retrieval (DIR) system </li></ul><ul><li>Provides one portal for searching information from multiple sources </li></ul><ul><ul><li>corporate intranets, fee-based databases, library catalogues, internet resources, user-specific digital storage </li></ul></ul><ul><li>Funnelback, Westlaw, FedStats, Cheshire, etc (see also http://federatedsearchblog.com/ ) </li></ul>
    25. 25. http://funnelback.com/pdfs/brochures/enterprise.pdf
    26. 26. Metasearch User Metasearch engine Raw Query WWW Query Query Query Query
    27. 27. Metasearch <ul><li>Search engine querying several different search engines and combines results from them (blended), or displays results separately (non-blended) </li></ul><ul><li>Does not crawl the web but rely on data gathered by other search engines </li></ul><ul><li>Dogpile,Metacrawler, Search.com, etc </li></ul><ul><ul><li>( see http://www.cryer.co.uk/resources/searchengines/meta.htm ) </li></ul></ul>
    28. 28. Aggregated search User Angelina Jolie Results WWW Index (text) Query Query Query Query
    29. 29. Aggregated search <ul><li>Specific to a web search engine </li></ul><ul><li>“ Increasingly” more than one type of information relevant to an information need </li></ul><ul><ul><li>mostly web page + image, map, blog, etc </li></ul></ul><ul><li>These types of information are indexed and ranked using dedicated approaches (verticals) </li></ul><ul><li>Presenting the results from verticals in an aggregated way believed to be more useful </li></ul><ul><li>All major search engines are doing some levels of aggregated search </li></ul>
    30. 30. Data fusion Query GOV2 BM25 KL Inquery Anchor only Title only One document collection Different document representations Different retrieval models Merging One ranked list of result (merged) (e.g. Voorhees etal, 95)
    31. 31. Data fusion <ul><li>Search one collection </li></ul><ul><li>Document can be indexed in different ways </li></ul><ul><ul><li>Title index, abstract index, etc (poly-representation) </li></ul></ul><ul><ul><li>Weighting scheme </li></ul></ul><ul><li>Different retrieval models </li></ul><ul><li>Rankings generated by different retrieval models (or different document representations) merged to produce the final rank </li></ul><ul><li>Has often been shown to improve retrieval performance (TREC) </li></ul>
    32. 32. Terminology - Resource <ul><li>Source </li></ul><ul><li>Server </li></ul><ul><li>Database </li></ul><ul><li>Collection (federated search) </li></ul><ul><li>Server </li></ul><ul><li>Vertical (aggregated search) </li></ul><ul><li>Domain </li></ul><ul><li>Genre </li></ul>
    33. 33. Terminology - Aggregation <ul><li>Merging </li></ul><ul><li>Blending </li></ul><ul><li>Fusion </li></ul><ul><li>Slotted </li></ul><ul><li>Tiled </li></ul>
    34. 34. Aggregated search (tiled) http://au.alpha.yahoo.com/
    35. 35. Aggregated search (tiled) Naver.com
    36. 36. Aggregated search (slotted)
    37. 37. Others <ul><li>Clustering </li></ul><ul><li>Faceted search </li></ul><ul><li>Multi-document summarization </li></ul><ul><li>Document generation </li></ul><ul><li>Entity search </li></ul><ul><ul><li>(see special issue – in press – on “Current research in focused retrieval and result aggregation”, Journal of Information Retrieval (Trotman etal, 10)) </li></ul></ul>
    38. 38. Yippy – Clustering search engine from Vivisimo clusty.com
    39. 39. Faceted search
    40. 40. Multi-document summarization http://newsblaster.cs.columbia.edu/
    41. 41. “ Fictitious” document generation (Paris et al, 10)
    42. 42. Entity search http://sandbox.yahoo.com/Correlator
    43. 43. Recap <ul><li>Shown the relations between federated, aggregated search, and others </li></ul><ul><li>Exposed the various terminologies used </li></ul><ul><li>In the rest of the tutorial, we concentrate on federated search and aggregated search </li></ul><ul><li>Focus is on “effective search” </li></ul>
    44. 44. Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li></ul><ul><li>Resource Selection </li></ul><ul><li>Result Presentation </li></ul><ul><li>Evaluation </li></ul><ul><li>Open Problems </li></ul><ul><li>Bibliography </li></ul>
    45. 45. Architecture: what are the general components of federated and aggregated search systems.
    46. 46. Federated search architecture
    47. 47. Aggregated search architecture <ul><li>Pre-retrieval aggregation: decide verticals before seeing results </li></ul><ul><li>Post-retrieval aggregation: decide verticals after seeing results </li></ul><ul><li>Pre-web aggregation: decide verticals before seeing web results </li></ul><ul><li>Post-web aggregation: decide verticals after seeing web results </li></ul>
    48. 48. Post-retrieval, pre-web
    49. 49. Pre and post-retrieval, pre-web
    50. 50. Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li></ul><ul><li>Resource Selection </li></ul><ul><li>Result Presentation </li></ul><ul><li>Evaluation </li></ul><ul><li>Open Problems </li></ul><ul><li>Bibliography </li></ul>
    51. 51. Resource representation: how to represent resources, so that we know what documents each contain.
    52. 52. Resource representation in federated search (Also known as resource summary/description)
    53. 53. Resource representation <ul><li>Cooperative environments </li></ul><ul><ul><li>Comprehensive term statistics </li></ul></ul><ul><ul><li>Collection size information </li></ul></ul><ul><li>Uncooperative environments </li></ul><ul><ul><li>Query-based sampling </li></ul></ul><ul><ul><li>Collection size estimation </li></ul></ul>
    54. 54. Resource representation (cooperative environments) <ul><li>STARTS Protocol (Gravano et al, 97) </li></ul><ul><ul><li>Source metadata </li></ul></ul><ul><ul><li>Rich query language </li></ul></ul>
    55. 55. <ul><li>Different types of term statistics </li></ul><ul><ul><li>(Callan et al, 95; Gravano et al, 94a,b,99; Meng et al, 01; Yuwono and Lee, 97; Xu and Callan, 98; Zobel, 97 ) </li></ul></ul><ul><li>Anchor-text </li></ul><ul><ul><li>HARP (Hawking and Thomas, 05) </li></ul></ul>Resource representation (cooperative environments)
    56. 56. Resource representation (uncooperative environments) <ul><li>Query-based sampling ( Callan and Connell, 01 ) </li></ul><ul><ul><li>Select a query, probe collection </li></ul></ul><ul><ul><li>Download the top n documents </li></ul></ul><ul><ul><li>Select the next query, repeat </li></ul></ul>Query selector Query Sampled documents
    57. 57. <ul><li>Query selector </li></ul><ul><ul><li>(Callan and Connell, 01) </li></ul></ul><ul><ul><ul><li>Other resource description (ord) </li></ul></ul></ul><ul><ul><ul><li>Learned resource description (lrd) </li></ul></ul></ul><ul><ul><ul><ul><li>Average tf, random , df, ctf </li></ul></ul></ul></ul><ul><ul><li>Query logs </li></ul></ul><ul><ul><ul><li>(Craswell, 00; Shokouhi et al, 07d) </li></ul></ul></ul><ul><ul><li>Focused probing </li></ul></ul><ul><ul><ul><li>(Ipeirotis and Gravano, 02) </li></ul></ul></ul>Resource representation (uncooperative environments)
    58. 58. <ul><li>Adaptive sampling </li></ul><ul><ul><li>(Shokouhi et al, 06a) </li></ul></ul><ul><ul><ul><li>Rate of visiting new vocabulary </li></ul></ul></ul><ul><ul><li>(Baillie et al, 06a) </li></ul></ul><ul><ul><ul><li>Rate of sample quality improvement (reference query log) </li></ul></ul></ul><ul><ul><li>(Caverlee et al, 06) </li></ul></ul><ul><ul><ul><li>Proportional document ratio ( PD ) </li></ul></ul></ul><ul><ul><ul><li>Proportional vocabulary ratio ( PV ) </li></ul></ul></ul><ul><ul><ul><li>Vocabulary growth (VG) </li></ul></ul></ul>Resource representation (uncooperative environments)
    59. 59. <ul><li>Improving incomplete samples </li></ul><ul><ul><li>Shrinkage (Ipeirotis, 04; Ipeirotis and Gravano, 04) : topically related collections should share similar terms </li></ul></ul><ul><ul><li>Q-pilot (Sugiura and Etzioni, 00) : </li></ul></ul><ul><ul><li>sampled documents + backlinks + front page </li></ul></ul>Resource representation (uncooperative environments)
    60. 60. <ul><li>Capture-recapture ( Liu et al, 01) </li></ul>Resource representation (Collection size estimation) Sample A (Capture) Sample B (recapture) http://www.dorlingkindersley-uk.co.uk/static/cs/uk/11/clipart/nature/image_nature040.html
    61. 61. Resource representation (Collection size estimation)
    62. 62. <ul><li>Multiple queries sampler </li></ul><ul><li>( Thomas and Hawking, 07 ) </li></ul><ul><li>Random-walk sampler, and pool-based sampler </li></ul><ul><li>( Bar-Yossef and Gurevich, 06 ) </li></ul><ul><li>Collection overlap estimation </li></ul><ul><li>( Shokouhi and Zobel, 07 ) </li></ul>Resource representation (Collection size estimation)
    63. 63. Resource representation (Updating summaries) <ul><li>(Ipeirotis et al, 05) </li></ul><ul><li>(Shokouhi et al, 07a) </li></ul>
    64. 64. Resource representation in aggregated search <ul><li>Vertical content </li></ul><ul><ul><li>samples or access to vertical API </li></ul></ul><ul><ul><li>represents content supply </li></ul></ul><ul><li>Vertical query logs </li></ul><ul><ul><li>samples or access to historic vertical searches </li></ul></ul><ul><ul><li>represents content demand </li></ul></ul>
    65. 65. Vertical content includes text NEWS
    66. 66. Vertical content includes structure SPORTS
    67. 67. Vertical content includes images IMAGES
    68. 68. Issues with vertical content <ul><li>Dynamics </li></ul><ul><ul><li>some vertical becomes stale fast </li></ul></ul><ul><li>Heterogeneous content </li></ul><ul><ul><li>heterogeneous ranking algorithms </li></ul></ul><ul><li>Non-free text APIs </li></ul><ul><ul><li>affects query-based sampling </li></ul></ul>
    69. 69. Addressing content dynamics <ul><li>sample most recently indexed documents </li></ul><ul><ul><li>(Diaz 09) </li></ul></ul><ul><li>assumes users more likely to be interested in recent content </li></ul><ul><li>in practice, only need a fraction of the corpus to perform well </li></ul>(Konig et al, 09)
    70. 70. Addressing heterogeneous content <ul><li>use text available with documents (e.g. captions) </li></ul><ul><li>manually map to surrogates (e.g. wikipedia pages) </li></ul>(Arguello et al, 09) performance of two different methods of dealing with heterogeneous content
    71. 71. Vertical query logs <ul><li>Queries issued directly to a vertical represent explicit vertical intent </li></ul><ul><li>Is similar to having a large body of labeled queries </li></ul>
    72. 72. Issues with vertical query logs <ul><li>Dynamics </li></ul><ul><ul><li>some verticals require temporally-sensitive sampling </li></ul></ul><ul><ul><li>for example, we do not want to sample news query logs for a whole year </li></ul></ul><ul><li>Non-free text APIs </li></ul><ul><ul><li>affects query modeling </li></ul></ul>
    73. 73. Hybrid approaches <ul><li>Should only sample documents likely to be useful for vertical selection/merging </li></ul><ul><ul><li>e.g. a document which is never requested is not useful for representing a vertical </li></ul></ul><ul><li>Suggests log-biased sampling </li></ul><ul><ul><li>(Shokouhi et al, 06; Arguello et al, 09) </li></ul></ul>
    74. 74. Recap – Resource representation federated search aggregated search Representation completeness low low-high Representation generation sampling/shared dictionaries sampling, API Freshness important critical
    75. 75. Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li></ul><ul><li>Resource Selection </li></ul><ul><li>Result Presentation </li></ul><ul><li>Evaluation </li></ul><ul><li>Open Problems </li></ul><ul><li>Bibliography </li></ul>
    76. 76. Resource selection: how to select the resource(s) to be searched for relevant documents.
    77. 77. Resource selection for federated search Query Broker Collection A Query Query Query Collection B Collection C Collection D Collection E Sum A Sum B Sum C Sum D Sum E
    78. 78. <ul><li>“ Big-document” bag of word summaries </li></ul><ul><ul><ul><li>CORI ( Callan et al, 95) </li></ul></ul></ul><ul><ul><ul><li>GlOSS (Gravano et al, 94b) </li></ul></ul></ul><ul><ul><ul><li>CVV (Yuwono and Lee, 97) </li></ul></ul></ul>Resource selection (Lexicon-based methods) Collection C Collection A Collection B Sampling Sampling Sampling Broker
    79. 79. Resource selection (Lexicon-based methods) <ul><li>CORI </li></ul><ul><li>GlOSS </li></ul>
    80. 80. <ul><li>Sample documents with retained boundaries </li></ul><ul><ul><ul><li>ReDDE ( Si and Callan, 03a) </li></ul></ul></ul><ul><ul><ul><li>CRCS (Shokouhi, 07a) </li></ul></ul></ul><ul><ul><ul><li>SUSHI ( Thomas and Shokouhi, 09 ) </li></ul></ul></ul>Resource selection (Document-surrogate methods) Collection C Collection A Collection B Sampling Sampling Sampling Broker
    81. 81. Resource selection (Document-surrogate methods) <ul><li>ReDDE </li></ul><ul><li>ReDDE assumes that the top-ranked sampled documents are relevant. </li></ul><ul><li>ReDDE estimates the size of collections by sample-resample </li></ul><ul><li>Assuming that all collections have the same size we have: yellow > blue > red </li></ul><ul><li>CRCS is inspired by ReDDE but assigns different probability of relevance based on document position: red > yellow, blue </li></ul>Query Ranking Broker
    82. 82. <ul><li>SUSHI </li></ul>Resource selection (Document-surrogate methods) http://www.monthly.se/nucleus/index.php?itemid=1464
    83. 83. <ul><li>SUSHI </li></ul>Resource selection (Document-surrogate methods) http://www.monthly.se/nucleus/index.php?itemid=1464
    84. 84. <ul><li>SUSHI </li></ul>Resource selection (Document-surrogate methods) <ul><li>Different regression functions for each collection and query </li></ul><ul><li>Scores are comparable (estimated over the same index) </li></ul>http://www.monthly.se/nucleus/index.php?itemid=1464
    85. 85. <ul><li>Utility maximization techniques </li></ul><ul><ul><li>Model the search effectiveness </li></ul></ul><ul><ul><li>DTF (Nottelmann and Fuhr, 03) , UUM (Si and Callan, 04a) , RUM (Si and Callan, 05b) </li></ul></ul><ul><li>Classification-based methods </li></ul><ul><ul><li>Classify collections/queries for better selection </li></ul></ul><ul><ul><li>Classification-aware server selection (Ipeirotis and Gravano, 08) , classification-based resource selection (Arguello et al, 09a) , learning from past queries (Cetintas et al, 09) </li></ul></ul>Resource selection (Supervised methods)
    86. 86. Resource selection in aggregated Search <ul><li>Content-based predictors </li></ul><ul><ul><li>derived from (sampled) vertical content </li></ul></ul><ul><li>Query string-based predictors </li></ul><ul><ul><li>derived from query text, independent of any resource associated with a vertical </li></ul></ul><ul><li>Query log-based predictors </li></ul><ul><ul><li>derived from previous requests issued by users to the vertical portal </li></ul></ul>
    87. 87. Content-based predictors <ul><li>Distributed information retrieval (DIR) predictors </li></ul><ul><li>Simple result set predictors </li></ul><ul><ul><li>numresults, score distributions, etc </li></ul></ul><ul><ul><li>(Diaz 09; Konig etal, 09) </li></ul></ul><ul><li>Complex result set predictors </li></ul><ul><ul><li>Clarity (Cronen-Townsend et al, 02) </li></ul></ul><ul><ul><li>Autocorrelation (Diaz, 07) </li></ul></ul><ul><ul><li>Many, many more (Hauff, 10) </li></ul></ul>
    88. 88. Issues with content-based predictors <ul><li>DIR (usually) assumes homogeneous content types </li></ul><ul><li>performance predictors (usually) assume text corpora </li></ul><ul><li>assumes ranking function consistency </li></ul><ul><ul><li>between verticals </li></ul></ul><ul><ul><li>between vertical selector machine and vertical ranker machine </li></ul></ul><ul><li>verticals have different dynamics (e.g. news vs. image) </li></ul>
    89. 89. String-based predictors <ul><li>Dictionary lookups </li></ul><ul><ul><li>terms correlated with a vertical (e.g., movie titles) </li></ul></ul><ul><li>Regular expressions </li></ul><ul><ul><li>patterns correlated with explicit vertical requests (e.g., obama news) </li></ul></ul><ul><li>Named entities </li></ul><ul><ul><li>automatically-detected entity types (e.g., geographic entities) </li></ul></ul>
    90. 90. String-based predictors <ul><li>Issues </li></ul><ul><ul><li>curating lists and expressions (manual or automatic) </li></ul></ul><ul><ul><li>terms included in dictionary manually vetted for relevance </li></ul></ul><ul><ul><ul><li>high precision/low recall </li></ul></ul></ul>
    91. 91. Log-based predictors <ul><li>Classification approaches </li></ul><ul><ul><li>(Beitzel etal 07; Li etal , 08) </li></ul></ul><ul><li>Language model approaches </li></ul><ul><ul><li>(Arguello etal, 09) </li></ul></ul><ul><li>Issues </li></ul><ul><ul><li>verticals with structured queries (e.g. local) </li></ul></ul><ul><ul><li>query logs with dynamics (e.g. news) </li></ul></ul><ul><ul><li>(Diaz, 09) </li></ul></ul>
    92. 92. Comparing predictor performance (Arguello et al, 09)
    93. 93. Predictor cost <ul><li>Pre-retrieval predictors </li></ul><ul><ul><li>computed without sending the query to the vertical </li></ul></ul><ul><ul><li>no network cost </li></ul></ul><ul><li>Post-retrieval predictors </li></ul><ul><ul><li>computed on the results from the vertical </li></ul></ul><ul><ul><li>requires vertical support of web scale query traffic </li></ul></ul><ul><ul><li>incurs network latency </li></ul></ul><ul><ul><li>can be mitigated with vertical content caches </li></ul></ul>
    94. 94. Combining predictors <ul><li>Use predictors as features for a machine-learned model </li></ul><ul><li>Training data </li></ul><ul><ul><li>editorial data </li></ul></ul><ul><ul><li>behavioral data (e.g. clicks) </li></ul></ul><ul><ul><li>other vertical data </li></ul></ul>(Diaz, 09; Arguello etal, 09; Konig etal, 09)
    95. 95. Editorial data <ul><li>Data: <query,vertical,{+,-}> </li></ul><ul><li>Features: predictors based on f(query,vertical) </li></ul><ul><li>Models: </li></ul><ul><ul><li>log-linear (Arguello etal, 09) </li></ul></ul><ul><ul><li>boosted decision trees (Arguello etal, 10) </li></ul></ul>
    96. 96. Combining predictors (Arguello etal, 09)
    97. 97. Click data <ul><li>Data: <query,vertical,{click,skip}>, <query,vertical,click through rate> </li></ul><ul><li>Features: predictors based on f(query,vertical) </li></ul><ul><li>Models: </li></ul><ul><ul><li>log-linear (Diaz, 09) </li></ul></ul><ul><ul><li>boosted decision trees (Konig etal, 09) </li></ul></ul>
    98. 98. Gathering click data <ul><li>Exploration bucket: </li></ul><ul><ul><li>show suboptimal presentations in order to gather positive (and negative) click/skip data </li></ul></ul><ul><li>Cold start problem: </li></ul><ul><ul><li>without a basic model, the best exploration is random </li></ul></ul><ul><li>Random exploration results in poor user experience </li></ul>
    99. 99. Gathering click data <ul><li>Solutions </li></ul><ul><ul><li>reduce impact to small fraction of traffic/users </li></ul></ul><ul><ul><li>train a basic high-precision non-click model (perhaps with editorial data) </li></ul></ul><ul><li>Other issues </li></ul><ul><ul><li>Presentation bias: different verticals have different click-through rates a priori </li></ul></ul><ul><ul><li>Position bias: different presentation positions have different click-through rates a priori </li></ul></ul>
    100. 100. Click precision and recall (Konig etal, 09) ability to predict queries using thresholded click-through-rate to infer relevance
    101. 101. Non-target data have training data no data
    102. 102. Non-target data <ul><li>Data: <query,source vertical,{+,-}> </li></ul><ul><li>Features: predictors based on f(query,target vertical) </li></ul><ul><li>Models: </li></ul><ul><ul><li>generic model+adaptation </li></ul></ul><ul><ul><li>(Arguello etal, 10) </li></ul></ul>
    103. 103. Non-target data <ul><ul><li>(Arguello etal, 10) </li></ul></ul>
    104. 104. Generic model <ul><li>Objective </li></ul><ul><ul><li>train a single model that performs well for all source verticals </li></ul></ul><ul><li>Assumption </li></ul><ul><ul><li>if it performs well across all source verticals, it will perform well on the target vertical </li></ul></ul><ul><ul><li>(Arguello etal, 10) </li></ul></ul>
    105. 105. Non-target data <ul><ul><li>(Arguello etal, 10) </li></ul></ul>adapted model
    106. 106. Adapted model <ul><li>Objective </li></ul><ul><ul><li>learn non-generic relationship between features and the target vertical </li></ul></ul><ul><li>Assumption </li></ul><ul><ul><li>can bootstrap from labels generated by the generic model </li></ul></ul><ul><ul><li>(Arguello etal, 10) </li></ul></ul>
    107. 107. Non-target query classification <ul><ul><li>(Arguello etal, 10) </li></ul></ul>average precision on target query classification; red (blue) indicates statistically significant improvements (degradations) compared to the single predictor
    108. 108. Training set characteristics <ul><li>What is the cost of generating training data </li></ul><ul><ul><li>how much money? </li></ul></ul><ul><ul><li>how much time? </li></ul></ul><ul><ul><li>how many negative impressions as a result of exploration? </li></ul></ul><ul><li>Are targets normalized ? </li></ul><ul><ul><li>can we compare classifier output? </li></ul></ul>
    109. 109. Training set cost summary
    110. 110. Online adaptation <ul><li>Production vertical selection systems receive a variety of feedback signals </li></ul><ul><ul><li>clicks, skips </li></ul></ul><ul><ul><li>reformulations </li></ul></ul><ul><li>A machine-learned system can adjust predictions based on real time user feedback </li></ul><ul><ul><li>very important for dynamic verticals </li></ul></ul><ul><ul><li>(Diaz, 09; Diaz and Arguello, 09) </li></ul></ul>
    111. 111. Online adaptation <ul><li>Passive feedback: adjust prediction/parameters in response to feedback </li></ul><ul><ul><li>allows recovery from false positives </li></ul></ul><ul><ul><li>difficult to recover from false negatives </li></ul></ul><ul><li>Active feedback/explore-exploit: opportunistically present suboptimal verticals for feedback </li></ul><ul><ul><li>allows recovery from both errors </li></ul></ul><ul><ul><li>incurs exploration cost </li></ul></ul><ul><ul><li>(Diaz, 09; Diaz and Arguello, 09) </li></ul></ul>
    112. 112. Online adaptation <ul><li>Issues </li></ul><ul><ul><li>setting learning rate for dynamic intent verticals </li></ul></ul><ul><ul><li>normalizing feedback signal across verticals </li></ul></ul><ul><ul><li>resolving feedback and training signal (click≠relevance) </li></ul></ul><ul><ul><li>(Diaz, 09; Diaz and Arguello, 09) </li></ul></ul>
    113. 113. Recap – Resource selection
    114. 114. Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li></ul><ul><li>Resource Selection </li></ul><ul><li>Result Presentation </li></ul><ul><li>Evaluation </li></ul><ul><li>Open Problems </li></ul><ul><li>Bibliography </li></ul>
    115. 115. Resource presentation: how to return results retrieved from several resources to users.
    116. 116. <ul><li>Same source (web) different overlapped indexes </li></ul><ul><li>Document scores may not be available </li></ul><ul><li>Title, snippet, position and timestamps </li></ul><ul><ul><li>D-WISE (Yuwono and Lee, 96) </li></ul></ul><ul><ul><li>Inquirus (Glover et al., 99) </li></ul></ul><ul><ul><li>SavvySearch (Dreilinger and Howe, 1997) </li></ul></ul>Result merging (Metasearch engines)
    117. 117. <ul><li>Same corpus </li></ul><ul><li>Different retrieval models </li></ul><ul><li>Document scores/positions available </li></ul><ul><ul><li>Unsupervised techniques </li></ul></ul><ul><ul><ul><li>CombSUM, CombMNZ (Fox and Shaw, 93, 94) </li></ul></ul></ul><ul><ul><ul><li>Borda fuse ( Aslam and Montague, 01 ) </li></ul></ul></ul><ul><ul><li>Supervised techniques </li></ul></ul><ul><ul><ul><li>Bayes-fuse, weighted Borda fuse ( Aslam and Montague, 01 ) </li></ul></ul></ul><ul><ul><ul><li>Segment-based fusion ( Lillis et al 06, 08; Shokouhi 07b) </li></ul></ul></ul>Result merging (Data fusion)
    118. 118. Result merging in federated search User Broker Collection A Query Query Collection B Collection C Collection D Collection E Sum A Sum B Sum C Sum D Sum E Merged results Query
    119. 119. <ul><li>CORI ( Callan et al, 95) </li></ul><ul><ul><li>Normalized collection score + Normalized document score. </li></ul></ul>Result merging
    120. 120. Result merging <ul><li>SSL (Si and Callan, 2003b) </li></ul>A G B C D E F H Query Ranking Selected resources L R D F Q Broker
    121. 121. Result merging http://upload.wikimedia.org/wikipedia/en/1/13/Linear_regression.png Source-specific score Broker score
    122. 122. <ul><li>Multi-lingual result merging </li></ul><ul><ul><li>SSL with logistic regression (Si and Callan, 05a; Si et al, 08) </li></ul></ul><ul><li>Personalized metasearch </li></ul><ul><ul><li>(Thomas, 08) </li></ul></ul><ul><li>Merging overlapped collections </li></ul><ul><ul><li>COSCO ( Hernandez and Kambhampati 05) : </li></ul></ul><ul><ul><li> exact duplicates </li></ul></ul><ul><ul><li>GHV ( Bernstein et al, 06; Shokouhi et al, 07b) : </li></ul></ul><ul><ul><li> exact/near duplicates </li></ul></ul>Result merging - Miscellaneous scenarios
    123. 123. Images on top Images in the middle Images at the bottom Images at top-right Images on the left Images at the bottom-right Slotted vs tiled result presentation 3 verticals 3 positions 3 degree of vertical intents (Sushmita et al, 10)
    124. 124. <ul><li>Designers of aggregated search interfaces should account for the aggregation styles </li></ul><ul><li>for both, vertical intent key for deciding on position and type of “vertical” results </li></ul><ul><li>slotted  accurate estimation of the best position of “vertical” result </li></ul><ul><li>tiled  accurate selection of the type of “vertical” result </li></ul>Slotted vs tiled
    125. 125. Recap – Result presentation federated search aggregated search Content type homogenous (text documents) heterogeneous Document scores depends on environment heterogeneous Oracle centralized index none
    126. 126. Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li></ul><ul><li>Resource Selection </li></ul><ul><li>Result Presentation </li></ul><ul><li>Evaluation </li></ul><ul><li>Open Problems </li></ul><ul><li>Bibliography </li></ul>
    127. 127. Evaluation Evaluation: how to measure the effectiveness of federated and aggregated search systems.
    128. 128. <ul><li>CTF ratio ( Callan and Connell, 01) </li></ul><ul><li>Spearman rank correlation coefficient (SRCC), ( Callan and Connell, 01) </li></ul><ul><li>Kullback-Leibler divergence (KL) (Baillie et al, 06b; Ipeirotis et al, 2005) , topical KL ( Baillie et al, 09) </li></ul><ul><li>Predictive likelihood ( Baillie et al, 06a) </li></ul>Resource representation (summaries) evaluation – Federated search
    129. 129. Resource selection evaluation – Federated search
    130. 130. Result merging evaluation – Federated search <ul><li>Oracle </li></ul><ul><ul><li>Correct merging (centralized index ranking) ( Hawking and Thistlewaite, 99) </li></ul></ul><ul><ul><li>Perfect merging (ordered by relevance labels) (H awking and Thistlewaite, 99) </li></ul></ul><ul><li>Metrics </li></ul><ul><ul><li>Precision </li></ul></ul><ul><ul><li>Correct matches ( Chakravarthy and Haase, 95) </li></ul></ul>
    131. 131. Vertical Selection Evaluation – Aggregated search <ul><li>Majority of publications focus on single vertical selection </li></ul><ul><ul><li>vertical accuracy, precision, recall </li></ul></ul><ul><li>Evaluation data </li></ul><ul><ul><li>editorial data </li></ul></ul><ul><ul><li>behavioral data </li></ul></ul>single vertical selection
    132. 132. Editorial data <ul><li>Guidelines </li></ul><ul><ul><li>judge relevance based on vertical results (implicit judging of retrieval/content quality) </li></ul></ul><ul><ul><li>judge relevance based on vertical description (assumes idealized retrieval/content quality) </li></ul></ul><ul><li>Evaluation metric derived from binary or graded relevance judgments </li></ul><ul><ul><li>(Arguello etal, 09; Arguello et al, 10) </li></ul></ul>
    133. 133. Behavioral data <ul><li>Inference relevance from behavioral data (e.g. click data) </li></ul><ul><li>Evaluation metric </li></ul><ul><ul><li>regression error on predicted CTR </li></ul></ul><ul><ul><li>infer binary or graded relevance </li></ul></ul><ul><ul><li>(Diaz, 09; Konig etal , 09) </li></ul></ul>
    134. 134. Test collections (a la TREC) * There are on an average more than 100 events/shots contained in each video clip (document) (Zhou & Lalmas, 10) Statistics on Topics number of topics 150 average rel docs per topic 110.3 average rel verticals per topic 1.75 ratio of “General Web” topics 29.3% ratio of topics with two vertical intents 66.7% ratio of topics with more than two vertical intents 4.0% quantity/media text image video total size (G) 2125 41.1 445.5 2611.6 number of documents 86,186,315 670,439 1,253* 86,858,007
    135. 135. ImageCLEF photo retrieval track …… TREC web track INEX ad-hoc track TREC blog track topic t 1 doc d 1 d 2 d 3 … d n judgment R N R … R …… Blog Vertical Reference (Encyclopedia) Vertical Image Vertical General Web Vertical Shopping Vertical topic t 1 doc d 1 d 2 … d V1 judgment R N … R vertical V 1 V 2 d 1 d 2 … d V2 N N … R …… V k d 1 d 2 … d Vk N N … N t 1 existing test collections (simulated) verticals Test collections (a la TREC)
    136. 136. Recap – Evaluation federated search aggregated search Editorial data document relevance judgments query labels Behavioral data none critical
    137. 137. Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li></ul><ul><li>Resource Selection </li></ul><ul><li>Result Presentation </li></ul><ul><li>Evaluation </li></ul><ul><li>Open Problems </li></ul><ul><li>Bibliography </li></ul>
    138. 138. Open problems in federated search <ul><li>Beyond big document </li></ul><ul><ul><li>Classification-based server selection (Arguello et al, 09a) </li></ul></ul><ul><ul><li>Topic modeling </li></ul></ul><ul><li>Query expansion </li></ul><ul><ul><li>Previous techniques had little success (Ogilvie and Callan, 01; Shokouhi et al, 09) </li></ul></ul><ul><li>Evaluating federated search </li></ul><ul><ul><li>Confounding factors </li></ul></ul><ul><li>Federated search in other context </li></ul><ul><ul><li>Blog Search (Elsas et al, 08; Seo and Croft, 08) </li></ul></ul><ul><li>Effective merging </li></ul><ul><ul><li>Supervised techniques </li></ul></ul>
    139. 139. Open problems in aggregated search <ul><li>Evaluation metrics </li></ul><ul><ul><li>slotted presentation </li></ul></ul><ul><ul><li>tiled presentation </li></ul></ul><ul><ul><li>metrics based on behavioral signals </li></ul></ul><ul><li>Models for multiple verticals </li></ul><ul><li>Minimizing the cost for new verticals, markets </li></ul>
    140. 140. Outline <ul><li>Introduction and Terminology </li></ul><ul><li>Architecture </li></ul><ul><li>Resource Representation </li></ul><ul><li>Resource Selection </li></ul><ul><li>Result Presentation </li></ul><ul><li>Evaluation </li></ul><ul><li>Open Problems </li></ul><ul><li>Bibliography </li></ul>
    141. 141. Bibliography <ul><ul><li>J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo, Sources of evidence for vertical selection. In SIGIR 2009 (2009). </li></ul></ul><ul><ul><li>J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In Proceedings of the ACM CIKM, Pages 1277--1286, Hong Kong, China, 2009a. </li></ul></ul><ul><ul><li>J. Arguello, F. Diaz, J.-F. Paiement, Vertical Selection in the Presence of Unlabeled Verticals. In SIGIR 2010 (2010). </li></ul></ul><ul><ul><li>J. Aslam and Mark Montague. Models for metasearch, In Proceedings of ACM SIGIR, Pages, 276--284, New Orleans, LA, 2001. </li></ul></ul><ul><ul><li>M. Baillie, L. Azzopardi, and F. Crestani. Adaptive query-based sampling of distributed collections, In Proceedings of SPIRE, Pages 316--328, Glasgow, UK, 2006a. </li></ul></ul><ul><ul><li>M. Baillie, L. Azzopardi, and F. Crestani. Towards better measures: evaluation of estimated resource description quality for distributed IR. In X. Jia, editor, Proceedings of the First International Conference on Scalable Information systems, page 41, Hong Kong, 2006b. </li></ul></ul><ul><ul><li>M. Baillie, M. Carman, and F. Crestani. A topic-based measure of resource description quality for distributed information retrieval. In Proceedings of ECIR, pages 485--496, Toulouse, France, 2009. </li></ul></ul>
    142. 142. Bibliography <ul><ul><li>Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. Proceedings of WWW, pages 367--376, Edinburgh, UK, 2006. </li></ul></ul><ul><ul><li>S. M. Beitzel, E. C. Jensen, D. D. Lewis, A. Chowdhury, O. and Frieder, Automatic classification of web queries using very large unlabeled query logs. ACM Trans. Inf. Syst. 25, 2 (2007), 9. </li></ul></ul><ul><ul><li>Y. Bernstein, M. Shokouhi, and J. Zobel. Compact features for detection of near-duplicates in distributed retrieval. Proceedings of SPIRE, Pages 110--121, Glasgow, UK, 2006. </li></ul></ul><ul><ul><li>J. Callan and M. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130, 2001. </li></ul></ul><ul><ul><li>J. Callan, Z. Lu, and B. Croft. Searching distributed collections with inference networks. In Proceedings of ACM SIGIR, pages 21--28. Seattle, WA, 1995 </li></ul></ul><ul><ul><li>J. Caverlee, L. Liu, and J. Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of ACM SIGIR, pages 340--347. Seattle, WA, 2006. </li></ul></ul><ul><ul><li>S. Cetintas, L. Si, and H. Yuan, Learning from past queries for resource selection, In Proceedings of ACM CIKM, Pages1867--1870, Hong Kong, China. </li></ul></ul>
    143. 143. <ul><ul><li>B.T. Bartell, G.W. Cottrell, and R.K. Belew. Automatic Combination of Multiple Ranked Retrieval Systems, ACM SIGIR, pp 173-181, 1994. </li></ul></ul><ul><ul><li>C. Baumgarten. A Probabilitstic Solution to the Selection and Fusion Problem in Distributed Information Retrieval, ACM SIGIR, pp 246-253, 1999. </li></ul></ul><ul><ul><li>N. Craswell. Methods for Distributed Information Retrieval. PhD thesis, Australian National University, 2000. </li></ul></ul><ul><ul><li>S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. ACM SIGIR, pp 299–306, 2002. </li></ul></ul><ul><ul><li>A. Chakravarthy and K. Haase. NetSerf: using semantic knowledge to find internet information archives, ACM SIGIR, pp 4-11, Seattle, WA, 1995. </li></ul></ul><ul><ul><li>F. Diaz. Performance prediction using spatial autocorrelation. ACM SIGIR, pp. 583–590, 2007. </li></ul></ul><ul><ul><li>F. Diaz. Integration of news content into web results. ACM International Conference on Web Search and Data Mining, 2009. </li></ul></ul><ul><ul><li>F. Diaz, J. and Arguello. Adaptation of offline vertical selection predictions in the presence of user feedback, ACM SIGIR, 2009. </li></ul></ul><ul><ul><li>D. Dreilinger and A. Howe. Experiences with selecting search engines using metasearch. ACM Transaction on Information Systems, 15(3):195-222, 1997. </li></ul></ul><ul><ul><li>J. Elsas, J. Arguello, J. Callan, and J. Carbonell. Retrieval and feedback models for blog feed search, ACM SIGIR, pp 347-354, Singapore, 2009. </li></ul></ul>Bibliography
    144. 144. <ul><ul><li>E. Glover, S. Lawrence, W. Birmingham, and C. Giles. Architecture of a metasearch engine that supports user information needs, ACM CIKM, pp 210—216,1999. </li></ul></ul><ul><ul><li>L. Gravano, H. García-Molina, and A. Tomasic. Precision and recall of GlOSS estimators for database discovery. Third International conference on Parallel and Distributed Information Systems, pp 103--106, Austin, TX, 1994a. </li></ul></ul><ul><ul><li>L. Gravano, H. García-Molina, and A. Tomasic. The effectiveness of GlOSS for the text database discovery problem. ACM SIGMOD, pp 126--137, Minneapolis, MN, 1994b. </li></ul></ul><ul><ul><li>L. Gravano, C. Chang, H. García-Molina, and A. Paepcke. STARTS:Stanford proposal for internet metasearching, ACM SIGMOD, pp 207--218, Tucson, AZ, 1997. </li></ul></ul><ul><ul><li>L. Gravano, H. García-Molina , and A. Tomasic. GlOSS: text-source discovery over the internet, ACM Transactions on Database Systems, 24(2):229--264, 1999. </li></ul></ul><ul><ul><li>E. Fox and J. Shaw. Combination of multiple searches. Second Text REtrieval Conference, pp 243-252, Gaithersburg, MD, 1993. </li></ul></ul><ul><ul><li>E. Fox and J. Shaw. Combination of multiple searches, Third Text REtrieval Conference, pp 105-108, Gaithersburg, MD, 1994. </li></ul></ul><ul><ul><li>J. French, and A. Powell. Metrics for evaluating database selection techniques, World Wide Web, 3(3):153--163, 2000. </li></ul></ul><ul><ul><li>C. Hauff. Predicting the Effectiveness of Queries and Retrieval Systems, PhD thesis, University of Twente, 2010. </li></ul></ul>Bibliography
    145. 145. <ul><ul><li>D. Hawking and P. Thomas. Server selection methods in hybrid portal search, ACM SIGIR, pp 75-82, Salvador, Brazil, 2005. </li></ul></ul><ul><ul><li>D. Hawking and P. Thistlewaite. Methods for information server selection, ACM Transactions on Information Systems, 17(1):40-76, 1999. </li></ul></ul><ul><ul><li>T. Hernandez and S. Kambhampati. Improving text collection selection with coverage and overlap statistics. WWW, pp 1128-1129, Chiba, Japan, 2005. </li></ul></ul><ul><ul><li>P. Ipeirotis and L. Gravano. When one sample is not enough: improving text database selection using shrinkage. ACM SIGMOD, pp 767-778, Paris, France, 2004. </li></ul></ul><ul><ul><li>P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB, pages 394-405, Hong Kong, China, 2002. </li></ul></ul><ul><ul><li>P. Ipeirotis and L. Gravano. Classification-aware hidden-web text database selection. ACM Transactions on Information Systems, 26(2):1-66, 2008. </li></ul></ul><ul><ul><li>P. Ipeirotis, A. Ntoulas, J. Cho, and L. Gravano. Modeling and managing content changes in text databases, 21st International Conference on Data Engineering, pp 606-617, Tokyo, Japan, 2005. </li></ul></ul><ul><ul><li>A. C. König, M. Gamon, and Q. Wu. Click-through prediction for news queries, ACM SIGIR, 2009. </li></ul></ul>Bibliography
    146. 146. <ul><ul><li>X. Li, Y.-Y. Wang, and A. Acero, Learning query intent from regularized click graphs, ACM SIGIR, pp. 339–346. </li></ul></ul><ul><ul><li>D. Lillis, F. Toolan, R. Collier, and J. Dunnion. ProbFuse: a probabilistic approach to data fusion, ACM SIGIR, pp 139-146, Seattle, WA, 2006. </li></ul></ul><ul><ul><li>K. Liu, C. Yu, and W. Meng. Discovering the representative of a search engine. ACM CIKM, pp 652-654, McLean, VA, 2002. </li></ul></ul><ul><ul><li>N. Liu, J. Yan, W. Fan, Q. Yang, and Z. Chen. Identifying Vertical Search Intention of Query through Social Tagging Propagation, WWW, Madrid, 2009. </li></ul></ul><ul><ul><li>W. Meng, Z. Wu, C. Yu, and Z. Li. A highly scalable and effective method for metasearch, ACM Transactions on Information Systems, 19(3):310-335, 2001. </li></ul></ul><ul><ul><li>W. Meng, C. Yu, and K. Liu. Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48-89, 2002. </li></ul></ul><ul><ul><li>V. Murdock, and M. Lalmas. Workshop on aggregated search, SIGIR Forum 42(2): 80-83, 2008. </li></ul></ul><ul><ul><li>H. Nottelmann and N. Fuhr. Combining CORI and the decision-theoretic approach for advanced resource selection, ECIR, pp 138--153, Sunderland, UK, 2004. </li></ul></ul><ul><ul><li>P. Ogilvie and J. Callan. The effectiveness of query expansion for distributed information retrieval, ACM CIKM, pp 1830--190, Atlanta, GA, 2001. </li></ul></ul><ul><ul><li>C. Paris, S. Wan and P. Thomas. Focused and aggregated search: a perspective from natural language generation, Journal of Information Retrieval, Special Issue, 2010. </li></ul></ul>Bibliography
    147. 147. <ul><li>S. Park. Analysis of characteristics and trends of Web queries submitted to NAVER, a major Korean search engine, Library & Information Science Research 31(2): 126-133, 2009. </li></ul><ul><li>F. Schumacher and R. Eschmeyer. The estimation of fish populations in lakes and ponds, Journal of the Tennessee Academy of Science, 18:228-249, 1943. </li></ul><ul><li>M. Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval, ECIR, pp 160-172, Rome, Italy, 2007a. </li></ul><ul><li>J. Seo and B. Croft. Blog site search using resource selection, ACM CIKM, pp 1053-1062, Napa Valley, CA, 2008. </li></ul><ul><li>M. Shokouhi. Segmentation of search engine results for effective data-fusion, ECIR, pp 185-197, Rome, Italy, 2007b. </li></ul><ul><li>M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates, ACM Transactions on Information Systems, 27(3): 1-29, 2009. </li></ul><ul><li>M. Shokouhi and J. Zobel. Federated text retrieval from uncooperative overlapped collections, ACM SIGIR, pp 495-502. Amsterdam, Netherlands, 2007. </li></ul><ul><li>M. Shokouhi, F. Scholer, and J. Zobel. Sample sizes for query probing in uncooperative distributed information retrieval, Eighth Asia Pacific Web Conference, pp 63--75, Harbin, China, 2006a. </li></ul>Bibliography
    148. 148. <ul><li>M. Shokouhi, J. Zobel, F. Scholer, and S. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval, ACM SIGIR, pp 316-323, Seattle, WA, 2006b. </li></ul><ul><li>M. Shokouhi, J. Zobel, S. Tahaghoghi, and F. Scholer. Using query logs to establish vocabularies in distributed information retrieval, Information Processing and Management, 43(1):169-180, 2007d. </li></ul><ul><li>M. Shokouhi, P. Thomas, and L. Azzopardi. Effective query expansion for federated search, ACM SIGIR, pp 427-434, Singapore, 2009. </li></ul><ul><li>L. Si and J. Callan. Unified utility maximization framework for resource selection, ACM CIKM, pages 32-41, Washington, DC, 2004a. </li></ul><ul><li>L. Si and J. Callan. CLEF2005: multilingual retrieval by combining multiple multilingual ranked lists. Sixth Workshop of the Cross-Language Evaluation Forum, Vienna, Austria, 2005a. http://www.cs.purdue.edu/homes/lsi/publications.htm </li></ul><ul><li>L. Si, J. Callan, S. Cetintas, and H. Yuan. An effective and efficient results merging strategy for multilingual information retrieval in federated search environments, Information Retrieval, 11(1):1--24, 2008. </li></ul><ul><li>L. Si and J. Callan. Relevant document distribution estimation method for resource selection, ACM SIGIR, pp 298-305, Toronto, Canada, 2003a. </li></ul><ul><li>L. Si and J. Callan. Modeling search engine effectiveness for federated search, ACM SIGIR, pp 83-90, Salvador, Brazil, 2005b. </li></ul><ul><li>L. Si and J. Callan. A semisupervised learning method to merge search engine results, ACM Transactions on Information Systems, 21(4):457-491, 2003b. </li></ul>Bibliography
    149. 149. <ul><li>A. Sugiura and O. Etzioni. Query routing for web search engines: architectures and experiments, WWW, Pages 417-429, Amsterdam, Netherlands, 2000. </li></ul><ul><li>S. Sushmita, H. Joho and M. Lalmas. A Task-Based Evaluation of an Aggregated Search Interface, SPIRE, Saariselkä, Finland, 2009. </li></ul><ul><li>S. Sushmita, H. Joho, M. Lalmas, and R. Villa. Factors Affecting Click-Through Behavior in Aggregated Search Interfaces, ACM CIKM, Toronto, Canada, 2010. </li></ul><ul><li>S. Sushmita, B. Piwowarski, and M. Lalmas. Dynamics of Genre and Domain Intents, Technical Report, University of Glasgow 2010. </li></ul><ul><li>S. Sushmita, H. Joho, M. Lalmas and J.M. Jose. Understanding domain &quot;relevance&quot; in web search, WWW 2009 Workshop on Web Search Result Summarization and Presentation, Madrid, Spain, 2009. </li></ul><ul><li>P. Thomas and D. Hawking. Evaluating sampling methods for uncooperative collections, ACM SIGIR, pp 503-510, Amsterdam, Netherlands, 2007. </li></ul><ul><li>P. Thomas. Server characterisation and selection for personal metasearch, PhD thesis, Australian National University, 2008. </li></ul><ul><li>P. Thomas and M. Shokouhi. SUSHI: scoring scaled samples for server selection, ACM SIGIR, pp 419-426, Singapore, Singapore, 2009. </li></ul><ul><li>A. Trotman, S. Geva, J. Kamps, M. Lalmas and V. Murdock (eds). Current research in focused retrieval and result aggregation, Special Issue in the Journal of Information Retrieval, Springer, 2010. </li></ul>Bibliography
    150. 150. <ul><li>T. Tsikrika and M. Lalmas. Merging Techniques for Performing Data Fusion on the Web, ACM CIKM, pp 181-189, Atlanta, Georgia, 2001. </li></ul><ul><li>Ellen M. Voorhees, Narendra Kumar Gupta, Ben Johnson-Laird. Learning Collection Fusion Strategies, ACM SIGIR, pp 172-179, 1995. </li></ul><ul><li>B. Yuwono and D. Lee. WISE: A world wide web resource database system. IEEE Transactions on Knowledge and Data Engineering, 8(4):548--554, 1996. </li></ul><ul><li>B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. Fifth International Conference on Database Systems for Advanced Applications, 6, pp 41-50, Melbourne, Australia, 1997. </li></ul><ul><li>J. Xu and J. Callan. Effective retrieval with distributed collections, ACM SIGIR, pp 112-120, Melbourne, Australia, 1998. </li></ul><ul><li>A. Zhou and M. Lalmas. Building a Test Collection for Aggregated Search, Technical Report, University of Glasgow 2010. </li></ul><ul><li>J. Zobel. Collection selection via lexicon inspection, Australian Document Computing Symposium, pp 74--80, Melbourne, Australia, 1997. </li></ul>Bibliography
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×