MILAD: Anchor text should not be THERE (you said it – please updated) MILAD: there was a comment from Andrew Trotman (we can ignore) about cooperative search engines. Anything you want to add about this (as I said we can safely ignore)
There was a comment about Amdox (Yellow Page): Mliad???
Say why some are underlined.
Formula does not print
Slide did not print well (stuff missing)
Milad you said “Collection overlap estimation” was misplaced here.
I have a comment here that says add the MJ slide
Server vs collection here – does it matter at the end? Would be nice to have collection here
Server vs collection
Server vs collection
Milad, you did speak quite a bit here, so maybe add something more?
I have a comment here: KDD cup?
All should be in % (or at least same format) Text needed here.
Say in some text what is combined here.
For other issues here, I have as comment add refs.
I have as comment here “predict newsworthiness of queries”
Say what C and D are.
Check E and F – something was not correct.
This slide does not print
This slide does not print.
CTR is full
From federated to aggregated search Fernando Diaz, Mounia Lalmas and Milad Shokouhi [email_address] [email_address] [email_address]
Increasingly different types of information being available, sough and relevant
e.g. news, image, wiki, video, audio, blog, map, tweet
Search engine allows accessing these through so-called verticals
Two “ways” to search
Users can directly search the verticals
Or rely on so called aggregated search
Google universal search 2007 : [ … ] search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results [ … ] will incorporate information from a variety of previously separate sources – including videos, images, news, maps, books, and websites – into a single set of results. http://www.google.com/intl/en/press/pressrel/universalsearch_20070516.html
Motivation for aggregated search (Arguello et al , 09) 25K editorially classified queries
“ Increasingly” more than one type of information relevant to an information need
mostly web page + image, map, blog, etc
These types of information are indexed and ranked using dedicated approaches (verticals)
Presenting the results from verticals in an aggregated way believed to be more useful
All major search engines are doing some levels of aggregated search
Data fusion Query GOV2 BM25 KL Inquery Anchor only Title only One document collection Different document representations Different retrieval models Merging One ranked list of result (merged) (e.g. Voorhees etal, 95)
SSL with logistic regression (Si and Callan, 05a; Si et al, 08)
Merging overlapped collections
COSCO ( Hernandez and Kambhampati 05) :
GHV ( Bernstein et al, 06; Shokouhi et al, 07b) :
Result merging - Miscellaneous scenarios
Images on top Images in the middle Images at the bottom Images at top-right Images on the left Images at the bottom-right Slotted vs tiled result presentation 3 verticals 3 positions 3 degree of vertical intents (Sushmita et al, 10)
Inference relevance from behavioral data (e.g. click data)
regression error on predicted CTR
infer binary or graded relevance
(Diaz, 09; Konig etal , 09)
Test collections (a la TREC) * There are on an average more than 100 events/shots contained in each video clip (document) (Zhou & Lalmas, 10) Statistics on Topics number of topics 150 average rel docs per topic 110.3 average rel verticals per topic 1.75 ratio of “General Web” topics 29.3% ratio of topics with two vertical intents 66.7% ratio of topics with more than two vertical intents 4.0% quantity/media text image video total size (G) 2125 41.1 445.5 2611.6 number of documents 86,186,315 670,439 1,253* 86,858,007
ImageCLEF photo retrieval track …… TREC web track INEX ad-hoc track TREC blog track topic t 1 doc d 1 d 2 d 3 … d n judgment R N R … R …… Blog Vertical Reference (Encyclopedia) Vertical Image Vertical General Web Vertical Shopping Vertical topic t 1 doc d 1 d 2 … d V1 judgment R N … R vertical V 1 V 2 d 1 d 2 … d V2 N N … R …… V k d 1 d 2 … d Vk N N … N t 1 existing test collections (simulated) verticals Test collections (a la TREC)
Recap – Evaluation federated search aggregated search Editorial data document relevance judgments query labels Behavioral data none critical
J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo, Sources of evidence for vertical selection. In SIGIR 2009 (2009).
J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In Proceedings of the ACM CIKM, Pages 1277--1286, Hong Kong, China, 2009a.
J. Arguello, F. Diaz, J.-F. Paiement, Vertical Selection in the Presence of Unlabeled Verticals. In SIGIR 2010 (2010).
J. Aslam and Mark Montague. Models for metasearch, In Proceedings of ACM SIGIR, Pages, 276--284, New Orleans, LA, 2001.
M. Baillie, L. Azzopardi, and F. Crestani. Adaptive query-based sampling of distributed collections, In Proceedings of SPIRE, Pages 316--328, Glasgow, UK, 2006a.
M. Baillie, L. Azzopardi, and F. Crestani. Towards better measures: evaluation of estimated resource description quality for distributed IR. In X. Jia, editor, Proceedings of the First International Conference on Scalable Information systems, page 41, Hong Kong, 2006b.
M. Baillie, M. Carman, and F. Crestani. A topic-based measure of resource description quality for distributed information retrieval. In Proceedings of ECIR, pages 485--496, Toulouse, France, 2009.
E. Glover, S. Lawrence, W. Birmingham, and C. Giles. Architecture of a metasearch engine that supports user information needs, ACM CIKM, pp 210—216,1999.
L. Gravano, H. García-Molina, and A. Tomasic. Precision and recall of GlOSS estimators for database discovery. Third International conference on Parallel and Distributed Information Systems, pp 103--106, Austin, TX, 1994a.
L. Gravano, H. García-Molina, and A. Tomasic. The effectiveness of GlOSS for the text database discovery problem. ACM SIGMOD, pp 126--137, Minneapolis, MN, 1994b.
L. Gravano, C. Chang, H. García-Molina, and A. Paepcke. STARTS:Stanford proposal for internet metasearching, ACM SIGMOD, pp 207--218, Tucson, AZ, 1997.
L. Gravano, H. García-Molina , and A. Tomasic. GlOSS: text-source discovery over the internet, ACM Transactions on Database Systems, 24(2):229--264, 1999.
E. Fox and J. Shaw. Combination of multiple searches. Second Text REtrieval Conference, pp 243-252, Gaithersburg, MD, 1993.
E. Fox and J. Shaw. Combination of multiple searches, Third Text REtrieval Conference, pp 105-108, Gaithersburg, MD, 1994.
J. French, and A. Powell. Metrics for evaluating database selection techniques, World Wide Web, 3(3):153--163, 2000.
C. Hauff. Predicting the Effectiveness of Queries and Retrieval Systems, PhD thesis, University of Twente, 2010.
M. Shokouhi, J. Zobel, F. Scholer, and S. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval, ACM SIGIR, pp 316-323, Seattle, WA, 2006b.
M. Shokouhi, J. Zobel, S. Tahaghoghi, and F. Scholer. Using query logs to establish vocabularies in distributed information retrieval, Information Processing and Management, 43(1):169-180, 2007d.
M. Shokouhi, P. Thomas, and L. Azzopardi. Effective query expansion for federated search, ACM SIGIR, pp 427-434, Singapore, 2009.
L. Si and J. Callan. Unified utility maximization framework for resource selection, ACM CIKM, pages 32-41, Washington, DC, 2004a.
L. Si and J. Callan. CLEF2005: multilingual retrieval by combining multiple multilingual ranked lists. Sixth Workshop of the Cross-Language Evaluation Forum, Vienna, Austria, 2005a. http://www.cs.purdue.edu/homes/lsi/publications.htm
L. Si, J. Callan, S. Cetintas, and H. Yuan. An effective and efficient results merging strategy for multilingual information retrieval in federated search environments, Information Retrieval, 11(1):1--24, 2008.
L. Si and J. Callan. Relevant document distribution estimation method for resource selection, ACM SIGIR, pp 298-305, Toronto, Canada, 2003a.
L. Si and J. Callan. Modeling search engine effectiveness for federated search, ACM SIGIR, pp 83-90, Salvador, Brazil, 2005b.
L. Si and J. Callan. A semisupervised learning method to merge search engine results, ACM Transactions on Information Systems, 21(4):457-491, 2003b.
T. Tsikrika and M. Lalmas. Merging Techniques for Performing Data Fusion on the Web, ACM CIKM, pp 181-189, Atlanta, Georgia, 2001.
Ellen M. Voorhees, Narendra Kumar Gupta, Ben Johnson-Laird. Learning Collection Fusion Strategies, ACM SIGIR, pp 172-179, 1995.
B. Yuwono and D. Lee. WISE: A world wide web resource database system. IEEE Transactions on Knowledge and Data Engineering, 8(4):548--554, 1996.
B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. Fifth International Conference on Database Systems for Advanced Applications, 6, pp 41-50, Melbourne, Australia, 1997.
J. Xu and J. Callan. Effective retrieval with distributed collections, ACM SIGIR, pp 112-120, Melbourne, Australia, 1998.
A. Zhou and M. Lalmas. Building a Test Collection for Aggregated Search, Technical Report, University of Glasgow 2010.
J. Zobel. Collection selection via lexicon inspection, Australian Document Computing Symposium, pp 74--80, Melbourne, Australia, 1997.