Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

2,426 views

Published on

Presentation done by Ricardo Baeza-Yates

Published in: Technology, Business
  • Be the first to comment

Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

  1. 1. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Challenges in Distributed Caching Information Retrieval Ricardo Baeza-Yates1,2 Joint work with: C. Castillo1 , F. Junqueira1 , V. Plachouras1 and F. Silvestri3 1. Yahoo! Research Barcelona – Catalunya, Spain 2. Yahoo! Research Latin America – Santiago, Chile 3. ISTI-CNR – Pisa, Italy
  2. 2. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Crawling 1 Indexing Query Processing Caching Indexing 2 Query Processing 3 Caching 4
  3. 3. Challenges in Main Modules and Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Partition Dependability Communication External Query (sync.) factors Processing Crawling URL assignment Re-crawl URL Web growth, Caching exchanges Content change, Network topology, Bandwidth, DNS, QoS of servers Indexing Doc. partition, Re-index Partial Web growth, Term partition indexing, Content change, updating, Global statistics merging Querying Query routing, Replication, Rank Changing user Collection caching aggregation, needs, User base selection, Load Personaliza- growth, DNS balancing tion
  4. 4. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Crawling 1 Caching Indexing 2 3 Query Processing 4 Caching
  5. 5. Challenges in Crawling Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching In theory it is simple: fetch, parse, fetch, parse, . . .
  6. 6. Challenges in Crawling Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching In theory it is simple: fetch, parse, fetch, parse, . . . In practice it is difficult: implies using other people’s resources (web servers’ CPU and network)
  7. 7. Challenges in Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching How to partition the crawling task?
  8. 8. Challenges in Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching How to partition the crawling task? What to do when one agent fails?
  9. 9. Challenges in Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching How to partition the crawling task? What to do when one agent fails? How to communicate among agents?
  10. 10. Challenges in Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching How to partition the crawling task? What to do when one agent fails? How to communicate among agents? How to deal with external factors?
  11. 11. Challenges in Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Host-based partitioning exploits locality of links Processing Caching Balance improves if large/small hosts are treated differently Performance improves if geographic location is considered
  12. 12. Challenges in Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Host-based partitioning exploits locality of links Processing Caching Balance improves if large/small hosts are treated differently Performance improves if geographic location is considered Consistent hashing Allows to add and remove agents from the pool [Boldi et al., 2004]
  13. 13. Challenges in Communication Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Host-based partitioning reduces communication Highly-linked URLs should be cached Communication with the server can be improved if server cooperates
  14. 14. Challenges in External factors Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching DNS can be a bottleneck Varying quality of implementation of HTTP Varying quality of HTML coding Varying quality of service in general SPAM
  15. 15. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Crawling 1 Caching Indexing 2 3 Query Processing 4 Caching
  16. 16. Challenges in What’s Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Indexing in Database and IR is the process of building an Caching index over a collection of documents
  17. 17. Challenges in What’s Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Indexing in Database and IR is the process of building an Caching index over a collection of documents Inverted Indexes are typically used in IR indexes
  18. 18. Challenges in What’s Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Indexing in Database and IR is the process of building an Caching index over a collection of documents Inverted Indexes are typically used in IR indexes Lexicon: contains distinct terms appearing in the collection’s documents
  19. 19. Challenges in What’s Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Indexing in Database and IR is the process of building an Caching index over a collection of documents Inverted Indexes are typically used in IR indexes Lexicon: contains distinct terms appearing in the collection’s documents Posting Lists: contains descriptions of occurrences of relative terms within the corresponding documents
  20. 20. Challenges in Index and Distributed Indexing Distributed IR Ricardo Baeza-Yates D Crawling T1 Indexing Query Term T2 Processing Partition D Caching Tn T T Document Partition D1 D2 Dm
  21. 21. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix)
  22. 22. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros:
  23. 23. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput
  24. 24. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes
  25. 25. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes load balanced
  26. 26. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes load balanced cons:
  27. 27. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes load balanced cons: high number of disk operations
  28. 28. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes load balanced cons: high number of disk operations high volume of data read from disk
  29. 29. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix)
  30. 30. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros:
  31. 31. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions
  32. 32. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions not scalable with large collections
  33. 33. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions not scalable with large collections cons:
  34. 34. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions not scalable with large collections cons: reduced number of disk accesses
  35. 35. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions not scalable with large collections cons: reduced number of disk accesses reduced volume of exchanged data
  36. 36. Challenges in Partitioning Goals Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching partitioning is the first design issue to be faced in distributed indexing
  37. 37. Challenges in Partitioning Goals Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching partitioning is the first design issue to be faced in distributed indexing a distributed index should allow for efficient query routing and resolution
  38. 38. Challenges in Partitioning Goals Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching partitioning is the first design issue to be faced in distributed indexing a distributed index should allow for efficient query routing and resolution reduction of the number of nodes queried, is desirable too
  39. 39. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing Caching
  40. 40. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching
  41. 41. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching topical organization using clustering (e.g. k-means [Larkey et al., 2000, Liu and Croft, 2004])
  42. 42. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching topical organization using clustering (e.g. k-means [Larkey et al., 2000, Liu and Croft, 2004]) documents are firstly clustered and then each partition is composed by one (or more) cluster(s)
  43. 43. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching topical organization using clustering (e.g. k-means [Larkey et al., 2000, Liu and Croft, 2004]) documents are firstly clustered and then each partition is composed by one (or more) cluster(s) usage-induced partitioning (e.g. Query-Vector Document Model [Puppin et al., 2006])
  44. 44. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching topical organization using clustering (e.g. k-means [Larkey et al., 2000, Liu and Croft, 2004]) documents are firstly clustered and then each partition is composed by one (or more) cluster(s) usage-induced partitioning (e.g. Query-Vector Document Model [Puppin et al., 2006]) clustering is induced by the way users interact with the index
  45. 45. Challenges in Load Balancing Issues Distributed IR Ricardo Baeza-Yates In document partitioned indexes not adopting collection selection strategies, load is almost balanced among all Crawling Indexing the query processors Query In term partitioned indexes (even the new pipelined Processing schema [Webber et al., 2006]) load balancing is an issue Caching In federated document partitioned systems where collection selection is applied, balancing the load is still an unexplored issue. 100.0 100.0 80.0 80.0 Load percentage Load percentage 60.0 60.0 40.0 40.0 20.0 20.0 0.0 0.0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Document-distributed Pipelined
  46. 46. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines
  47. 47. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines query distribution
  48. 48. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines query distribution query arrival time
  49. 49. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines query distribution query arrival time clickthrough information
  50. 50. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines query distribution query arrival time clickthrough information ...
  51. 51. Challenges in Usage Information in Term Partitioned Systems Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching frequency of query terms can be exploited to partition a collection with the aim of balancing the load of query processors
  52. 52. Challenges in Usage Information in Term Partitioned Systems Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching frequency of query terms can be exploited to partition a collection with the aim of balancing the load of query processors bin-packing approach [Moffat et al., 2006]
  53. 53. Challenges in Usage Information in Term Partitioned Systems Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching frequency of query terms can be exploited to partition a collection with the aim of balancing the load of query processors bin-packing approach [Moffat et al., 2006] data mining approach [Lucchese et al., 2007]
  54. 54. Challenges in Usage Information in Document Partitioned Distributed IR Systems Ricardo Baeza-Yates Crawling Indexing Query Processing random partitioning does not ensure load Caching balancing [Badue et al., 2006]
  55. 55. Challenges in Usage Information in Document Partitioned Distributed IR Systems Ricardo Baeza-Yates Crawling Indexing Query Processing random partitioning does not ensure load Caching balancing [Badue et al., 2006] broadcast-based systems perform unnecessary operations on sub-collections containing few or no relevant documents
  56. 56. Challenges in Usage Information in Document Partitioned Distributed IR Systems Ricardo Baeza-Yates Crawling Indexing Query Processing random partitioning does not ensure load Caching balancing [Badue et al., 2006] broadcast-based systems perform unnecessary operations on sub-collections containing few or no relevant documents Usage-based mapping can be adopted to partition sub-collections that can be effectively discriminated upon query receipt [Puppin et al., 2006]
  57. 57. Challenges in Challenges in Distributed Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching in document partitioned system it is needed to find partitioning strategies for enhancing collection selection performance in terms of effectiveness
  58. 58. Challenges in Challenges in Distributed Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching in document partitioned system it is needed to find partitioning strategies for enhancing collection selection performance in terms of effectiveness in both systems it is a challenges to find effective load balancing strategies
  59. 59. Challenges in Query processing Distributed IR Ricardo Baeza-Yates System components Crawling Indexing Clients submitting queries Query Processing Sites consisting of servers Caching Servers are commodity computers Query processing System receives a query Query routing: forwarding query to appropriate sites Merging results Challenges Determine appropriate sites on the fly WAN communication is costly
  60. 60. Challenges in Challenges in more detail Distributed IR Ricardo Baeza-Yates Large-scale systems Crawling Indexing Large amount of data Query Processing Large data structures Caching Large number of clients and servers Partitioning of data structures Necessary due to very large data structures Parallel processing e.g. document collection split by topic, language, region Replication of data structures For availability, throughput, and response time Conflict with resource utilization
  61. 61. Challenges in Framework for Distributed Query Processing Distributed IR Ricardo Baeza-Yates Site B Region Y Crawling Site A Indexing Region X Query Processing Caching 2 1 Client 3 WAN Site C Region Z Query processor matches documents to the received queries Coordinator receives queries and routes them to appropriate sites Cache stores results from previous queries
  62. 62. Challenges in Currently... Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Multiple sites Processing Sites are full replicas of each other Caching Simple query routing: Dynamic DNS According to the previous framework, opportunity to Use storage resources more efficiently More sophisticated query routing mechanisms Effective partition strategies (e.g., language-based strategies)
  63. 63. Challenges in Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing Goals Query Processing Achieve cost-effective scalability Caching Reduce response times Potential solutions Partition of large data structures by topic, language, etc. Effective query routing first to local sites, then to global sites Incremental presentation of results to alleviate network latencies
  64. 64. Challenges in Dependability Distributed IR Ricardo Baeza-Yates Goals Crawling Indexing Availability of query processors Query Processing Consistency of replicated query data (can be weak) Caching Consistency of user state: e.g., personalization, user preferences Potetial solutions More network resources: multi-homed sites Replication: within and across sites Consistency: techniques for weak consistency (replicas eventually converge) Caching: improve availability when query processors are unavailable
  65. 65. Challenges in Dependability Distributed IR Ricardo Baeza-Yates Achieving availability is not straighforward Crawling BIRN system studied by Junqueira and Marzullo [Junqueira and Marzullo, 2005] Indexing Query Partitions are quite frequent Processing Caching 12 10 Average number of sites 8 6 4 2 0 < 100 < 99.8 < 99 < 98 < 97 Monthly availability
  66. 66. Challenges in Communication Distributed IR Ricardo Baeza-Yates Crawling Indexing Message latency Query Communication is costly in wide-area networks Processing Caching Latency is not neglible Reduced capacity of servers as the latency to process a query increases Potential solutions Reduce as much as possible the number of sites contacted to process a query Most queries processed by sites that are close according to network distance
  67. 67. Challenges in Caching query results or Distributed IR postings [Baeza-Yates et al., 2007] Ricardo Baeza-Yates Crawling Caching query answers: Indexing Query 44% of queries are singletons (appear only once) Processing Caching 88% of the unique queries are singletons Infinite cache would achieve 56% hit-ratio Caching postings of terms: 4% of terms are singletons 73% of the unique terms (the vocabulary) are singletons Infinite cache would achieve 96% hit-ratio Note: All statistics and graphs on caching refer to a one-year query log from yahoo.co.uk
  68. 68. Challenges in Static or dynamic caching of postings Distributed IR Ricardo Baeza-Yates Crawling Static caching of postings (Qtf) Indexing Cache terms with the highest query log frequency fq (t) Query Processing Caching However, there is a tradeoff between fq (t) and fd (t) Terms with high query log frequency fq (t) are good for the cache Terms with high document frequency fd (t) occupy too much space Static caching of postings as a KnapSack problem (QtfDf) fq (t) Cache posting lists of terms with the highest ratio fd (t)
  69. 69. Challenges in Static or dynamic caching of postings Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching
  70. 70. Challenges in Analysis of static caching Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Trade-offs between caching postings and answers Processing Caching postings results in more hits Caching Caching answers is faster To compare need to consider time/space parameters Problem: Given a fixed amount of memory and the average response times for a system, how much to allocate for caching answers and how much for caching postings?
  71. 71. Challenges in Analysis of static caching Distributed IR Ricardo Baeza-Yates Crawling Scenario 1: Centralized retrieval system, complete/partial query evaluation, un/compressed postings Indexing Query Postings cache can answer more queries than answers cache Processing Caching Most available memory for caching postings Scenario 2: WAN distributed system, complete/partial query evaluation, un/compressed postings Network time dominates Most available memory for caching answers Query Dynamics Slowly changing query dynamics makes static caching viable
  72. 72. Challenges in Distributed IR Ricardo Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and Baeza-Yates Ziviani, N. (2006). Crawling Analyzing imbalance among homogeneous index servers in a Indexing web search system. Query Processing Information Processing & Management. Caching Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Silvestri, F., and Plachouras, V. (2007). The impact of caching on search engines. In Proceedings of the Internation ACM SIGIR Conference (to appear), Amsterdam, Neatherlands. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004). Ubicrawler: a scalable fully distributed web crawler. Software, Practice and Experience, 34(8):711–726.
  73. 73. Challenges in Distributed IR Junqueira, F. and Marzullo, K. (2005). Ricardo Coterie availability in sites. Baeza-Yates In Proceedings of the International Conference on Distributed Crawling Computing (DISC), number 3724 in LNCS, pages 3–17, Indexing Krakow, Poland. Springer Verlag. Query Processing Larkey, L. S., Connell, M. E., and Callan, J. (2000). Caching Collection selection and results merging with topically organized u.s. patents and trec data. In CIKM ’00: Proceedings of the ninth international conference on Information and knowledge management, pages 282–289, New York, NY, USA. ACM Press. Liu, X. and Croft, W. B. (2004). Cluster-based retrieval using language models. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186–193, New York, NY, USA. ACM Press.
  74. 74. Challenges in Distributed IR Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. (2007). Ricardo Baeza-Yates Mining query logs to optimize index partitioning in parallel web search engines. Crawling To Appear in Proceedings of The 2nd International Conference Indexing on Scalable Information Systems (INFOSCALE 2007). Query Processing Caching Moffat, A., Webber, W., and Zobel, J. (2006). Load balancing for term-distributed parallel retrieval. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 348–355, New York, NY, USA. ACM Press. Puppin, D., Silvestri, F., and Laforenza, D. (2006). Query-driven document partitioning and collection selection. In InfoScale ’06: Proceedings of the 1st international conference on Scalable information systems, page 34, New York, NY, USA. ACM Press.
  75. 75. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R. Processing (2006). Caching A pipelined architecture for distributed text query evaluation. Information Retrieval. published online October 5, 2006.

×