Successfully reported this slideshow.

Comparing Distributed Indexing To Mapreduce or Not?

1,747 views

Published on

Presentation on the Indexing under MapReduce made at the Large Scale Distributed Systems workshop at SIGIR 2009.

Published in: Technology
  • Be the first to comment

Comparing Distributed Indexing To Mapreduce or Not?

  1. 1. Comparing Distributed Indexing: To Mapreduce or Not? Richard McCreadie Craig Macdonald Iadh Ounis
  2. 2. Talk Outline <ul><li>Motivations </li></ul><ul><li>Classical Indexing </li></ul><ul><ul><li>Single-Pass Indexing </li></ul></ul><ul><ul><li>Shared-Nothing/Corpus Indexing </li></ul></ul><ul><li>MapReduce </li></ul><ul><ul><li>What is MapReduce? </li></ul></ul><ul><ul><li>Indexing Strategies for MapReduce </li></ul></ul><ul><li>Experimentation and Results </li></ul><ul><ul><li>Measures and Environment </li></ul></ul><ul><ul><li>Using Shared-Corpus Indexing as a baseline </li></ul></ul><ul><ul><li>Comparing MapReduce Indexing Techniques </li></ul></ul><ul><li>Conclusions </li></ul>
  3. 3. MOTIVATIONS <ul><li>Why is Efficient Indexing Important </li></ul><ul><li>Contributions </li></ul>
  4. 4. Why is Efficient Indexing Important? <ul><li>Indexing is an essential part of any IR system </li></ul><ul><li>But test corpora have grown exponentially </li></ul><ul><li>This has reinvigorated the need for efficient indexing </li></ul>Collection Data Year Docs Size(GB) WT2G Web 1999 240k 2.0 GOV Web 2002 1.8M 18.0 Blogs06 Blogs 2006 3M 13.0 GOV2 Web 2004 25M 425.0 ClueWeb09 Web 2009 1.2B 25,000
  5. 5. <ul><li>Commercial organisations and research groups alike have embraced scale-out approaches </li></ul><ul><li>MapReduce has been widely adopted by the commercial search engines </li></ul><ul><li>No studies into its suitability for indexing </li></ul>Solutions? MapReduce
  6. 6. Contributions <ul><li>We examine the benefits to be gained from indexing with MapReduce ( ) </li></ul><ul><li>We discuss 4 indexing techniques in MapReduce </li></ul><ul><ul><li>3 existing techniques </li></ul></ul><ul><ul><li>1 novel technique inspired by single-pass indexing </li></ul></ul><ul><li>We then compare them to a shared-corpus distributed indexing strategy </li></ul>
  7. 7. CLASSICAL INDEXING <ul><li>Classical Indexing </li></ul><ul><li>Single-Pass Indexing </li></ul><ul><li>Shared-Nothing & Shared Corpus Distributed Indexing </li></ul>
  8. 8. Classical Indexing <ul><li>Need to build two important structures: </li></ul><ul><ul><li>Inverted Index: posting lists for each term: <docid, frequency> </li></ul></ul><ul><ul><li>Lexicon: term information and pointer to correct posting list </li></ul></ul><ul><ul><li>State-of-the-art indexing uses a single-pass strategy. </li></ul></ul>(I.H. Witten, A. Moffat and T.C. Bell, 1999) Lexicon Posting List term Total docs Total frequency pointer Document number frequency
  9. 9. <ul><li>Parse collection </li></ul><ul><li>Build postings in memory </li></ul><ul><li>Merge inverted indices </li></ul>Single-Pass In-Memory Indexing <> <> <> <> <> <> DISK Compressed Files % Used RAM t1 t2 t3 Indexer Final Inverted Index
  10. 10. How can Indexing be Distributed? <ul><li>Two classical approaches </li></ul><ul><ul><li>Shared-Nothing </li></ul></ul><ul><ul><ul><li>Indexer on every machine </li></ul></ul></ul><ul><ul><ul><li>Index local data </li></ul></ul></ul><ul><ul><ul><li>Optimal! </li></ul></ul></ul><ul><ul><li>Shared-Corpus </li></ul></ul><ul><ul><ul><li>Indexer on every machine </li></ul></ul></ul><ul><ul><ul><li>Index remote data over NFS </li></ul></ul></ul>
  11. 11. Problems with Classical Approaches <ul><li>Shared-Nothing </li></ul><ul><ul><li>No fault tolerance </li></ul></ul><ul><ul><li>Only a single copy of data </li></ul></ul><ul><ul><li>Jobs have to be manually administered </li></ul></ul><ul><ul><li>Data may not be available locally </li></ul></ul><ul><li>Shared-Corpus </li></ul><ul><ul><li>No fault tolerance </li></ul></ul><ul><ul><li>Single point of failure (server) </li></ul></ul><ul><ul><li>Constrained by network bandwidth and server speed </li></ul></ul><ul><ul><li>Jobs have to be manually administered </li></ul></ul><ul><li>MapReduce reportedly solves these issues </li></ul>
  12. 12. MAPREDUCE INDEXING <ul><li>MapReduce </li></ul><ul><li>MapReduce Indexing Strategies </li></ul>
  13. 13. MapReduce <ul><li>Programming paradigm by </li></ul><ul><li>Splits jobs into map and reduce operations </li></ul><ul><ul><li>Map : map function (indexing) over each entry in input </li></ul></ul><ul><ul><li>Sort map output locally </li></ul></ul><ul><ul><li>Reduce : merge map output </li></ul></ul><ul><li>Provides: </li></ul><ul><ul><li>Convenient programming API </li></ul></ul><ul><ul><li>Automatic job control </li></ul></ul><ul><ul><li>Fault tolerance </li></ul></ul><ul><ul><li>Distributed data storage (DFS) </li></ul></ul><ul><ul><li>Data replication </li></ul></ul>
  14. 14. Indexing with MapReduce <ul><li>Implies that indexing is trivial . . . </li></ul><ul><li>Could a more complex approach perform better? </li></ul><ul><li>We try multiple approaches: </li></ul><ul><ul><li>“ Map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document ID’s and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.” </li></ul></ul>Dean & Ghemawat, MapReduce: Simplified data processing on large clusters. OSDI 2004 Emit each word in the document A big intermediate sort Lots of merging Approach Emits Sorting Num emits per map Emit size D&G_Token Tokens Lots Lots Tiny D&G_Term Terms Lots Many Tiny Nutch Documents Little Some Average Single-Pass Posting lists Some Few Large
  15. 15. D&G_Token & D&G_Term <ul><li>D&G_Token </li></ul><ul><ul><li>Map : For each token in document </li></ul></ul><ul><ul><li>emit (token, document-ID) </li></ul></ul><ul><ul><li>Sort : by token and document-ID </li></ul></ul><ul><ul><li>Reduce : For each unique token (term) </li></ul></ul><ul><ul><li>sum repeated document-Ids to get tf’s </li></ul></ul><ul><ul><li>write posting list for that term </li></ul></ul><ul><li>D&G_Term </li></ul><ul><ul><li>Map : For each term in document </li></ul></ul><ul><ul><li>emit (term, document-ID, tf) </li></ul></ul><ul><ul><li>Sort : by term and document-ID </li></ul></ul><ul><ul><li>Reduce : For each term </li></ul></ul><ul><ul><li>write posting list for that term </li></ul></ul><ul><li>Based on Dean & Ghemawat’s MapReduce Paper (OSDI 2004) </li></ul>Emit every token Emit every term
  16. 16. Nutch Style Indexing <ul><li>Nutch (lucene) provides MapReduce indexing </li></ul><ul><li>We investigated v0.9 </li></ul><ul><ul><li>Map : For each document </li></ul></ul><ul><ul><li>analyse document </li></ul></ul><ul><ul><li>emit(document-ID, analysed-document) </li></ul></ul><ul><ul><li>Sort : by document-ID (URL) </li></ul></ul><ul><ul><li>Reduce : For each document-ID </li></ul></ul><ul><ul><li>build Nutch document </li></ul></ul><ul><ul><li>index Nutch document </li></ul></ul><ul><li>Approximately equivalent to using a null-mapper </li></ul><ul><li>Emits less than the D&G approaches </li></ul><ul><li>But we believe we can do better . . . </li></ul>Emit every document
  17. 17. Our Single-Pass Indexing Strategy <ul><li>We propose a novel adaptation of Single-Pass indexing </li></ul><ul><li>Idea: </li></ul><ul><ul><li>Use local machine memory </li></ul></ul><ul><ul><li>Build useful compressed structures </li></ul></ul><ul><ul><li>Emit less, therefore less IO and less sorting </li></ul></ul><ul><li>Single-Pass Indexing </li></ul><ul><ul><li>Map : For each document </li></ul></ul><ul><ul><li>add document to compressed in-memory partial index </li></ul></ul><ul><ul><li>if (memory near full) flush: </li></ul></ul><ul><ul><li>for each term in partial index </li></ul></ul><ul><ul><li>emit(term, partial posting list) </li></ul></ul>Emit limited Posting-Lists
  18. 18. Our MapReduce Indexing Strategy (2) <ul><ul><li>Sort : map, flush, and term </li></ul></ul><ul><ul><li>Reduce : for each term </li></ul></ul><ul><ul><li>merge partial posting lists </li></ul></ul><ul><ul><li>write out merged posting list </li></ul></ul><ul><li>Maps only emit compressed posting lists (less IO) </li></ul><ul><li>Few emits so sorting is easy </li></ul><ul><li>We now evaluate these indexing strategies . . . </li></ul>
  19. 19. EXPERIMENTATION & RESULTS <ul><li>Measures and Setup </li></ul><ul><li>Indexing Throughput and Scaling </li></ul>
  20. 20. Research Questions <ul><li>Recall, we want to evaluate MapReduce for Indexing large-scale collections </li></ul><ul><li>Questions: </li></ul><ul><ul><li>Is Shared-Corpus Indexing sufficient for large-scale collections (baseline) </li></ul></ul><ul><ul><li>Can MapReduce perform close to Shared-Nothing Indexing (optimal) </li></ul></ul>
  21. 21. Evaluation of MapReduce Indexing <ul><li>Two measures are used: </li></ul><ul><ul><li>Measure speed with throughput: </li></ul></ul><ul><ul><li>(Compressed) Collection Size </li></ul></ul><ul><ul><li>Total Time Taken to Index </li></ul></ul><ul><ul><li>Measure scaling with speedup: </li></ul></ul><ul><ul><ul><ul><ul><li>Total Time Taken to Index on a 1 Machine </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Total Time Taken to Index using m Machines </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Optimal speedup would be where m machines were m times faster </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>- known as linear speedup </li></ul></ul></ul></ul></ul>
  22. 22. Experimental Setup <ul><li>All indexing strategies are implemented in </li></ul><ul><ul><li>The only fully open-source MapReduce implementation </li></ul></ul><ul><ul><li>Apache Software Foundation Project </li></ul></ul><ul><ul><li>Reportedly used by for indexing </li></ul></ul><ul><li>Cluster Setup </li></ul><ul><ul><li>4 (3) cores, 2.4GHz, 4GB RAM, ~1TB of hdd </li></ul></ul><ul><ul><li>Single gigabit Ethernet rack </li></ul></ul><ul><ul><li>v0.18.2 (HDFS) </li></ul></ul><ul><ul><li>Hadoop on Demand (HOD) </li></ul></ul><ul><ul><li>Torque Resource Manager (v2.1.9) </li></ul></ul><ul><ul><li>Also a RAID5 file server with 8 3GHz cores </li></ul></ul><ul><li>Use the TREC .GOV2 corpus </li></ul><ul><ul><li>25 million documents, 425GB uncompressed </li></ul></ul>
  23. 23. Target Indexing Throughput Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines <ul><li>Terrier Single-Pass indexing has a throughput of 1MB/sec for a single core </li></ul><ul><li>We project Shared-Nothing as being optimal (linear speedup) </li></ul>Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed MapReduce D&G_Token MapReduce D&G_Term MapReduce Single-Pass Number of Machines Allocated
  24. 24. Baseline Indexing Throughput <ul><li>Use shared-corpus indexing as a baseline </li></ul><ul><li>No improvement after 4 machines </li></ul><ul><li>File server acts as a bottleneck </li></ul><ul><li>Can’t scale files servers indefinitely </li></ul>Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token MapReduce D&G_Term MapReduce Single-Pass Number of Machines Allocated
  25. 25. D&G_Token Indexing Throughput <ul><li>No results </li></ul><ul><li>Runs failed due to there being to much map output, disks ran out of space </li></ul>Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token - - - - - MapReduce D&G_Term MapReduce Single-Pass Number of Machines Allocated
  26. 26. D&G_Term Indexing Throughput <ul><li>Indexing was possible </li></ul><ul><li>But scaling was strongly sub-linear </li></ul><ul><li>Still too much emitting </li></ul><ul><li>Half speed of Shared-Corpus approach </li></ul>Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token - - - - - MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38 MapReduce Single-Pass Number of Machines Allocated
  27. 27. Single-Pass Indexing Throughput <ul><li>Faster than both D&G_Term and Shared Corpus approaches </li></ul><ul><li>Still scales sub-linearly </li></ul><ul><li>Data locality is the main concern </li></ul>Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token - - - - - MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38 MapReduce Single-Pass 2.59 5.19 9.45 13.16 17.31 Number of Machines Allocated
  28. 28. CONCLUSION
  29. 29. Conclusions <ul><li>Shared-Corpus indexing is limited by the speed of the central file store </li></ul><ul><li>Performing efficient indexing in MapReduce is not trivial </li></ul><ul><ul><li>Indeed, all indexing strategies implemented here scale sub-linearly </li></ul></ul><ul><li>Single-pass indexing is only marginally sub-linear </li></ul><ul><ul><li>Small price to pay for the advantages? </li></ul></ul><ul><li>Using single-pass indexing MapReduce is suitable for indexing the current generation of web corpora </li></ul><ul><ul><li>We indexed the ‘B’ set of ClueWeb09 is just under 20 hours using 5 machines (15 cores) </li></ul></ul><ul><ul><li>It has been reported on the TRECWeb mailing list that using Indri and Amazon EC2 (4 cores) and S3 that indexing takes 31 hours with a Shared-Nothing approach </li></ul></ul><ul><ul><li>. . . plus 10 days upload time! </li></ul></ul>
  30. 30. Questions? <ul><li>http://terrier.org </li></ul>

×