Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[@IndeedEng] From 1 To 1 Billion: Evolution of Indeed's Document Serving System

11,473 views

Published on

Video available: http://youtu.be/jwq_0mPNnN8
As Indeed’s traffic has grown to its current level of over 3 billion job searches per month worldwide, we have evolved our job data storage and serving architecture in order to maintain high levels of reliability and performance, including an average retrieval time per document of 31ms. This talk describes that evolution, from the initial direct-access MySQL-based solution to a dedicated service and custom data store built around a log-structured merge-tree (LSM-Tree) implementation.

Speakers:

Jack Humphrey is director of the engineering teams that build Indeed’s job search and resume products. Since joining Indeed in 2009, he has helped build the service architecture that now handles over 3 billion job searches monthly.

Jeff Plaisance is a software engineer at Indeed focused on data storage infrastructure and analysis tools, including the datastore that serves up billions of jobs daily for Indeed’s search results.

Published in: Technology
  • Be the first to comment

[@IndeedEng] From 1 To 1 Billion: Evolution of Indeed's Document Serving System

  1. 1. From 1 to 1 BillionEvolution of a Document Serving System Jack Humphrey Engineering Director http://engineering.indeed.com/blog Twitter @IndeedEng
  2. 2. search engine for jobs simple fast comprehensive relevant
  3. 3. *.indeed.comover 50 countries, 26 languages 100 million unique visitors 3 billion searches
  4. 4. 173 ms
  5. 5. 87 ms
  6. 6. fastUS: median server time ~90 ms
  7. 7. fastUS: median perceived time 450 ms
  8. 8. grabperf.org
  9. 9. Where we started
  10. 10. lucene
  11. 11. Document Serving
  12. 12. Document ServingTitle Company Location SnippetAge
  13. 13. Let lucene do it. stored fields
  14. 14. Let lucene do it. search webapp job DB index builder lucene index rsync lucene index
  15. 15. It worked.Job search was fast. Results were fresh.
  16. 16. PROBLEMScache performance data lifetime
  17. 17. Get it from the DB! MySQL
  18. 18. Get it from the DB! job data search webapp job DB search results index builder lucene index rsync lucene index
  19. 19. It worked.Job search was fast. Results were fresh.
  20. 20. PROBLEMwrite contentionreplication delay
  21. 21. Caching to the rescue memcache
  22. 22. Caching to the rescue ... search webapp search webapp search webapp memcached docservice job DBonly hit DB on cache misses"tail" job data and pre-load into cache
  23. 23. It worked.Job search was fast. Results were fresh.
  24. 24. PROBLEMS still need DB sometimes priming from DB need more data from DBneed data around the world
  25. 25. CHALLENGEGet all the job data: in one place, easy to replicate,and fast to access
  26. 26. How about more caching? docstore serialized file system cache behind memcache, in front of DB segment files of 100 jobs
  27. 27. docstore example job id 793627789 directory file offset/docstore/7/936/277.tdoc.dflt
  28. 28. docstore structure 000 001 002 000 001 002job_ids: job_ids: job_ids:0 100 2001 101 2012 102 202... ... ...99 199 299
  29. 29. docstoresearch webapp search webapp search webapp ... job DBs docservice memcached DocStore Builder rsync docstore docstore
  30. 30. It worked.Job search was fast. Results were fresh.
  31. 31. PROBLEMS random I/O average: 1.5 updates per job slow replicationerror detection & recovery difficult
  32. 32. PROBLEMS random I/O updates slowfind segment file & decompress update recompress & write file
  33. 33. PROBLEMS random I/O updates slowhigh write volume
  34. 34. Time for v2
  35. 35. From 1 to 1 Billion Docstore v2 Jeff Plaisance Software Engineer
  36. 36. Docstore V1 Recap 000 001 002 000 001 002job_ids: job_ids: job_ids:0 100 2001 101 2012 102 202... ... ...99 199 299
  37. 37. Disk I/O NumbersDisk Seek 10 msSequential Disk Bandwidth 30 MB/sAverage Job Size, LZO Compressed 1 KBJob Throughput, Sequential 30,000 Jobs/SecJob Throughput, Random 100 Jobs/SecDisk Bandwidth, Jobs Random 100 KB/Sec
  38. 38. Overview of a Job Posting 1 KB Compressed with LZO Optimal: 30,000 updates/sec Docstore V1: 37.5 updates/secDocstore V1 only achieves 0.125%of optimal throughput!
  39. 39. Scaling reads iseasy
  40. 40. Scaling writes ishard
  41. 41. Key insight: ourreads are random,so might as welloptimize for writeperformance
  42. 42. So the trick isconvertingrandom writes tosequential writes
  43. 43. Big picture Doc Requests Aggregation Doc Service rsync Doc Store Doc Store jobs db Queue Queue index
  44. 44. Big picture Doc Requests Aggregation Doc Service rsync Doc Store Doc Store jobs db Queue Queue index
  45. 45. Doc Store Queue000000000 000000001 000000002000000003 000000004 000000005
  46. 46. Record File address address address compressed size compressed size compressed size
  47. 47. Big picture Doc Requests Aggregation Doc Service rsync Doc Store Doc Store jobs db Queue Queue index
  48. 48. In-Memory IndexDoc Store Queue
  49. 49. In-Memory IndexDoc Store Queue
  50. 50. In-Memory IndexDoc Store Queue
  51. 51. In-Memory IndexDoc Store Queue
  52. 52. In-Memory IndexDoc Store Queue
  53. 53. In-Memory IndexDoc Store Queue
  54. 54. Basic MergeDoc Store Queue
  55. 55. Basic MergeDoc Store Queue
  56. 56. Basic MergeDoc Store Queue
  57. 57. Basic MergeDoc Store Queue # of writes per batch 1
  58. 58. Basic MergeDoc Store Queue 1
  59. 59. Basic MergeDoc Store Queue 1
  60. 60. Basic MergeDoc Store Queue 1
  61. 61. Basic Merge Doc Store Queue# of writes 2 1
  62. 62. Basic MergeDoc Store Queue 2 1
  63. 63. Basic MergeDoc Store Queue 2 1
  64. 64. Basic MergeDoc Store Queue 2 1
  65. 65. Basic Merge Doc Store Queue 3# of writes 2 1
  66. 66. Basic MergeDoc Store Queue 3 2 1
  67. 67. Basic MergeDoc Store Queue 3 2 1
  68. 68. Basic MergeDoc Store Queue 3 2 1
  69. 69. Basic Merge Doc Store Queue 4 3# of writes 2 1
  70. 70. Basic MergeDoc Store Queue 4 3 2 1
  71. 71. Basic MergeDoc Store Queue 4 3 2 1
  72. 72. Basic MergeDoc Store Queue 4 3 2 1
  73. 73. Basic Merge Doc Store Queue 5 4# of writes 3 2 1
  74. 74. Basic Merge Doc Store Queue 5 4# of writes 3 2 1
  75. 75. Why Merge?Doc Store Queue 1
  76. 76. Why Merge?Doc Store Queue 1 1
  77. 77. Why Merge?Doc Store Queue 1 1 1
  78. 78. Why Merge?Doc Store Queue 1 1 1 1
  79. 79. Why Merge?Doc Store Queue 1 1 1 1 1
  80. 80. Why Merge?Doc Store Queue 1 1 1 1 1
  81. 81. The Solution: LSM TreeLog Structured Merge TreeHierarchical key/value indexBuilt through incremental merging of similar-sized subindexesEfficient for reads and writes
  82. 82. LSM Tree Write
  83. 83. LSM Tree Write
  84. 84. LSM Tree Write
  85. 85. LSM Tree Write 1
  86. 86. LSM Tree Write 1
  87. 87. LSM Tree Write 1
  88. 88. LSM Tree Write 1
  89. 89. CompactionBasic idea:If we only merge all of our on disk indexeswhen our total dataset doubles in size, we willhave log(n) writes per batch.
  90. 90. Compaction HeuristicnewIndex = new DiskIndex(memoryIndex)while newIndex.size >= diskIndexes.first.size newIndex.add(diskIndexes.removeFirst())newIndex.mergeAndWrite()diskIndexes.addFirst(newIndex)
  91. 91. LSM Tree Write size: 1 MERGE 1 size: 1
  92. 92. LSM Tree Write 2 1
  93. 93. LSM Tree Write 2 1
  94. 94. LSM Tree Write 2 1
  95. 95. LSM Tree Write size: 1 2 DONT MERGE 1 size: 2
  96. 96. LSM Tree Write 2 1 1
  97. 97. LSM Tree Write 2 1 1
  98. 98. LSM Tree Write 2 1 1
  99. 99. LSM Tree Write size: 1 MERGE 2 size: 1 1 1
  100. 100. LSM Tree Write 2 1 1 size: 2 size: 2 MERGE
  101. 101. LSM Tree Write 3 2 2 1
  102. 102. LSM Tree Write 3 2 2 1
  103. 103. LSM Tree Write 3 2 2 1
  104. 104. LSM Tree Write size: 1 DONT MERGE 3 2 2 1 size: 4
  105. 105. LSM Tree Write 3 2 2 1 1
  106. 106. LSM Tree Write 3 2 2 1 1
  107. 107. LSM Tree Write 3 2 2 1 1
  108. 108. LSM Tree Write size: 1 MERGE 3 2 size: 1 2 1 1
  109. 109. LSM Tree Write 3 2 2 1 1 size: 4 size: 2 DONT MERGE
  110. 110. LSM Tree Write 3 2 2 2 1 1
  111. 111. LSM Tree Write 3 2 2 2 1 1
  112. 112. LSM Tree Write 3 2 2 2 1 1
  113. 113. LSM Tree Write size: 1 DONT MERGE 3 2 2 size: 2 2 1 1
  114. 114. LSM Tree Write 3 2 2 2 1 1 1
  115. 115. LSM Tree Write 3 2 2 2 1 1 1
  116. 116. LSM Tree Write 3 2 2 2 1 1 1
  117. 117. LSM Tree size: 1 Write MERGE size: 1 3 2 2 2 1 1 1
  118. 118. LSM Tree Write 3 2 2 2 1 1 1 size: 2 size: 2 MERGE
  119. 119. LSM Tree Write 3 2 2 2 1 1 1 size: 4 size: 4 MERGE
  120. 120. LSM Tree Write 4 3 3 2 3 2 2 1
  121. 121. Doc Store Queue 000 001 002 Doc Memcached Requests job_id job contents job_id job contents 003 004 005 job_id job contents job_id job contents New Jobs
  122. 122. Doc Store Queue 000 001 002 Doc Memcached Requests job_id job contents job_id job contents 003 004 005 job_id job contents job_id job contents New Jobs 0011 1100 0101 1010 0100 1000 0001 0010 1011 0100 0111 1101 0110 1001
  123. 123. Bloom Filters
  124. 124. Bloom FiltersProbabilistic Set
  125. 125. Bloom FiltersProbabilistic Setcontains(item) => (Maybe|No)
  126. 126. Bloom FiltersProbabilistic Setcontains(item) => (Maybe|No)Very small (~5%)
  127. 127. Bloom FiltersProbabilistic Setcontains(item) => (Maybe|No)Very small (~5%)Low false positives (~2%)
  128. 128. 0110 1001 1011 0100 0111 11010011 1100 0101 1010 0100 1000 0001 0010
  129. 129. Only works ifbloom filters arein memory
  130. 130. We have to havean upper boundon memory usage
  131. 131. Key Observations
  132. 132. Key ObservationsBloom filter on disk is slower than index
  133. 133. Key ObservationsBloom filter on disk is slower than indexNo is the only useful answer
  134. 134. Key ObservationsBloom filter on disk is slower than indexNo is the only useful answerAlways safe to assume 1
  135. 135. Key ObservationsBloom filter on disk is slower than indexNo is the only useful answerAlways safe to assume 1Bloom filters for newer levels are morelikely to say no
  136. 136. 12.5% 0110 1001 12.5% 1011 0100 0111 1101 25%0011 1100 0101 1010 0100 1000 0001 0010 50%
  137. 137. Limited Memory
  138. 138. Limited MemoryPage in most useful pages
  139. 139. Limited MemoryPage in most useful pagesAssume all 1s for others
  140. 140. Limited MemoryPage in most useful pagesAssume all 1s for othersUsefulness = requests for page * probability of no for filter
  141. 141. 12.5% chance 12.5% of hit 0110 1001 14.3% chance of 12.5% 84% chance of no hit 1011 0100 0111 1101 65.3% chance of no 33.3% chance of hit 25% 0% chance of no 0011 1100 0101 1010 0100 1000 0001 0010100% chance ofhit 50%
  142. 142. 12.5% probability of 0110 1001 accessing filter: 87.5% 12.5% 84% chance of noprobability ofaccessingfilter: 75% 1011 0100 0111 1101 65.3% chance of no probability of accessing 25% filter: 50% 0% chance of no 0011 1100 0101 1010 0100 1000 0001 0010 50%
  143. 143. 12.5% probability of access per 0110 1001 page: 43.8% 12.5% 84% chance of noprobability ofaccess perpage: 18.8% 1011 0100 0111 1101 65.3% chance of no probability of access per page: 6.25% 25% 0% chance of no 0011 1100 0101 1010 0100 1000 0001 0010 50%
  144. 144. 12.5% per page 0110 1001 usefulness: 36.8% 12.5%per pageusefulness:12.2% 1011 0100 0111 1101 per page usefulness: 25% 0% 0011 1100 0101 1010 0100 1000 0001 0010 50%
  145. 145. DurabilityWe should be able to power off adocstore server at any time
  146. 146. Doc Store Queue 000 001 002 Doc Memcached Requests job_id job contents job_id job contents 003 004 005 job_id job contents job_id job contents New Jobs 0011 1100 0101 1010 0100 1000 0001 0010 1011 0100 0111 1101 0110 1001
  147. 147. Doc Store Queue 000 001 002 Doc Memcached Requests job_id job contents job_id job contents 003 004 005 job_id job contents job_id job contents New Jobs Log 0011 1100 0101 1010 0100 1000 0001 0010 1011 0100 0111 1101 0110 1001
  148. 148. Lies, Damn Lies, andBenchmarks10,000,000 operations using 8 byte keysand 96 byte valuesRandom Writes:Indeed LSM Tree:Google LevelDB:Kyoto Cabinet B-Tree:Random Reads:Indeed LSM Tree:Google LevelDB:Kyoto Cabinet B-Tree:
  149. 149. Lies, Damn Lies, andBenchmarks10,000,000 operations using 8 byte keysand 96 byte valuesRandom Writes:Indeed LSM Tree: 272 secondsGoogle LevelDB: 375 secondsKyoto Cabinet B-Tree: 375 secondsRandom Reads:Indeed LSM Tree:Google LevelDB:Kyoto Cabinet B-Tree:
  150. 150. Lies, Damn Lies, andBenchmarks10,000,000 operations using 8 byte keysand 96 byte valuesRandom Writes:Indeed LSM Tree: 272 secondsGoogle LevelDB: 375 secondsKyoto Cabinet B-Tree: 375 secondsRandom Reads:Indeed LSM Tree: 46 secondsGoogle LevelDB: 80 secondsKyoto Cabinet B-Tree: 183 seconds
  151. 151. Lies, Damn Lies, andBenchmarkssame benchmark using cgroups to limitpage cache to 512 MB
  152. 152. Lies, Damn Lies, andBenchmarkssame benchmark using cgroups to limitpage cache to 512 MBRandom Writes:Indeed LSM Tree: 454 secondsLevel DB: 464 secondsKyoto Cabinet B-Tree: 50 hours
  153. 153. Kyoto Cabinet Block Dev I/O
  154. 154. LSM Tree Block Device I/O
  155. 155. LSM Tree Block Device I/O
  156. 156. RecapAll writes are sequentialEach write is log(n)Each read is log(n)*log(n) (log(n) with bloom filters)
  157. 157. fast US: median total server time ~90 msUS: median docserving time ~5 ms
  158. 158. Docstore V1 ComparisonDocstore V1 took 12 hours to build on 1days worth of data in 2011LSM-Tree based Docstore V2 takes 5minutes to index 1 days dataDocstore V2 has been in production for1 year
  159. 159. It works.Job search is fast.Results are fresh.
  160. 160. Questions?http://engineering.indeed.com/blog Twitter @IndeedEng Next @IndeedEng Talk: Wednesday, March 27

×