[@IndeedEng] How to Get a Job 35 Million Times a Day Using RabbitMQ

9,166 views

Published on

@IndeedEnd March: Wednesday, March 27th
Video available: http://www.youtube.com/watch?v=MeRHetCMiHg

The goal of Indeed's aggregation engine is to find and retrieve every job in the world, as quickly and accurately as possible. As we described in our previous tech talk, we strive to build products that are simple, fast, comprehensive, and relevant. The world's most comprehensive job search site is fueled by the more than 35 million job postings we process every day, which we deliver to jobseekers within minutes of discovery.

Our original aggregation architecture was implemented using standard patterns. Our growth required levels of scalability, performance, and resilience this architecture simply could not handle. In a case study of scaling for the web, we will discuss how we tackled this problem. We will cover the issues we saw with our original architecture, how we analyzed our options to guide a solution, how we used RabbitMQ as a key component in the new architecture, and benchmarks to evaluate how successful we were.

Speaker Ketan Gangatirkar is the development manager responsible for Indeed's continuous deployment infrastructure as well as its aggregation system.

Speaker Cameron Davison is a software engineer on the aggregation team at Indeed and a graduate of UT Austin. He re-architected Indeed's aggregation pipeline using RabbitMQ to sustain high write volumes, and continues to improve products in the aggregation system to make it run more efficiently.

Published in: Technology, Business

[@IndeedEng] How to Get a Job 35 Million Times a Day Using RabbitMQ

  1. 1. How to Get a Job 35 Million Times a DayUsing RabbitMQKetan Gangatirkar and Cameron Davison
  2. 2. One search. All jobs.
  3. 3. Aggregation gets jobs
  4. 4. Aggregation gets jobs soJobseekers get jobs
  5. 5. Aggregation != SpideringSpiders see pages.Aggregation sees jobs.
  6. 6. How spiders see job sitesPagePagePagePagePagePagePagePagePagePagePagePagePagePagePagePagePagePage
  7. 7. How Indeed sees job sitesStartJob ListJob Job JobJob ListJob Job JobJob ListJob Job JobNavigation NavigationJobJobJob
  8. 8. Aggregation != SpideringJob sites have structureJob pages have semanticsNavigation is more than following links
  9. 9. Rememberthis
  10. 10. Aggeveryjob
  11. 11. {Url: http://www.applytracking.com/track.aspx/3VYzRTitle: Senior Erlang EngineerCompany: Machine ZoneLocation: Palo Alto,CA,US, 94301Source Type: EmployerJob Type: Full-time...Description: The Senior Erlang Engineer is an integral ......Createdate: 2013-02-05 23:18:05...}Whats in a job
  12. 12. locationdescriptionCompanyTitle
  13. 13. Titlesalarylocationjob typedescriptionCompany
  14. 14. How we build productssimplefastcomprehensiverelevant
  15. 15. SimpleTough problems, simple solutions
  16. 16. FastDiscover the jobs quicklyGet them to jobseekers in minutes
  17. 17. 10% of jobseekers sort by date
  18. 18. Do you want only new jobs?
  19. 19. 20% of jobseekers want only new jobs
  20. 20. Daily new job emails
  21. 21. Speed matters
  22. 22. ComprehensiveGet every job
  23. 23. RelevantSemantic extractionThe job is still availableIgnore non-jobs
  24. 24. This is a hard problemFlaky sitesSite redesignsJavascriptMissing or bad information
  25. 25. Big N makes it even harderExamine 38M jobs every day
  26. 26. Do this in minutesSearch100MJobseekersAggregationEmployersJob BoardsStaffing firmsRecruiters
  27. 27. Strawman* architectureDatacenter BMySQLEngineDatacenter AJob siteEngineJob siteEngineJob siteEngineJob siteEngineJob siteEngineJob sitePrimaryDatacenter
  28. 28. Limitations
  29. 29. N connectionsMySQLJob siteJob siteJob siteJob siteJob siteJob sitePrimaryDatacenterEngineEngineEngineEngineEngineEngineDatacenter BDatacenter A
  30. 30. N concurrent writersMySQLJob siteJob siteJob siteJob siteJob siteJob sitePrimaryDatacenterEngineEngineEngineEngineEngineEngineDatacenter BDatacenter A
  31. 31. High latencyMySQLJob siteJob siteJob siteJob siteJob siteJob sitePrimaryDatacenterEngineEngineEngineEngineEngineEngineDatacenter BDatacenter A
  32. 32. Limitation: failure pointsDatacenter BMySQLEngineDatacenter AJob siteEngineJob siteEngineJob siteEngineJob siteEngineJob siteEngineJob sitePrimaryDatacenterXX
  33. 33. Scaling PatternsWhat has worked for us so far?
  34. 34. Service-Oriented ArchitectureEngineEngineEngineJob WriteService MySQLRemoteDatacenterPrimaryDatacentersee http://go.indeed.com/boxcar
  35. 35. Standard Service InteractionClient Service Database
  36. 36. Our InteractionClient Service Database
  37. 37. Does this do what we need?● Lots of workers...● Sending lots of results...● Over a long distance...● That need to get processed fast...● Reliably?
  38. 38. Engine FailureEngineEngineEngineJob WriteService MySQLRemoteDatacenterXPrimaryDatacenter
  39. 39. Engine failure fix:Buffer to diskEngineEngineEngineJob WriteService MySQLRemoteDatacenterdiskdiskdiskPrimaryDatacenterX
  40. 40. Network FailureEngineEngineEngineJob WriteService MySQLRemoteDatacenterXPrimaryDatacenter
  41. 41. Network failure fix:Disks solve that tooEngineEngineEngineJob WriteService MySQLRemoteDatacenterdiskdiskdiskXPrimaryDatacenter
  42. 42. Write Service FailureJob WriteService MySQLRemoteDatacenterXEngineEngineEnginePrimaryDatacenter
  43. 43. Write Service Failure fix:Disks solve that tooJob WriteService MySQLRemoteDatacenterXEngineEngineEnginePrimaryDatacenterdiskdiskdisk
  44. 44. Write Service Failure fix:RedundancyJob WriteServiceMySQLRemoteDatacenterPrimaryDatacenterXEngineEngineEngineJob WriteServiceJob WriteService
  45. 45. Database FailureJob WriteService MySQLRemoteDatacenterXEngineEngineEnginePrimaryDatacenter
  46. 46. Database Failure fix:Buffer to diskJob WriteServiceMySQLRemoteDatacenterXEngineEngineEnginediskPrimaryDatacenter
  47. 47. Our new architectureJob WriteServiceMySQLRemoteDatacenterPrimaryDatacenterEngineEngineEnginediskdiskdiskJob WriteServiceJob WriteServicediskdiskdisk
  48. 48. We could build this...Job WriteServiceMySQLRemoteDatacenterPrimaryDatacenterEngineEngineEnginediskdiskdiskJob WriteServiceJob WriteServicediskdiskdisk
  49. 49. ... maybe someone already hasJob WriteServiceMySQLRemoteDatacenterPrimaryDatacenterEngineEngineEnginediskdiskdiskJob WriteServiceJob WriteServicediskdiskdisk
  50. 50. We should use a message queue
  51. 51. Cameron Davison
  52. 52. Aggregation Requirements● Durable● Multi-Data Center (latency)● 38 million jobs a day● 2KB average job size○ 76 GB a day● Target peaks of 1000 jobs / second● Programming language agnostic
  53. 53. Selection
  54. 54. What we foundHigh AvailabilityOpen Source/FreeSelf-hostedPerformant
  55. 55. Out-of-the-box Experience
  56. 56. Advanced Message QueuingProtocol (AMQP)● Open Standard● Wire protocol● Existing Clients in Multiple Languages
  57. 57. Concepts● Confirmation and Ack● At least once● Asynchronous Confirms● Persistent● Clustering
  58. 58. Confirmation and AckMQProducer Consumermsgconfirmackmsg12 34
  59. 59. At least onceMQAt most onceConsumerMessageAckMQConsumerMessageAuto Ack
  60. 60. Asynchronous Confirms12345678910111213141516Producermessagesconfirm #6
  61. 61. PersistentMQProducer Consumer
  62. 62. PersistentMQProducer Consumer
  63. 63. PersistentMQProducer ConsumerX
  64. 64. PersistentMQProducer Consumer
  65. 65. PersistentMQProducer Consumer
  66. 66. ClusteringSlaveMasterProducer1234
  67. 67. Testing
  68. 68. Test RabbitMQ● Send millions of 2KB messages● 20 producers and 20 consumers● 1000 messages / second● Simulate multiple failures
  69. 69. Test ConsistencyProducersRabbitMQRabbitMQConsumersSlaveMaster
  70. 70. Test ConsistencyProducersRabbitMQRabbitMQConsumersMasterSlave
  71. 71. Test ConsistencyProducersRabbitMQRabbitMQConsumersMasterSlave
  72. 72. Test ConsistencyProducersRabbitMQRabbitMQConsumersXMaster
  73. 73. Test ConsistencyProducersRabbitMQRabbitMQConsumersXMaster
  74. 74. Test ConsistencyProducersRabbitMQRabbitMQConsumersMasterSlave
  75. 75. RabbitMQ ClusteringMaster Slave
  76. 76. RabbitMQ ClusteringMaster Slave
  77. 77. RabbitMQ ClusteringMasterX
  78. 78. RabbitMQ ClusteringMasterX
  79. 79. RabbitMQ ClusteringMasterSlave
  80. 80. RabbitMQ ClusteringMasterSlave
  81. 81. RabbitMQ ClusteringMasterSlave
  82. 82. RabbitMQ ClusteringMasterSlave
  83. 83. RabbitMQ ClusteringMasterSlave
  84. 84. RabbitMQ ClusteringMasterSlave
  85. 85. RabbitMQ ClusteringMasterX
  86. 86. RabbitMQ ClusteringMasterSlave
  87. 87. RabbitMQ ClusteringMasterSlave
  88. 88. RabbitMQ ClusteringMasterSlave
  89. 89. RabbitMQ ClusteringMasterX
  90. 90. RabbitMQ ClusteringMasterXX
  91. 91. RabbitMQ ClusteringMasterX
  92. 92. RabbitMQ ClusteringMasterX
  93. 93. RabbitMQ ClusteringMaster Slave
  94. 94. Non-persistent15990 Messages / Second30 MB/s
  95. 95. Persistent2781 Message / Second5.5 MB/s
  96. 96. Clustered and Persistent1262 Message / Second2.5 MB/s
  97. 97. Applying RabbitMQ
  98. 98. Unreliable High LatencyConnectionsEngineEngineEngineJob WriteServiceRemote DC Primary DCMySQL
  99. 99. Replaced with RabbitMQEngineEngineEngineJob WriteServiceRabbitMQRemote DC Primary DCMySQL
  100. 100. Replaced with RabbitMQEngineEngineEngineJob WriteServiceRabbitMQRemote DC Primary DC
  101. 101. Replaced with RabbitMQEngineEngineEngineJob WriteServiceRabbitMQRemote DC Primary DC
  102. 102. Replaced with RabbitMQEngineEngineEngineJob WriteServiceRemote DC Primary DCRabbitMQ
  103. 103. Replaced with RabbitMQEngineEngineEngineJob WriteServiceRemote DC Primary DCRabbitMQ
  104. 104. Rabbit can talk to RabbitShovel PluginProducer RabbitMQ 1 ConsumerRabbitMQ 2
  105. 105. Replaced with RabbitMQEngineEngineEngineJob WriteServiceRabbitMQRabbitMQRabbitMQRabbitMQRemote DC Primary DC
  106. 106. Replaced with RabbitMQEngineEngineEngineJob WriteServiceRabbitMQRabbitMQRabbitMQRabbitMQPrimary DCRabbitMQRemote DC
  107. 107. ParallelizeJob Write ServiceRabbitMQJob WriteServiceJob WriteServiceJob WriteServiceJob AJob BJob C
  108. 108. Replaced with RabbitMQEngineEngineEngineJob WriteServiceRabbitMQRabbitMQRabbitMQRabbitMQPrimary DCRabbitMQJob WriteServiceRemote DC
  109. 109. Replaced with RabbitMQEngineEngineEngineJob WriteServiceRabbitMQRabbitMQRabbitMQRabbitMQPrimary DCRabbitMQJob WriteService
  110. 110. Message Flow
  111. 111. Message FlowEngineEngineEngineJob WriteServicePrimary DCJob WriteServiceRabbitMQRabbitMQRabbitMQRabbitMQRabbitMQ
  112. 112. Message FlowEngineEngineEngineJob WriteServicePrimary DCJob WriteServiceRabbitMQRabbitMQRabbitMQRabbitMQRabbitMQ
  113. 113. Message FlowEngineEngineEngineJob WriteServicePrimary DCJob WriteServiceRabbitMQRabbitMQRabbitMQRabbitMQRabbitMQ
  114. 114. Message FlowEngineEngineEngineJob WriteServicePrimary DCJob WriteServiceRabbitMQRabbitMQRabbitMQRabbitMQRabbitMQ
  115. 115. Jobs/minute
  116. 116. Jobs/minute from one site220,000 jobs6 hours611 jobs / minute
  117. 117. Jobs/minute from one site251,000 jobs20 minutes12550 jobs / minute
  118. 118. RabbitMQHorizontal ScaleEngineEngineEngine Job WriteServiceRabbitMQJob WriteServiceRabbitMQRabbitMQRabbitMQRabbitMQRabbitMQJob WriteServiceJob WriteService
  119. 119. Horizontal Scale
  120. 120. Horizontal Scale
  121. 121. Today 1000 messages / second
  122. 122. RabbitMQ 32486 Message / Second5MB/s
  123. 123. RabbitMQ Configuration● Confirmations - Fire and Forget● Persistent Messages - Durable● Shoveling - Multi-Data Center● Mirrored Queues in Cluster - High Reliability
  124. 124. Can we do more with RabbitMQ?
  125. 125. Aggregation ViewerReal-time browser-based view of job stream
  126. 126. ● Almost real-time● Exclusive queue● Transient messagesAggregation Viewer ArchitectureAgg JobsRabbit MQClusterAgg ViewerRabbit MQAggViewerShovel* SubscribeJobs HTTP Browser
  127. 127. Resume Contacts BillingPay-per-contact: limited budget
  128. 128. Resume Contacts BillingOriginal PathPacificAsia DC US DCLog repoResume SearchMySQLsee http://go.indeed.com/logrepo
  129. 129. Resume Contacts BillingFast PathPacificAsia DC US DCRabbitMQMySQLLog repoRabbitMQResume SearchX
  130. 130. Company Page EditsUser-contributed content about companies
  131. 131. Company Page
  132. 132. Company Page EditsImplementationWriting data AND reading it back
  133. 133. Company Page EditsSingle DatacenterBrowserWeb ServerMySQL
  134. 134. Company Page ServingBrowserWeb ServerLSM TreeAsia DatacenterMemcachedsee http://go.indeed.com/lsmtree
  135. 135. PacificCompany Page EditsBrowserWeb ServerRabbitMQ RabbitMQ MySQLPrimary USDatacenterAsia Datacenter EU DatacenterAtlantic[Et cetera]Memcached
  136. 136. PacificCompany Page ReadsMySQLLSM TreeBuilderLSM TreePrimary USDatacenterAsia DatacenterLSM TreeEU DatacenterAtlantic[Et cetera]
  137. 137. MemcachedPacificCompany Pages SystemBrowserWeb ServerRabbitMQ RabbitMQ MySQLLSM TreeBuilderLSM TreePrimary USDatacenterAsia DatacenterLSM TreeEU DatacenterAtlantic[Et cetera]
  138. 138. Other applications
  139. 139. Company Pages
  140. 140. Recap: The jobs must flow● Durability● High throughput● Low latency● Partition-tolerance● Efficient use of the database● Minimal points of failure

×