[@IndeedEng] Redundant Array of Inexpensive Datacenters

4,003 views

Published on

Video available: http://youtu.be/hOsA5UpPUSU

Learn how Indeed built one of the fastest and most reliable websites in the world. Indeed Operations ensures indeed.com is always available and always fast for the jobseeker. Operations leaders Charles Valentine and Chris Graf will share how we configure and provision multiple datacenters around the world to provide a massively scalable platform for connecting job seekers with jobs. Charles and Chris will detail a simple and inexpensive method to build a platform that provides DNS-based global load balancing and failover, provider portability, and disposable datacenters.

Speakers:

Charles Valentine (VP of Technology Services at Indeed) leads the Operations, IT, and Security teams. Prior to joining Indeed in 2011, Charles served as VP Technology Services at The Knot.

Chris Graf has managed operations at Indeed since 2011. In that time, Indeed's traffic has grown by more than 300%. Prior to Indeed, Chris managed Web operations in the online gaming industry.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,003
On SlideShare
0
From Embeds
0
Number of Embeds
2,941
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

[@IndeedEng] Redundant Array of Inexpensive Datacenters

  1. 1. Redundant Array of Inexpensive Datacenters Charles Valentine and Chris Graf June 2013
  2. 2. Overview Charles Valentine VP, Technology Services
  3. 3. I help people get jobs.
  4. 4. Indeed ● 100 million unique visitors per month ● Over 50 countries and 26 languages ● 3 Billion job searches per month
  5. 5. Indeed Ops ● Assist development in designing new products ● Engineer scalable systems to support applications ● Monitor applications ● Fix systems when they break
  6. 6. Indeed Lingo Datacenter = Point of Presence
  7. 7. Each Presence is Full Stack ● Applications ● Services ● Read/Write Data systems ● Communications ● Monitoring We need serious processing power in each datacenter!
  8. 8. Applications per Datacenter ● Over 40 Java-based web applications ● Over 90 Java-based services
  9. 9. Data Systems ● MySQL databases ● Mongo databases ● Memcached instances ● LSM Trees ● Search indexes ● Numerous other data stores
  10. 10. Goals ● Fast ● Reliable ● Inexpensive
  11. 11. Triple Constraint Fast Reliable Inexpensive
  12. 12. Traditional Method Fast Reliable Inexpensive
  13. 13. Indeed Method Fast Reliable Inexpensive
  14. 14. Fast Speed is a product feature ● Server Time ● Client Time
  15. 15. Monthly Job Searches
  16. 16. 1 ms, 3 Billion Times/Month 1 ms = 34 job seeker days per month
  17. 17. 20 ms, 3 Billion Times/Month 20 ms = 22 jobseeker months
  18. 18. 100 ms, 3 Billion Times/Month 100 ms = 9.5 jobseeker years
  19. 19. Reliable Reliability is a product feature
  20. 20. Impact of Downtime 8,000 Disappointed Job Seekers every minute
  21. 21. People get hired on Indeed 7 seconds
  22. 22. Availability ● Jobseekers can find jobs ● Less focus on mitigating failure ● More focus on recovering quickly
  23. 23. Availability is Good for Job Seekers 9's
  24. 24. Good 99.9% availability => down for 525 minutes At peak 4,500 jobseekers don't get a job
  25. 25. Better 99.99% availability => down for 52 minutes At peak 450 jobseekers don't get a job
  26. 26. Almost Best 99.999% uptime => down for 5 minutes At peak 45 jobseekers don't get a job
  27. 27. Indeed is Always there for Job Seekers Availability > 99.999% Less than 5 minutes downtime per year
  28. 28. How It Works Chris Graf Operations Manager
  29. 29. Maximize Availability Beyond 99.999% No downtime, scheduled or otherwise
  30. 30. Maximize Performance Optimize page load times to the millisecond
  31. 31. Minimize Cost Minimize cost while meeting performance and availability goals
  32. 32. Hosting Models ● Traditional Colocation ● The Cloud ● Managed Hosting
  33. 33. Traditional Colocation ● You buy the servers, network gear, cables... ● You send people to set it up ● You send people to fix stuff when it breaks ● You manage your own pipes (maybe)
  34. 34. Traditional Colocation Expansion 1. Acquire rack space 2. Buy the hardware 3. Wait for manufacturing 4. Wait for delivery 5. Send people to the datacenter to set it all up Expansion can take weeks
  35. 35. Traditional Colocation Good if you have ● Fairly static environment ● Really beefy hardware ● Some centralized functionality ● Time to wait ● Lots of cap-ex budget ● Like signing long-term deals ● People to do stuff
  36. 36. ● You rent access to computing power ● You pay to reserve it if you aren't using it ● Usually abstracted from hardware layer The Cloud
  37. 37. Expanding Cloud-based systems 1. Order new instances 2. Wait a few minutes 3. Provision them Expansion takes minutes.
  38. 38. The Cloud is good! If you have significant, unpredictable changes in load
  39. 39. The Cloud is bad! Costs more if you need all your instances available all of the time
  40. 40. Managed Hosting ● Rent hardware from provider ● Provider buys and hosts servers, network, etc. ● Provider deals with hardware issues
  41. 41. Expanding Managed Hosting 1. Order new servers 2. Wait a few hours 3. Provision Expansion takes hours (depending on provider)
  42. 42. Indeed Uses Managed Hosting Least expensive overall Access to real bare metal hardware Agile enough
  43. 43. Steps for beyond 99.999% uptime 1. Find a provider 2. Sign contract for 100% uptime with 100% revenue protection 3. Profit Right?
  44. 44. Providers "guarantee" availability "Service Level Agreement" (SLA) guarantees some percentage of uptime
  45. 45. SLA: brief outages aren't outages Less than 30 minutes downtime not counted against "100% SLA" One 5-minute outage per month < 99.99% Two 25-minute outages per month < 99.9% The provider can call that 100% available
  46. 46. SLA: maintenance is not downtime Scheduled maintenance not counted against SLA 1 hour maintenance each month < 99.9% The provider can call that 100% available
  47. 47. SLA credits don't cover your business You get a refund for the services, not for lost business and lost customer confidence Providers lose your hosting fees You lose your revenue
  48. 48. 100% is not really 100% Hosting is complicated A single datacenter is rarely 100% available
  49. 49. Bug in provider hardware caused total loss of Internet access under certain load Core network problem
  50. 50. Power outage 1. Utility power was disrupted 2. Backup generator and UPS couldn't handle load 3. Core network went offline 4. Servers lost power 5. Upon power restoration, router did not recover
  51. 51. Power Outage Aftermath ● Event duration = 54 minutes ● Recovery duration = 12 hours ● 5% monthly credit for affected hardware
  52. 52. Backhoe Induced Fiber Failure (BIFF)
  53. 53. Wet servers Tornado peeled back the roof of an AT&T datacenter in 2004.
  54. 54. Other Disasters ● Hurricanes ● Floods ● Earthquakes ● Fires ● Etc.
  55. 55. Need better uptime than providers Can only get ~99.7% after asterisks We have to build something better
  56. 56. Save a document to a hard disk Hard Disk Doc
  57. 57. Saved Hard Disk Doc
  58. 58. Disk failure Hard Disk A
  59. 59. Disaster Recovery Restore from an external USB drive?
  60. 60. Redundant Storage Simple case - RAID 1 Hard Disk A Hard Disk B
  61. 61. RAID - Save it twice Hard Disk A Hard Disk B Doc
  62. 62. RAID - Two copies of everything Hard Disk A Hard Disk B Doc Doc
  63. 63. RAID Hard Disk A Hard Drive B Doc Doc
  64. 64. RAID == Redundant Array of Inexpensive Datacenters Datacenter A Datacenter B Jobseekers
  65. 65. RAID makes datacenters more reliable Datacenter A Datacenter B Jobseekers
  66. 66. Building a more reliable system Using inexpensive, less reliable components
  67. 67. 99.7% in, 99.999% out Now our system can get better availability as a whole than any single provider can give us.
  68. 68. Expect your datacenter to fail Failure is inevitable Design for it
  69. 69. Simpler datacenters with RAID Only need one of everything inside each datacenter: ● Firewalls ● Load balancers ● Servers provisioned primarily for capacity not redundancy
  70. 70. Primary and secondary datacenters 21
  71. 71. Datacenter level redundancy Protects against a single datacenter failure
  72. 72. Datacenter level redundancy Protects against a single datacenter failure ... But there are problems that can affect more than one datacenter on the same provider
  73. 73. Denial of service attacks Distributed denial of service attack against another customer who had servers in the same facilities took multiple facilities offline
  74. 74. Network configuration errors Provider pushed a bad global route which took their entire global network offline
  75. 75. The biggest threat Humans
  76. 76. Protect against global provider failure Use multiple providers to get provider-level redundancy
  77. 77. Provider-level redundancy 21
  78. 78. Provider-level redundancy 21 X X
  79. 79. Recovering from Failure ● Offline ● Active/Passive ● Active/Active
  80. 80. Offline ● One active datacenter handles all traffic ● Backup systems are offline and incomplete ● Restore backups to new systems ● Downtime during switchover is ~days
  81. 81. Active / Passive (Dark) ● One active datacenter handles all traffic ● A second datacenter has provisioned systems and all data ● Switch from primary to secondary ● Downtime during switchover is minutes to hours
  82. 82. Active / Active ● Every datacenter handles traffic ● Data and systems are replicated ● Failover activated automatically ● Downtime during switchover measured in seconds ● Scales beyond two facilities
  83. 83. Jobseeker Impact Offline: extended downtime for all jobseekers Active/Passive: some downtime for all jobseekers Active/Active: brief downtime for some jobseekers
  84. 84. Which jobseekers go to which datacenter? Offline: go to single datacenter Active/Passive: go to single datacenter Active/Active: go to many datacenters?
  85. 85. Send jobseekers to the best datacenter Use dynamic DNS service to send job seekers to the best, healthy data center
  86. 86. Anycast DNS Resolving same hostname to different IP addresses ● Client A: nslookup www.indeed.com Server: dns.client-a.com Address: 1.1.1.1 ● Client B: nslookup www.indeed.com Server: dns.client-b.com Address: 2.2.2.2
  87. 87. DNS Lookup Jobseeker A Jobseeker DNS Server 5.5.5.5 Indeed DNS Service www.indeed.com 1.1.1.1 www.indeed.com 1.1.1.1
  88. 88. Vary response from primary DNS Indeed DNS Service www.indeed.com 1.1.1.1 www.indeed.com 1.1.1.1 Indeed DNS Service www.indeed.com 2.2.2.2 www.indeed.com 2.2.2.2 Jobseeker DNS Server 5.5.5.5 Jobseeker DNS Server 8.8.8.8 Jobseeker A Jobseeker B
  89. 89. Similar jobseekers get similar responses Indeed DNS Service www.indeed.com 1.1.1.1 www.indeed.com 1.1.1.1 Indeed DNS Service www.indeed.com 2.2.2.2 www.indeed.com 2.2.2.2 Indeed DNS Service www.indeed.com 2.2.2.2 www.indeed.com 2.2.2.2 Jobseeker DNS Server 5.5.5.5 Jobseeker DNS Server 8.8.8.8 Jobseeker DNS Server 8.8.8.8 Jobseeker A Jobseeker B Jobseeker C
  90. 90. Remap jobseekers via DNS changes Indeed DNS Service www.indeed.com 1.1.1.1 www.indeed.com 1.1.1.1 R e c o n f i g Indeed DNS Service www.indeed.com www.indeed.com 2.2.2.22.2.2.2 Jobseeker DNS Server 5.5.5.5 Jobseeker DNS Server 5.5.5.5 Jobseeker A Jobseeker A
  91. 91. Outsource your DNS service Doing this well is an investment
  92. 92. Outsource your DNS service ● Robust ● Flexible ● Inexpensive Our core competency is jobs Their core competency is DNS
  93. 93. Global DNS Service
  94. 94. Degradation and Failure Manually switch datacenter on service degradation Automatically switch datacenter on failure
  95. 95. DNS propagation delays 1. Healthcheck cycle - up to 30 seconds 2. Healthcheck server to nearest PoP 3. Jobseeker's DNS server cache refresh 4. Jobseeker's local DNS cache refresh
  96. 96. DNS Time-to-live (TTL) TTL tells local name servers and clients how long to wait before looking up a domain name again TTL limits load, but also slows change propagation
  97. 97. Some clients and servers ignore TTL We specify a 30 second TTL, but local DNS servers and clients can ignore it
  98. 98. Impact of propagation delay 90 second traffic hole
  99. 99. 30 minute tail Well-behaved clients Ignoring our TTL
  100. 100. Big Picture 90 second hole Failing datacenter Total traffic
  101. 101. Accepting DNS limitations Complete datacenter failure is extremely rare Predictable limitation Massive costs to reduce propagation delay
  102. 102. Remapping Manually The same system allows us to reroute traffic whenever we want ● Datacenter maintenance ● Non-critical performance problems ● Non-critical feature loss ● Other degradation of jobseeker experience
  103. 103. Datacenter Redirection datacenter disabled traffic moves to others
  104. 104. Anycast DNS for performance This capability is also used to improve performance
  105. 105. Closer to the jobseekers The DNS service can give the IP address of the datacenter closest to the jobseeker.
  106. 106. Network hops Based on network hops between jobseeker DNS server and our DNS service POP
  107. 107. Network paths Estimates how many networks traffic must pass through to reach our servers
  108. 108. Count hops Picks estimated shortest path
  109. 109. Optimize for network distance We can push our data center presences closer to the jobseekers to reduce network latency
  110. 110. Datacenters for redundancy only
  111. 111. Fast for some jobseekers
  112. 112. Datacenters close to the jobseekers
  113. 113. Fast for most jobseekers
  114. 114. Sent to the East Coast
  115. 115. Sent to Central US
  116. 116. Sent to the West Coast
  117. 117. No downtime for datacenter replacement Incrementally send traffic to new datacenters Incrementally reduce traffic to old data centers
  118. 118. Move West Coast hosting? ?
  119. 119. Move West Coast hosting! -20 ms
  120. 120. Move European hosting? ?
  121. 121. Don't move European hosting! +50 ms!
  122. 122. Search Engine Performance Source GrabPerf.org
  123. 123. Page Load Time 1,000ms 9,000ms
  124. 124. Summary and Results Charles Valentine
  125. 125. ● Higher-capacity network equipment ● Redundant firewalls ● Redundant load balancers ● Bigger Internet connections ● Redundant Internet connections This is "vertical scaling." Traditional Scaling Model
  126. 126. Horizontal Scaling with RAID Add capacity by adding datacenters Add redundancy by adding datacenters Rent "good" datacenters, not "best"
  127. 127. You can RAID too!
  128. 128. Avoid using proprietary features ● Load balancer ● Security devices ● Virtualization ● Servers
  129. 129. Be Hardware Agnostic
  130. 130. More potential providers
  131. 131. Use free software No licensing costs or recurring maintenance fees
  132. 132. Agile Providers ● New hardware racked and ready in a few hours ● No need to over provision
  133. 133. Automate configuration ● Cobbler ● Puppet
  134. 134. Rent instead of buying ● Obsolete hardware is not your problem ● No depreciation ● No hardware maintenance ● No need to hire people to maintain the hardware
  135. 135. Architect Applications for RAID Work with your development teams
  136. 136. Traditional Hardware Scaling ● Old hardware supports baseline traffic ● New hardware supports growth
  137. 137. Indeed Hardware Scaling Old hardware gets replaced by new, on demand
  138. 138. Moore's Law Hardware is always getting better ● Faster processors ● More memory per chassis ● Larger, faster disks
  139. 139. Higher capacity, lower cost ● Number of machines drives cost ● Power of machines drives cost ● More machines => more problems ● Compute power grows faster than compute cost
  140. 140. Replace hardware every 18 months Managed hosting + Moore's Law + RAID = new and powerful hardware every 18 months
  141. 141. Amazon EC2? ● Amazon is a single provider ● Costs more to run 24x7 ○ 2x without bandwidth cost ● Can't be as close to the jobseeker
  142. 142. What RAID gets you ● Servers closer to your customers ● Disposable datacenters ○ Datacenter-level failover ○ Get modern hardware every 18 months ● Many hosting options
  143. 143. Spend Time On... ● Automation ● Managed DNS ● Investigating Providers ● Monitoring
  144. 144. Spend Less On ● Proprietary hardware ● Network Infrastructure ● Support Contracts ● Software Licenses ● Headcount
  145. 145. Monthly Server Count vs Job Search
  146. 146. Inexpensive ● Cost as a percentage of revenue ● Cost of delivery per job search
  147. 147. Revenue vs Infrastructure Cost
  148. 148. Revenue/Search vs. Cost/Search
  149. 149. Fast ● 100 ms average client time Reliable ● > 99.999% availability in 2012 Cost Effective ● Cost of delivery < 0.5% of revenue RAIDing FTW
  150. 150. Q&A

×