Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS to Bare Metal: Motivation, Pitfalls, and Results

2,356 views

Published on

Like many startups, Wish grew up on AWS. As our cluster grew and the price of SSDs fell, we started exploring bare metal. Fast-forward 2 years and we have hundreds of MongoDB instances on bare metal fully integrated with our AWS infrastructure. It wasn't all smooth sailing, but the performance & cost improvements were worth it! Hear the story of how we did it and gain a framework for thinking about how to make the leap from cloud-centric architecture to a hybrid model.

Published in: Technology
  • Be the first to comment

AWS to Bare Metal: Motivation, Pitfalls, and Results

  1. 1. AWS CLOUD TO BARE METAL
  2. 2. Wish saved 35% on MongoDB costs Improved latency by 20% And reduced latency variance
  3. 3. HI, I’M ADAM. (I’m a software engineer; I also run production…)
  4. 4. I WORK AT WISH. (we’re a mobile eCommerce platform)
  5. 5. I WORK AT WISH. (we also grow really fast…)
  6. 6. AWSTO BARE METAL • The Why • The Scope • The Servers • The Network • The Operations • The Results
  7. 7. THETHEME
  8. 8. The Why
  9. 9. there was spinning disk EBS In the beginning
  10. 10. DB slows to a crawl Replica set detects failure Election kills the app for 30s App slows down EBS LATENCY SPIKE
  11. 11. Provisioned IOPS EBS launches Summer 2012
  12. 12. But - super expensive!
  13. 13. Maybe time for bare metal?
  14. 14. So we modeled the costs…
  15. 15. The Scope
  16. 16. ?
  17. 17. The Servers
  18. 18. Server Specs?
  19. 19. GOAL Find lowest cost per query for your workload
  20. 20. THROUGHPUT & LATENCY • Typically: more throughput → more latency • Application dictates max latency (p95?) • For each hardware config… • Find highest throughput under max latency
  21. 21. THE WORKLOAD • db.setProfilingLevel(2) • Snapshot the DB volume • Dump system.profile after 1 hour
  22. 22. OURTOOL • Restore the snapshot • Clear filesystem caches • Replay ops at configured throughput • Report on latency / MongoDB stats
  23. 23. LATEST SPECS • 2x Ivy Bridge 3.3 GHz (32 hyperthreads) • 256 GB RAM • 3.2TB LSI WarpDrive PCI-e YO U R M ILEAG E M AY VA RY!
  24. 24. The Network
  25. 25. NETWORKS ARE WEIRD • Network engineering is weird for software people • Need to master a few, big pieces • We wasted a lot of time improvising…
  26. 26. PLANTO FAIL • Every component and connection fails • Switch dies? • NIC dies? • Switch ⟷ switch connection dies? • DirectConnect dies?
  27. 27. The Operations
  28. 28. THE OPERATIONS • Migration / Rollback • Backups • Processes • Documentation
  29. 29. MIGRATION (PREP) • Add new nodes to replica set • hidden: true, priority: 0 • Wait for them to sync
  30. 30. MIGRATION (READ-ONLY) • Unhide nodes: • hidden: false, priority: 0
  31. 31. MIGRATION (READ-WRITE) • Force primary into colo: • hidden: false, priority: 2
  32. 32. MIGRATION (DONE) • Hide old AWS nodes: • hidden: true, priority: 0
  33. 33. ROLLBACK • No big deal • Adjust hidden/priority to move traffic back
  34. 34. BACKUPS • EBS snapshots rock! • Hidden member in EC2 for backup • Nice for DR too…
  35. 35. PROCESSES • No RackServer() API • Ensure consistency: • Checklists • Verification tools
  36. 36. DOCUMENTATION • No DescribeInstances either… • Consider life without AWS Management Console • Worse: consider it being occasionally wrong
  37. 37. DOCUMENTATION • Wiremaps • Network maps (IPs,VLANs, etc) • Equipment specs • Serial numbers
  38. 38. The Results
  39. 39. Big project - took about 6 months
  40. 40. Savings made it worthwhile
  41. 41. Bonus: it got faster!
  42. 42. Budget a lot of time for learning
  43. 43. Benchmark & validate your assumptions
  44. 44. Obsess over the details
  45. 45. Thanks! adam@wish.com

×