Scaling Hike Messenger to 15M Users

5,768 views

Published on

Published in: Technology, Business

Scaling Hike Messenger to 15M Users

  1. 1. 1 Rajat Bansal Chief Technology Officer Scaling to Millions, FAST!
  2. 2. 2 - The Hike Journey! - Why MongoDB?! - MongoDB Use Cases:Overview and Deep Dive! - Lessons learned & Next Steps Agenda
  3. 3. 3 The fastest growing ! Made in India IM App!
  4. 4. 4
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. 8
  9. 9. 9
  10. 10. 10
  11. 11. 11
  12. 12. 12
  13. 13. 13 AND TODAY
  14. 14. 14
  15. 15. 15 Why MongoDB? Storage decisions governed by CAP Theorem
  16. 16. 16 Why MongoDB? CConsistency Availability Partition Tolerance Cost Availability
 of Talent Production Support A P
  17. 17. 17 Cost - Start Small, Grow Big, FAST, REALLY FAST! - Started with 1 replica set on EC2! - Grew to multiple replica sets, sharded clusters
  18. 18. 18 Availability - Small team of 3 people writing the entire server! - Easy Ramp-up and management! - Low setup and administration requirements
  19. 19. 19 Production Support - Cost of downtime is very high. ! - Small team needed support in downtime. ! - Decision to take Production Support proved life-saving in outages
  20. 20. 20 User Profile Store! !! - 3000 reads / 500 writes per sec! ! Temporary Message Store! !! - 1000 reads / 3900 writes per sec! !! ! Other Miscellaneous usage: Grid FS etc HIKE’s MongoDB Use-cases
  21. 21. 21 Offline Message Store! !! - App Level Sharding! !! - 4 Mongo instances with 32 DB each! !! - Horizontally scalable upto 128 instances when needed! !! - Tested upto 30K Ops in simulated environment ! !! - Protected by “Redis” layer to reduce queries! !! - Latencies < 1ms HIKE MongoDB Architecture
  22. 22. Primary! Secondary! Secondary! Mongod-1 Primary! Secondary! Secondary! Mongod-2 Primary! Secondary! Secondary! Mongod-3 Primary! Secondary! Secondary! Mongod-4 Shard Manager 32 dbs each App Layer
  23. 23. 23 User Profile Store - Replica Set (1 primary, 2 Secondary) - Writes to Primary - Reads from Secondary - Latencies < 1ms HIKE MongoDB Architecture
  24. 24. 24 mongoDB (Happy State < 1ms) 0.65ms 0.80ms
  25. 25. 25 mongoDB (1 Year Timeline)
  26. 26. 26 mongoDB (Outage 1) Outage 1
  27. 27. 27 Outage 1! !! - Latencies went over the roof “1ms —> 1000ms”! !! - What went wrong: Lot of operations on “Arrays”! !! - “Production Support” to the rescue! ! !! “Adding and modifying array entries can require a scan of much or all of each array being updated, resulting in slow operations" HIKE Learnings
  28. 28. 28 mongoDB (Outage 2) Outage 2
  29. 29. 29 Outage 2! !! - Latencies increased 20-50X! !! - What went wrong: ! !! ! ! - Disk I/O was bottleneck! !! ! ! - “ReadAhead” was high! !! ! !“Read/Write Throughput Exceeds I/O” HIKE Learnings
  30. 30. 30 mongoDB (Outage 3) Outage 3
  31. 31. 31 Outage 3! !! - MongoDB crashed! !! - Adhoc Script doing fullTableScan ! !! - Need to protect your systems “noTableScan” ! ! ! “Protect your production systems. Use the mechanisms available”! ! HIKE Learnings
  32. 32. 32 - Proactive Health Checks - Production Support Helps - Put Mechanisms to safeguard production HIKE Learnings
  33. 33. http://hike.in @hikeapp 33

×