Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling tappsi


Published on

This is the story of how we managed to scale and improve Tappsi’s RoR RESTful API to handle our ever-growing load - told from different perspectives: infrastructure, data storage tuning, web server tuning, RoR optimization, monitoring and architecture design.

Published in: Software
  • I was always one of those students who were top of the class for maths during my KS3 years. However, we didn't have maths lessons for around 6 months in year 10, so I fell behind rapidly, and I was getting below average for my GCSE mocks. This package has really boosted my knowledge in a matter of only two weeks! I am vastly improving in maths and I am confident, given that I follow Jeevan's principles, I will achieve an A* in GCSE maths... In the end I achieved an 'A' grade in GCSE maths (summer 2014). I was a little disappointed in myself. However, considering the circumstances, I think I did pretty well. I am now taking A-Level maths at a grammar school and wanted to thank you for helping me along the way. You have inspired me to do well in this subject and I'm sure my 'A' grade will definitely help me to study Veterinary Medicine at a top University. Once again, thank you so much! ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here
  • Cool
    Are you sure you want to  Yes  No
    Your message goes here

Scaling tappsi

  1. 1. Scaling Tappsi’s RoR RESTful API Or the TL;DR story of how we evolved By: Óscar Andrés López Twitter: @oscar_lopez
  2. 2. •  A big thank you to all the team that made this possible! •  We have a tight and continuous dialogue between Devs and Ops, where we discuss matters and iterate over solutions; this is one of the major reasons why we have evolved to where we are now •  The evolution of the system occurred organically, based on experimentation and metrics on both programming and system administration The Team
  3. 3. Table of Contents •  This is a journey, from a RoR monolith to a well-defined set of services and microservices •  A tale of the different strategies we implemented, not necessarily in chronological order •  Our greatest scalability challenge: how to inform all available drivers which passengers near them need a ride 1.  Monitoring and Alerts 2.  Phase I 3.  Phase II 4.  Phase III 5.  Microservices 6.  Results
  5. 5. Deprecated Monitoring Tools •  Riemann (briefly) •  Idera uptime cloud monitor (previously CopperEgg) •  Grafana - Served with graphite/carbon on a single instance
  6. 6. Current Monitoring Tools •  New Relic •  Grafana on steroids – now runs in a sharded and replicated cluster in each of our AZs •  Automatic alerts when a service goes down •  By collecting systems data ourselves, we are able to observe systems characteristics better
  7. 7. Grafana Architecture
  8. 8. PHASE I
  9. 9. In the beginning…
  10. 10. … There was the Monolith … •  Ruby 1.9, Rails 3.2, ran in a screen session •  Up to 8 instances (first c3.4xl, later c3.2xl), scaled by hand •  PostgreSQL 8.3 + PostGIS, XFS SoftRAID setup •  We were using a hard disk drive for the DB •  No tests, at all •  Huge technical debt
  11. 11. … And the bad practices! •  Overreliance on clocks for processing •  All processing was synchronous •  Misunderstood REST web services – all requests were POSTs, all endpoints returned 200
  12. 12. The Scalability Challenge •  The location endpoint, requests every 45 seconds, 100+ ms response time – and it accounts for 65% - 75% of the requests! •  Measure first, optimize later – remember the root of all evil... •  Optimize top endpoints’ response time •  9 out of 10 times, performance problems are to be found on the way data storage/querying is implemented •  Completely DB-bound, killed the DB
  13. 13. location Performance
  14. 14. PHASE II
  15. 15. Separate Systems
  16. 16. Infrastructure Improvements
  17. 17. Engineering Improvements •  RSpec (> 920) and Funkload (> 100) tests •  RabbitMQ introduced for decoupling asynchronous processes •  ELB •  ASG – up to 10 instances of c4.xl, but we rarely go above 6. And it’s cheaper, too! •  We learned: not to use multiple Redis instances to get around the fact that it’s single-threaded •  Redis Cluster was not stable back then •  We settled for a single, managed Redis instance •  Improvements on the driver and passenger apps
  18. 18. Database Optimization •  Create valuable DB indexes. Lots of time spent searching in logs and looking for missing indexes •  Avoid unnecessary trips to DB •  Use a stored procedure (in this day and age? yes!) •  Many code optimizations and bug fixes
  19. 19. Active Record Optimization •  These tips apply only for code that has to perform very quickly •  Always try to hit the indexes in the where part, create them if necessary •  Don’t do select * , if possible: Driver.where(id: 10) X'cedula, name').where(id: 10) ✓ •  Load info in batches (activerecord-import gem): Driver.import(drivers) ✓
  20. 20. Fake Separation of Concerns •  We learned: concerns should not be mixed •  Ideally, an endpoint should have a single responsibility •  If it’s hard to separate concerns in existing endpoints (because changing clients is not possible), at least do the other concerns less frequently: def retrieve_individual_messages? uniform_random_chosen?(1.minute) end def uniform_random_chosen?(time) n = time.to_i / LOCATION_FREQUENCY rand(1..n) == n / 2 end
  21. 21. PHASE III
  22. 22. Eternal vigilance is the price of scalability •  Let’s not get too smug! The last “Day Without Car” hit us hard •  Over 2.300 requests per second •  Forced us to rethink and optimize well-understood processes •  PostgreSQL 9.6.2 + PostGIS •  DB and API on FreeBSD 11, ZFS for DB •  Upgrade to Ruby 2.2.7 – relatively simple, a couple of dependencies caused (minor) trouble. Big performance wins! •  Rails 4.2.8 – hard to upgrade, needs to be done step-by-step: 3.2 -> 4.0 -> 4.1 -> 4.2
  23. 23. Engineering Improvements •  Stress simulation on the Rails side, Unit (JUnit + Mockito) and Instrumentation (Espresso + UI Automator) tests on the apps side •  Delete lots of unused code, tables, (useless or redundant) indexes and columns - coverage tools like simplecov help a lot •  Eliminate sources of data inconsistencies, like badly-implemented data caches and lack of DB locks •  A new and modern driver app, written in Kotlin from scratch •  True separation of concerns - different functionality implemented in different endpoints •  Truly RESTful endpoints, with simple and well-defined contracts, correct usage of HTTP verbs and error codes
  24. 24. Architecture Improvements •  We learned: not every process should be synchronous •  Split asynchronous or real-time processes to their own Elixir microservices •  We learned: a relational DB + stored procedures, does not scale well when faced with high concurrency •  Separate the most-frequently queried data to its own optimized, special-purpose storage: Tile38, a Go geospatial DB
  25. 25. Tile38 Architecture
  26. 26. Server Tuning •  The database still had performance issues. What to do? Tune-up the servers! •  There was a very high number of DB connections coming from API servers, this overloaded the DB •  The reason? Too many workers and too many threads per worker •  Solution: lower the number of processes and threads, this avoids lock contention and optimizes CPU usage •  Now the number of threads matches the number of database connections in each Puma worker!
  27. 27. Server Tuning •  Sample configuration, works for us (YMMV): PUMA_WORKERS=10 # one worker per core PUMA_THREADS_MIN=6 # exactly 6 threads per worker PUMA_THREADS_MAX=6 DB_POOL=6 # connection pool size DB_TIMEOUT=5000 # statement execution timeout DB_STATEMENT_TIMEOUT=10000 # prepared statement timeout DB_MAX_PREPARED_STATEMENTS=200 # limit number of prepared statements instances * workers * threads = db connections •  Move from a processor-bound DB instance (c4.8xlarge) to a memory-bound instance (r4.4xlarge) •  Really slow I/O of AWS forces us to use ZFS to keep data in RAM
  29. 29. General Guidelines •  Implemented in Elixir •  We learned: not to rewrite all production code as a microservice from scratch, no matter how ugly •  We learned: that not all processes have to be implemented as microservices, because they have a premium •  We learned: not to use the same abstractions in Elixir as we would do in Ruby •  Bottom line: transactional, stateful data is really tricky to separate as a microservice - stateless, asynchronous processes are better candidates for microservices
  30. 30. Microservices Architecture
  31. 31. RESULTS
  32. 32. The Results •  Today, the location endpoint is called every 8 seconds, (4 during stress tests) – remember? it was called every 45 seconds! •  During peak hours, we get as much as 1.800 requests per second (2.700 during stress tests) •  All this, with an average response time of 13 ms, tops •  To get an idea of what this means, take a look at these stats: 02/09/2014 05/09/2017 Average Acceptance Time (seconds)! 46! 21! Average Arrival Time (seconds)! 219! 203! Average Distance (meters)! 435! 340! •  In the end, what truly matters – happier passengers and drivers!
  33. 33. Global Performance
  34. 34. Current location Performance
  35. 35. Grafana API Metrics
  36. 36. Grafana DB Metrics
  37. 37. THANK YOU!