Operations-Driven Web Services at Rent the Runway


Published on

From a meetup on 8/26, Rent the Runway's operations-driven service infrastructure including an overview of why we use Dropwizard

Published in: Technology, Education

Operations-Driven Web Services at Rent the Runway

  1. 1. Operations Driven Web Services -A Case Study of Service Evolution at Rent the Runway Camille Fournier, Head of Engineering @skamille Carlo Barbara, Senior Systems Engineer @CarloBarbara
  2. 2. In The Beginning, There Was Drupal
  3. 3. There was also all of these folks…
  4. 4. Can‟t Just Burn the World Down
  5. 5. Hollow It Out!
  6. 6. Hollow It Out!
  7. 7. Hollow It Out!
  8. 8. Hollow It Out!
  9. 9. Complexity 0 2 4 6 8 10 12 14 Dec-11 Jan-12 Feb-12 Mar-12 Apr-12 May-12 Jun-12 Jul-12 Aug-12 Sep-12 Oct-12 Nov-12 Dec-12 Jan-13 Feb-13 Mar-13 Apr-13 May-13 Jun-13 Jul-13 Number of Services in Production
  10. 10. Operations first…  Availability and performance of our services is critical to running our business  The software we develop has to make delivering on our SLAs possible  How (besides sane design):  Healthchecks + Nagios  Measurements  Historical Data with Graphs
  11. 11. Metrics  Gauges – instantaneous value  Counters – counter with +/-  Meters – rate over time (mean, 1, 5, & 15 moving avg.)  Histograms – distribution of data (mean, median, max, std. div., 75th, 90th, 95th, 98th, 99th, & 99.9th percentiles)  Timers – Meter of requests & Histogram of duration (frequency & latency)
  12. 12. Metrics - Healthchecks  Verify that your service is running correctly
  13. 13. Metrics - Reporting  HTTP  JMX  Graphite
  14. 14. Dropwizard: What is it?  Quality open source Java webservice components glued together in a modular way  Eliminates the need for picking a platform stack, it‟s all there  It‟s opinionated. If you don‟t like a Dropwizard core component, that‟s too bad, don‟t use Dropwizard  Developers focus on business logic, not framework  It‟s easy, maintainable, and it works!
  15. 15. A Few Words from Coda… “I had no one I had to toss a WAR to. I had no one to stand up a Tomcat server and fiddle with it until their eyes bled. I had no one who didn't trust me to spin up my own threads or connection pools. So I wrote something which worked as simply and in as straight- forward a manner as possible because my own ass was on the line if it didn't work.”
  16. 16. Dropwizard: The Ingredients  Jersey for REST  Jackson for JSON  Jetty for a webserver  Metrics for measuring  YAML for configuring  Dropwizard for weaving everything together
  17. 17. Dropwizard – Healthchecks  Register hooks that check the health of your app  An HTTP endpoint that iterates over all the hooks  “The meaning of healthy” is decided by you (i. e. Database Connections, Client Connections, DeadLock Count)
  18. 18. Dropwizard + Metrics  Dropwizard has lots of platform instrumentation baked in using Metrics, happens for free! (i.e. Jetty, JVM, Log Counts, etc…)  Ability to add Timers to your endpoints with @Timed  Ability to add arbitrary metrics as you see fit
  19. 19. Other Frameworks  Play 1.X  Abandonware for Play 2.X, which was still beta  Magic  Glassfish  OSGI hell  “standards”  Spring  Everything and the kitchen sink  Also I hate XML
  20. 20. What do I get out of it? Dev agenda  Story telling: causation & correlation  Integral piece of the operational excellence puzzle  State of the world – Dashboards  Developers focus on features, operations is mostly free lunch  Code review & demo Disclaimer: You need graphite to really harness the value
  21. 21. Story telling  The grid is slow why?  Is it load?  Is it dependent service latency?  How does that compare to yesterday  JVM throws out of memory, what‟s the problem?  What does the GC jigsaw look?  When did it change?  Is it correlated with increased load?  How is that new „performance‟ tweak?  If you never measured, then you didn‟t tune. True story!  What does my 5XX graph look like?
  22. 22. Operational Excellence: The ingredients  Application Instrumentation (Dropwizard)  Time Series Data & Graphing (Graphite, D3)  Centralized logging & log parsing (Rsyslog, Logstash, Nagios)  Automated alerting & escalation (Pagerduty) DW & Graphite will get you very far, but if you want total control & visibility you need the rest. This is the stack that RTR is moving towards, rather than relying on basic java logging smtp appenders
  23. 23. OMG, we are on GMA, are we OK?  10+ services  Each services runs in a cluster behind an LB  „OK‟ is somewhat service specific Basically you need a lot of info at your fingertips. Pictures are worth a thousand words. Get yourself some dashboards!
  24. 24. Graphite Dashboard
  25. 25. Tasseo dashboard (D3) • Red, Yellow, & Green Lights • Realtime • Endless cool things: graphite + D3 If we see yellow or red, start diagnosing
  26. 26. Free Lunch? Not really  DB connection pool monitoring  Http client connection pool monitoring  JVM Heap & GC info  Http Server response counts  Http Server connection info  Endpoint duration & throughput stats
  27. 27. Where do I sign up?  You install Graphite, one time hit + some TLC. Medium Difficulty  You annotate your endpoints and maybe add finer telemetry. Easy  You configure so your service is feeding into graphite. Hopefully consistently across services, via a „Bundle‟. Easy
  28. 28. Demo  Show a simple dropwizard codebase  Do some curls  Show the admin endpoints
  29. 29. References  dropwizard.codahale.com  metrics.codahale.com  graphite.wikidot.com
  30. 30. Presenters  @CarloBarbara (www.cabkata.com)  @Skamille (whilefalse.blogspot.com)  Rent The Runway is hiring! (renttherunway.com/careers)