Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Operational Software Design

5,233 views

Published on

There are two common tenets of operations: "hell is other people's software," and "better software is produced by those forced to operate it." In this session I'll take a fly-by-tour of two pieces of software that were built from the ground up for operability from the hard-earned teachings of their inoperable predecessors: a distributed datastore replacing PostgreSQL, and a message queue replacing RabbitMQ.

We'll discuss specific design aspects that increase resiliency in the event of failure and observability at all times.

Published in: Technology

Operational Software Design

  1. 1. Theo Schlossnagle, Founder @Circonus, @postwait on Twitter Operational Software Ignorance Buys Pain https://www.flickr.com/photos/hoyvinmayvin/4906678960
  2. 2. Tenet #1 Designing for the unknown should never burden today Compromise when forced. Design with tomorrow’s expected volume, but today’s functional requirements. https://www.flickr.com/photos/alexander/20090860
  3. 3. Tenet #2 Never be left curious as to what your software is doing Observability is the only fast-path to success. https://www.flickr.com/photos/jox1989/4764186425
  4. 4. Tenet #3 Understand what your software was doing This is one of the trickiest tenets. You can’t log every instruction performed, every packet sent & received, every context switch, every I/O. Log things that are infrequent, measure things that are frequent. What’s frequent? I said this was tricky. https://www.flickr.com/photos/guest_family/6175062186
  5. 5. Tenet #4 Understand internal failures The only thing worse than a malfunction in production, is one without sufficient actionable data. Core dump, stack trace analysis, etc. These can wake people up. https://www.flickr.com/photos/skipthefiller/2451481428
  6. 6. Tenet #5 Operator remediation of an external failure is a bug A system failing, a disk failing, a network partition, etc. Once the external failure is remediated, no human being should be involved in returning the (internal) system to correct operation. https://www.flickr.com/photos/timzim/2308099322
  7. 7. Tenet #6 & #7 Tight coupling
 reduces resiliency Loose coupling
 reduces debug-ability
  8. 8. Tenet #8 Avoid difficult problems
 if possible One does not “just add” replication, consensus and fault tolerance into systems. Never solve a harder problem than you are presented. The best solution to a problem is to remove the problem. https://www.flickr.com/photos/shakestercody/2124972276
  9. 9. A tail of two systems Queueing Storage https://www.flickr.com/photos/skynoir/8300610952
  10. 10. From RabbitMQ to Fq RabbitMQ: Oh, the outages we’ve had. violates #1, #2, #3, #5, #8 Fq: (#1) build only semantics we need (#2/#3) DTrace probes (#4) cores dumps and backtrace.io (#5) handled by simplicity of design and use (#6/#7) push decoupling out, gain debug-ability (#8) punt clustering downstream to clients https://www.flickr.com/photos/fotologic/2165675515
  11. 11. ns level deep routing
  12. 12. From Postgres to Snowth Postgres 8… hacked in a column store using arrays Tenet #3: build pg_statsd Tenet #5: … build Snowth Snowth: ring topology time-series store (#1/#8) only commutative operations, (#2) DTrace probes (#3) Circonus itself (statsd, histograms, etc.) (#4) backtrace.io (#7) implemented Zipkin (Dapper-like) tracing
  13. 13. quartile bands
  14. 14. “mvalue” best-guess modes

×