On-call is hell
Moore’s law was catching up to us and we couldn’t just rely on vertical scaling to sustain more compute
Also a lot more data was being produced and needed to be computed over and disks were still pretty slow but we wanted that data fast
2003: Commodity hardware, GFS Google Map-Reduce: 2004, Hadoop 2005, BigTable 2006, Dynamo 2007
Distributed systems were mostly storage/data-related and still written by systems engineers
SQS beta: 2004 AWS: 2006, Heroku: 2007
Developers suddenly had the keys to the kingdom of what had once been ops-only territory
DEVOPS
DevOps
Yes it is a lot of things, a lot of good things, but hard to argue that it would be as successful or popular without the ability for developers to do a lot more without operation specialists
Monorail monolith easy to get spun up and going quickly, hard to scale to lots of developers especially given that it is legacy early-stage startup code
Moving out of a monolith into services is actually a great way to rewrite a system, so many of us did just that!
Our devs don’t want to talk to each other, and it’s easy for them to spin up new hardware to play with, leading to….
AAAAHHHHHHHHHH
Instead of one complex software system with relatively simple overall system, we have a million simple systems creating one hugely complex overall system
A distributed system written by application developers
All these principles that we must be following to create good microservices architectures…
Lower operations startup costs ->
Devs playing ops on unstable hardware (cloud) +
Monoliths slowing us down and leading us to services ->
Application developers writing distributed systems
Under conditions of…
I really hate the fact that so much of the time I feel like talks on microservices and devops make me feel like I should know what to do and the fact that I’m not doing it means my pain is my own doing
OK pause. What the hell do we do in this imperfect situation? How does monitoring fit into all of this?
Break down monitoring: why do I care? What do I need? What does it mean?
We knew a few things when we were doing those three tier architectures…
Testing is a good thing! Your customers should not be treated as testers for you! Generally speaking, the fewer incidents in prod due to bugs the better!
Testing helps you lower incidents caused, at minimum, by errors in the application logic. Also of course drastically lowers risk of change and makes your code infinitely more maintainable.
Monitoring lets us take on a little bit more risk, release code faster because risk of change is reduced.
BOTH ARE NECESSARY.
No, I don’t mean write clear pom files
What are the downstream dependencies?
What does healthy look like?
Write runbooks!
Force the developers who wrote the systems to get them to a state to be supported by others!
Feature Flags!
Runbooks!
Not DevOps? I don’t care! You are a small startup. People will quit without warning, you are in an environment where there is less shared overhead of knowledge between teams, you HAVE to get things able to be handed off!
This is bad when done for storage layers, and bad when done with monitoring tools!
No one can learn all these tools! You don’t need them! Your java devs refuse to learn an IDE because they started off in VI, you expect them to learn 5 different monitoring tools for their on-call rotations?
Pay for good tools, it is probably cheaper than paying the developer/ops time and mental overhead running them internally
Powerful general tools can approximate useful things
Keep history. Look at it.
Goal is to know the important things, and to be able to identify when things changed
Know your trends
HTTP response codes
Performance
User Behavior
Alert intelligently on shit that is out of trend
For monitoring, you invert the normal system architecture because at the end of the day YOUR CUSTOMER is the FOUNDATION
If your customer is generally continuing to operate as normal, your team might be suffering but it is less urgent (and your team may not need to treat the problem as a 5-alarm fire)
Robust Distributed Systems are harder to write than robust single machine-systems
True for your stack, true for open source stacks, true for vendor stacks
Plan for failure, try to degrade elegantly, but don’t OVERPLAN for failure too early
Poor error handling is the death of distributed systems
Really don’t love this. You don’t have to be perfectly fault-tolerant to get value from microservices/SOA
If literally any system can fail and take everything down, fine. But especially for smaller businesses, you are probably going to have critical systems that essentially mean business stops. For us, reservations. That’s ok.
This is an overengineer that we did at RTR. Goals: high volume traffic, fault isolation, fault tolerance, geodiversity. Reality was long deploys, devs on branches for long periods of time, and a way-too-high barrier to change for our scale. Wasn’t necessary! Moved our frontend back to a monolith (keeping the fault tolerance for the backend microservices for now)
Take a breath, it’s ok, you don’t have to be a perfect culture or perfectly enlightened being, enjoy the ride, and test your freaking code