2. Web Startup Challenges
• Low-friction development
• Hodgepodge of technologies
• Hodgepodge of infrastructures
• Legacy support
• Constant migrations and upgrades
• Bottom line:
High rate of change and no time to check!
3.
4. A Gordian Knot
• How utilized is our Hadoop cluster?
• How utilized is our DC?
• Are all of our services running correctly?
• Is our latency OK at every layer in the stack?
• Someone changed something, were there any
negative ripple effects?
• Are we hitting any scaling issues?
5. A Network Knot
• Our products live on the internet
• Our data centers are global
– Some of them are virtual
• Network effects are a fact of life
– Network partitions
– Latency makes information late
– Noise is natural and frequent
– Data just goes missing
– High availability compounds the problem
9. Solution Design
• Hypothesize existence of
system state
a time varying stream of state components
• Build it by measuring our systems in toto
• Stream all measurements to one place
• Gain insight by inspecting this stream
computationally and ad-hoc
11. Collecting Sate
• Define a state event ADT capturing:
– Host
– Service
– State
– Timestamp
– Any additional key/value fields
• Find something to collect it
12. Riemann
• Riemann accepts state events as a stream
• Riemann indexes the stream, provides stream
processing facilities and some alerting tools
• Also provides downstream pipes:
– Unix domain sockets
– Web sockets
– Graphite stream comes free
– Create your own
13. Innternal State Relays
• Poll third party monitors for state
• Map to Riemann events
• Send to Riemann
• Fill in holes with custom monitors
– Hadoop jobs, load balancer state, etc.
• Foundation in place to know everything about
our global DC state
14. Network Monitors
• Static monitors around the world
– Constantly check HTTP state of services
• Poll third party monitors (Pingdom, etc.)
• Deduce network state from aggregate streams
• Detect outages from user perspective
• Can extend with phantomjs to get Gomez like
waterfall and do whatever we want!
15. Demo Time
• Ad hoc demo
– Grep the stream
– Quickly analyze state of disk utilization
• Hadoop global state
– It just pipes nagios data!
• Network monitoring demo
– Let’s combine pingdom + network monitors
– And iterate! awesome dashboard
16. Distributed Gotchas
• Riemann can scale, but some nasty surprises
– Events on a TCP connection are processed serially
– If event rate gets too high, stream gets saturated
and backs up into OS network buffers, then into
Netty’s unbounded buffers. This ultimately
starves heap and crashes Riemann.
– Solution is to use large connection pools at the
clients that push events
17. Distributed Gotchas
• Network outages and partitions are difficult
– Riemann must not go down
– Riemann must deal with split-brain
• Highly available SRE solution planned
– Virtual ip, heartbeat (similar to LB solution)
• Riemann servers in separate locations
– End up with two masters on partition => double
the alerts but at least we get something
At no point can we sit down and sift through our architecture and say this situation is an error and that situation is ok. We cannot just classify things like that because they become defunct within a month and sometimes within days. OK, we can do it for certain things, but for most application level stuff we have no way to do it. We have to somehow monitor *everything* and figure out how we can know what went wrong from that. Note that this requires us to be experts at every level of the system, as Bilke covered last presentation.
Let’s take a look at some things we may want to know. These are some gnarly, but super important questions.
Our life is complicated by the distributed nature of our systems, so we need to ensure that whatever solution we have takes into account the network.
Here are some existing solutions we have tried over the years..
However, our experience is that these do not work. They each solve different problems, sometimes very well, but they all fail to answer the knotty questions about the overall system. We have to drill down into many of these applications to get an idea of what the heck is going on. I don’t know about you, but I’m getting log-in fatigue whenever a problem happens. And the situation is getting worse with all these pay-ware hosted third party solutions.
So is there a better way? We need to clear our minds of these approaches and look at the fundamental problem from a fresh perspective.
If we really get back to the basics, we’re talking information theory, computer science, really thinking about the problem as far down as we need to go. And I’m not being academic. Hamming’s quote illustrates a highly pragmatic wisdom despite his heavily mathematical work. It’s also quite on topic: We will take a deep look at what we’re really trying to do here, to come up with some solution design that considers our desire for insight and how we can piece it numerically from our chaotic mess of systems, people and processes.
Each of the existing tools we just swiped off the table purported to yield insight from some data, but they somehow failed to tell us what we need to know: the state of our system. Let’s look at a solution design that involves the so called state of our system. (read slide) Now much of this was motivated by a project called Riemann, which was designed by a Physics nerd. In science when you model something, you choose to represent the system as state vectors in some convenient topological space, and then you run gnarly computations to see if the model matches reality. This is a powerful approach that has consistently yielded great insights on the nature of the universe. We will repeat this process here because hey, our computer systems are a subset of the universe.
This makes it straightforward to implement, debug, scale and maintain.
The point of all this is to be able to operate on the stream as needed. Note that you don’t need to write clojure code to do this, you can simply open a socket and stream it into python or whatever. Later on there will be demos that I cobbled together using javascript over websockets.
What about monitoring the data center? It turns out we don’t have to re-invent the wheel. Each monitoring system like nagios and new relic have API which allows us to poll the state and map it to Riemann friendly events. This is great because we can leverage existing expertise of monitoring systems and get a huge return right off the bat.
Pingdom is great, but it lacks some features, such as telling us what the network state is in general. We can deduce the network state by creating our own series of monitors. This also gives us a platform to replicate the latency waterfall for web pages as done by Gomez and Akamai.
Demo time
I wrote something about Riemann java client being lousy. The network monitors have to reconnect on timeout, but it wasn’t supported. So I implemented my own connection logic with one TCP connection and ended up getting burned rather nicely by this. So now I have to contribute to the java client or roll my own. Exciting stuff!
It’s too soon to say, but I have been using this system during recent outages and it’s starting to look quite useful. We can expect the a follow up to cover the problem of insight and whether this kind of streaming state processor helps at all. There are some additional preliminary and exciting ideas that I haven’t covered here. It’s shaping up to be an interesting body of work Finally, who would have known: monitoring seems like such a dry topic, until you realize it’s actually very deep.