Monitoring modern real time distributed infrastructure is complex and expensive. In this talk we explore Riemann, specifically, how Riemann low latency helped us to get real time metrics from our Distributed Systems.
8. Types of Monitoring
● Customers -- The Good ones who send
email/chat to support and tell them
“Dear XYZ your application foobar module is
not working. Please check”
9. Challenges in Distributed Systems Monitoring?
● Hundreds of machines.
● Hundreds of thousands of metrics
every second.
● Metrics to Monitor?
● Metrics to set Alert?
● Alert Frequency?
10. Challenges in Distributed Systems Monitoring?
● Storage of useful metrics
● Real Time metrics
● Monitoring Cost
● Informative Dashboards
11. What is Riemann?
In one word “ Riemann is an Event Aggregator”
● A monitoring tool that aggregates events from servers and applications.
● Riemann uses powerful stream processing language written in Clojure to
aggregate the events.
12. Why Riemann?
● Written in Clojure.
● Low latency events processing monitoring engine.
● Streams are Clojure functions which makes it highly adaptable.
● Riemann configuration file is a Clojure Program.
13. Why Riemann?
● Monitoring as a Code.
● Can monitor anything.
● Comes with its own Instrumentation - measures own performance.
● Can send alerts via Email, Chat, SMS and many more..
● Can get connected to back end time series databases InfluxDB and Graphite
to store metrics for historical data.
15. How an Event Looks Like? A Clojure map
(immutable for sure)
16. Riemann Events
● Events in Riemann are the base construct.
● Riemann receives events and processes them.
● Events fields are referred by Keywords in config like :host, :service, :tags.
● Apart from the standard fields, custom fields can also be sent in the event.
17.
18. Riemann Streams
● Streams are Clojure functions that we can define.
● Streams are defined in stream section of the Riemann config file.
● Streams can have a child stream.
● Events get passed to the streams for aggregation, modification and alerting.
● Riemann config can have as many streams.
19. Riemann Indexes
● Table of current state of all services tracked by Riemann.
● Each event is uniquely indexed by its host and service. The index just keeps
track of the most recent event for a given (host, service) pair.
● Index can have TTL (time to leave).
The event is the base construct of Riemann. Events flow into Riemann and can be processed, counted, collected, manipulated, or exported to other systems. A Riemann event is a struct that Riemann treats as an immutable map.Inside our Riemann configuration, we’ll generally refer to an event field using keywords. Remember that keywords are often used to identify the key in a key/value pair in a map and that our event is an immutable map. We identify keywords by their :prefix. So, the host field would be referenced as :host. A Riemann event can also be supplemented with optional custom fields. You can configure additional fields when you create the event, or you can add additional fields to the event as it is being processed — for example, you could add a field containing a summary or derived metrics to an event.
Each arriving event is added to one or more streams. You define streams in the (streams section of your Riemann configuration. Streams are functions you can pass events to for aggregation, modification, or escalation. Streams can also have child streams that they can pass events to. This allows for filtering or partitioning of the event stream, such as by only selecting events from specific hosts or services. You can think of streams like plumbing in the real world. Events enter the plumbing system, flow through pipes and tunnels, collect in tanks and dams, and are filtered by grates and drains.
You can have as many streams as you like and Riemann provides a powerful stream processing language that allows you to select the events relevant to a specific stream. For example, you could select events from a specific host or service that meets some other criteria.
Like your plumbing, though, streams are designed for events to flow through them and for limited or no state to be retained. For many purposes, however, we do need to retain some state. To manage this state Riemann has the index.
Riemann indexes are sort for copy for the last events for each server and service. It is also a cache. Riemann sends a fack event expired.
Where takes a predicate, which is a special expression for matching events. After the predicate, where takes any number of child streams, each of which will receive events which the predicate matched. For example, we could email only events which have state "error".
The where stream provides some syntactic sugar to allow you to access your event fields. In a where stream you can refer to "standard" fields like host, service, description, metric, and ttl by name. If you need to refer to another field you need to reference the full field name, (:field_name event).
Rollup will allow a few events to pass through readily. Then it starts to accumulate events, rolling them up into a list which is submitted at the end of a given time interval.
Let's define a new stream for alerting the operations team, which sends only five emails per hour (3600 seconds). We'll receive the first four events immediately--and at the end of the hour, a single email with a summary of all the rest.
Rollup is memory hogger as it keeps all events in memory till the defined time hence always try to use throttle which will send 5 events and ignores rest of the events.
The coalesce stream remembers the latest events from each host and service, and sends them all as a vector to its children. We can map that vector of events to a single event--the one with the largest metric--using folds/maximum. Then we just set the service and host, since this event pertains to the system as a whole.
moving-time-window forwards the last n seconds of events
moving-event-window forwards the last n events
fixed-time-window forwards events from disjoint n-second windows
fixed-event-window forwards disjoint sequences of n events