Monitor everything from physical hardware to application functionality

Only 4
days
SUBHEADING TEXT
Monitor everything from
physical hardware to
application functionality
Welcome to our lavish
smorgasbord offering
within IT Monitoring.
OP5 is the market leader
of IT monitoring
throughout the Nordic
region and in over 50
countries around the
world.

Passionate software developer at OP5 AB.
Particular interests are coding, cloud, software engineering and architecture,
distributed and scalable systems.
Nicolas Seyvet

The IT Monitoring
Software Solution.
From Sweden. For a Global Market. Based on Open Source.
OP5 is a Swedish company founded in 2004. The vision was to develop an IT
monitoring software solution based on the Open Source project Nagios that
would offer an unprecedented user experience. A solution that would be
easy to implement, intuitive to work with and provide unparalleled scalability
to support clients and their ever changing business needs.
Today, OP5 has grown into an International company with a presence in over
60 countries. Thousands of IT professionals across the world rely daily on
solutions from OP5 to monitor their business-critical IT services.

The OP5 product Monitor is Nagios
Based on:
- Checks
- Plugins
- BUT static infrastructure

Infrastructure:
- Increased number of devices
- Virtual
Applications:
- On-demand deployments (cloud)
- Ephemeral/moving processes
- Distributed
Monitor everything in the data center?
The three Vs of Big Data:
- Volume
- Velocity
- Variety
Dynamic, complex environment
Outpacing humans
Average DC -> ~ 20 000 servers

Monitoring
One simple
dimension:
Dynamicity

Time series
Event
source
Multiple series of timestamp, value pairs
<series name> (t0, v0) (t1, v1) (t2, v2) (t3, v3) …
metric/event
produces
Time
pod.io.read_bytes_sec

Not all sources are created equal
Time
Long lived
Virtual Infrastructure
Application layer
Medium lived
Ephemeral
Physical Infrastructure

An example
Let’s assume 20 000 servers with 4 micro-services per server:
Assume 100 metrics per instance:
Out of which:
Add dynamicity and elasticity → 0.01%/s replacement rate:
Then, add the virtual infrastructure, failures in the DC, new racks, etc.
→ 20 000 + 4 x 20 000 = 100 000 instances
→ 10 000 000 active time series
→ 2 000 000 are long lived
8 000 000 are ephemeral
→ 0.01% * 8 000 000 = 80 new time series/s
~6 900 000 new time series per day

Monasca (http://monasca.io/) is a open-source multi-tenant, massively scalable,
fault-tolerant monitoring-as-a-service solution.
Main features:
- An event driven architecture.
- A set of REST APIs for high-speed event processing and querying.
- A real-time streaming engine (alarms and transformations)..
- An agent (collector) with plugins.
- A push based system.
Part of the (but not limited to) OpenStack family.
Monasca

OpenStack began in 2010 as a joint project between NASA and Rackspace.
Open source software for creating private and public clouds (Infrastructure as a Service)..
Control large pools of compute, storage, and networking resources throughout a datacenter,
managed through a dashboard or via RESTful APIs.
OpenStack
Key Features

OpenStack Open Source projects
MonascaMONASCA
Monitoring

The clients
Monasca API
Horizon
Dashboard
Grafana
Dashboard
Monasca
Agent
Users
GET/POST Push
Auth.
Keystone
Authentication/Authorization → Multi-tenancy
Query,
Create/define alarms and notifications

Monasca API
Data/Event Bus
Publish/
Subscribe
The core
Kafka is an OpenSource massively scalable Pub-Sub message queue:
- horizontally scalable
- fault-tolerant
- high throughput (>100K to millions of events/s)
- at least once guarantee

Monasca API
Data/Event Bus
Configuration
Persister
Streaming
Engine
Notification
Engine
Threshold
Transform
Anomaly
Subscribe SubscribePublish/
Subscribe
TSDB
Logs/Events
The backend
Threshold engine: What to monitor in real-time (alarms)
Transform engine: From raw to smart data.

The Monasca stack
Monasca API
Horizon
Dashboard
Grafana
Dashboard
Monasca
Agent
Users
GET/POST
Push
Data/Event Bus
Configuration
Persister
Streaming
Engine
Notification
Engine
Threshold
Transform
Anomaly
Subscribe SubscribePublish/
Subscribe
TSDB
Logs/Events
Auth.
Keystone

Stack
Two benefits:
Extensibility and
“what?”

Easy to extend
Data/Event Bus
My Function/App
Persister
Streaming
Engine
Notification
Engine
Event driven architecture.
Publish/
Subscribe
...

Highest level:
What to alarm on?
Domain Specific Language (DSL)
Where a sub-expression:
<sub_expression>
::= <function> '(' <metric> [',' period] ')' <operator> threshold_value ['times' periods]
Example:
<expression>
::= <subexpression> [(and | or) <subexpression>]*
avg(disk.space_used_perc{hostname=compute_node_1}) >= 99
and
count(log.error{hostname=compute_node_1,component=kafka},deterministic) >= 1
function
min
max
sum
avg
count
last

To sum up:
- Built for self-healing and elasticity (horizontal scalability)
- Can handle billions of time-series at high throughput
- Multi-tenant
- Extensible
- DSL to monitor what matters
- Can combine different sources (metrics/events/logs)
Built on top of Kubernetes, runs on AWS, OpenStack and VMWare.
$ # Deploy in one line
$ helm install op5_monasca
OP5 Monasca

OP5 HQ
Norgegatan 2
SE-164 32 Kista
Sweden
+46 (0)8 58 83 01 00
www.OP5.com
inkedin.com/company/OP5/
facebook.com/OP5ab
twitter.com/OP5ab
Call us
Follow us
Nicolas Seyvet
Backend Engineer
Email nseyvet@op5.com
Twitter: @NicolasSeyvet
Blog: http://babounehacks.blogspot.se/
Github: https://github.com/nseyvet
https://github.com/baboune
Questions?

Monitor everything from physical hardware to application functionality

More Related Content

What's hot

Similar to Monitor everything from physical hardware to application functionality

Recently uploaded

Monitor everything from physical hardware to application functionality