Only 4
days
SUBHEADING TEXT
Monitor everything from
physical hardware to
application functionality
Welcome to our lavish
smorgasbord offering
within IT Monitoring.
OP5 is the market leader
of IT monitoring
throughout the Nordic
region and in over 50
countries around the
world.
Passionate software developer at OP5 AB.
Particular interests are coding, cloud, software engineering and architecture,
distributed and scalable systems.
Nicolas Seyvet
The IT Monitoring
Software Solution.
From Sweden. For a Global Market. Based on Open Source.
OP5 is a Swedish company founded in 2004. The vision was to develop an IT
monitoring software solution based on the Open Source project Nagios that
would offer an unprecedented user experience. A solution that would be
easy to implement, intuitive to work with and provide unparalleled scalability
to support clients and their ever changing business needs.
Today, OP5 has grown into an International company with a presence in over
60 countries. Thousands of IT professionals across the world rely daily on
solutions from OP5 to monitor their business-critical IT services.
The OP5 product Monitor is Nagios
Based on:
- Checks
- Plugins
- BUT static infrastructure
Infrastructure:
- Increased number of devices
- Virtual
Applications:
- On-demand deployments (cloud)
- Ephemeral/moving processes
- Distributed
Monitor everything in the data center?
The three Vs of Big Data:
- Volume
- Velocity
- Variety
Dynamic, complex environment
Outpacing humans
Average DC -> ~ 20 000 servers
Monitoring
One simple
dimension:
Dynamicity
Time series
Event
source
Multiple series of timestamp, value pairs
<series name> (t0, v0) (t1, v1) (t2, v2) (t3, v3) …
metric/event
produces
Time
pod.io.read_bytes_sec
Not all sources are created equal
Time
Long lived
Virtual Infrastructure
Application layer
Medium lived
Ephemeral
Physical Infrastructure
An example
Let’s assume 20 000 servers with 4 micro-services per server:
Assume 100 metrics per instance:
Out of which:
Add dynamicity and elasticity → 0.01%/s replacement rate:
Then, add the virtual infrastructure, failures in the DC, new racks, etc.
→ 20 000 + 4 x 20 000 = 100 000 instances
→ 10 000 000 active time series
→ 2 000 000 are long lived
8 000 000 are ephemeral
→ 0.01% * 8 000 000 = 80 new time series/s
~6 900 000 new time series per day
Monitoring Monasca
Monasca (http://monasca.io/) is a open-source multi-tenant, massively scalable,
fault-tolerant monitoring-as-a-service solution.
Main features:
- An event driven architecture.
- A set of REST APIs for high-speed event processing and querying.
- A real-time streaming engine (alarms and transformations)..
- An agent (collector) with plugins.
- A push based system.
Part of the (but not limited to) OpenStack family.
Monasca
OpenStack began in 2010 as a joint project between NASA and Rackspace.
Open source software for creating private and public clouds (Infrastructure as a Service)..
Control large pools of compute, storage, and networking resources throughout a datacenter,
managed through a dashboard or via RESTful APIs.
OpenStack
Key Features
OpenStack Open Source projects
MonascaMONASCA
Monitoring
Stack What is Monasca?
The clients
Monasca API
Horizon
Dashboard
Grafana
Dashboard
Monasca
Agent
Users
GET/POST Push
Auth.
Keystone
Authentication/Authorization → Multi-tenancy
Query,
Create/define alarms and notifications
Monasca API
Data/Event Bus
Publish/
Subscribe
The core
Kafka is an OpenSource massively scalable Pub-Sub message queue:
- horizontally scalable
- fault-tolerant
- high throughput (>100K to millions of events/s)
- at least once guarantee
Monasca API
Data/Event Bus
Configuration
Persister
Streaming
Engine
Notification
Engine
Threshold
Transform
Anomaly
Subscribe SubscribePublish/
Subscribe
TSDB
Logs/Events
The backend
Threshold engine: What to monitor in real-time (alarms)
Transform engine: From raw to smart data.
The Monasca stack
Monasca API
Horizon
Dashboard
Grafana
Dashboard
Monasca
Agent
Users
GET/POST
Push
Data/Event Bus
Configuration
Persister
Streaming
Engine
Notification
Engine
Threshold
Transform
Anomaly
Subscribe SubscribePublish/
Subscribe
TSDB
Logs/Events
Auth.
Keystone
Stack
Two benefits:
Extensibility and
“what?”
Easy to extend
Data/Event Bus
My Function/App
Persister
Streaming
Engine
Notification
Engine
Event driven architecture.
Publish/
Subscribe
...
Highest level:
What to alarm on?
Domain Specific Language (DSL)
Where a sub-expression:
<sub_expression>
::= <function> '(' <metric> [',' period] ')' <operator> threshold_value ['times' periods]
Example:
<expression>
::= <subexpression> [(and | or) <subexpression>]*
avg(disk.space_used_perc{hostname=compute_node_1}) >= 99
and
count(log.error{hostname=compute_node_1,component=kafka},deterministic) >= 1
function
min
max
sum
avg
count
last
Stack In conclusion
To sum up:
- Built for self-healing and elasticity (horizontal scalability)
- Can handle billions of time-series at high throughput
- Multi-tenant
- Extensible
- DSL to monitor what matters
- Can combine different sources (metrics/events/logs)
Built on top of Kubernetes, runs on AWS, OpenStack and VMWare.
$ # Deploy in one line
$ helm install op5_monasca
OP5 Monasca
OP5 HQ
Norgegatan 2
SE-164 32 Kista
Sweden
+46 (0)8 58 83 01 00
www.OP5.com
inkedin.com/company/OP5/
facebook.com/OP5ab
twitter.com/OP5ab
Call us
Follow us
Nicolas Seyvet
Backend Engineer
Email nseyvet@op5.com
Twitter: @NicolasSeyvet
Blog: http://babounehacks.blogspot.se/
Github: https://github.com/nseyvet
https://github.com/baboune
Questions?

Monitor everything from physical hardware to application functionality

  • 1.
    Only 4 days SUBHEADING TEXT Monitoreverything from physical hardware to application functionality Welcome to our lavish smorgasbord offering within IT Monitoring. OP5 is the market leader of IT monitoring throughout the Nordic region and in over 50 countries around the world.
  • 2.
    Passionate software developerat OP5 AB. Particular interests are coding, cloud, software engineering and architecture, distributed and scalable systems. Nicolas Seyvet
  • 3.
    The IT Monitoring SoftwareSolution. From Sweden. For a Global Market. Based on Open Source. OP5 is a Swedish company founded in 2004. The vision was to develop an IT monitoring software solution based on the Open Source project Nagios that would offer an unprecedented user experience. A solution that would be easy to implement, intuitive to work with and provide unparalleled scalability to support clients and their ever changing business needs. Today, OP5 has grown into an International company with a presence in over 60 countries. Thousands of IT professionals across the world rely daily on solutions from OP5 to monitor their business-critical IT services.
  • 4.
    The OP5 productMonitor is Nagios Based on: - Checks - Plugins - BUT static infrastructure
  • 5.
    Infrastructure: - Increased numberof devices - Virtual Applications: - On-demand deployments (cloud) - Ephemeral/moving processes - Distributed Monitor everything in the data center? The three Vs of Big Data: - Volume - Velocity - Variety Dynamic, complex environment Outpacing humans Average DC -> ~ 20 000 servers
  • 6.
  • 7.
    Time series Event source Multiple seriesof timestamp, value pairs <series name> (t0, v0) (t1, v1) (t2, v2) (t3, v3) … metric/event produces Time pod.io.read_bytes_sec
  • 8.
    Not all sourcesare created equal Time Long lived Virtual Infrastructure Application layer Medium lived Ephemeral Physical Infrastructure
  • 9.
    An example Let’s assume20 000 servers with 4 micro-services per server: Assume 100 metrics per instance: Out of which: Add dynamicity and elasticity → 0.01%/s replacement rate: Then, add the virtual infrastructure, failures in the DC, new racks, etc. → 20 000 + 4 x 20 000 = 100 000 instances → 10 000 000 active time series → 2 000 000 are long lived 8 000 000 are ephemeral → 0.01% * 8 000 000 = 80 new time series/s ~6 900 000 new time series per day
  • 10.
  • 11.
    Monasca (http://monasca.io/) isa open-source multi-tenant, massively scalable, fault-tolerant monitoring-as-a-service solution. Main features: - An event driven architecture. - A set of REST APIs for high-speed event processing and querying. - A real-time streaming engine (alarms and transformations).. - An agent (collector) with plugins. - A push based system. Part of the (but not limited to) OpenStack family. Monasca
  • 12.
    OpenStack began in2010 as a joint project between NASA and Rackspace. Open source software for creating private and public clouds (Infrastructure as a Service).. Control large pools of compute, storage, and networking resources throughout a datacenter, managed through a dashboard or via RESTful APIs. OpenStack Key Features
  • 13.
    OpenStack Open Sourceprojects MonascaMONASCA Monitoring
  • 14.
    Stack What isMonasca?
  • 15.
    The clients Monasca API Horizon Dashboard Grafana Dashboard Monasca Agent Users GET/POSTPush Auth. Keystone Authentication/Authorization → Multi-tenancy Query, Create/define alarms and notifications
  • 16.
    Monasca API Data/Event Bus Publish/ Subscribe Thecore Kafka is an OpenSource massively scalable Pub-Sub message queue: - horizontally scalable - fault-tolerant - high throughput (>100K to millions of events/s) - at least once guarantee
  • 17.
    Monasca API Data/Event Bus Configuration Persister Streaming Engine Notification Engine Threshold Transform Anomaly SubscribeSubscribePublish/ Subscribe TSDB Logs/Events The backend Threshold engine: What to monitor in real-time (alarms) Transform engine: From raw to smart data.
  • 18.
    The Monasca stack MonascaAPI Horizon Dashboard Grafana Dashboard Monasca Agent Users GET/POST Push Data/Event Bus Configuration Persister Streaming Engine Notification Engine Threshold Transform Anomaly Subscribe SubscribePublish/ Subscribe TSDB Logs/Events Auth. Keystone
  • 19.
  • 20.
    Easy to extend Data/EventBus My Function/App Persister Streaming Engine Notification Engine Event driven architecture. Publish/ Subscribe ...
  • 21.
    Highest level: What toalarm on? Domain Specific Language (DSL) Where a sub-expression: <sub_expression> ::= <function> '(' <metric> [',' period] ')' <operator> threshold_value ['times' periods] Example: <expression> ::= <subexpression> [(and | or) <subexpression>]* avg(disk.space_used_perc{hostname=compute_node_1}) >= 99 and count(log.error{hostname=compute_node_1,component=kafka},deterministic) >= 1 function min max sum avg count last
  • 22.
  • 23.
    To sum up: -Built for self-healing and elasticity (horizontal scalability) - Can handle billions of time-series at high throughput - Multi-tenant - Extensible - DSL to monitor what matters - Can combine different sources (metrics/events/logs) Built on top of Kubernetes, runs on AWS, OpenStack and VMWare. $ # Deploy in one line $ helm install op5_monasca OP5 Monasca
  • 24.
    OP5 HQ Norgegatan 2 SE-16432 Kista Sweden +46 (0)8 58 83 01 00 www.OP5.com inkedin.com/company/OP5/ facebook.com/OP5ab twitter.com/OP5ab Call us Follow us Nicolas Seyvet Backend Engineer Email nseyvet@op5.com Twitter: @NicolasSeyvet Blog: http://babounehacks.blogspot.se/ Github: https://github.com/nseyvet https://github.com/baboune Questions?