OSMC 2016 - Monasca - Monitoring-as-a-Service (at-Scale) by Roland Hochmuth

Monasca
Monitoring/Logging-as-a-Service (at-scale)

Speaker
Roland Hochmuth
Hewlett Packard Enterprise
Fort Collins, Colorado, USA

Agenda
• Describe how to build a highly scalable monitoring and logging as a
service platform
• Architectural and design principles
• Scale, HA
• Provide an overview of Monasca
• Features
• API
• Demo

What is Monitoring-as-a-Service?
• A Monitoring or Logging solution deployed as Software-as-a-Service
• E.g. CloudWatch, Datadog, New Relic, Librato, Loggly and many others
• First-class, preferably RESTful HTTP API
• Authentication
• Multi-tenancy
• Provides self-provisioning to users/tenants of the service
• Designed to be highly reliable and operate at scale
• Historically run by an operations team doing web services

What is OpenStack?
• OpenStack is a cloud operating system that controls large pools of
compute, storage, and networking resources
• Open-source alternative to AWS, Microsoft Azure, Google Cloud and
other cloud services
• Deployed in both public and private clouds

What is Monasca?
• Open-source Monitoring/Logging-as-a-Service platform for OpenStack
• Authentication currently via OpenStack Identity Service (Keystone)
• Microservices message-bus based architecture
• First-class RESTful API
• Push-based metrics
• Consolidates Operational Monitoring, Monitoring-as-a-Service, Metering &
Billing and more
• Designed for elastic cloud environments/deployments
• High-availability / clustering built-in
• Horizontally scalable and vertically 4 tiered/layered architecture
• Capable of long-term data retention to address metering, SLA, capacity
planning, trend analysis, post-hoc RCA, and other use cases
• Extensible and Composable

The Log
• The Log: What every software engineer should know about real-time data's
unifying abstraction
• https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-
should-know-about-real-time-datas-unifying
• Log: An append-only, totally-ordered sequence of records ordered by time
From To

Kafka
• A performant, distributed, durable, publish/subscribe messaging and stream
processing system
• Metrics, logs and events are published to topics in Kafka
• Microservices register in a "consumer group" as a consumer
• Microservices "subscribe" to topics and consume metrics/logs and events
• Messages are replicated per consumer group
• Messages are load-balanced across all consumers in a consumer group
• Can add/remove micro-services to handle load or mitigate problems
• As micro-services expand/contract the partitions are automatically re-balanced
• At-least-once semantic guarantees on message delivery
• Also used for domain events, notification retry events, periodic notifications,
grouping notifcations and other areas
• Always accept data, never drop data, true elasticity
• Loggly: https://www.youtube.com/watch?v=LpNbjXFPyZ0

CQRS
• Command Query Responsibility Segregation (CQRS)
• CQRS involves splitting an application into two parts internally:
1. Command side ordering the system to update state
2. Query side that gets information without changing state
• Advantages
• Decouples the read/write load. Allows each to be scaled independently
• Read store can be optimized for the query pattern of the application
• Reference
• Event sourcing, CQRS, stream processing and Apache Kafka
• https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/

Microservices
• Microservices are small, autonomous, decoupled services that are
deployed independenty and work together as a single application
• Communication between services occurs via a network
• Services need to be able to change independently of each other, and be
deployed by themselves without requiring consumers to change
• Benefits:
• Resilience
• Scale
• Ease of deployment
• Organizational Alignment
• Optimized for Change/Replaceability

Deployment Models (HA/Scale)
• Many ways to deploy Monasca
• Typically deployed in a clustered/HA configuration using three nodes
or greater
• If any node or microservice fails, the cluster remains operational
• Partitions in Kafka are redistributed among the remaining components
• Preferably, the database is run on a separate layer from the other
components/microservices
• Note, Monasca can also be deployed on a single-node, non-clustered
• Has also been containerized and run in Kubernetes

Metrics Model
POST /v2.0/metrics
{
name: http_status,
dimensions:
{
url: http://host.domain.com:1234/service,
cluster: c1,
control_plane: ccp,
service: compute
}
timestamp: 0, /* milliseconds */
value: 1.0,
value_meta: {
status_code: 500,
msg: Internal server error
}
}
• Simple, concise, multi-dimensional flexible description
• Name (string)
• Dimensions: Dictionary of user-defined (key, value)
pairs that are used to uniquely identify a metric
• Optional dictionary of user-defined (key, value)
pairs that can be used to describe a measurement
• Normally used for errors and messages

Push vs Pull
• Monitoring-as-a-Service
• Can't always pull due to firewalls and network issues
• Low-latency: sub-second latency difficult for pull model
• Doesn't require service discovery and registration
• As entities are deployed, they can start sending metrics without have to be
discovered or registered
• Events
• Temporary caching/buffering of metrics/events while service
unreachable.

Monasca API
• Primary point for pushing metrics and handling queries
• Authenticates all requests against the Keystone identity service
• Note, auth tokens are cached to reduce the load on Keystone
• Resources: Metrics, Alarm Definitions, Alarms and Notification Methods
• API Specification:
• https://github.com/openstack/monasca-api/tree/master/docs
• Horizontally scalable
• Publishes metrics to Kafka
• Queries timeseries DB for measurements and statistics
• Queries Config DB for alarms, alarm definitions and notification methods

Persister
• Consumes both metrics and alarm state transition events from Kafka
• Stores temporarily in-memory and does batch writes to the TSDB, based on
batch size or time, to optimize write performance
• At-least once message delivery semantics:
• No metrics or alarm state transition events are lost
• The Kafka consumer offset for each batch is only updated after successfully storing
the metric or alarm state transition event
• Note, duplicates are possible
• HA/fault-tolerance:
• Multiple persisters run simultaneously and balance load
• If a persister fails, the load is automatically re-balanced across the remaining
persisters.

Time Series Databases
• Used for storing:
• Metrics
• Alarm state history
• Two databases supported:
1. Vertica
• Enterprise class, proprietary, closed-source, clustered, HA, analytics database
• Excels at time-series
2. InfluxDB
• Open-source single-node time-series DB
• Clustering is closed-source
• Note, can replicate to multiple instances of InfluxDB using Kafka
• Investigating support for additional databases

Config Database
• Stores all "transactional" data for Monasca such as
• Alarm Definitions
• Alarms
• Notification Methods
• MySQL and Postgres supported
• Typically deployed in a clustered or HA configuration

Threshold Engine
• Near real-time stream processing, clustered and highly available
threshold engine
• Based on Apache Storm
• Consumes metrics from Kafka
• Creates alarms based on metrics that match patterns specified in the
alarm definition
• Evaluates whether metrics exceed threshold
• Publishes alarm state transition events to Kafka
• Supports both simple and compound alarm expressions

Notification Engine
• Consumes "alarm state transition events" from Kafka produced by the
Threshold Engine
• Evaluates whether notifications should be sent based on actions specified
in the alarm definition.
• OK, ALARM and UNDETERMINED actions
• Supports email, PagerDuty, webhooks, HipChat, Slack and JIRA
• Dynamic plugins supported
• Supports both "one-shot" and "periodic" notifications
• If sending to the notification address fails, then notification is published to
retry topic in Kafka, and retried later
• Grouping notifications: In progress

Kafka Message Schema
• JSON messages published/consumed to/from Kafka by Monasca
micro-services
• Well-defined schema is published at:
• https://wiki.openstack.org/wiki/Monasca/Message_Schema

Metrics
Create, query and get statistics for metrics
• GET, POST /v2.0/metrics
• GET /v2.0/metrics/names:
• Returns the unique metric names
• GET /v2.0/metrics/dimension/names
• Returns the unique dimension names
• GET /v2.0/metrics/dimension/names/values
• Returns the unique dimension values

Measurements
GET /v2.0/metrics/measurements
• Returns a list of measurements
• Query parameters
• Name and dimensions to filter by
• Start_time and end_time
• Offset and limit
• merge_metrics: allow multiple metrics to be combined into a single list
of measurements.
• group_by: list of columns to group the metrics to be returned. Allows
multiple unique metrics to be returned in a single query.

Statistics
GET /v2.0/metrics/statistics
• Name and dimensions to filter by
• Start_time and end_time
• Statistics: avg, min, max, sum and count
• Period: The time period to aggregate measurements by
• Offset, limit
• merge_metrics: allow multiple metrics to be combined into a single list
of statistics
• group_by: list of columns to group the metrics to be returned. Allows
multiple unique metrics to be returned in a single query.

Metrics Names
GET /v2.0/metrics/names
• Returns a list of the unique metric names
• Dimensions
• Offset, limit

Metric Dimension Names
GET /v2.0/metrics/dimensions/names
• List the dimension names
• Metric name
• Offset, limit

Metric Dimension Values
GET /v2.0/metrics/dimensions/names/values
• List the dimension values
• Metric name
• Dimension name
• Offset, limit

Alarm Definitions
POST, GET /v2.0/alarm-definitions
• Alarm definitions are templates that are used to automatically and
dynamically create alarms based on matching metric names and
dimensions
• One alarm definition can result in zero or more alarms.
• Simple grammar for creating compound alarm expressions:
• avg(cpu.user_perc{}) > 85 or avg(disk.read_ops{device=vda}, 120) > 1000
• Alarm states (OK, ALARM and UNDETERMINED)
• Actions associated with alarms for state transitions
• User assigned severity (LOW, MEDIUM, HIGH, CRITICAL)
• Thresholds can be dynamically adjusted via PATCH
• Minimal lifecycle management, alarm_lifecycle_state and link.

List Alarms
GET /v2.0/alarms
Query parameters:
• metric_name - Name of metric to filter by
• metric_dimensions
• State: OK, ALARM or UNDETERMINED.
• Severity: One or more severities to filter by, separated with |,
ex. severity=LOW|MEDIUM
• state_updated_start_time : The start time in ISO 8601 combined date and
time format in UTC.
• Offset, limit
• sort_by

Alarms
GET, PUT, PATCH, DELETE /v2.0/alarms/{alarm-id}
• Alarms created by the Threshold Engine based on matching alarm
definitions.
• When new nodes or components are deployed, alarms are automatically created
• Alarms are resources within Monasca. They have a resource ID and
lifecycle.
• By default, three states: OK, ALARM and UNDETERMINED
• UNDETERMINED state occurs when metrics are no longer being received
• Deterministic alarms, two states: OK and ALARM
• Used for systems where metrics are sporadic. E.g. Creating metrics when errors in log
files occur, and no metrics, when there aren't any errors.

Alarm Counts
GET /v2.0/alarms/count
• Query the total number of alarms in the OK, ALARM or
UNDETERMINED state, and their severities, grouped by
metrics dimension, such as OpenStack service, state and
severity.
• Used for summary dashboards

Alarm History
GET /v2.0/alarms/state-history
• Lists the alarm state history for alarms
• Query Parameters:
• Dimensions to filter on
• Start/end timestamp
• Offset, limit
GET /v2.0/alarms/{alarm-id}/state-history
• Lists the alarm state history for a specific alarm

Notification Methods
POST, GET, DELETE /v2.0/notification-methods
Notification methods are associated with Actions in alarm definitions.
Example:
POST /v2.0/notification-methods {
"name":"Name of notification method",
"type":"EMAIL",
"address":"john.doe@hp.com"
}

Monasca Agent
• System metrics (cpu, memory, network, filesystem, …)
• Service metrics
• MySQL, Kafka, and many others
• Application metrics
• Built-in Statsd daemon
• Python monasca-statsd library: Adds support for dimensions
• VM system metrics
• Open vSwitch metrics
• Active checks
• HTTP status checks and response times
• System up/down checks (ping and ssh)
• Runs any Nagios plugin or check_mk
• Extensible/Pluggable: Additional services can be easily added

Agent details
• The Agent Forwarder buffers metrics for a short time to increase the
size of the http request body (number of metrics) sent to the
Monasca API.
• The Agent request an auth token from the Keystone Identity service
which is supplied on all requests.
• The Monasca Agent and API caches Monasca Agent and API caches
Monasca Agent and API caches auth tokens in-memory to reduce
the round-trip authorization requests to Keystone
• If network connectivity between the Agent and API occurs the Agent
will buffer metrics and send when connectivity is restored
• Metrics are submitted using a “agent” role, which only allows metrics
to be POST’d to the metrics endpoint

Grafana/Monasca Integration
• Datasource: A datasource that can be added to the Grafana
dashboard to enable Monasca
• https://github.com/openstack/monasca-grafana-datasource
• Keystone authentication
• https://github.com/twc-openstack/grafana
• Support for Alerting will be added in Grafana 4.

Logging API
• POST /v3.0/logs
• Batch log messages in a single http request
• Global / local / mixed dimensions
• Similar to dimensions in metrics.
• JSON only
• Specification
• https://github.com/openstack/monasca-log-api/blob/master/docs/monasca-
log-api-spec.md
• Queries not done via API, but via Tenantized version of Kibana
• https://github.com/FujitsuEnablingSoftwareTechnologyGmbH/fts-keystone

Log Model
• { "dimensions": {
"hostname":"devstack",
"service":"monitoring",
"component":"monasca-api" }
"logs":[
{ "message":"msg1",
"dimensions": {
"service":"compute",
"component":"nova-api",
"path":"/var/log/mysql.log" } },
{ "message":"msg2",
"dimensions": {
"path":"/var/log/monasca/monasca-api.log" } }
]
}

Log Agents
• Logstash
• https://github.com/logstash-plugins/logstash-output-monasca_log_api/pull/1
• Beaver
• https://github.com/python-beaver/python-beaver/pull/406
• Logspout: Under Investigation

Kibana Integration
• Keystone authentication support for Kibana
• Authentication plugin:
• https://github.com/FujitsuEnablingSoftwareTechnologyGmbH/fts-keystone
• Note: In progress of moving to official OpenStack repo

Transform and Analytics Engine

Monasca Transform
• A new micro-service in Monasca that aggregates and transforms metrics.
• Currently based on Apache Spark Streaming.
• Use Cases:
• Object Storage Disk Capacity
• Object Storage Capacity
• Compute Host Capacity
• VM Capacity
• More to come
• Metrics are aggregated and published every hour.
• Currently in deployment in HPE Helion OpenStack 4.0.
• OpenStack project/repo
• https://github.com/openstack/monasca-transform

Monasca Analytics
• A framework that adds data science tools (parsers, algorithms, etc).
• Features include:
• Algorithmic flow definition, enabling sharing of complex algorithmic recipes
• Thin orchestration layer that instantiates an execution environment.
• Focused on:
• Anomaly detection
• Reducing alert fatigue via alarm clustering (unsupervised machine learning).
• Example algorithms: One Class SVM and LiNGAM.
• Status: Under Development
• OpenStack project/repo
• https://github.com/openstack/monasca-analytics

Distributions & Deployments
• Charter Communications:
• Monasca and Grafana is currently deployed in production private cloud
• Monitoring-as-a-Service Use cases supported with Grafana as the Visualization
Dashboard
• 2 datacenters, 600-700 compute nodes, 1000 VMs, 11,000 metrics/sec
• FIWARE Lab:
• http://superuser.openstack.org/articles/monitoring-a-multi-region-cloud-based-on-openstack/
• Hewlett Packard Enterprise: Cloud System, Helion OpenStack
• Supported and tested up to 65K metrics/sec injest rates.
• Fujitsu:
• FUJITSU Software ServerView Cloud Monitoring Manager.
• NEC:
• Planning to include Monasca in "Cloud Solution Menus" solution.
• Others

Statistics: Mitaka/Newton Release
• Organizations:
• Contributors:
• Commits:
• Reviews:
• Lines of code:
31
97
1075
4080
215,370

Ecosystem
• Hewlett Packard Enterprise
• Fujitsu
• Charter Communications
• NEC
• Cisco
• Cloudbase Solutions
• SUSE
• SolidFire
• SAP
• Cray Inc.
• FIWARE Lab
• Mirantis
• Broadcom

Containers and Kubernetes
• New Monasca Agent Plugins
• Docker plugin
• cAdviser plugin
• Kubernetes plugin: Monitors both Kubernetes control plane and containers
• Prometheus client plugin: Scrapes apps
• Mesos pugin
• Containerization of Monasca
• Heapster Monasca data sink

Next Steps
• Containerizing Monasca
• Monitoring containers and container managers, such as Kubernetes
• Grouping notifications

OSMC 2016 - Monasca - Monitoring-as-a-Service (at-Scale) by Roland Hochmuth

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to OSMC 2016 - Monasca - Monitoring-as-a-Service (at-Scale) by Roland Hochmuth

Similar to OSMC 2016 - Monasca - Monitoring-as-a-Service (at-Scale) by Roland Hochmuth (20)

Recently uploaded

Recently uploaded (20)

OSMC 2016 - Monasca - Monitoring-as-a-Service (at-Scale) by Roland Hochmuth