Monitoring microservices platform

Monitoring micro-services platform
Boyan Dimitrov,
Platform Engineering @ Hailo @nathariel

Outline
• Intro to the Hailo world
• Platform Overview
• Monitoring Evolution

The Platform
Troll a platform by Swinsto101 / CC BY-SA 3.0 / Desaturated
from original

Platform specifics
• SOA based on Go ( and Java… )
• 1000+ AWS instances spanning multiple regions
• 160+ services in production
• Designed specifically for the cloud – different building blocks and
components will constantly be in flux, broken or unavailable.

eu-west-1
Proxy Layer
Message Bus+
Go Services
Java
Services
C*
us-east-1
Proxy Layer
Message Bus+
Go Services Java
C*
Services

Provisioning Service
CI Pipeline (Janky/Jenkins)
Amazon S3
Provisioning Service Provisioning Service
Provisioning Manager
Docker Registry
Inside an environment

A micro-service under the hood
Handler platform-layer
Logic
Storage
Library for abstracting service-to-
service comms
service-layer
Self-configuring external
service adapters
Service
Any service gets for free:
• Provisioning
• Discovery
• Configuration
• Authentication/Authorization
• A/B testing capabilities
• Self-configuring connectivity to
third-party services
• Monitoring
• Instrumentation

Mission:
Define high level platform and business metrics
Gather as many insights as possible
Add automatic failover and recovery capabilities
"A[ollo 8 Launch Control Room” by Tfawls
/ Desaturated from original

PHP Java
Host Instance
Graphite
Zabbix
Aspiration vs Reality
CloudWatch
Zabbix
Agent
StatsD Carbon

Challenges
• Single StatsD instance and generic graphite setup cannot cope with all the traffic
(surprise!)
• No easy way of generating and searching for graphs quickly
• We didn’t instrument everything
• “Traditional” monitoring systems can only give basic app insights
• Se#ing up app templates is a manual daunting process and does not scale
• No in-depth visibility into our main KPIs
• No way of identifying platform / release / config / cloud infrastructure changes

Instrumentation++
“Airplaine board” by Smithore
/ Desaturated from original

Host Instance
Graphite
Cache
Zabbix
Iterate on what we already know
Relay
CloudWatch
CollectD StatsD
Cache
Cache
Zabbix
Agent

Result
• Scaling up graphite and moving StatsD to every box allowed us to collect millions
of metrics
• Instrumenting everything gives us a lot of insights.
• Grafana allows us to quickly build, store and search for important graphs. Widely
adopted by the whole development team!
Tip: Focus on upper 95th and 99th percentiles and work out from there.

RReatzhiinekl Service
Monitoring

Message bus
Monitoring
Service
New
Service
Publish
Healthchecks
Host Instance
Provisioning Manager
Binding Discovery
Host Instance
Monitoring
V2

healthcheck.Register(&healthcheck.HealthCheck{!
Id: “MyHCId”,!
ServiceName: ServiceName,!
ServiceVersion: ServiceVersion,!
Hostname: Hostname,!
InstanceId: InstanceID,!
Interval: time.Minute,!
Checker: myCallbackFunc,!
Priority: hc.Warning,!
})!

Result
• Service health checks give us in-depth service performance details
• The monitoring service has a holistic view of our platform health and can identify
degraded availability zones
• Developers can identify what is important for their service and track & alert on it.

Trace++
Monitoring &
Instrumentation
“Abstract conception of network and communication”
by Leszekglasner / Desaturated from original

Trace Architecture
CollectD StatsD
Zabbix
Agent
Host Instance
Phosphor
Publish
Trace
Service
Dashboards
Monitoring
In-memory
Aggregates
Optional
persistant
storage
Async
UDP

Result
• Trace incoming requests and pinpoint bo#lenecks & SLA offenders
• Easily identify problems on the request/response path
• Quickly find out exactly which services participate on the request path

Result
• Identify business impacting issues immediately
• Highlight the service on the critical path that is most likely responsible for the
problems

Event Correlation
“Connection” by A2bb5s
/ Desaturated from the original

CollectD StatsD
Zabbix
Agent
Host Instance
Phosphor
Publish
c
Dashboards
Monitoring
Persistent
Storage
SNS
Platform
Events
Whisper
Service
c
Platform events

Result
• Answer to the most important “Did anything change?” question
• Audit trail for any platform changes
• Holistic view of our platform status

It is not over yet!
++ Machine Learning
++ Event source weighting

Thanks!
PS. We’re hiring!
@nathariel
boyan@hailocab.com London DevOps

Monitoring microservices platform

More Related Content

What's hot

Similar to Monitoring microservices platform

More from Boyan Dimitrov

Recently uploaded

Monitoring microservices platform