I will describe how we gather useful statistics on Mesos/Singularity and the tools we use (Grafana, Graphite, Elasticsearch, etc) to provide meaningful dashboards and data to users in order to optimise their usage and performance. I will also provide a brief description of the open source tools we have written for gathering statistics out of Mesos and building dashboards.
9. Let’s Scale !
. Search
Codebase
Reviews Emails
Reservations
Photo
Service
Restaurant
profiles
Availability
Service
Menu API White Label
External API Person API
Feedback
API
Etc Etc Etc
Search
DATACENTRE
Reviews Emails
Reservatio
ns
Photo
Service
Restaurant
profiles
Availability
Service
Menu API White Label
External API Person API Feedback
API
VM
VM
VM VMVM
VM VM
VM VM VM
VMVM
VM
EmailsVM
Search
VM
Emails
Restaurant
profiles
VM
Restaurant
profiles
VM
Restaurant
profiles
VM
Menu API
VM
Menu API
VM
Menu API
VM
11. Write Puppet
Code
Local Vagrant
Build
Test and
Version ControlCode
Provision VMs
Provision More
VMs in different
Regions/Envs
Wait for
Provisioned
host puppet
run
Infrastructure
Team pushes
Puppet Code
Local Build
Provision
Metrics Write Puppet
Code
Infrastructure
Team pushes
Puppet code
Build Grafana
Dashboards
Code
integration with
Statsd/Graphite
Monitoring
Runbooks and
escalation
policies
Write Puppet
Code
Infrastructure
Team pushes
Puppet code
Identify Metrics
or emit metrics
14. Local Docker
Testing
Push to Docker
Repo
Code
Deploy service
to other Mesos
Cluster
Deploy
Service to
Mesos Cluster
Local Build
Provision
Metrics Write Puppet
Code
Infrastructure
Team pushes
Puppet code
Build Grafana
Dashboards
Code
integration with
Statsd/Graphite
Monitoring
Runbooks and
escalation
policies
Write Puppet
Code
Infrastructure
Team pushes
Puppet code
Identify Metrics
or emit metrics
15. Mesos Task
Singularity API
Mesos API
Carbon
Format
Publisher
Kafka
Carbon
Format
Consumer
Carbon-c relay
Graphite
Cluster
Grafana
https://github.com/opentable/mesos_statshttps://github.com/weaveworks/grafanalib
Metrics Pipeline
https://github.com/weaveworks/grafanalib
16. Auto-generated Grafana Dashboard
Help text explaining
the graphs and what
they mean
Every Service running
in Mesos will have an
auto-generated
dashboard
Shows cluster-wide
Usage and Instance Usage
17. Right-sizing Resource Usage = $$$ Saved
Singularity
Task
Mesos
Cluster Mesos stats
and Metrics
Shows that memory is
over-provisioned
for this service
19. Local Docker
Testing
Push to Docker
Repo
Code
Deploy service
to other Mesos
Cluster
Deploy
Service to
Mesos Cluster
Local Build
Provision
Metrics
Monitoring
Runbooks and
escalation
policies
Write Puppet
Code
Infrastructure
Team pushes
Puppet code
Identify Metrics
or emit metrics
Only
application
specific
metrics
Create
application
specific
dashboards
Optional
26. Key Takeaways
Map out developer workflow and constantly look for
opportunities to standardise, automate and enhance.
Make metrics and monitoring part and parcel of the
Mesos service.
Engineers don’t always make the best choice when
deciding resource usage - help them make an
informed choice.
Have a common deployment pipeline across the
organisation that facilitates production readiness*
Having a global data model for logging allows us to
make more sense of logging data across the various
Mesos tasks.