Monitoring uptime on the
Nectar Research Cloud
TPAC
Uni Melb
NCI
Core
Services
Pawsey
QCIF
eRSA
Intersect Monash
Nectar architecture
Load BalancerLoad Balancer
Tier 1
Service API
Tier 1
Service API
Tier 1
Service API
Tier 2
Service API
Message
Queue
Message
Queue
Message
Queue
Database
Tier 1
Service API
Tier 1
Service
Engine
Tier 1
Service API
Tier 2
Service
Engine
Load BalancerDashboard
Nectar core services
Test everything
● APIs and dashboard are running
● Services are working correctly
● Existing resources are happy (e.g. instances, networks)
● New resources can be created successfully
● Across all sites
Control plane hosts
Nagios
● Ping
● SSH
● NTP
● Filesystem
● Uptime
● Puppet
Ganglia metrics
● CPU
● Memory
● Network
● Disk I/O
Control plane services
● Service ports and processes
● HTTP endpoint
● API process
● Oslo middleware healthcheck
● Consistent /healthcheck URL for all services
● Called by load balancers
● More complex tests
● Request token from Keystone
● Check glance for image
Environment
Canary instance in each AZ
● Ping
● DHCP
● Metadata
Not an exhaustive test, but a good indicator
Instance boot test
Exercise the whole stack with Tempest
● Fetch a token
● Create a keypair
● Create security groups and rules
● Create instance
● Ping instance
● SSH to instance
● Destroy/clean up all resources
Tempest
Instance boot test
Instance boot for each AZ
● Tiny CirrOS image for speed
● Help identify site specific issues
Instance boot for each flavour
● Enough capacity for large flavours?
● Scheduler working properly
● Can be problematic with cells v1
Tempest
Tempest
● OpenStack integration testing suite
● Jobs launched by Jenkins with custom wrapper script
● Result pushed (passive) to Nagios via NRDP
● Lots more can be done here (e.g testing more services)
Tempest
Jenkins
Nagios
Status dashboard
Alerts
Nagios
● Configured by Puppet
● Notifications delivered by email and Slack
● Site specific alerts sent to site ops team
Analysing logs
ELK
● ElasticSearch, Logstash and Kibana
● Service and LB access logs sent to central syslog server
● Pretty dashboards
● Great for diagnosing issues
Kibana graphs
Tying it all together
● Define nagios_host and nagios_service resources
in Puppet
● Nagios configuration built by Naginator from
PuppetDB
● Deploy Ganglia
● Custom scripts to extract data from Nagios for
dashboard and reports
Thanks
Andy Botting
Nectar Core Services
andrew.botting@unimelb.edu.au

Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of Melbourne

  • 1.
    Monitoring uptime onthe Nectar Research Cloud
  • 2.
  • 3.
    Load BalancerLoad Balancer Tier1 Service API Tier 1 Service API Tier 1 Service API Tier 2 Service API Message Queue Message Queue Message Queue Database Tier 1 Service API Tier 1 Service Engine Tier 1 Service API Tier 2 Service Engine Load BalancerDashboard Nectar core services
  • 4.
    Test everything ● APIsand dashboard are running ● Services are working correctly ● Existing resources are happy (e.g. instances, networks) ● New resources can be created successfully ● Across all sites
  • 5.
    Control plane hosts Nagios ●Ping ● SSH ● NTP ● Filesystem ● Uptime ● Puppet Ganglia metrics ● CPU ● Memory ● Network ● Disk I/O
  • 6.
    Control plane services ●Service ports and processes ● HTTP endpoint ● API process ● Oslo middleware healthcheck ● Consistent /healthcheck URL for all services ● Called by load balancers ● More complex tests ● Request token from Keystone ● Check glance for image
  • 7.
    Environment Canary instance ineach AZ ● Ping ● DHCP ● Metadata Not an exhaustive test, but a good indicator
  • 8.
    Instance boot test Exercisethe whole stack with Tempest ● Fetch a token ● Create a keypair ● Create security groups and rules ● Create instance ● Ping instance ● SSH to instance ● Destroy/clean up all resources Tempest
  • 9.
    Instance boot test Instanceboot for each AZ ● Tiny CirrOS image for speed ● Help identify site specific issues Instance boot for each flavour ● Enough capacity for large flavours? ● Scheduler working properly ● Can be problematic with cells v1 Tempest
  • 10.
    Tempest ● OpenStack integrationtesting suite ● Jobs launched by Jenkins with custom wrapper script ● Result pushed (passive) to Nagios via NRDP ● Lots more can be done here (e.g testing more services) Tempest
  • 11.
  • 12.
  • 13.
  • 14.
    Alerts Nagios ● Configured byPuppet ● Notifications delivered by email and Slack ● Site specific alerts sent to site ops team
  • 15.
    Analysing logs ELK ● ElasticSearch,Logstash and Kibana ● Service and LB access logs sent to central syslog server ● Pretty dashboards ● Great for diagnosing issues
  • 16.
  • 17.
    Tying it alltogether ● Define nagios_host and nagios_service resources in Puppet ● Nagios configuration built by Naginator from PuppetDB ● Deploy Ganglia ● Custom scripts to extract data from Nagios for dashboard and reports
  • 18.
    Thanks Andy Botting Nectar CoreServices andrew.botting@unimelb.edu.au