Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of Melbourne

90 views

Published on

Audience Level
Intermediate

Synopsis
We will discuss how we do monitoring on the Nectar research cloud, utilising tools like OpenStack tempest, Nagios and translating this into a user facing dashboard.

Speaker Bio:
Andy is a DevOps engineer working at the University of Melbourne in the Core Services team for the Nectar Research Cloud.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of Melbourne

  1. 1. Monitoring uptime on the Nectar Research Cloud
  2. 2. TPAC Uni Melb NCI Core Services Pawsey QCIF eRSA Intersect Monash Nectar architecture
  3. 3. Load BalancerLoad Balancer Tier 1 Service API Tier 1 Service API Tier 1 Service API Tier 2 Service API Message Queue Message Queue Message Queue Database Tier 1 Service API Tier 1 Service Engine Tier 1 Service API Tier 2 Service Engine Load BalancerDashboard Nectar core services
  4. 4. Test everything ● APIs and dashboard are running ● Services are working correctly ● Existing resources are happy (e.g. instances, networks) ● New resources can be created successfully ● Across all sites
  5. 5. Control plane hosts Nagios ● Ping ● SSH ● NTP ● Filesystem ● Uptime ● Puppet Ganglia metrics ● CPU ● Memory ● Network ● Disk I/O
  6. 6. Control plane services ● Service ports and processes ● HTTP endpoint ● API process ● Oslo middleware healthcheck ● Consistent /healthcheck URL for all services ● Called by load balancers ● More complex tests ● Request token from Keystone ● Check glance for image
  7. 7. Environment Canary instance in each AZ ● Ping ● DHCP ● Metadata Not an exhaustive test, but a good indicator
  8. 8. Instance boot test Exercise the whole stack with Tempest ● Fetch a token ● Create a keypair ● Create security groups and rules ● Create instance ● Ping instance ● SSH to instance ● Destroy/clean up all resources Tempest
  9. 9. Instance boot test Instance boot for each AZ ● Tiny CirrOS image for speed ● Help identify site specific issues Instance boot for each flavour ● Enough capacity for large flavours? ● Scheduler working properly ● Can be problematic with cells v1 Tempest
  10. 10. Tempest ● OpenStack integration testing suite ● Jobs launched by Jenkins with custom wrapper script ● Result pushed (passive) to Nagios via NRDP ● Lots more can be done here (e.g testing more services) Tempest
  11. 11. Jenkins
  12. 12. Nagios
  13. 13. Status dashboard
  14. 14. Alerts Nagios ● Configured by Puppet ● Notifications delivered by email and Slack ● Site specific alerts sent to site ops team
  15. 15. Analysing logs ELK ● ElasticSearch, Logstash and Kibana ● Service and LB access logs sent to central syslog server ● Pretty dashboards ● Great for diagnosing issues
  16. 16. Kibana graphs
  17. 17. Tying it all together ● Define nagios_host and nagios_service resources in Puppet ● Nagios configuration built by Naginator from PuppetDB ● Deploy Ganglia ● Custom scripts to extract data from Nagios for dashboard and reports
  18. 18. Thanks Andy Botting Nectar Core Services andrew.botting@unimelb.edu.au

×