Monitor Nectar Research Cloud uptime

•

1 like•306 views

Audience Level Intermediate Synopsis We will discuss how we do monitoring on the Nectar research cloud, utilising tools like OpenStack tempest, Nagios and translating this into a user facing dashboard. Speaker Bio: Andy is a DevOps engineer working at the University of Melbourne in the Core Services team for the Nectar Research Cloud.

Technology

Monitoring uptime on the
Nectar Research Cloud

TPAC
Uni Melb
NCI
Core
Services
Pawsey
QCIF
eRSA
Intersect Monash
Nectar architecture

Load BalancerLoad Balancer
Tier 1
Service API
Tier 1
Service API
Tier 1
Service API
Tier 2
Service API
Message
Queue
Message
Queue
Message
Queue
Database
Tier 1
Service API
Tier 1
Service
Engine
Tier 1
Service API
Tier 2
Service
Engine
Load BalancerDashboard
Nectar core services

Test everything
● APIs and dashboard are running
● Services are working correctly
● Existing resources are happy (e.g. instances, networks)
● New resources can be created successfully
● Across all sites

Control plane hosts
Nagios
● Ping
● SSH
● NTP
● Filesystem
● Uptime
● Puppet
Ganglia metrics
● CPU
● Memory
● Network
● Disk I/O

Control plane services
● Service ports and processes
● HTTP endpoint
● API process
● Oslo middleware healthcheck
● Consistent /healthcheck URL for all services
● Called by load balancers
● More complex tests
● Request token from Keystone
● Check glance for image

Environment
Canary instance in each AZ
● Ping
● DHCP
● Metadata
Not an exhaustive test, but a good indicator

Instance boot test
Exercise the whole stack with Tempest
● Fetch a token
● Create a keypair
● Create security groups and rules
● Create instance
● Ping instance
● SSH to instance
● Destroy/clean up all resources
Tempest

Instance boot test
Instance boot for each AZ
● Tiny CirrOS image for speed
● Help identify site specific issues
Instance boot for each flavour
● Enough capacity for large flavours?
● Scheduler working properly
● Can be problematic with cells v1
Tempest

Tempest
● OpenStack integration testing suite
● Jobs launched by Jenkins with custom wrapper script
● Result pushed (passive) to Nagios via NRDP
● Lots more can be done here (e.g testing more services)
Tempest

Alerts
Nagios
● Configured by Puppet
● Notifications delivered by email and Slack
● Site specific alerts sent to site ops team

Analysing logs
ELK
● ElasticSearch, Logstash and Kibana
● Service and LB access logs sent to central syslog server
● Pretty dashboards
● Great for diagnosing issues

Tying it all together
● Define nagios_host and nagios_service resources
in Puppet
● Nagios configuration built by Naginator from
PuppetDB
● Deploy Ganglia
● Custom scripts to extract data from Nagios for
dashboard and reports

Thanks
Andy Botting
Nectar Core Services
andrew.botting@unimelb.edu.au

What's hot

Monitoring of OpenNebula installationsNETWAYS

Integration testing for salt states using aws ec2 container serviceSaltStack

Unbounded bounded-data-strangeloop-2016-monal-daxiniMonal Daxini

From nothing to Prometheus : one year afterAntoine Leroyer

Fluentd - CNCF ParisHorgix

OSMC 2013 | Zabbix: A Practical Demo by Rihards OlupsNETWAYS

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas

promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...Tokuhiro Matsuno

Nginx monitoring with graphitedamaex17

Netflix at-disney-09-26-2014Monal Daxini

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini

Kafka Summit SF 2017 - Running Kafka as a Service at Scaleconfluent

Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner) Puppet

gRPC: The Story of Microservices at SquareApigee | Google Cloud

Prezo at-mesos con2015-finalSharma Podila

Writing Rust Command Line ApplicationsAll Things Open

Airflow Clustering and High AvailabilityRobert Sanders

Smart Testing: Catching More Bugs with Less Code Through Topology ShufflerOPNFV

How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà

DOD 2016 - Kamil Szczygieł - Patching 100 OpenStack Compute Nodes with Zero-d...PROIDEA

What's hot (20)

Monitoring of OpenNebula installations

Integration testing for salt states using aws ec2 container service

Unbounded bounded-data-strangeloop-2016-monal-daxini

From nothing to Prometheus : one year after

Fluentd - CNCF Paris

OSMC 2013 | Zabbix: A Practical Demo by Rihards Olups

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...

Nginx monitoring with graphite

Netflix at-disney-09-26-2014

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015

Kafka Summit SF 2017 - Running Kafka as a Service at Scale

Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner)

gRPC: The Story of Microservices at Square

Prezo at-mesos con2015-final

Writing Rust Command Line Applications

Airflow Clustering and High Availability

Smart Testing: Catching More Bugs with Less Code Through Topology Shuffler

How bol.com makes sense of its logs, using the Elastic technology stack.

DOD 2016 - Kamil Szczygieł - Patching 100 OpenStack Compute Nodes with Zero-d...

Similar to Monitor Nectar Research Cloud uptime

Netty trainingJackson dos Santos Olveira

Netty trainingMarcelo Serpa

Introduction to kubernetesRishabh Indoria

Scaling Up Logging and MetricsRicardo Lourenço

QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker

Using Ceph in OStack.de - Ceph Day Frankfurt Ceph Community

USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse

Testing kubernetes and_open_shift_at_scale_20170209mffiedler

All of the thing about PostmanAlihossein shahabi

Micro services infrastructure with AWS and AnsibleBamdad Dashtban

Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez

Serverless for the Cloud Native Era with FissionNATS

Build cloud like Rackspace with OpenStack AnsibleJirayut Nimsaeng

Performance testing in scope of migration to cloud by Serghei RadovValeriia Maliarenko

Ensuring Performance in a Fast-Paced Environment (CMG 2014)Martin Spier

reBuy on KubernetesStephan Lindauer

Monitoring kubernetes across data center and cloudDatadog

Fuel's current use cases, architecture and next stepsOpen-IT

TryStack: A Sandbox for OpenStack Users and AdminsAnne Gentle

PaaSTA: Running applications at YelpNathan Handler

Similar to Monitor Nectar Research Cloud uptime (20)

Netty training

Introduction to kubernetes

Scaling Up Logging and Metrics

QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons

Using Ceph in OStack.de - Ceph Day Frankfurt

USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month

Testing kubernetes and_open_shift_at_scale_20170209

All of the thing about Postman

Micro services infrastructure with AWS and Ansible

Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...

Serverless for the Cloud Native Era with Fission

Build cloud like Rackspace with OpenStack Ansible

Performance testing in scope of migration to cloud by Serghei Radov

Ensuring Performance in a Fast-Paced Environment (CMG 2014)

reBuy on Kubernetes

Monitoring kubernetes across data center and cloud

Fuel's current use cases, architecture and next steps

TryStack: A Sandbox for OpenStack Users and Admins

PaaSTA: Running applications at Yelp

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

How to convert PDF to text with Nanonetsnaman860154

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Install Stable Diffusion in windows machinePadma Pradeep

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Artificial intelligence in the post-deep learning eraDeakin University

CloudStudio User manual (basic edition):comworks

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Key Features Of Token Development (1).pptxLBM Solutions

AI as an Interface for Commercial BuildingsMemoori

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

How to convert PDF to text with Nanonets

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Install Stable Diffusion in windows machine

Streamlining Python Development: A Guide to a Modern Project Setup

Advanced Test Driven-Development @ php[tek] 2024

Unlocking the Potential of the Cloud for IBM Power Systems

The transition to renewables in India.pdf

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Artificial intelligence in the post-deep learning era

CloudStudio User manual (basic edition):

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Key Features Of Token Development (1).pptx

AI as an Interface for Commercial Buildings

Breaking the Kubernetes Kill Chain: Host Path Mount

Unblocking The Main Thread Solving ANRs and Frozen Frames

Injustice - Developers Among Us (SciFiDevCon 2024)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Monitor Nectar Research Cloud uptime

1. Monitoring uptime on the Nectar Research Cloud

2. TPAC Uni Melb NCI Core Services Pawsey QCIF eRSA Intersect Monash Nectar architecture

3. Load BalancerLoad Balancer Tier 1 Service API Tier 1 Service API Tier 1 Service API Tier 2 Service API Message Queue Message Queue Message Queue Database Tier 1 Service API Tier 1 Service Engine Tier 1 Service API Tier 2 Service Engine Load BalancerDashboard Nectar core services

4. Test everything ● APIs and dashboard are running ● Services are working correctly ● Existing resources are happy (e.g. instances, networks) ● New resources can be created successfully ● Across all sites

5. Control plane hosts Nagios ● Ping ● SSH ● NTP ● Filesystem ● Uptime ● Puppet Ganglia metrics ● CPU ● Memory ● Network ● Disk I/O

6. Control plane services ● Service ports and processes ● HTTP endpoint ● API process ● Oslo middleware healthcheck ● Consistent /healthcheck URL for all services ● Called by load balancers ● More complex tests ● Request token from Keystone ● Check glance for image

7. Environment Canary instance in each AZ ● Ping ● DHCP ● Metadata Not an exhaustive test, but a good indicator

8. Instance boot test Exercise the whole stack with Tempest ● Fetch a token ● Create a keypair ● Create security groups and rules ● Create instance ● Ping instance ● SSH to instance ● Destroy/clean up all resources Tempest

9. Instance boot test Instance boot for each AZ ● Tiny CirrOS image for speed ● Help identify site specific issues Instance boot for each flavour ● Enough capacity for large flavours? ● Scheduler working properly ● Can be problematic with cells v1 Tempest

10. Tempest ● OpenStack integration testing suite ● Jobs launched by Jenkins with custom wrapper script ● Result pushed (passive) to Nagios via NRDP ● Lots more can be done here (e.g testing more services) Tempest

11. Jenkins

12. Nagios

13. Status dashboard

14. Alerts Nagios ● Configured by Puppet ● Notifications delivered by email and Slack ● Site specific alerts sent to site ops team

15. Analysing logs ELK ● ElasticSearch, Logstash and Kibana ● Service and LB access logs sent to central syslog server ● Pretty dashboards ● Great for diagnosing issues

16. Kibana graphs

17. Tying it all together ● Define nagios_host and nagios_service resources in Puppet ● Nagios configuration built by Naginator from PuppetDB ● Deploy Ganglia ● Custom scripts to extract data from Nagios for dashboard and reports

18. Thanks Andy Botting Nectar Core Services andrew.botting@unimelb.edu.au

Monitor Nectar Research Cloud uptime

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monitor Nectar Research Cloud uptime

Similar to Monitor Nectar Research Cloud uptime (20)

More from OpenStack

More from OpenStack (20)

Recently uploaded

Recently uploaded (20)

Monitor Nectar Research Cloud uptime