This document discusses monitoring distributed high performance computing systems. It describes using the Nudnik infrastructure monitoring tool to collect metrics from systems, parse the metrics, and take actions. Nudnik can collect baseline metrics with small latencies, load test metrics under CPU, memory, disk and network stress, and introduce chaos by setting failure percentages or response latencies randomly. It reports metrics to databases and services like InfluxDB, Elasticsearch and Prometheus.