NetApp is a global cloud-led, data-centric software company. They are an industry leader in hybrid cloud data services and data management solutions. Their platform enables their customers to store and share large quantities of digital data across physical and hybrid cloud environments. NetApp Engineering’s Site Reliability Engineering team is tasked with supporting their internal build environment, test, and automation infrastructure. After collecting their time-stamped data in InfluxDB, they are using Kapacitor to push alerts directly to Slack via webhooks. Their globally distributed SRE team are able to seamlessly collaborate and troubleshoot. Discover how NetApp uses a time series platform to detect trends in real time that can result in failures within their environments, and to provide key metrics used in SRE postmortems.
Join this webinar as Dustin Sorge will dive into:
NetApp's approach to monitoring their SRE team's metrics — including SLO's and SLI's
Their best practices and techniques for monitoring memory usage and CPU usage
How they use InfluxDB and Telegraf to detect trends and coordinate fixes faster.