SlideShare a Scribd company logo
Your logo
here
Monitoring Alerts and Metrics on
Large Power Systems Clusters
Marcelo Perazolo
Cognitive Systems Architect
IBM Systems
mperazolo@us.ibm.com
Nuremberg, Nov 4-7, 2019
http://osmc.de
• Introduction
• CORAL & Summit Supercomputer case
• Power Firmware Monitoring – The CRASSD open source project
• Power-Ops open source project – an open source collaboration
• Demo
• Conclusion
Agenda
Why Power/OpenPOWER is popular for certain Workloads
• Open Hardware Architecture
• Multiple vendors
• OpenPOWER Foundation
• CORAL: Collaboration of Oak Ridge, Argonne and Lawrence Livermore
• Summit is located at the Oak Ridge Laboratory, used for civilian research
• Sister project: Sierra supercomputer at Lawrence Livermore (nuclear weapons research)
• First supercomputer to reach exaOps performance
• ~ interconnected by 185 miles of fiber optic cables
• ~ 5,600 sqft of data center floor space
• ~ 340 tons of hardware and overhead infrastructure
• ~ 13MW power consumption
• 4,608 Power9 AC922 22-core systems
• 27,648 NVIDIA GPUs (6 per node)
• 250 Peta Bytes of Storage
• 200Gbps InfiniBand bandwidth between nodes
• Pumps up to 200 petaFLOPS / 3 exaOps
• Helps researchers with AI / BigData / Analytics, HPC capabilities
Case Study: The Summit Supercomputer
Summit: The Most Energy-Efficient Supercomputer
“The world’s smartest supercomputer is sharing data with its cooling
plant, reducing energy consumption and cost”
• “Summit is also the most energy-efficient supercomputer in
its Green500 class—based on gigaflops per watt—outranking systems a 10th as
fast.”
• “We wanted to couple Summit’s mechanical cooling system with its
computational workload to optimize efficiency, which can translate to significant
cost savings for a system of this size.”
• “We’ve developed the infrastructure architecture to scale to millions of events
per second using containerized microservices and popular enterprise open-
source software.”
• “On each Summit node OpenBMC provides real-time data readings from dozens
of sensors totaling more than 460,000 metrics per second that describe power
consumption, temperature, and performance for the entire supercomputer.”
• ”Facility staff can now visualize Summit behavior across all 4,608 nodes with a
temperature heat map, a power consumption map, and power and consumption
data broken down by CPUs and GPUs.”
• “Capturing all possible data in real time allows operators and researchers to
gain powerful insights into job behavior, machine performance, and cooling
response.”
*** Quoted from: https://www.hpcwire.com/off-the-wire/olcf-and-providentia-worldwide-build-intelligence-
system-for-supercomputer-cooling-plant/
Summit: High Level Hardware/Architecture View
CRASSD
Firmware Alerts & Telemetry from Power nodes flow to Crassd servers and then to open tools for
visualization such as Grafana, Elastic Stack. Data includes power consumption, frequencies, cooling, etc.
CRASSD: Open tooling for Power Firmware Monitoring
CRASSD Facts
▪ CORAL required telemetry data for all
nodes/layers in the Power Cluster
▪ Proposed RAS architecture had flaws:
▪ No method existed to route errors from the BMC
▪ Built CRASSD as an open tool:
–To collect error events and sort using policy tables
– extended the daemon to gather sensor readings to
fulfill ORNL telemetry requirements
–Provides an API that makes it easy to develop plug-
ins using various Open Source monitoring tools
▪ The results have been impressive, and many
more use cases are being developed
▪ CRASSD currently being incorporated into
other Solutions where the same requirements
exist, e.g. Power-Ops stack.
Available at: https://github.com/open-power-ref-design-toolkit/ibm-crassd
Motivations
• Replace legacy tools and solutions with modern/open alternatives for Power clusters
• Monitoring for x86 is feature-rich and commoditized with extensive support
• Not so much for Power, e.g.: Elastic on Power still on v5.x; new v7.x now has binaries (x86 only)
• Power users often need to port / build / configure these tools from scratch !!
➔ May influence cost of maintenance, thus decision to user Power at all
• Automate a complete ecosystem of tools that fit all needs of a modern Ops stack
• types of data: logs/alerts vs. telemetry
• analysis: historical vs. real-time
• multi-layer aggregation: firmware, OS, services, etc.
• single system or cluster-wide
➔ Popular stacks use Grafana & Prometheus, ELK, Nagios / Icinga / Zabbix, Netdata, etc.
and are deployed/configured by tools such as Ansible, Terraform, Salt, Puppet, etc.
Proposal: Build & curate a key set of modern open tools for Power systems, engage Power systems
users and open source monitoring/ops community
Value 1: reduce cost of modernizing Operations for existing Power clusters (legacy → open)
Value 2: enable adding Power nodes easily into data centers that already use modern Ops tooling
Value 3: reduced entry cost of Operation for new solutions interested on Power advantages
Beyond Power Firmware Monitoring: Power-Ops project
Power-Ops: Open tooling for Power Cluster Operations
Power-Ops Facts
▪ Management stack runs on Power LE architecture
▪ Managed endpoints supported are Power Linux
(could also be easily used on x86):
▪ RedHat family of OSs
▪ Debian/Ubuntu family of OSs
▪ AIX (limited, starting to be supported as endpoints)
▪ Composed of automation components using
Ansible playbooks
▪ 3 Main goals:
▪ Bring-up and pre-configure target platforms
(Bare-Metal, Virtual Machines, Containers*)
▪ Build components not currently available on the
Power platform
▪ Deploy and Configure tooling and start-up dashboards that
work off-the-shelf with Power
▪ Growing community of interested end-users
Power-Ops: Bring-Up
The Bring-Up Process
▪ DevOps professional triggers process on
CI/CD platform
▪ CI/CD tools invoke Ansible
▪ Ansible Playbooks interact with IaaS of choice
▪ Nodes are brought up targeted for different roles:
–Builders
–Controllers
–Endpoints
▪ Bring-up includes powering-up (if needed) and
laying down pre-requisites for building or
deployment
–OS
–Packages & Libraries
–Access configuration
–Software configuration
devops CI/CD
builders
controllers
endpoints
This could be one of several choices, e.g.
- Bare-Metal
- Hypervisors or Power
- Power Hyperconverged Infrastructure
- Containers on OpenShift, etc.
(integrations are easy, just drop playbook)
Power-Ops: Build
The Build Process
▪ Many components are already available on Power,
but there are exceptions
▪ CRASSD: source on github
▪ Build process generates packages for Debian, RedHat
▪ Go Lang
▪ Go Daemon binary must be recompiled on Power
▪ Elastic Stack
–Up to v5.x code is implemented in Java
–Newer releases include binaries (not yet supported)
–Beats must be re-packaged for Debian, RedHat
▪ All relevant packages are then stored on a
local repository
▪ Doesn’t have to run frequently
–DevOps orgs could automate upstream integration
devops CI/CD
builders
repo
Generates binaries/packages for Power
not yet widely available on public repos
Long-term goal is to
integrate Power packages
onto upstream repositories
libs
Power-Ops: Deploy
The Deploy Process
▪ Choose deployment topology
▪ Where each component is deployed to
▪ How they interconnect with each other
▪ Deploy tooling to nodes
▪ Elastic Stack, Netadata, Crassd go to Controller nodes
▪ Beats (FileBeat, MetricBeat) go to Endpoint nodes
▪ Deploy configuration & visualizations/dashboards
▪ Crassd is configured to collect firmware data:
Telemetry data goes to Netdata
Alerting data goes to Logstash
▪ FileBeat collects logs and sends to Logstash
▪ MetricBeat collects telemetry and sends to Elasticsearch
▪ Visualizations/Dashboards are deployed to Netdata and
Kibana
▪ Operators can then access User Interfaces from
Kibana and Netdata
devops CI/CD
repo
CRASSD
Flexible deployment to both
controllers and endpoints
Demo Overview
(controller)
wmdepos
P8 bare metal
Marcelo’s Laptop
(endpoint/VM)
pops-ubuntu-ept
crassd
(endpoint/VM)
pops-redhat-ept
(endpoint / P9)
bos-1
github
repos
deploy
f/w
alerts
telemetry
+ logs
(controller)
launchgr01
P9 bare metal
crassd
Dashboards:
- F/W Alerts (Kibana)
- Logs/Infrastructure (Kibana)
- Cluster Metrics (Kibana)
- OS & F/W Metrics (Netdata)
firmware
192.168.10.25
IPMI OBMCtelemetry
deployment
playbooks
(*)
(*) F/W data supported on Power9 systems
(endpoint/VM)
pops-aix-ept
DEMO / Walk-through
Next Steps
Grow the community
1. Engage with traditional Power systems users (e.g. AIX, legacy Power) promoting modernization
2. Engage with Power Linux community, foster benefits of sharing solutions for everybody’s benefit
3. Engage with Open Source communities, promote support of Power out of the box (when such doesn’t yet exist)
4. Use as a catalyst for monitoring of new large Power clusters (taking advantage of lower cost of entry on Power)
Enhance the Operational Stack
• Add Call Home support to CRASSD
• Support more deployment use cases, such as:
• Containers (development under way)
• Broader integration targeting other IaaS/PaaS solutions (e.g. OpenShift clusters)
• Support additional tools, such as:
• Prometheus / Grafana (development planned)
• Zabbix and/or Nagios / Icinga, others… (feel free to suggest / collaborate !!!)
• Support additional hardware, such as:
• Support other/newer BMC Firmware interfaces such as Redfish
• Monitor GPUs, Networking & Storage equipment
• More Power / OpenPOWER system models
• Currency work to support and maintain newer releases of tooling, e.g.
• Migrate to Elastic Stack v7.x (needs automation)
• Add support for more Beats
• More AIX support
Q&As
Backup
Kibana: Dashboard for Power Firmware events (fed from CRASSD Alerts)
Kibana: Dashboard for Power Infrastructure logs (fed from FileBeat)
Kibana: Multiple Dashboard for Long-Term Power metrics
(fed from MetricBeat and kept on Elasticsearch)
+ more
Netdata: Dashboards for Real-Time Power Firmware metrics (fed from CRASSD)
and Power Infrastructure metrics (fed from other Netdata plugins)

More Related Content

What's hot

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
Databricks
 
YARN and the Docker container runtime
YARN and the Docker container runtimeYARN and the Docker container runtime
YARN and the Docker container runtime
DataWorks Summit/Hadoop Summit
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
Spark Summit
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
DataWorks Summit/Hadoop Summit
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
Yu Liu
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
t3rmin4t0r
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
DataWorks Summit
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
DataWorks Summit
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
Michael Young
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
Szehon Ho
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
t3rmin4t0r
 
Provisioning with Stacki at NIST
Provisioning with Stacki at NISTProvisioning with Stacki at NIST
Provisioning with Stacki at NIST
StackIQ
 

What's hot (20)

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
YARN and the Docker container runtime
YARN and the Docker container runtimeYARN and the Docker container runtime
YARN and the Docker container runtime
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
Provisioning with Stacki at NIST
Provisioning with Stacki at NISTProvisioning with Stacki at NIST
Provisioning with Stacki at NIST
 

Similar to OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by Marcelo Perazolo

OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
Radisys Corporation
 
ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014
Michael Christofferson
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStackStacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Open-NFP
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john mao
NAVER D2
 
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
OpenNebula Project
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
Alan Sill
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Nagios
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machines
inside-BigData.com
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
inside-BigData.com
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
Steve Wong
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
OpenStack
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFV
Debojyoti Dutta
 
Public vs. Private Cloud Performance by Flex
Public vs. Private Cloud Performance by FlexPublic vs. Private Cloud Performance by Flex
Public vs. Private Cloud Performance by Flex
StackIQ
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
振东 刘
 

Similar to OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by Marcelo Perazolo (20)

OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
 
ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStackStacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStack
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john mao
 
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machines
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFV
 
Public vs. Private Cloud Performance by Flex
Public vs. Private Cloud Performance by FlexPublic vs. Private Cloud Performance by Flex
Public vs. Private Cloud Performance by Flex
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 

Recently uploaded

一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
Envertis Software Solutions
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
sjcobrien
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
Prestware
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
kalichargn70th171
 

Recently uploaded (20)

一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
 

OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by Marcelo Perazolo

  • 1. Your logo here Monitoring Alerts and Metrics on Large Power Systems Clusters Marcelo Perazolo Cognitive Systems Architect IBM Systems mperazolo@us.ibm.com Nuremberg, Nov 4-7, 2019 http://osmc.de
  • 2. • Introduction • CORAL & Summit Supercomputer case • Power Firmware Monitoring – The CRASSD open source project • Power-Ops open source project – an open source collaboration • Demo • Conclusion Agenda
  • 3. Why Power/OpenPOWER is popular for certain Workloads • Open Hardware Architecture • Multiple vendors • OpenPOWER Foundation
  • 4. • CORAL: Collaboration of Oak Ridge, Argonne and Lawrence Livermore • Summit is located at the Oak Ridge Laboratory, used for civilian research • Sister project: Sierra supercomputer at Lawrence Livermore (nuclear weapons research) • First supercomputer to reach exaOps performance • ~ interconnected by 185 miles of fiber optic cables • ~ 5,600 sqft of data center floor space • ~ 340 tons of hardware and overhead infrastructure • ~ 13MW power consumption • 4,608 Power9 AC922 22-core systems • 27,648 NVIDIA GPUs (6 per node) • 250 Peta Bytes of Storage • 200Gbps InfiniBand bandwidth between nodes • Pumps up to 200 petaFLOPS / 3 exaOps • Helps researchers with AI / BigData / Analytics, HPC capabilities Case Study: The Summit Supercomputer
  • 5. Summit: The Most Energy-Efficient Supercomputer “The world’s smartest supercomputer is sharing data with its cooling plant, reducing energy consumption and cost” • “Summit is also the most energy-efficient supercomputer in its Green500 class—based on gigaflops per watt—outranking systems a 10th as fast.” • “We wanted to couple Summit’s mechanical cooling system with its computational workload to optimize efficiency, which can translate to significant cost savings for a system of this size.” • “We’ve developed the infrastructure architecture to scale to millions of events per second using containerized microservices and popular enterprise open- source software.” • “On each Summit node OpenBMC provides real-time data readings from dozens of sensors totaling more than 460,000 metrics per second that describe power consumption, temperature, and performance for the entire supercomputer.” • ”Facility staff can now visualize Summit behavior across all 4,608 nodes with a temperature heat map, a power consumption map, and power and consumption data broken down by CPUs and GPUs.” • “Capturing all possible data in real time allows operators and researchers to gain powerful insights into job behavior, machine performance, and cooling response.” *** Quoted from: https://www.hpcwire.com/off-the-wire/olcf-and-providentia-worldwide-build-intelligence- system-for-supercomputer-cooling-plant/
  • 6. Summit: High Level Hardware/Architecture View CRASSD Firmware Alerts & Telemetry from Power nodes flow to Crassd servers and then to open tools for visualization such as Grafana, Elastic Stack. Data includes power consumption, frequencies, cooling, etc.
  • 7. CRASSD: Open tooling for Power Firmware Monitoring CRASSD Facts ▪ CORAL required telemetry data for all nodes/layers in the Power Cluster ▪ Proposed RAS architecture had flaws: ▪ No method existed to route errors from the BMC ▪ Built CRASSD as an open tool: –To collect error events and sort using policy tables – extended the daemon to gather sensor readings to fulfill ORNL telemetry requirements –Provides an API that makes it easy to develop plug- ins using various Open Source monitoring tools ▪ The results have been impressive, and many more use cases are being developed ▪ CRASSD currently being incorporated into other Solutions where the same requirements exist, e.g. Power-Ops stack. Available at: https://github.com/open-power-ref-design-toolkit/ibm-crassd
  • 8. Motivations • Replace legacy tools and solutions with modern/open alternatives for Power clusters • Monitoring for x86 is feature-rich and commoditized with extensive support • Not so much for Power, e.g.: Elastic on Power still on v5.x; new v7.x now has binaries (x86 only) • Power users often need to port / build / configure these tools from scratch !! ➔ May influence cost of maintenance, thus decision to user Power at all • Automate a complete ecosystem of tools that fit all needs of a modern Ops stack • types of data: logs/alerts vs. telemetry • analysis: historical vs. real-time • multi-layer aggregation: firmware, OS, services, etc. • single system or cluster-wide ➔ Popular stacks use Grafana & Prometheus, ELK, Nagios / Icinga / Zabbix, Netdata, etc. and are deployed/configured by tools such as Ansible, Terraform, Salt, Puppet, etc. Proposal: Build & curate a key set of modern open tools for Power systems, engage Power systems users and open source monitoring/ops community Value 1: reduce cost of modernizing Operations for existing Power clusters (legacy → open) Value 2: enable adding Power nodes easily into data centers that already use modern Ops tooling Value 3: reduced entry cost of Operation for new solutions interested on Power advantages Beyond Power Firmware Monitoring: Power-Ops project
  • 9. Power-Ops: Open tooling for Power Cluster Operations Power-Ops Facts ▪ Management stack runs on Power LE architecture ▪ Managed endpoints supported are Power Linux (could also be easily used on x86): ▪ RedHat family of OSs ▪ Debian/Ubuntu family of OSs ▪ AIX (limited, starting to be supported as endpoints) ▪ Composed of automation components using Ansible playbooks ▪ 3 Main goals: ▪ Bring-up and pre-configure target platforms (Bare-Metal, Virtual Machines, Containers*) ▪ Build components not currently available on the Power platform ▪ Deploy and Configure tooling and start-up dashboards that work off-the-shelf with Power ▪ Growing community of interested end-users
  • 10. Power-Ops: Bring-Up The Bring-Up Process ▪ DevOps professional triggers process on CI/CD platform ▪ CI/CD tools invoke Ansible ▪ Ansible Playbooks interact with IaaS of choice ▪ Nodes are brought up targeted for different roles: –Builders –Controllers –Endpoints ▪ Bring-up includes powering-up (if needed) and laying down pre-requisites for building or deployment –OS –Packages & Libraries –Access configuration –Software configuration devops CI/CD builders controllers endpoints This could be one of several choices, e.g. - Bare-Metal - Hypervisors or Power - Power Hyperconverged Infrastructure - Containers on OpenShift, etc. (integrations are easy, just drop playbook)
  • 11. Power-Ops: Build The Build Process ▪ Many components are already available on Power, but there are exceptions ▪ CRASSD: source on github ▪ Build process generates packages for Debian, RedHat ▪ Go Lang ▪ Go Daemon binary must be recompiled on Power ▪ Elastic Stack –Up to v5.x code is implemented in Java –Newer releases include binaries (not yet supported) –Beats must be re-packaged for Debian, RedHat ▪ All relevant packages are then stored on a local repository ▪ Doesn’t have to run frequently –DevOps orgs could automate upstream integration devops CI/CD builders repo Generates binaries/packages for Power not yet widely available on public repos Long-term goal is to integrate Power packages onto upstream repositories libs
  • 12. Power-Ops: Deploy The Deploy Process ▪ Choose deployment topology ▪ Where each component is deployed to ▪ How they interconnect with each other ▪ Deploy tooling to nodes ▪ Elastic Stack, Netadata, Crassd go to Controller nodes ▪ Beats (FileBeat, MetricBeat) go to Endpoint nodes ▪ Deploy configuration & visualizations/dashboards ▪ Crassd is configured to collect firmware data: Telemetry data goes to Netdata Alerting data goes to Logstash ▪ FileBeat collects logs and sends to Logstash ▪ MetricBeat collects telemetry and sends to Elasticsearch ▪ Visualizations/Dashboards are deployed to Netdata and Kibana ▪ Operators can then access User Interfaces from Kibana and Netdata devops CI/CD repo CRASSD Flexible deployment to both controllers and endpoints
  • 13. Demo Overview (controller) wmdepos P8 bare metal Marcelo’s Laptop (endpoint/VM) pops-ubuntu-ept crassd (endpoint/VM) pops-redhat-ept (endpoint / P9) bos-1 github repos deploy f/w alerts telemetry + logs (controller) launchgr01 P9 bare metal crassd Dashboards: - F/W Alerts (Kibana) - Logs/Infrastructure (Kibana) - Cluster Metrics (Kibana) - OS & F/W Metrics (Netdata) firmware 192.168.10.25 IPMI OBMCtelemetry deployment playbooks (*) (*) F/W data supported on Power9 systems (endpoint/VM) pops-aix-ept
  • 15. Next Steps Grow the community 1. Engage with traditional Power systems users (e.g. AIX, legacy Power) promoting modernization 2. Engage with Power Linux community, foster benefits of sharing solutions for everybody’s benefit 3. Engage with Open Source communities, promote support of Power out of the box (when such doesn’t yet exist) 4. Use as a catalyst for monitoring of new large Power clusters (taking advantage of lower cost of entry on Power) Enhance the Operational Stack • Add Call Home support to CRASSD • Support more deployment use cases, such as: • Containers (development under way) • Broader integration targeting other IaaS/PaaS solutions (e.g. OpenShift clusters) • Support additional tools, such as: • Prometheus / Grafana (development planned) • Zabbix and/or Nagios / Icinga, others… (feel free to suggest / collaborate !!!) • Support additional hardware, such as: • Support other/newer BMC Firmware interfaces such as Redfish • Monitor GPUs, Networking & Storage equipment • More Power / OpenPOWER system models • Currency work to support and maintain newer releases of tooling, e.g. • Migrate to Elastic Stack v7.x (needs automation) • Add support for more Beats • More AIX support
  • 16. Q&As
  • 18. Kibana: Dashboard for Power Firmware events (fed from CRASSD Alerts)
  • 19. Kibana: Dashboard for Power Infrastructure logs (fed from FileBeat)
  • 20. Kibana: Multiple Dashboard for Long-Term Power metrics (fed from MetricBeat and kept on Elasticsearch) + more
  • 21. Netdata: Dashboards for Real-Time Power Firmware metrics (fed from CRASSD) and Power Infrastructure metrics (fed from other Netdata plugins)