SlideShare a Scribd company logo
Andrea Bartolini, Prof
University of Bologna – DEI, Italy
Combining out-of-band monitoring with AI and big data
for datacenter automation in OpenPOWER
Outline
• Datacenter Automation
• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring
• Big Data/AI enabled Anomaly Detection
• Future Works
Performance
Analysis
Scalable Moniitoring
Framework
Machine
Learning
Data
Visualization
Resources
Management
Energy
efficiency
Job
Scheduling
Heterogeneous
Sensors
Common Interface
CRAC
PDU
CLUSTER
Reactive and Proactive
Feedbacks
ENV.
A New Trend: Datacentre Automation
Usage Scenarios
Fine Grain Power and
Performance Measurements:
- Verify and classify node performance
- In spec / out of spec behaviour
- Miss configuration
- Aging and wear out
- Detect security hazards
- Predictive maintenance
Coarse grain
Fine grain
CPU
CPU
ACC ACC
Node
DIMMDIMMDIMM
Performance Counters:
- Node components
- Microarchitectural events
AI
Datacenter Automation Design and Bottlenecks
Centralized
Monitoring
&
Analytics
Edge Monitoring
& Analytics
Low
Data-Rate
Huge
Data-Rate
Centralized
Monitoring
&
Analytics
Bottlenecks:
• Network BW
• Storage
• SW Overhead
Infrastructure Sensors (e.g., CRAC, PDU, Etc.)
Node-1 Node-n Node-nNode-1
Infrastructure Sensors (e.g., CRAC, PDU, Etc.)
Outline
• Datacenter Automation
• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring
• Big Data/AI enabled Anomaly Detection
• Future Works
7
D.A.V.I.D.E.
SUPERCOMPUTER
(Development of an
Added
Value
Infrastructure
Designed in
Europe)
D.A.V.I.D.E. SUPERCOMPUTER
(Development of an Added Value Infrastructure Designed in Europe)
OCP form-factor compute node
based on IBM Minsky
2xIB EDR
LIQUID COOLING
4x Tesla P100 HSMX2
University of Bologna / ETH Zurich
DiG: FINE GRAIN POWER AND
PERFORMANCE MONITORING & ANALYTICS
BusBar
2 x POWER8 with NVLink
• Power measuring block - placed between
the power sensing unit (PSU) and the DC-
DC converter provide overall node power
consumption
• Embedded system (BeagleBoneBlack)
connected through pass through
commands to node performance metrics
• Scalable interface to the data analysis point
through the MQTT protocol
• Embedded system powerful enough to
support edge analytics & inference
DiG: Out-of-band Power & Performance
Monitoring
Architecture
• Sub-Watt precision
• Power Monitoring Sampling rate
@50kS/s (T=20us)
• Performance Monitoring 242
Amester metrics every 10s
• Time synchronized (±3σ)
• NTP < 35us
• PTP < 1.3us
State-of-the art systems (HDEEM and PowerInsight)
• Max. 1 ms sampling period
• Use data only offline (no possibility for real-time computing)
Architecture
DiG: Out-of-band Power & Performance
Monitoring
Framework Fsmax [kHz]
E4 PPBB 50
HDEEM 1
PowerInsight 1
Nice, but how do I use it?
11
Application 1
Application 2
Spectral signature of an
application!
Real-time Frequency analysis on power supply and more…a live oscilloscope
• For instance, using the FFT we plot the power spectral density of the power
benchmark of two applications, and we can distinguish them by the harmonics
present in each of the signals
Low overhead, accurate monitoring
Interesting feature for node
level and system level
Intrusion Detection System!
OK, but do I automate?
Outline
• Datacenter Automation
• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring
• Big Data/AI enabled Anomaly Detection
• Future Works
Scalable Data Collection, Analytics
Sens_pub
Broker1
Sens_pub Sens_pub
Cassandra
node1
MQTT
Sens_pub
BrokerM
Sens_pub Sens_pub
Cassandra
nodeM
Grafana
Back-end
• MQTT–enabled sensor
collectors
Front-end
• MQTT Brokers
• Data Visualization
• NoSQL Storage
• Big Data Analytics
Apache
Spark
Target Facility
MQTT Brokers
Applications
NoSQL
ADMIN
MQTT2Kairos MQTT2kairos
Kairosdb
Panda Matlab
MQTT: MQ Telemetry Transport
•Lightweight message queuing and transport protocol
•Developed by IBM and Eurotech
•Well suited for low resource demanding scenarios like M2M,
WSN and IoT applications
•Basic features:
•PubSub model
•Async communication protocol (messages)
•Low overhead packet (2 bytes header)
•QoS (3 levels)
•Open source implementation:
•https://mosquitto.org/
Publisher
Topic
(Broker)
Subscriber
(mosquitto_pub) (mosquitto_sub)(mosquitto)
Cassandra Column Family
MQTT Publishers
facility/sensors/B
Sens_pub_A Sens_pub_B Sens_pub_C
Metric:
A
Tags:
facility
Sensors
Metric:
B
Tags:
Facility
Sensors
Metric:
C
Tags:
facility
sensors
facility/sensors/# MQTT2KairosdbMQTT
Broker
MQTT to NoSQL Storage: MQTT2Kairosdb
= {Value;Timestamp}
12-Nov-2018 16
DiG: Power & Perf Meas on D.A.V.I.D.E.
Broker
MQTT
DiG
MQTT
Pow_pub IPMI_pub OCC_pub
DiG SW Daemons
PSU_pub Cooling_pub
D.A.V.I.D.E. Front-End
Power:
• 20μs → on-board Analysis
• 1s,1ms → to Central Unit (45 kS/s)
IPMI: 89 metrics per node every 5sOCC: 242 metrics per node every 10s
IPMI
Liteon
Overall Rack info
(e.g., Total Power)
Asetek
Info Liquid Cooling
Antonio Libri
Slurm
Slurm_pub
Examon Analytics: Batch training + Edge inference
examon-client
(REST)
(Batch)
Pandas
dataframe
Embedded Board:
Monitoring +
Anomaly Detection
Monitoring
Infrastructure
Historical Data
collect data
1
DL Model
train
2
Computing
Node 1
Embedded
Board
Computing
Node 2
Embedded
Board
Computing
Node N
…
3
load trained
model in
boards
DL
4 Normal
Behaviour
Anomalyonline anomaly
detection on live, new
data
AI+Big Data on D.A.V.I.D.E.:
Example of Anomaly detection
1. Collect Data
2. AI Train
3. Edge
4. AI Inference
Does it work?
Anomaly Detection
X Y
AUTOENCODER
…
Z
encoder decoder
• An autoencoder tries to copy its input (X) into its output (Y)
• An autoencoder learns to represent its input in the latent space,
extracting the important characteristics of the input set X
IDEA: train an autoencoder with the normal behavior of a HPC
system and use its reconstruction error to detect anomalies
AI+Big Data on D.A.V.I.D.E.: Anomaly
detection
• Autoencoder: neural network with 3 layers
• Sparse layers (dimension = n_features x 10)
• We trained on D.A.V.I.D.E. the autoencoder using ~two months of normal data,
collected with Examon
• To test our approach we injected anomalies in a subset of nodes
• Misconfigurations → change of frequency governor policy
• Default policy conservative: cpu frequency depends on load
• Anomaly 1 policy changed to powersave: cpu freq always at min value
• Anomaly 2 policy changed to performance: cpu freq always at max value
Train on «normal» data
Validation
Fault!!
AI+Big Data on D.A.V.I.D.E.: Anomaly
detection
F-Score 99th percentile
Normal Anomaly
0.99 0.97
Inference Accuracy
Edge Inference with Tensorflow on
Embedded Computers (BBB): 11 ms
Conclusion & Future Works
• We presented an approach to conbine out-of-band monitoring and
big data and AI to enable Datacenter Automation
• We proof the effectivity of our approach toward enabling
automated anomaly detection of computing node
• Future Works:
• Extending the approach toward Security and house-keeping tasks in
Datacenters
• Leverage OpenBMC and custom firmware to deploy it as part of BMC
• Looking for partnership for bringing it to Large scale P9 systems
ACKNOWLEDGE
The Datacenter Automation TEAM
• Luca Benini, Michela Milano, Andrea
Borghesi, Antonio Libri, Francesco
Beneventi, Alessandro Petrella

More Related Content

What's hot

Monitoring of Transmission and Distribution Grids using PMUs
Monitoring of Transmission and Distribution Grids using PMUsMonitoring of Transmission and Distribution Grids using PMUs
Monitoring of Transmission and Distribution Grids using PMUs
Luigi Vanfretti
 
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
Ryousei Takano
 
Open Source Software Tools for Synchrophasor Applications
Open Source Software Tools for  Synchrophasor ApplicationsOpen Source Software Tools for  Synchrophasor Applications
Open Source Software Tools for Synchrophasor Applications
Luigi Vanfretti
 
Real-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTMReal-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTM
Numenta
 
AI Hardware for Real-Time Machine Learning
AI Hardware for Real-Time Machine LearningAI Hardware for Real-Time Machine Learning
AI Hardware for Real-Time Machine Learning
Facultad de Informática UCM
 
Detecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
Detecting Anomalies in the Engine Coolant Sensor using One-Class ClassifiersDetecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
Detecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
Eronides Da Silva Neto
 
Differential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networksDifferential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networks
Daniel Lim
 
Self-Tuned Remote Execution for Pervasive Computing
Self-Tuned Remote Execution for Pervasive ComputingSelf-Tuned Remote Execution for Pervasive Computing
Self-Tuned Remote Execution for Pervasive ComputingKevin McGregor MSc, IFC
 
Continuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringContinuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud Clustering
Hannaneh Najdataei
 
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
OW2
 
Env2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep LearningEnv2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep Learning
GUANGYUAN PIAO
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
Dr.M.Prasad Naidu
 
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Oleg Ovcharenko
 

What's hot (13)

Monitoring of Transmission and Distribution Grids using PMUs
Monitoring of Transmission and Distribution Grids using PMUsMonitoring of Transmission and Distribution Grids using PMUs
Monitoring of Transmission and Distribution Grids using PMUs
 
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
 
Open Source Software Tools for Synchrophasor Applications
Open Source Software Tools for  Synchrophasor ApplicationsOpen Source Software Tools for  Synchrophasor Applications
Open Source Software Tools for Synchrophasor Applications
 
Real-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTMReal-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTM
 
AI Hardware for Real-Time Machine Learning
AI Hardware for Real-Time Machine LearningAI Hardware for Real-Time Machine Learning
AI Hardware for Real-Time Machine Learning
 
Detecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
Detecting Anomalies in the Engine Coolant Sensor using One-Class ClassifiersDetecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
Detecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
 
Differential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networksDifferential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networks
 
Self-Tuned Remote Execution for Pervasive Computing
Self-Tuned Remote Execution for Pervasive ComputingSelf-Tuned Remote Execution for Pervasive Computing
Self-Tuned Remote Execution for Pervasive Computing
 
Continuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringContinuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud Clustering
 
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
 
Env2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep LearningEnv2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep Learning
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
 
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
 

Similar to Combining out - of - band monitoring with AI and big data for datacenter automation in OpenPOWER

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
OPAL-RT TECHNOLOGIES
 
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Zabbix
 
IoT Tech Expo 2023_Micha vor dem Berge presentation
IoT Tech Expo 2023_Micha vor dem Berge presentationIoT Tech Expo 2023_Micha vor dem Berge presentation
IoT Tech Expo 2023_Micha vor dem Berge presentation
VEDLIoT Project
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
Edge AI and Vision Alliance
 
Brain wave controlled robot
Brain wave controlled robotBrain wave controlled robot
Brain wave controlled robot
Rahul Wagh
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...
Mahdi Hosseini Moghaddam
 
Application of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisApplication of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data Analysis
ChrisCoughlin9
 
ME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNs
Ivano Malavolta
 
micro manit.pptx
micro manit.pptxmicro manit.pptx
micro manit.pptx
bhaveshagrawal35
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
LEGATO project
 
Power optimization for Android apps
Power optimization for Android appsPower optimization for Android apps
Power optimization for Android apps
Xavier Hallade
 
DNA: an overview
DNA: an overviewDNA: an overview
DNA: an overview
Cisco DevNet
 
System-on-Chip Programmable Retina
System-on-Chip Programmable RetinaSystem-on-Chip Programmable Retina
System-on-Chip Programmable Retina
Vanya Valindria
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
智慧檢測技術與工業自動化
智慧檢測技術與工業自動化智慧檢測技術與工業自動化
智慧檢測技術與工業自動化
CHENHuiMei
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Alerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionAlerting mechanism and algorithms introduction
Alerting mechanism and algorithms introduction
FEG
 

Similar to Combining out - of - band monitoring with AI and big data for datacenter automation in OpenPOWER (20)

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
 
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
 
IoT Tech Expo 2023_Micha vor dem Berge presentation
IoT Tech Expo 2023_Micha vor dem Berge presentationIoT Tech Expo 2023_Micha vor dem Berge presentation
IoT Tech Expo 2023_Micha vor dem Berge presentation
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
 
Brain wave controlled robot
Brain wave controlled robotBrain wave controlled robot
Brain wave controlled robot
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...
 
Application of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisApplication of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data Analysis
 
ME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNs
 
micro manit.pptx
micro manit.pptxmicro manit.pptx
micro manit.pptx
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
 
Power optimization for Android apps
Power optimization for Android appsPower optimization for Android apps
Power optimization for Android apps
 
DNA: an overview
DNA: an overviewDNA: an overview
DNA: an overview
 
System-on-Chip Programmable Retina
System-on-Chip Programmable RetinaSystem-on-Chip Programmable Retina
System-on-Chip Programmable Retina
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
智慧檢測技術與工業自動化
智慧檢測技術與工業自動化智慧檢測技術與工業自動化
智慧檢測技術與工業自動化
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Alerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionAlerting mechanism and algorithms introduction
Alerting mechanism and algorithms introduction
 
Multilin™ Intelligent Line Monitoring System
Multilin™ Intelligent Line Monitoring SystemMultilin™ Intelligent Line Monitoring System
Multilin™ Intelligent Line Monitoring System
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 

Recently uploaded

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 

Recently uploaded (20)

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 

Combining out - of - band monitoring with AI and big data for datacenter automation in OpenPOWER

  • 1. Andrea Bartolini, Prof University of Bologna – DEI, Italy Combining out-of-band monitoring with AI and big data for datacenter automation in OpenPOWER
  • 2. Outline • Datacenter Automation • D.A.V.I.D.E. Out-of-Band and Big Data Monitoring • Big Data/AI enabled Anomaly Detection • Future Works
  • 4. Usage Scenarios Fine Grain Power and Performance Measurements: - Verify and classify node performance - In spec / out of spec behaviour - Miss configuration - Aging and wear out - Detect security hazards - Predictive maintenance Coarse grain Fine grain CPU CPU ACC ACC Node DIMMDIMMDIMM Performance Counters: - Node components - Microarchitectural events AI
  • 5. Datacenter Automation Design and Bottlenecks Centralized Monitoring & Analytics Edge Monitoring & Analytics Low Data-Rate Huge Data-Rate Centralized Monitoring & Analytics Bottlenecks: • Network BW • Storage • SW Overhead Infrastructure Sensors (e.g., CRAC, PDU, Etc.) Node-1 Node-n Node-nNode-1 Infrastructure Sensors (e.g., CRAC, PDU, Etc.)
  • 6. Outline • Datacenter Automation • D.A.V.I.D.E. Out-of-Band and Big Data Monitoring • Big Data/AI enabled Anomaly Detection • Future Works
  • 8. D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) OCP form-factor compute node based on IBM Minsky 2xIB EDR LIQUID COOLING 4x Tesla P100 HSMX2 University of Bologna / ETH Zurich DiG: FINE GRAIN POWER AND PERFORMANCE MONITORING & ANALYTICS BusBar 2 x POWER8 with NVLink
  • 9. • Power measuring block - placed between the power sensing unit (PSU) and the DC- DC converter provide overall node power consumption • Embedded system (BeagleBoneBlack) connected through pass through commands to node performance metrics • Scalable interface to the data analysis point through the MQTT protocol • Embedded system powerful enough to support edge analytics & inference DiG: Out-of-band Power & Performance Monitoring Architecture
  • 10. • Sub-Watt precision • Power Monitoring Sampling rate @50kS/s (T=20us) • Performance Monitoring 242 Amester metrics every 10s • Time synchronized (±3σ) • NTP < 35us • PTP < 1.3us State-of-the art systems (HDEEM and PowerInsight) • Max. 1 ms sampling period • Use data only offline (no possibility for real-time computing) Architecture DiG: Out-of-band Power & Performance Monitoring Framework Fsmax [kHz] E4 PPBB 50 HDEEM 1 PowerInsight 1 Nice, but how do I use it?
  • 11. 11 Application 1 Application 2 Spectral signature of an application! Real-time Frequency analysis on power supply and more…a live oscilloscope • For instance, using the FFT we plot the power spectral density of the power benchmark of two applications, and we can distinguish them by the harmonics present in each of the signals Low overhead, accurate monitoring Interesting feature for node level and system level Intrusion Detection System! OK, but do I automate?
  • 12. Outline • Datacenter Automation • D.A.V.I.D.E. Out-of-Band and Big Data Monitoring • Big Data/AI enabled Anomaly Detection • Future Works
  • 13. Scalable Data Collection, Analytics Sens_pub Broker1 Sens_pub Sens_pub Cassandra node1 MQTT Sens_pub BrokerM Sens_pub Sens_pub Cassandra nodeM Grafana Back-end • MQTT–enabled sensor collectors Front-end • MQTT Brokers • Data Visualization • NoSQL Storage • Big Data Analytics Apache Spark Target Facility MQTT Brokers Applications NoSQL ADMIN MQTT2Kairos MQTT2kairos Kairosdb Panda Matlab
  • 14. MQTT: MQ Telemetry Transport •Lightweight message queuing and transport protocol •Developed by IBM and Eurotech •Well suited for low resource demanding scenarios like M2M, WSN and IoT applications •Basic features: •PubSub model •Async communication protocol (messages) •Low overhead packet (2 bytes header) •QoS (3 levels) •Open source implementation: •https://mosquitto.org/ Publisher Topic (Broker) Subscriber (mosquitto_pub) (mosquitto_sub)(mosquitto)
  • 15. Cassandra Column Family MQTT Publishers facility/sensors/B Sens_pub_A Sens_pub_B Sens_pub_C Metric: A Tags: facility Sensors Metric: B Tags: Facility Sensors Metric: C Tags: facility sensors facility/sensors/# MQTT2KairosdbMQTT Broker MQTT to NoSQL Storage: MQTT2Kairosdb = {Value;Timestamp}
  • 16. 12-Nov-2018 16 DiG: Power & Perf Meas on D.A.V.I.D.E. Broker MQTT DiG MQTT Pow_pub IPMI_pub OCC_pub DiG SW Daemons PSU_pub Cooling_pub D.A.V.I.D.E. Front-End Power: • 20μs → on-board Analysis • 1s,1ms → to Central Unit (45 kS/s) IPMI: 89 metrics per node every 5sOCC: 242 metrics per node every 10s IPMI Liteon Overall Rack info (e.g., Total Power) Asetek Info Liquid Cooling Antonio Libri Slurm Slurm_pub
  • 17. Examon Analytics: Batch training + Edge inference examon-client (REST) (Batch) Pandas dataframe
  • 18. Embedded Board: Monitoring + Anomaly Detection Monitoring Infrastructure Historical Data collect data 1 DL Model train 2 Computing Node 1 Embedded Board Computing Node 2 Embedded Board Computing Node N … 3 load trained model in boards DL 4 Normal Behaviour Anomalyonline anomaly detection on live, new data AI+Big Data on D.A.V.I.D.E.: Example of Anomaly detection 1. Collect Data 2. AI Train 3. Edge 4. AI Inference Does it work?
  • 19. Anomaly Detection X Y AUTOENCODER … Z encoder decoder • An autoencoder tries to copy its input (X) into its output (Y) • An autoencoder learns to represent its input in the latent space, extracting the important characteristics of the input set X IDEA: train an autoencoder with the normal behavior of a HPC system and use its reconstruction error to detect anomalies
  • 20. AI+Big Data on D.A.V.I.D.E.: Anomaly detection • Autoencoder: neural network with 3 layers • Sparse layers (dimension = n_features x 10) • We trained on D.A.V.I.D.E. the autoencoder using ~two months of normal data, collected with Examon • To test our approach we injected anomalies in a subset of nodes • Misconfigurations → change of frequency governor policy • Default policy conservative: cpu frequency depends on load • Anomaly 1 policy changed to powersave: cpu freq always at min value • Anomaly 2 policy changed to performance: cpu freq always at max value Train on «normal» data Validation
  • 21. Fault!! AI+Big Data on D.A.V.I.D.E.: Anomaly detection F-Score 99th percentile Normal Anomaly 0.99 0.97 Inference Accuracy Edge Inference with Tensorflow on Embedded Computers (BBB): 11 ms
  • 22. Conclusion & Future Works • We presented an approach to conbine out-of-band monitoring and big data and AI to enable Datacenter Automation • We proof the effectivity of our approach toward enabling automated anomaly detection of computing node • Future Works: • Extending the approach toward Security and house-keeping tasks in Datacenters • Leverage OpenBMC and custom firmware to deploy it as part of BMC • Looking for partnership for bringing it to Large scale P9 systems
  • 23. ACKNOWLEDGE The Datacenter Automation TEAM • Luca Benini, Michela Milano, Andrea Borghesi, Antonio Libri, Francesco Beneventi, Alessandro Petrella