Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Combining out - of - band monitoring with AI and big data for datacenter automation in OpenPOWER

428 views

Published on

Combining out
-
of
-
band monitoring with AI and big data
for datacenter automation in
OpenPOWER

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Combining out - of - band monitoring with AI and big data for datacenter automation in OpenPOWER

  1. 1. Andrea Bartolini, Prof University of Bologna – DEI, Italy Combining out-of-band monitoring with AI and big data for datacenter automation in OpenPOWER
  2. 2. Outline • Datacenter Automation • D.A.V.I.D.E. Out-of-Band and Big Data Monitoring • Big Data/AI enabled Anomaly Detection • Future Works
  3. 3. Performance Analysis Scalable Moniitoring Framework Machine Learning Data Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC PDU CLUSTER Reactive and Proactive Feedbacks ENV. A New Trend: Datacentre Automation
  4. 4. Usage Scenarios Fine Grain Power and Performance Measurements: - Verify and classify node performance - In spec / out of spec behaviour - Miss configuration - Aging and wear out - Detect security hazards - Predictive maintenance Coarse grain Fine grain CPU CPU ACC ACC Node DIMMDIMMDIMM Performance Counters: - Node components - Microarchitectural events AI
  5. 5. Datacenter Automation Design and Bottlenecks Centralized Monitoring & Analytics Edge Monitoring & Analytics Low Data-Rate Huge Data-Rate Centralized Monitoring & Analytics Bottlenecks: • Network BW • Storage • SW Overhead Infrastructure Sensors (e.g., CRAC, PDU, Etc.) Node-1 Node-n Node-nNode-1 Infrastructure Sensors (e.g., CRAC, PDU, Etc.)
  6. 6. Outline • Datacenter Automation • D.A.V.I.D.E. Out-of-Band and Big Data Monitoring • Big Data/AI enabled Anomaly Detection • Future Works
  7. 7. 7 D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe)
  8. 8. D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) OCP form-factor compute node based on IBM Minsky 2xIB EDR LIQUID COOLING 4x Tesla P100 HSMX2 University of Bologna / ETH Zurich DiG: FINE GRAIN POWER AND PERFORMANCE MONITORING & ANALYTICS BusBar 2 x POWER8 with NVLink
  9. 9. • Power measuring block - placed between the power sensing unit (PSU) and the DC- DC converter provide overall node power consumption • Embedded system (BeagleBoneBlack) connected through pass through commands to node performance metrics • Scalable interface to the data analysis point through the MQTT protocol • Embedded system powerful enough to support edge analytics & inference DiG: Out-of-band Power & Performance Monitoring Architecture
  10. 10. • Sub-Watt precision • Power Monitoring Sampling rate @50kS/s (T=20us) • Performance Monitoring 242 Amester metrics every 10s • Time synchronized (±3σ) • NTP < 35us • PTP < 1.3us State-of-the art systems (HDEEM and PowerInsight) • Max. 1 ms sampling period • Use data only offline (no possibility for real-time computing) Architecture DiG: Out-of-band Power & Performance Monitoring Framework Fsmax [kHz] E4 PPBB 50 HDEEM 1 PowerInsight 1 Nice, but how do I use it?
  11. 11. 11 Application 1 Application 2 Spectral signature of an application! Real-time Frequency analysis on power supply and more…a live oscilloscope • For instance, using the FFT we plot the power spectral density of the power benchmark of two applications, and we can distinguish them by the harmonics present in each of the signals Low overhead, accurate monitoring Interesting feature for node level and system level Intrusion Detection System! OK, but do I automate?
  12. 12. Outline • Datacenter Automation • D.A.V.I.D.E. Out-of-Band and Big Data Monitoring • Big Data/AI enabled Anomaly Detection • Future Works
  13. 13. Scalable Data Collection, Analytics Sens_pub Broker1 Sens_pub Sens_pub Cassandra node1 MQTT Sens_pub BrokerM Sens_pub Sens_pub Cassandra nodeM Grafana Back-end • MQTT–enabled sensor collectors Front-end • MQTT Brokers • Data Visualization • NoSQL Storage • Big Data Analytics Apache Spark Target Facility MQTT Brokers Applications NoSQL ADMIN MQTT2Kairos MQTT2kairos Kairosdb Panda Matlab
  14. 14. MQTT: MQ Telemetry Transport •Lightweight message queuing and transport protocol •Developed by IBM and Eurotech •Well suited for low resource demanding scenarios like M2M, WSN and IoT applications •Basic features: •PubSub model •Async communication protocol (messages) •Low overhead packet (2 bytes header) •QoS (3 levels) •Open source implementation: •https://mosquitto.org/ Publisher Topic (Broker) Subscriber (mosquitto_pub) (mosquitto_sub)(mosquitto)
  15. 15. Cassandra Column Family MQTT Publishers facility/sensors/B Sens_pub_A Sens_pub_B Sens_pub_C Metric: A Tags: facility Sensors Metric: B Tags: Facility Sensors Metric: C Tags: facility sensors facility/sensors/# MQTT2KairosdbMQTT Broker MQTT to NoSQL Storage: MQTT2Kairosdb = {Value;Timestamp}
  16. 16. 12-Nov-2018 16 DiG: Power & Perf Meas on D.A.V.I.D.E. Broker MQTT DiG MQTT Pow_pub IPMI_pub OCC_pub DiG SW Daemons PSU_pub Cooling_pub D.A.V.I.D.E. Front-End Power: • 20μs → on-board Analysis • 1s,1ms → to Central Unit (45 kS/s) IPMI: 89 metrics per node every 5sOCC: 242 metrics per node every 10s IPMI Liteon Overall Rack info (e.g., Total Power) Asetek Info Liquid Cooling Antonio Libri Slurm Slurm_pub
  17. 17. Examon Analytics: Batch training + Edge inference examon-client (REST) (Batch) Pandas dataframe
  18. 18. Embedded Board: Monitoring + Anomaly Detection Monitoring Infrastructure Historical Data collect data 1 DL Model train 2 Computing Node 1 Embedded Board Computing Node 2 Embedded Board Computing Node N … 3 load trained model in boards DL 4 Normal Behaviour Anomalyonline anomaly detection on live, new data AI+Big Data on D.A.V.I.D.E.: Example of Anomaly detection 1. Collect Data 2. AI Train 3. Edge 4. AI Inference Does it work?
  19. 19. Anomaly Detection X Y AUTOENCODER … Z encoder decoder • An autoencoder tries to copy its input (X) into its output (Y) • An autoencoder learns to represent its input in the latent space, extracting the important characteristics of the input set X IDEA: train an autoencoder with the normal behavior of a HPC system and use its reconstruction error to detect anomalies
  20. 20. AI+Big Data on D.A.V.I.D.E.: Anomaly detection • Autoencoder: neural network with 3 layers • Sparse layers (dimension = n_features x 10) • We trained on D.A.V.I.D.E. the autoencoder using ~two months of normal data, collected with Examon • To test our approach we injected anomalies in a subset of nodes • Misconfigurations → change of frequency governor policy • Default policy conservative: cpu frequency depends on load • Anomaly 1 policy changed to powersave: cpu freq always at min value • Anomaly 2 policy changed to performance: cpu freq always at max value Train on «normal» data Validation
  21. 21. Fault!! AI+Big Data on D.A.V.I.D.E.: Anomaly detection F-Score 99th percentile Normal Anomaly 0.99 0.97 Inference Accuracy Edge Inference with Tensorflow on Embedded Computers (BBB): 11 ms
  22. 22. Conclusion & Future Works • We presented an approach to conbine out-of-band monitoring and big data and AI to enable Datacenter Automation • We proof the effectivity of our approach toward enabling automated anomaly detection of computing node • Future Works: • Extending the approach toward Security and house-keeping tasks in Datacenters • Leverage OpenBMC and custom firmware to deploy it as part of BMC • Looking for partnership for bringing it to Large scale P9 systems
  23. 23. ACKNOWLEDGE The Datacenter Automation TEAM • Luca Benini, Michela Milano, Andrea Borghesi, Antonio Libri, Francesco Beneventi, Alessandro Petrella

×