Sunku Ranganath
https://www.linkedin.com/in/sunkuranganath/
Legal Disclaimer
© 2019 Intel Corporation. Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Experience What’s Inside, The Intel Experience What’s Inside logo, and Xeon are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
The cost reduction scenarios described are intended to enable you to get a better understanding of how the purchase of a given Intel based product, combined with a number of situation-specific variables, might
affect future costs and savings. Circumstances will vary and there may be unaccounted-for costs related to the use and deployment of a given product. Nothing in this document should be interpreted as either a
promise of or contract for a given level of costs or cost reduction.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please
refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804.
No computer system can be absolutely secure.
Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to
operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and
system configuration and you can learn more at http://www.intel.com/go/turbo.
Available on select Intel® processors. Requires an Intel® HT Technology-enabled system. Your performance varies depending on the specific hardware and software you use. Learn more by visiting
http://www.intel.com/info/hyperthreading.
Intel, the Intel logo, [List the Intel trademarks in your document] are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© Intel Corporation
Acknowledgements
Timothy Verrall
John Browne
Damien Power
Emma Collins
Jean Christophe Bouche
Krzysztof Kepka
Agenda
Platform Observability
Service Assurance
Closed Loop Automation
Platform Observability & Service Assurance (SA)
• Observability: Ability to expose state of the platform to ensure Service Level
Objectives are met
• Observability Considerations: Logging, Metrics & Tracing
• Communications Service Provider Context:
• Care about overall Service Assurance
• Both Monitoring & Observability are important
• Service Assurance
• Application of policies to ensure services meet a pre-defined service quality level
• FCAPS (Fault, Configuration, Accounting, Performance & Security) attributes on
existing network infrastructure
6
Three Key Elements of SA Platform
 Monitoring: Enabling deeper
management and tracking of
specific service levels
 Presentation: Reporting to
enable reaction to service level
changes:
 Provisioning: Enable
configuration of service levels
based on workload or service
priority
Figure: Service Assurance elements mapping to ETSI NFV Model
7
Collectd Monitoring Agent
Collectd: Why & What
• Statistics collection daemon
• Uses read or write plugins to collect metrics write to an end
point
• Open source
• Widely adopted
• Configurable Collection Interval
Various Plugin types:
• Input/Output
• Binding Plugins
• Logging Plugins
• Notification Plugins
• Other: Network plugin with both send/receive feature
Figure: Collectd Architecture
https://github.com/collectd/collectd
8
Platform Telemetry Exposure & Integration
Compute Network Storage
Hypervisor [RT/SA KVM4NFV extensions]
NFVI
IPFIX
Virtualised
Compute
Virtualised
Network
Virtualised
Storage
E.g.
Working/Protect
Failover
Local
Corrective
Action
Enterprise
MIB
SYSLOG
Collectd
PMU*
counters
NIC counters
vSwitch
counters
SNMP API
Perfmon
MIB
Common / Standard Open APIs
Fast Path
Triggers on events or
counters
VM Stall Detection/
RT Stall Detection
Monitoring/
Analytics
Systems
Slow Path
Periodic Pull 1/15mins
RAS Hypervisor/Container
Counters
Container
Monitoring
Solutions
(Prometheus
….)
Includes
NetFlow Collectors
Vendor SA
Middleware
Intel® Node
Manager
NFV Platform
MIB
Standard Open APIs
Intel Components
Open Platform
Collectors
Intel® Run Sure Technology
MCA* PCIe AER
Resilient System Technology
Resilient Memory Technology
SDDC DDDC+1 Mirroring
RAID/
NVMe
Intel® Rapid
Storage
Technology
sFlow
Intel®
Management
Engine
IPMI
Ceilometer
Aodh
Vitrage
Congress
In progress
Done/Integrated
Open Stack
Collectd PluginsIntel Infrastructure
Management Technologies
Gnocchi
VES Plugin
Redfish
C
M
T
Intel® RDT
C
A
T
M
B
M
C
D
P
PO
W
ER
Out Of
Band
Telemetry
Kafka Prometheus
OpenStack
VIM
PMU*: Performance Monitoring Unit
Multiple Closed Loops
Plan & Provision
Offline
feedback loop
Design Analyze
Use cases (Loops)
• Capacity planning
• Peering planning
• Cache placement
• …
Optimize
MonitorOrchestrate
Near-real
Time
Feedback loop Real-Time
Feedback loop
Use cases (Loops)
• Service assurance
• Security operations
• …
Use cases (Loops)
• Traffic Engineering:
Network Optimization
• Demand placement
• Workload placement…
Telemetry
Telemetry
Real-time/Near Real-time Loops - Automated
Telemetry
Offline Processing
Online Processing
Source: https://pndablog.com/2017/06/05/feedback-loops-and-closed-loop-control/
10
Networking Closed Loops – High Level Architecture
Platform Resources
Forwarding Plane
Interfaces
Interfaces
TrafficTraffic
Platform
Analytics
Systems
Business Applications
Setting of Policy
SDN/NMS
Network Services
Cloud and Virtual
Management
MANO
EMS VNFM
Infrastructure
Control
Application
Independent Closed Loops: SDN, Cloud & Virtual Mgt, Platform
Local
Platform
Agent
Telemetry
distribution or
storage or
…..
Platform
Telemetry
Policy Based Provisioning
Control Loops
11
Closed Loops – Networking Stack
Application Layer
Network Data Analytics
Orchestration, Management, Policy
Cloud & Virtual Management
Network Control
Operating Systems
Data Path
Hardware/
Disaggregated Hardware
ServicesManagement&ControlInfrastructure
Micro-seconds/
Milliseconds
Mins/Hours/Days
Closed Loop
Reaction Time
Domain Knowledge
Local to
Platform
End to End
Enforce Local
Policy
Deployment
Policies
Enforce Network
Domain Policy
Map Policies
HW Enabled
Loops (eg
RAS)
Enforce DP
Loops (HA etc.)
Analyze/
Plan Policies
High Speed Control Loops are Close to the Platform
Seconds/Mins
Analytics
12
Closed Loops – Business Cases
Improved Customer
Experience
Cloud Optimization &
Efficiency
Edge Placement
Service Healing
Differentiated QoS
Service Optimization
Energy Optimization
Capacity Optimization
Cloud Configurations
Business
Use Cases
AI/ML/DL
Platform(s)
Feature Exposure Provisioning Telemetry
Local Policy Enforcement Agent(s)
For Local Dynamic Control
Intel Infrastructure
Management Tech
Intel RDT Power
Monitoring/Storage
NFV Orchestrator (NFVO) [eg ONAP/OSM]
Security
Threat Detection
Threat Response
Business Applications
collectd
Policy Based Provisioning
Control Loops
VNF Manager (VNFM)
Open Stack Kubernetes Telemetry I/FTelemetry I/F
Actively
Contributing
Intel
RunSure
Bare Metal
Telemetry I/F
Closed Loop Resiliency Demo
Goal: Maximize Service Availability
of Virtual Border Network Gateway
(vBNG) in memory error scenario
Figure 1 Source: OpenSAF and VMware from the Perspective of High Availability - Ali Nikzad, Ferhat KhendekMaria Toeroe
Concordia University Ericsson SVM’2013 – Zurich – October 2013
Figure 1: Service Recovery Timeline Figure 2: Closed Loop Resiliency
Demo with Kubernetes
More Details on Demo: https://networkbuilders.intel.com/social-hub/video/closed-loop-
platform-automation-workload-resiliency-demo
Closed Loop Automation (CLA) – Communities,
Standards
• Open Network Automation Platform
(ONAP) – Closed Loop Automation
Management Platform (CLAMP)
• OPNFV Working Group for CLA
• ETSI Zero Touch Service
Management (ZSM)
• ETSI Experiential Networked
Intelligence (ENI)
Ex: OPNFV WG
Ex: ONAP CLAMP
Use Cases & Gaps
• 5G Network Slicing
• Demand based Energy Savings
• Workload Resiliency
• Noisy Neighbor Detection & Avoidance
• And many more….
Figure: 5G Network Slicing Architecture
Source: https://www.researchgate.net/figure/5G-network-slicing-architecture_fig1_324175599
Gaps, On Going Work
• Telemetry tagging
• Policy delivery & management across
VIM to NFVI
Summary
Platform Observability & Monitoring play crucial role in ensuring service assurance
Platform telemetry heavily differentiate the services, along side of application telemetry
Various levels of closed loops are required for autonomous networks
Realtime & Near-Realtime closed loops require automation
Collaborate through Open Source Communities
Figure out use cases of interest
Leverage relevant infrastructure telemetry
Call To Action
18
ServiceAssurance“Phased”EvolutionforNFV/SDN
• Strategic Framework for SA “Phase” Evolution
 Phase 1 - Equivalence (Virtualized + Interworking with existing management systems)
 Phase 2 - Automated by MANO+SDN Controller
 Phase 3 - Predict failures and adapt automatically
Platform Service Assurance -
Equivalence
• Platform Service Assurance supporting:
•Intel RAS Technologies
•Cache Config & Monitoring
•Bios Config & Reporting
•Fastpath DPDK Interface Reporting
•Fastpath DPDK Keep Alive
•Virtual Switch Health
•QAT Watchdog
•Host Health
• …….
Platform Service Assurance
(MANO + SDN Controller)
•VIM and above, support:
• Enable RAS Technologies
• Enable Watchdog Metrics
• Enable DPDK and Keep Alive
• Enable Host Health
• Policy Based Provisioning
• …
Predictive Platform Service
Assurance
•Predict Failures and Adapt
Automatically:
• Automated and Adaptive to changes
notified in metrics
• Closed loop and Dynamic SA
environment
•
Phase 1 Phase 2 Phase 3
Evolving from Equivalence towards NFV/SDN Automation
Never Stops Solution of the day Under Construction
19
Platform Plugins Contributed by Intel
Plugin Domain Description
Intel RunSure/
RAS
Mcelog, PCIe AER, logparser: Metrics & notifications pertaining to Intel RunSure
technologies
Intel_RDT Resource Director Technologies related metrics
Virt Libvirt related metrics
OVS Ovs_stats, ovs_events: Metrics related to Open Virtual Switch
DPDK Dpdk_stats, dpdk_events, hugepages: DPDK related metrics
OpenStack Gnocchi, Aodh: Integration in OpenStack projects
Cloud Write_Kafka, Write_Prometheus, VES: Integration in to various cloud platforms
Storage RAID, NVMe: Storage related Metrics
Power/Energy CPUFreq, Turbostat: Frequency & power related metrics
Platform IPMI, RedFish, PMU: Out of Band metrics & platform counters
Infrastructure Metrics are Crucial as Application Metrics
20
Barometer Strategy:
• Ensure platform metrics/events are
accessible through open industry standard
interfaces.
• Demonstrate IA platform technologies can
be monitored, consumed and actioned in
real time
Opnfvbarometer
One Click Install:
 Easy install/configuration
for customers
 One command to install
Collectd/Influxdb/Grafana
• Three container approach for
Collectd:
• Stable Container: latest stable branch
• Master Container: up to date with
master
• Experimental Container: cherry pick
features of interest
Source: https://opnfv-barometer.readthedocs.io/en/latest/release/userguide/docker.userguide.html

Platform Observability and Infrastructure Closed Loops

  • 1.
  • 2.
    Legal Disclaimer © 2019Intel Corporation. Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Experience What’s Inside, The Intel Experience What’s Inside logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. The cost reduction scenarios described are intended to enable you to get a better understanding of how the purchase of a given Intel based product, combined with a number of situation-specific variables, might affect future costs and savings. Circumstances will vary and there may be unaccounted-for costs related to the use and deployment of a given product. Nothing in this document should be interpreted as either a promise of or contract for a given level of costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. No computer system can be absolutely secure. Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. Available on select Intel® processors. Requires an Intel® HT Technology-enabled system. Your performance varies depending on the specific hardware and software you use. Learn more by visiting http://www.intel.com/info/hyperthreading. Intel, the Intel logo, [List the Intel trademarks in your document] are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation
  • 3.
    Acknowledgements Timothy Verrall John Browne DamienPower Emma Collins Jean Christophe Bouche Krzysztof Kepka
  • 4.
  • 5.
    Platform Observability &Service Assurance (SA) • Observability: Ability to expose state of the platform to ensure Service Level Objectives are met • Observability Considerations: Logging, Metrics & Tracing • Communications Service Provider Context: • Care about overall Service Assurance • Both Monitoring & Observability are important • Service Assurance • Application of policies to ensure services meet a pre-defined service quality level • FCAPS (Fault, Configuration, Accounting, Performance & Security) attributes on existing network infrastructure
  • 6.
    6 Three Key Elementsof SA Platform  Monitoring: Enabling deeper management and tracking of specific service levels  Presentation: Reporting to enable reaction to service level changes:  Provisioning: Enable configuration of service levels based on workload or service priority Figure: Service Assurance elements mapping to ETSI NFV Model
  • 7.
    7 Collectd Monitoring Agent Collectd:Why & What • Statistics collection daemon • Uses read or write plugins to collect metrics write to an end point • Open source • Widely adopted • Configurable Collection Interval Various Plugin types: • Input/Output • Binding Plugins • Logging Plugins • Notification Plugins • Other: Network plugin with both send/receive feature Figure: Collectd Architecture https://github.com/collectd/collectd
  • 8.
    8 Platform Telemetry Exposure& Integration Compute Network Storage Hypervisor [RT/SA KVM4NFV extensions] NFVI IPFIX Virtualised Compute Virtualised Network Virtualised Storage E.g. Working/Protect Failover Local Corrective Action Enterprise MIB SYSLOG Collectd PMU* counters NIC counters vSwitch counters SNMP API Perfmon MIB Common / Standard Open APIs Fast Path Triggers on events or counters VM Stall Detection/ RT Stall Detection Monitoring/ Analytics Systems Slow Path Periodic Pull 1/15mins RAS Hypervisor/Container Counters Container Monitoring Solutions (Prometheus ….) Includes NetFlow Collectors Vendor SA Middleware Intel® Node Manager NFV Platform MIB Standard Open APIs Intel Components Open Platform Collectors Intel® Run Sure Technology MCA* PCIe AER Resilient System Technology Resilient Memory Technology SDDC DDDC+1 Mirroring RAID/ NVMe Intel® Rapid Storage Technology sFlow Intel® Management Engine IPMI Ceilometer Aodh Vitrage Congress In progress Done/Integrated Open Stack Collectd PluginsIntel Infrastructure Management Technologies Gnocchi VES Plugin Redfish C M T Intel® RDT C A T M B M C D P PO W ER Out Of Band Telemetry Kafka Prometheus OpenStack VIM PMU*: Performance Monitoring Unit
  • 9.
    Multiple Closed Loops Plan& Provision Offline feedback loop Design Analyze Use cases (Loops) • Capacity planning • Peering planning • Cache placement • … Optimize MonitorOrchestrate Near-real Time Feedback loop Real-Time Feedback loop Use cases (Loops) • Service assurance • Security operations • … Use cases (Loops) • Traffic Engineering: Network Optimization • Demand placement • Workload placement… Telemetry Telemetry Real-time/Near Real-time Loops - Automated Telemetry Offline Processing Online Processing Source: https://pndablog.com/2017/06/05/feedback-loops-and-closed-loop-control/
  • 10.
    10 Networking Closed Loops– High Level Architecture Platform Resources Forwarding Plane Interfaces Interfaces TrafficTraffic Platform Analytics Systems Business Applications Setting of Policy SDN/NMS Network Services Cloud and Virtual Management MANO EMS VNFM Infrastructure Control Application Independent Closed Loops: SDN, Cloud & Virtual Mgt, Platform Local Platform Agent Telemetry distribution or storage or ….. Platform Telemetry Policy Based Provisioning Control Loops
  • 11.
    11 Closed Loops –Networking Stack Application Layer Network Data Analytics Orchestration, Management, Policy Cloud & Virtual Management Network Control Operating Systems Data Path Hardware/ Disaggregated Hardware ServicesManagement&ControlInfrastructure Micro-seconds/ Milliseconds Mins/Hours/Days Closed Loop Reaction Time Domain Knowledge Local to Platform End to End Enforce Local Policy Deployment Policies Enforce Network Domain Policy Map Policies HW Enabled Loops (eg RAS) Enforce DP Loops (HA etc.) Analyze/ Plan Policies High Speed Control Loops are Close to the Platform Seconds/Mins
  • 12.
    Analytics 12 Closed Loops –Business Cases Improved Customer Experience Cloud Optimization & Efficiency Edge Placement Service Healing Differentiated QoS Service Optimization Energy Optimization Capacity Optimization Cloud Configurations Business Use Cases AI/ML/DL Platform(s) Feature Exposure Provisioning Telemetry Local Policy Enforcement Agent(s) For Local Dynamic Control Intel Infrastructure Management Tech Intel RDT Power Monitoring/Storage NFV Orchestrator (NFVO) [eg ONAP/OSM] Security Threat Detection Threat Response Business Applications collectd Policy Based Provisioning Control Loops VNF Manager (VNFM) Open Stack Kubernetes Telemetry I/FTelemetry I/F Actively Contributing Intel RunSure Bare Metal Telemetry I/F
  • 13.
    Closed Loop ResiliencyDemo Goal: Maximize Service Availability of Virtual Border Network Gateway (vBNG) in memory error scenario Figure 1 Source: OpenSAF and VMware from the Perspective of High Availability - Ali Nikzad, Ferhat KhendekMaria Toeroe Concordia University Ericsson SVM’2013 – Zurich – October 2013 Figure 1: Service Recovery Timeline Figure 2: Closed Loop Resiliency Demo with Kubernetes More Details on Demo: https://networkbuilders.intel.com/social-hub/video/closed-loop- platform-automation-workload-resiliency-demo
  • 14.
    Closed Loop Automation(CLA) – Communities, Standards • Open Network Automation Platform (ONAP) – Closed Loop Automation Management Platform (CLAMP) • OPNFV Working Group for CLA • ETSI Zero Touch Service Management (ZSM) • ETSI Experiential Networked Intelligence (ENI) Ex: OPNFV WG Ex: ONAP CLAMP
  • 15.
    Use Cases &Gaps • 5G Network Slicing • Demand based Energy Savings • Workload Resiliency • Noisy Neighbor Detection & Avoidance • And many more…. Figure: 5G Network Slicing Architecture Source: https://www.researchgate.net/figure/5G-network-slicing-architecture_fig1_324175599 Gaps, On Going Work • Telemetry tagging • Policy delivery & management across VIM to NFVI
  • 16.
    Summary Platform Observability &Monitoring play crucial role in ensuring service assurance Platform telemetry heavily differentiate the services, along side of application telemetry Various levels of closed loops are required for autonomous networks Realtime & Near-Realtime closed loops require automation Collaborate through Open Source Communities Figure out use cases of interest Leverage relevant infrastructure telemetry Call To Action
  • 18.
    18 ServiceAssurance“Phased”EvolutionforNFV/SDN • Strategic Frameworkfor SA “Phase” Evolution  Phase 1 - Equivalence (Virtualized + Interworking with existing management systems)  Phase 2 - Automated by MANO+SDN Controller  Phase 3 - Predict failures and adapt automatically Platform Service Assurance - Equivalence • Platform Service Assurance supporting: •Intel RAS Technologies •Cache Config & Monitoring •Bios Config & Reporting •Fastpath DPDK Interface Reporting •Fastpath DPDK Keep Alive •Virtual Switch Health •QAT Watchdog •Host Health • ……. Platform Service Assurance (MANO + SDN Controller) •VIM and above, support: • Enable RAS Technologies • Enable Watchdog Metrics • Enable DPDK and Keep Alive • Enable Host Health • Policy Based Provisioning • … Predictive Platform Service Assurance •Predict Failures and Adapt Automatically: • Automated and Adaptive to changes notified in metrics • Closed loop and Dynamic SA environment • Phase 1 Phase 2 Phase 3 Evolving from Equivalence towards NFV/SDN Automation Never Stops Solution of the day Under Construction
  • 19.
    19 Platform Plugins Contributedby Intel Plugin Domain Description Intel RunSure/ RAS Mcelog, PCIe AER, logparser: Metrics & notifications pertaining to Intel RunSure technologies Intel_RDT Resource Director Technologies related metrics Virt Libvirt related metrics OVS Ovs_stats, ovs_events: Metrics related to Open Virtual Switch DPDK Dpdk_stats, dpdk_events, hugepages: DPDK related metrics OpenStack Gnocchi, Aodh: Integration in OpenStack projects Cloud Write_Kafka, Write_Prometheus, VES: Integration in to various cloud platforms Storage RAID, NVMe: Storage related Metrics Power/Energy CPUFreq, Turbostat: Frequency & power related metrics Platform IPMI, RedFish, PMU: Out of Band metrics & platform counters Infrastructure Metrics are Crucial as Application Metrics
  • 20.
    20 Barometer Strategy: • Ensureplatform metrics/events are accessible through open industry standard interfaces. • Demonstrate IA platform technologies can be monitored, consumed and actioned in real time Opnfvbarometer One Click Install:  Easy install/configuration for customers  One command to install Collectd/Influxdb/Grafana • Three container approach for Collectd: • Stable Container: latest stable branch • Master Container: up to date with master • Experimental Container: cherry pick features of interest Source: https://opnfv-barometer.readthedocs.io/en/latest/release/userguide/docker.userguide.html