Aaron Smith, Red hat, Pasi Vaananen, Red Hat
Carrier-Grade Cloud Infrastructure (Aaron Smith, Pasi Vaananen, Red Hat): The move from vertically integrated hardware and software to distributed execution in a cloud complicates the delivery of highly available services. Vertically integrated systems enabled all system layers required to communicate and participate in the support of availability of the service to be under control of single system vendor. With NFV, the cloud philosophy of infrastructure and application decoupling requires new open interfaces to support the necessary flow of information between layers and clear separation of the fault and availability management responsibilities between the infrastructure and application SW subsystems. Even in the cloud environment, traditional availability concepts such as fast detection, correlation, and fault notification still apply. A fast, low-latency fault management platform will be presented that allows cloud-based services to achieve 5NINES of availability and service continuity. Performance measurements from a prototype of the system will be presented along with a demo of the operation of a service requiring 50 ms fault remediation.
3. Agenda
• Introduction
• Problem and goals?
• Fault management cycle and timeline
• Relative impact to Service Availability
• Proof of concept
• PoC results
• What's next?
4. Problem
• The move to a NFV and a cloud infrastructure complicates the
delivery of highly-available services
No longer a vertically integrated hardware / software stack
Stack components provided by different vendors
• Same requirements apply (50ms … 1000ms, increasing by “layer”)
• For a cloud infrastructure, the network impacts availability more
than individual compute hosts, and detection / protection strategies
must adjust accordingly
5. Goals
• Produce a monitoring and event detection framework that
distributes fault information to various listeners with low latency
(<10’s of milliseconds)
• Provide a hierarchy of remediation controllers, which can react
quickly (<10’s of milliseconds) to faults.
• Provide FM mechanisms for both current virtualization environments
and future containerization environments orchestrated by
Kubernetes, etc…
7. Fault Management Cycle Phases
• Detection – Requires low-latency, low-overhead mechanisms
• Localization – Physical/Virtualized resources to resource
consumer(s) mapping within the context of fault trees
• Isolation – Remove the ability of the failed component to
affect service state
• Remediation – Service restoration through failover to
redundant resource / component, or component restart
• Recovery – Restoration of service redundancy configuration
8. FM Cycle Timeline
Up, redundant Down, Remediation
Up, Recovering
Up, Repair Pending
Minimize TUA
TDET TREM
1st
Failure -- Potential
Outage or Degradation
TUA = TDET + TREM
Up, Redundant
Up, Recovering Up, Redundant
Failure Event
Service
Recovered
Redundancy
Restored
(pooled)
Repair
Completed
(non-pooled)
Redundancy
Restored (non-
pooled)
TREC, Pooled
2nd failure exposure, typ. ~2 mins MTTREC
TREP
TREC, Non-Pooled
2nd failure exposure, typ. 4+ hrs MTTREP
1st Indication:
FM cycle start
For non-pooled resources: coupled, critical repair
For pooled resources: uncoupled, deferred repairs
9. Fault Management Cycle Timeline
• TDET + TNOT + TREM < 50 ms (lowest “layers”, typ. network)
• TDET -- Detection time
• TNOT-- Notification
• TREM-- Remediation is often the longest process and therefore TDET
+ TNOT should be made as small as possible
Minimize
10. Automated Service Recovery Survey
Within 1 second
Within 50 ms
Within 5 seconds
Automated recovery not important
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
40%
39%
20%
1%
Heavy Reading NFV operator survey of 128 service providers, “Telco Requirements for NFVI”, November 2016
11. Relative Impact to Service Availability
• Different infrastructure components do have different impact
potential on the application level Service Availability e.g.:
• Network switch faults have a very high impact potential on the SA
(can affect all associated nodes / services)
• Compute node faults can only affect the VMs / Containers running on
them
• Spine > Leaf > Network Nodes > Storage Nodes > Control Nodes >
Control Node (Specific Service) > Compute Nodes > Compute Nodes
(Critical Services) > Compute Node (Specific VM/Container)
12. Service Relative Criticality (cont’d)
• Focus monitoring/remediation efforts with respect to the
relative impact potential, e.g.:
Switch failure affects 10s of hosts (100s of services)
Need fast detection and remediation of switch failures
13. Proof of Concept
• Demonstrate that events can be detected < 10ms
• Node network interfaces
• Kernel fault conditions
• Complete node failure (and differentiate host vs. switch)
• Demonstrate that event messages can be delivered to
subscribed components with consistently low latency
(99.999% of the latency values < 10ms)
14. Proof of Concept (cont’d)
• Applications can be enhanced to include the subscription and
reception of events
• Ensure that the collectd framework is suitable for event
monitoring (detection latency & overhead)
• Prototype integration with OpenStack services
• Prototype a node/switch monitoring system that provides quick
detection without adding significant overhead
15. Node Monitoring (PoC)
rules / action
engine
policies /
topology
Ingress Plugins
Kafka/AMQP
Local Agent
Config
Kublet
process
kernel
syslogd
libVirt
network
cpu
libVirt
cAdvisor
MCE CollectdCore
Egress Plugins
kernel
net
cpu
mem
hardware
syslog
/proc
pid
interface
Event
Telemetry
Gnocchi
telemetry
collectd config
Policy,
topology,
events
Local corrective actions
G-VNFM
Aodh
Keystone
NFVO/E2EO
RTMD
Ceilometer
Ceilometer
Services
Local Agent
Visualization
16. Proof of Concept Results
• Demonstrate that events can be detected < 10ms
• Node network interfaces – Dependent upon driver but achievable
• Kernel fault conditions – Verified monitoring of syslog output
• Complete node failure (and differentiate host vs. switch) – 802.1ag
17. Proof of Concept Results (cont'd)
• Demonstrate that event messages can be delivered to
subscribed services with consistently low latency. (99.999% of
the latency values < 10ms) – Mixed results with Kafka. With
simulated metrics from 700 nodes, average latency is
below 10ms. However, the cumulative latency distribution
had a long tail with values out to 200ms.
• Applications can be enhanced to include the subscription and
reception of events
18. Proof of Concept Results (cont'd)
• Telco and enterprise applications can be enhanced to include
the subscription and reception of events – In Progress. Low-
latency delivery of messages is achievable, however,
issues of scale and multi-tenancy/security need to be
addressed.
19. What’s Next?
• Common Object Model for Events and Telemetry
• Inclusion of Object and Event model in TOSCA
• Event interfaces towards G-VNFM and other MANO subsystems