ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus

Ali Kafel, VP of Business Development
Ensuring High Availability and Resiliency for NFV
Monday 15th February, 2016,
3.00 - 6.00pm
Croke Park, Dublin 3, Ireland
1
MOVING IT TO THE FIELD
(CO-LOCATED WITH ETSI NFV#13)
The details of this presentation are covered in this White Paper:
http://www.slideshare.net/akafel/nfv-resiliency-whitepaper-ali-kafel-stratus-technologies

 Why We Need Resiliency vs High Availability
 Achieving Resiliency Management for NFV
 Proof point – ETSI PoC#35
Agenda
2

3
Stratus Technologies
Intel Platforms
ftServer
Hardware Fault Tolerance
Proprietary Platforms
1980 - Present
Software Fault Tolerance
everRun Enterprise
12,000+ Installed
2008 - Present
Trusted Name in Fault Tolerant Computing for 35 years
Stratus Fault Tolerant Cloud
Resilient Cloud Technologies
Based of proven SW infrastructure
2015-present

4

5
Why the need for Resiliency in NFV
• It is no longer about voice services ….. Certain data and video services
need HA and Resiliency more that voice
• Even “mature” cloud technologies still lack HA and Resiliency
uptime hours mins secs
99.9% 8.76 525.6 31536
99.99% 52.56 3154
99.999% 5.256 315.4
99.9999% 0.526 31.54
Down time

 Reliability
• How long a system performs its intended function.
• MTBF = total time in service / number of failures
 Availability
• % of time an equipment is in an operable state ie. Service accessible and
service continuity
• Availability (A) = Uptime / (Uptime + Downtime);
• A = MTBF / (MTBF + MTTR)
 Resiliency
• The ability to recover quickly from failures, to return to its original form /
state to maintain operable state + QoS
• Resiliency (R) = Availability (A) + QoS
 What you need is R, not just A… because, for example:…
 A 99.999% application that fails once a week for just 1 secs and disrupts active services is not
Resilient and not acceptable
 A 99.9999% application that causes increases latency during a fault is not acceptable
Defining Reliability, Availability and Resiliency
Stratus Technologies Page 2

Resiliency Management cannot be done in the VNFs
…..Because you cannot manage what you cannot see
VNFs
Virtualized Resources
Performance Faults
Resource Depletion
Fault Impacts
External Dependencies
AccessNetworks
Are exposed to
Depend 0n
VNFM
SDNC-OL
SDNC-UL
Shared
Storage
Shared
Network
NFVI Fabric
NODE HW C/N/S
NODE SW
C/N/S
Virtualization SW
vC, vN, Vs
Facility Infra
DCIM
CoreNetworks
Over 80% of system failure
modes are not directly
visible by the VNFs
Infrastructure decoupling hides
the information required to take
actions on faults from VNFs
VIM
HW Faults
SW Faults
Config Faults
Migrations Upgrades
7Stratus Technologies

Resiliency management can be “designed In” in multiple ways
but it’s best done in the Software Infrastructure
Applications / VNFs
Operating Environment
Hardware
• Transparent – no code change
• Fast & Simple Deployment
• No special App Software
• Very expensive
• Inefficient utilization
• Special Hardware
• Rigid
Costs&Resources
Pros
Cons
In the Hardware In the Applications In the Software Infrastructure
Applications / VNFs
.
.
.
.
Hardware
• App specific state can be
Customized
• Can’t detect & manage all infrastructure faults
• Code written for resiliency increased by ~40%
• Most developers don’t have Resiliency experience
• More complex & Longer time to develop
Middleware
Applications / VNFs
with Resilience Layer
Hardware
• Needs to be adaptable to a wide range of
Application Architectures
• Broader & Faster fault detection and correlation
• Faster and simpler Application development
• Transparent – no code changes
• Multiple levels of Resiliency
Benefits:
• Reduces Development & Verification time
• Lower Risks
• Faster time to market

9

Resiliency Management
It’s Complexity, Multi-Dimensional and more than just Fault Management
Detection
(Prediction)
Localization
Isolation
Remediation
(Service
restoration)
Recovery
(Redundancy
restoration)
Resiliency on multiple factors
• Speed of Service restoration & Redund. restoration
• State Management: Service continuity
• “Key state” versus “All state”
• Redundancy mode: Resource consumption / cost
• Application performance impact
Availability
Management
Configuration
Management
Fault management

 State Protection
 Remembering the preceding events in a given sequence of
interactions within the application
 All or partial?
 Service Restoration (or Failover)
 Insuring that service is restored either through a fast restart or
failover to an active secondary or hotStandy
 The speed of Service Restoration depends on the type of
application
 Some applications need State Protection, most
applications need fast Service Restoration
Multi-dimensional aspects of Resiliency
Two Key Elements: Service Restoration and State protection

StateProtectionNoStateProtection
StateManagement
Slow (mins)
Start from
reset
Key state
stored on
disk
Re-instantiation after
failure: No Standby
“OSS, Billing”
“Web server”
State Protection versus Service Restoration
 Types of State Protection
 Full state protection
 Key state protection
 No state protection
(Stateless)
 State Management has
implications on
 Transparency
 Performance
 Resources
“Cold Standby”
Service Restoration Speed

StateManagement
Slow (mins)
Start from
reset
Failover
Medium (secs)
Key state
Stored in RAM
or Disk
Key state
stored on
disk
Pre-instantiated Before failure:
Failover to running Standby
“OSS, Billing” “email, SMS”
“Web server”
“vCE Router
Forwarder”
“Cold Standby” “Warm Standby”
(Stateless)
implications on
 Transparency
 Performance
 Resources
failure: No Standby

StateManagement
Slow (mins)
Fast (msecs)
Start from
reset
Failover +
key state
reload
Failover Full
VM state in
RAM
Failover
Medium (secs)
Key state
Stored in RAM
or Disk
Key state
stored on
disk
Service
Accessibility
Service
Continuity
“Warm Standby” “Hot Standby or
Active-Active”
“OSS, Billing” “email, SMS”
“Voice control,
Router Control”
“Web server”
“vPE Router
Forwarder”
“vCE Router
Forwarder”
“Cold Standby”
(Stateless)
implications on
 Transparency
 Performance
 Resources
failure: No Standby
 To do Fast Remediation
you need
 Pre-instantiation
 State management

Immense Pain Loss of
Consciousness
Loss of
Bodily Control
Temporary
Brain Loss
Fault Tolerant Systems Provide Service Continuity, Even During Failures
Failure
Cold Restart versus Hot Standby or Active-Active
……it’s like surviving a heart attack versus preventing one
Cold Restart
(Instant HA)
Hot Standby
Or Active-Active
(Fault Tolerant)
msecs secs mins hours days
Fully Protected
Backup Activated -
Unprotected
Restored to Fully Protected Redundancy
Customer Affecting Application Outage NormalApp Restart
All state is Lost
All state is Preserved
15
Re-instantiation after failure:
No Standby
Stratus Technologies Confidential

State protection
Guaranteeing Globally Consistent State
 Different ways to describe StatePointing
• Active-Standby synchronous VM replication
• Also known Checkpointing with I/O barrier, I/O lock-stepping or
buffering
 What does it guarantee
• Application transparency
• IO barrier prevents all external communications from the
speculative execution prior to state replication
• Consistent VM memory replica between act-standby and hot-
standby, at the confirmed statepoint
16

We call it StatePointing (VM replication)
Providing Service Continuity with fast Service Restoration
 VM instances paired between primary and secondary hosts in the cloud infrastructure
 State of primary (active) captured regularly and applied to secondary (HotStandby)
 StatePoint™ = VM Checkpoint + I/O StateStepping
• Provides globally consistent state
 Fast service restoration from the most recent StatePoint upon primary failover to secondary
 Automatic redundancy restoration through third host instantiation
Hot Standby Host
SP N-1
If the primary host fails, it
automatically switches to
the secondary host
Active Host
Guest Run
Epoch N-1
Guest Run
Epoch N
SP N-1
SP N
SP N
Guest Run
Epoch N+1
Guest Run
Epoch N+2
Guest Run
Epoch N+1
SP N+1
Third Host
(created
post primary
failure)
17
Guest From
Image
SP N+X
SP N+1 SP N+X
17
Active host
Hot Standby host

Act.-Stby. & Egress Network Traffic
n-1 n+1
QEMU Monitor
n
QEMU Monitor
QEMU Monitor
QEMU Monitor
QEMU Monitor
QEMU Monitor
Egress Network Queue Barrier; prevents transmission of queued egress packet(s) until the barrier is removed
PCR
PCR
PCR
Insert n
PCR Pause, Capture, Resume (PCR); phases of Statepoint process when VM execution is suspended
Note: For simplicity, n-2 interactions are not shown.
18
P1
P2
P3
P4
P5
P5
QEMU
(Standby)
Network
Egress
Queue
[snapshots]
QEMU
(Active)
Enqueue
Insert n-1 state I/O barrier
P1
P2
P3
P4
P5
P1
P2
P3
P4
P1
P2
P3
Guest VM
(Active)
Insert n+1
barrier
n-1 I/O barrier
Still on
n-1 I/O barrier
removed
n I/O barrier
still on
n I/O barrier
removed

Multiple levels of resiliency
Ensures flexibility and resource optimization based of applications
Deliver Availability as an
infrastructure service to virtual and
cloud ecosystems
Firewall MME IMS Web
Server
While every VNF needs Fault
Management, not all need state
protection
VNF-C
Forwarding
Element
VNF-C
Forwarding
Element
VNF-C
Forwarding
Element
VNF-C
Control
Element
Monolithic
VNFs
De-composed VNFs
(separate control and forwarding
elements)
Stateless Fast Path
Forwarding
Elements
Stateful
Control
Element
Fault
Tolerant
(includes State
protection)
High
Availability
(no State
protection)
Unprotected
Modes of
protection

Commodity
High Volume
Networking
Virtualization
Commodity Hyper Scale
COTS Computing
Commodity
High Volume
Storage
Linux
EPC
Linux
PCRF
Linux
HSS
Linux
IMS
…
Linux
OpticalTransport
ControlPlane
Linux
L3Routing
ControlPlane
Linux
Billing
Linux
CustomerCare
Linux
NOC
Linux
L2Switching
ControlPlane
Virtualized
OSS/BSS
Virtualized
SDN
Orchestration
NFV
Stratus Node Resiliency Services (NRS)
Protection with Application transparency, no code changes
Resiliency Functionality in the NFVI nodes & managed in the MANO
20
Stratus
Resiliency Management
Services (RMS)
MANO
OpenStack
environment
The Stratus Approach has implemented enhancements in KVM and plug-ins in OpenStack to make it seamless for the VNFs

 SW Infrastructure Resiliency Management
• Fault protection for all applications, no required code changes for most apps
• State Protection, offering globally consistent state
• Multiple levels of Resiliency – Software Defined Availability (SDA)
 Control vs. Forwarding element, Stateful vs. stateless, etc
 Benefits:
• Reduces Development & Verification time
• Lower Risks
• Faster time to market
Benefits of Resiliency Management
that includes Fault Management, Availability Management
and Configuration Management

22

The Stratus led PoC (ETSI PoC#35)
Participants of PoC#35
Availability Management with Stateful Fault Tolerance
• Demonstrated at NFV World Congress May 6-8 in San Jose, CA
OpenStack Summit, May 2015, Vancouver, Canada
SDN World Congress Oct 2015, Dusseldorf, Germany
• Completed 7/31/2015, final reported submitted
http://nfvwiki.etsi.org/index.php?title=Availability_Management_with_Stateful_Fault_Tolerance

24
 OpenStack based VIM mechanisms alone are insufficient for supporting
carrier grade resiliency, but Stratus Cloud Technology solves that and
provided stateful failover enabling service continuity with acceptable QoS
• Service Restoration in millisecs
• Redundancy Restoration in seconds
 Any non resilient VNF can be made instantaneously Resilient with no code
change (as long as it is OpenStack ready and there is no standard way to
package VNF)
 Multiple levels of Resiliency can be easily provided using Software Defined
Resiliency in the Infrastructure, based on application requirement for State
and service restoration speed
What we proved with PoC#35

ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus

Similar to ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus (20)

Recently uploaded

Recently uploaded (20)

ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus