SlideShare a Scribd company logo
1 of 25
Download to read offline
Ali Kafel, VP of Business Development
Ensuring High Availability and Resiliency for NFV
Monday 15th February, 2016,
3.00 - 6.00pm
Croke Park, Dublin 3, Ireland
1
MOVING IT TO THE FIELD
(CO-LOCATED WITH ETSI NFV#13)
The details of this presentation are covered in this White Paper:
http://www.slideshare.net/akafel/nfv-resiliency-whitepaper-ali-kafel-stratus-technologies
 Why We Need Resiliency vs High Availability
 Achieving Resiliency Management for NFV
 Proof point – ETSI PoC#35
Agenda
2
3
Stratus Technologies
Intel Platforms
ftServer
Hardware Fault Tolerance
Proprietary Platforms
1980 - Present
Software Fault Tolerance
everRun Enterprise
12,000+ Installed
2008 - Present
Trusted Name in Fault Tolerant Computing for 35 years
Stratus Fault Tolerant Cloud
Resilient Cloud Technologies
Based of proven SW infrastructure
2015-present
 Why We Need Resiliency vs High Availability
 Achieving Resiliency Management for NFV
 Proof point – ETSI PoC#35
4
5
Why the need for Resiliency in NFV
• It is no longer about voice services ….. Certain data and video services
need HA and Resiliency more that voice
• Even “mature” cloud technologies still lack HA and Resiliency
uptime hours mins secs
99.9% 8.76 525.6 31536
99.99% 52.56 3154
99.999% 5.256 315.4
99.9999% 0.526 31.54
Down time
 Reliability
• How long a system performs its intended function.
• MTBF = total time in service / number of failures
 Availability
• % of time an equipment is in an operable state ie. Service accessible and
service continuity
• Availability (A) = Uptime / (Uptime + Downtime);
• A = MTBF / (MTBF + MTTR)
 Resiliency
• The ability to recover quickly from failures, to return to its original form /
state to maintain operable state + QoS
• Resiliency (R) = Availability (A) + QoS
 What you need is R, not just A… because, for example:…
 A 99.999% application that fails once a week for just 1 secs and disrupts active services is not
Resilient and not acceptable
 A 99.9999% application that causes increases latency during a fault is not acceptable
Defining Reliability, Availability and Resiliency
Stratus Technologies Page 2
Resiliency Management cannot be done in the VNFs
…..Because you cannot manage what you cannot see
VNFs
Virtualized Resources
Performance Faults
Resource Depletion
Fault Impacts
External Dependencies
AccessNetworks
Are exposed to
Depend 0n
VNFM
SDNC-OL
SDNC-UL
Shared
Storage
Shared
Network
NFVI Fabric
NODE HW C/N/S
NODE SW
C/N/S
Virtualization SW
vC, vN, Vs
Facility Infra
DCIM
CoreNetworks
Over 80% of system failure
modes are not directly
visible by the VNFs
Infrastructure decoupling hides
the information required to take
actions on faults from VNFs
VIM
HW Faults
SW Faults
Config Faults
Migrations Upgrades
7Stratus Technologies
Resiliency management can be “designed In” in multiple ways
but it’s best done in the Software Infrastructure
Applications / VNFs
Operating Environment
Hardware
• Transparent – no code change
• Fast & Simple Deployment
• No special App Software
• Very expensive
• Inefficient utilization
• Special Hardware
• Rigid
Costs&Resources
Pros
Cons
In the Hardware In the Applications In the Software Infrastructure
Applications / VNFs
.
.
.
.
Operating Environment
Hardware
• App specific state can be
Customized
• Can’t detect & manage all infrastructure faults
• Code written for resiliency increased by ~40%
• Most developers don’t have Resiliency experience
• More complex & Longer time to develop
Middleware
Applications / VNFs
Operating Environment
with Resilience Layer
Hardware
• Needs to be adaptable to a wide range of
Application Architectures
• Broader & Faster fault detection and correlation
• Faster and simpler Application development
• Transparent – no code changes
• Multiple levels of Resiliency
Benefits:
• Reduces Development & Verification time
• Lower Risks
• Faster time to market
8Stratus Technologies
 Why We Need Resiliency vs High Availability
 Achieving Resiliency Management for NFV
 Proof point – ETSI PoC#35
9
Resiliency Management
It’s Complexity, Multi-Dimensional and more than just Fault Management
Detection
(Prediction)
Localization
Isolation
Remediation
(Service
restoration)
Recovery
(Redundancy
restoration)
Resiliency on multiple factors
• Speed of Service restoration & Redund. restoration
• State Management: Service continuity
• “Key state” versus “All state”
• Redundancy mode: Resource consumption / cost
• Application performance impact
10Stratus Technologies
Availability
Management
Configuration
Management
Fault management
 State Protection
 Remembering the preceding events in a given sequence of
interactions within the application
 All or partial?
 Service Restoration (or Failover)
 Insuring that service is restored either through a fast restart or
failover to an active secondary or hotStandy
 The speed of Service Restoration depends on the type of
application
 Some applications need State Protection, most
applications need fast Service Restoration
Multi-dimensional aspects of Resiliency
Two Key Elements: Service Restoration and State protection
11Stratus Technologies
StateProtectionNoStateProtection
StateManagement
Slow (mins)
Start from
reset
Key state
stored on
disk
Re-instantiation after
failure: No Standby
“OSS, Billing”
“Web server”
Multi-dimensional aspects of Resiliency
State Protection versus Service Restoration
 Types of State Protection
 Full state protection
 Key state protection
 No state protection
(Stateless)
 State Management has
implications on
 Transparency
 Performance
 Resources
“Cold Standby”
Service Restoration Speed
12Stratus Technologies
StateProtectionNoStateProtection
Service Restoration Speed
StateManagement
Slow (mins)
Start from
reset
Failover
Medium (secs)
Key state
Stored in RAM
or Disk
Key state
stored on
disk
Pre-instantiated Before failure:
Failover to running Standby
“OSS, Billing” “email, SMS”
“Web server”
“vCE Router
Forwarder”
“Cold Standby” “Warm Standby”
 Types of State Protection
 Full state protection
 Key state protection
 No state protection
(Stateless)
 State Management has
implications on
 Transparency
 Performance
 Resources
Re-instantiation after
failure: No Standby
Multi-dimensional aspects of Resiliency
State Protection versus Service Restoration
13Stratus Technologies
StateProtectionNoStateProtection
StateManagement
Slow (mins)
Fast (msecs)
Start from
reset
Failover +
key state
reload
Failover Full
VM state in
RAM
Failover
Medium (secs)
Key state
Stored in RAM
or Disk
Key state
stored on
disk
Service
Accessibility
Service
Continuity
“Warm Standby” “Hot Standby or
Active-Active”
“OSS, Billing” “email, SMS”
“Voice control,
Router Control”
“Web server”
“vPE Router
Forwarder”
“vCE Router
Forwarder”
“Cold Standby”
Pre-instantiated Before failure:
Failover to running Standby
 Types of State Protection
 Full state protection
 Key state protection
 No state protection
(Stateless)
 State Management has
implications on
 Transparency
 Performance
 Resources
Re-instantiation after
failure: No Standby
 To do Fast Remediation
you need
 Pre-instantiation
 State management
Service Restoration Speed
Multi-dimensional aspects of Resiliency
State Protection versus Service Restoration
14Stratus Technologies
Immense Pain Loss of
Consciousness
Loss of
Bodily Control
Temporary
Brain Loss
Fault Tolerant Systems Provide Service Continuity, Even During Failures
Failure
Cold Restart versus Hot Standby or Active-Active
……it’s like surviving a heart attack versus preventing one
Cold Restart
(Instant HA)
Hot Standby
Or Active-Active
(Fault Tolerant)
msecs secs mins hours days
Fully Protected
Backup Activated -
Unprotected
Restored to Fully Protected Redundancy
Customer Affecting Application Outage NormalApp Restart
All state is Lost
All state is Preserved
15
Re-instantiation after failure:
No Standby
Pre-instantiated Before failure:
Failover to running Standby
Stratus Technologies Confidential
State protection
Guaranteeing Globally Consistent State
 Different ways to describe StatePointing
• Active-Standby synchronous VM replication
• Also known Checkpointing with I/O barrier, I/O lock-stepping or
buffering
 What does it guarantee
• Application transparency
• IO barrier prevents all external communications from the
speculative execution prior to state replication
• Consistent VM memory replica between act-standby and hot-
standby, at the confirmed statepoint
16
We call it StatePointing (VM replication)
Providing Service Continuity with fast Service Restoration
 VM instances paired between primary and secondary hosts in the cloud infrastructure
 State of primary (active) captured regularly and applied to secondary (HotStandby)
 StatePoint™ = VM Checkpoint + I/O StateStepping
• Provides globally consistent state
 Fast service restoration from the most recent StatePoint upon primary failover to secondary
 Automatic redundancy restoration through third host instantiation
Hot Standby Host
SP N-1
If the primary host fails, it
automatically switches to
the secondary host
Active Host
Guest Run
Epoch N-1
Guest Run
Epoch N
SP N-1
SP N
SP N
Guest Run
Epoch N+1
Guest Run
Epoch N+2
Guest Run
Epoch N+1
SP N+1
Third Host
(created
post primary
failure)
17
Guest From
Image
SP N+X
SP N+1 SP N+X
17
Active host
Hot Standby host
Act.-Stby. & Egress Network Traffic
n-1 n+1
QEMU Monitor
n
QEMU Monitor
QEMU Monitor
QEMU Monitor
QEMU Monitor
QEMU Monitor
Egress Network Queue Barrier; prevents transmission of queued egress packet(s) until the barrier is removed
PCR
PCR
PCR
Insert n
PCR Pause, Capture, Resume (PCR); phases of Statepoint process when VM execution is suspended
Note: For simplicity, n-2 interactions are not shown.
18
P1
P2
P3
P4
P5
P5
QEMU
(Standby)
Network
Egress
Queue
[snapshots]
QEMU
(Active)
Enqueue
Insert n-1 state I/O barrier
P1
P2
P3
P4
P5
P1
P2
P3
P4
P1
P2
P3
Guest VM
(Active)
Insert n+1
barrier
n-1 I/O barrier
Still on
n-1 I/O barrier
removed
n I/O barrier
still on
n I/O barrier
removed
Multiple levels of resiliency
Ensures flexibility and resource optimization based of applications
Deliver Availability as an
infrastructure service to virtual and
cloud ecosystems
Firewall MME IMS Web
Server
While every VNF needs Fault
Management, not all need state
protection
VNF-C
Forwarding
Element
VNF-C
Forwarding
Element
VNF-C
Forwarding
Element
VNF-C
Control
Element
Monolithic
VNFs
De-composed VNFs
(separate control and forwarding
elements)
Stateless Fast Path
Forwarding
Elements
Stateful
Control
Element
Fault
Tolerant
(includes State
protection)
High
Availability
(no State
protection)
Unprotected
Modes of
protection
19Stratus Technologies
Commodity
High Volume
Networking
Virtualization
Commodity Hyper Scale
COTS Computing
Commodity
High Volume
Storage
Linux
EPC
Linux
PCRF
Linux
HSS
Linux
IMS
…
Linux
OpticalTransport
ControlPlane
Linux
L3Routing
ControlPlane
Linux
Billing
Linux
CustomerCare
Linux
NOC
Linux
L2Switching
ControlPlane
Virtualized
OSS/BSS
Virtualized
SDN
Orchestration
NFV
Stratus Node Resiliency Services (NRS)
Protection with Application transparency, no code changes
Resiliency Functionality in the NFVI nodes & managed in the MANO
20
Stratus
Resiliency Management
Services (RMS)
MANO
OpenStack
environment
The Stratus Approach has implemented enhancements in KVM and plug-ins in OpenStack to make it seamless for the VNFs
Stratus Technologies
 SW Infrastructure Resiliency Management
• Fault protection for all applications, no required code changes for most apps
• State Protection, offering globally consistent state
• Multiple levels of Resiliency – Software Defined Availability (SDA)
 Control vs. Forwarding element, Stateful vs. stateless, etc
 Benefits:
• Reduces Development & Verification time
• Lower Risks
• Faster time to market
Benefits of Resiliency Management
that includes Fault Management, Availability Management
and Configuration Management
21Stratus Technologies
 Why We Need Resiliency vs High Availability
 Achieving Resiliency Management for NFV
 Proof point – ETSI PoC#35
22
The Stratus led PoC (ETSI PoC#35)
Participants of PoC#35
Availability Management with Stateful Fault Tolerance
• Demonstrated at NFV World Congress May 6-8 in San Jose, CA
OpenStack Summit, May 2015, Vancouver, Canada
SDN World Congress Oct 2015, Dusseldorf, Germany
• Completed 7/31/2015, final reported submitted
http://nfvwiki.etsi.org/index.php?title=Availability_Management_with_Stateful_Fault_Tolerance
Stratus Technologies
24
 OpenStack based VIM mechanisms alone are insufficient for supporting
carrier grade resiliency, but Stratus Cloud Technology solves that and
provided stateful failover enabling service continuity with acceptable QoS
• Service Restoration in millisecs
• Redundancy Restoration in seconds
 Any non resilient VNF can be made instantaneously Resilient with no code
change (as long as it is OpenStack ready and there is no standard way to
package VNF)
 Multiple levels of Resiliency can be easily provided using Software Defined
Resiliency in the Infrastructure, based on application requirement for State
and service restoration speed
What we proved with PoC#35
Stratus Technologies
25
Thank You!

More Related Content

What's hot

Nfv open stack-shuo-yang
Nfv open stack-shuo-yangNfv open stack-shuo-yang
Nfv open stack-shuo-yang
OW2
 

What's hot (20)

Platform Observability and Infrastructure Closed Loops
Platform Observability and Infrastructure Closed LoopsPlatform Observability and Infrastructure Closed Loops
Platform Observability and Infrastructure Closed Loops
 
VMworld 2013: SDDC is Here and Now: A Success Story
VMworld 2013: SDDC is Here and Now: A Success Story VMworld 2013: SDDC is Here and Now: A Success Story
VMworld 2013: SDDC is Here and Now: A Success Story
 
Dell EMC - - OpenStack Summit 2016/Red Hat NFV Mini Summit
Dell EMC - - OpenStack Summit 2016/Red Hat NFV Mini Summit Dell EMC - - OpenStack Summit 2016/Red Hat NFV Mini Summit
Dell EMC - - OpenStack Summit 2016/Red Hat NFV Mini Summit
 
Learn About FACE Aligned Reference Platform: Built on COTS and DO-178C Certif...
Learn About FACE Aligned Reference Platform: Built on COTS and DO-178C Certif...Learn About FACE Aligned Reference Platform: Built on COTS and DO-178C Certif...
Learn About FACE Aligned Reference Platform: Built on COTS and DO-178C Certif...
 
SDN and NFV integrated OpenStack Cloud - Birds eye view on Security
SDN and NFV integrated OpenStack Cloud - Birds eye view on SecuritySDN and NFV integrated OpenStack Cloud - Birds eye view on Security
SDN and NFV integrated OpenStack Cloud - Birds eye view on Security
 
IBM Software Defined Networking for Virtual Environments (IBM SDN VE)
IBM Software Defined Networking for Virtual Environments (IBM SDN VE)IBM Software Defined Networking for Virtual Environments (IBM SDN VE)
IBM Software Defined Networking for Virtual Environments (IBM SDN VE)
 
APAC Webinar: Learn how to maximise the benefits of NFV
APAC Webinar: Learn how to maximise the benefits of NFVAPAC Webinar: Learn how to maximise the benefits of NFV
APAC Webinar: Learn how to maximise the benefits of NFV
 
Nsx t reference design guide 3-0
Nsx t reference design guide 3-0Nsx t reference design guide 3-0
Nsx t reference design guide 3-0
 
Weather Information System Airport and Decision Support (WISADS)
Weather Information System Airport and Decision Support (WISADS)Weather Information System Airport and Decision Support (WISADS)
Weather Information System Airport and Decision Support (WISADS)
 
VMware vCloud NFV Reference Architecture
 VMware vCloud NFV Reference Architecture VMware vCloud NFV Reference Architecture
VMware vCloud NFV Reference Architecture
 
How to Leverage Open Architectures for Existing Systems
How to Leverage Open Architectures for Existing SystemsHow to Leverage Open Architectures for Existing Systems
How to Leverage Open Architectures for Existing Systems
 
Intel® Select Solutions for the Network
Intel® Select Solutions for the NetworkIntel® Select Solutions for the Network
Intel® Select Solutions for the Network
 
Nfv open stack-shuo-yang
Nfv open stack-shuo-yangNfv open stack-shuo-yang
Nfv open stack-shuo-yang
 
IBM Software Defined Networking = Brave New World of IT
IBM Software Defined Networking = Brave New World of  ITIBM Software Defined Networking = Brave New World of  IT
IBM Software Defined Networking = Brave New World of IT
 
Anuta Networks at Networking Field Day 14
Anuta  Networks at Networking Field Day 14Anuta  Networks at Networking Field Day 14
Anuta Networks at Networking Field Day 14
 
Network Function Virtualisation (NFV) BoF
Network Function Virtualisation (NFV) BoFNetwork Function Virtualisation (NFV) BoF
Network Function Virtualisation (NFV) BoF
 
Service Assurance Constructs for Achieving Network Transformation - Sunku Ran...
Service Assurance Constructs for Achieving Network Transformation - Sunku Ran...Service Assurance Constructs for Achieving Network Transformation - Sunku Ran...
Service Assurance Constructs for Achieving Network Transformation - Sunku Ran...
 
Scaling Your SDDC Network: Building a Highly Scalable SDDC Infrastructure wit...
Scaling Your SDDC Network: Building a Highly Scalable SDDC Infrastructure wit...Scaling Your SDDC Network: Building a Highly Scalable SDDC Infrastructure wit...
Scaling Your SDDC Network: Building a Highly Scalable SDDC Infrastructure wit...
 
Oredev Mucon Survey Nov 2015
Oredev Mucon Survey Nov 2015Oredev Mucon Survey Nov 2015
Oredev Mucon Survey Nov 2015
 
Running Kubernetes on OpenStack
Running Kubernetes on OpenStackRunning Kubernetes on OpenStack
Running Kubernetes on OpenStack
 

Similar to ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus

Acceleration_and_Security_draft_v2
Acceleration_and_Security_draft_v2Acceleration_and_Security_draft_v2
Acceleration_and_Security_draft_v2
Srinivasa Addepalli
 

Similar to ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus (20)

CS_10_DR_CFD
CS_10_DR_CFDCS_10_DR_CFD
CS_10_DR_CFD
 
Availability Considerations for SQL Server
Availability Considerations for SQL ServerAvailability Considerations for SQL Server
Availability Considerations for SQL Server
 
Cloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service OverviewCloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service Overview
 
Virtual Disaster Recovery ROI
Virtual Disaster Recovery ROIVirtual Disaster Recovery ROI
Virtual Disaster Recovery ROI
 
New Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery PlanningNew Essentials of Disaster Recovery Planning
New Essentials of Disaster Recovery Planning
 
Accidental Resiliency - MITRE ResilienCyCon 2022-draft-PRE-MARKETING -grey.pptx
Accidental Resiliency - MITRE ResilienCyCon 2022-draft-PRE-MARKETING -grey.pptxAccidental Resiliency - MITRE ResilienCyCon 2022-draft-PRE-MARKETING -grey.pptx
Accidental Resiliency - MITRE ResilienCyCon 2022-draft-PRE-MARKETING -grey.pptx
 
Emc vplex deep dive
Emc vplex deep diveEmc vplex deep dive
Emc vplex deep dive
 
Nonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinNonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the Coin
 
EMC VPLEX Continuous availability and non disruptive
EMC VPLEX Continuous availability and non disruptiveEMC VPLEX Continuous availability and non disruptive
EMC VPLEX Continuous availability and non disruptive
 
Veeam - Fast Secure Cloud base Disaster Recovery with Veeam Cloud Connect
Veeam - Fast Secure Cloud base Disaster Recovery with Veeam Cloud ConnectVeeam - Fast Secure Cloud base Disaster Recovery with Veeam Cloud Connect
Veeam - Fast Secure Cloud base Disaster Recovery with Veeam Cloud Connect
 
MONITORING PPT.pdf
MONITORING PPT.pdfMONITORING PPT.pdf
MONITORING PPT.pdf
 
HPE + Veeam Technical Hands ON Workshop #1
HPE + Veeam Technical Hands ON Workshop #1HPE + Veeam Technical Hands ON Workshop #1
HPE + Veeam Technical Hands ON Workshop #1
 
DR Planning - Improving Recovery Time
DR Planning - Improving Recovery TimeDR Planning - Improving Recovery Time
DR Planning - Improving Recovery Time
 
utf-8''VRP_3.2_Technical_overview_deck_July_2018.pptx
utf-8''VRP_3.2_Technical_overview_deck_July_2018.pptxutf-8''VRP_3.2_Technical_overview_deck_July_2018.pptx
utf-8''VRP_3.2_Technical_overview_deck_July_2018.pptx
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solution
 
Profile narendraredy
Profile narendraredyProfile narendraredy
Profile narendraredy
 
Acceleration_and_Security_draft_v2
Acceleration_and_Security_draft_v2Acceleration_and_Security_draft_v2
Acceleration_and_Security_draft_v2
 
VMware Site Recovery Manager
VMware Site Recovery ManagerVMware Site Recovery Manager
VMware Site Recovery Manager
 
Zerto for dr migration to cloud overview
Zerto for dr migration to cloud overviewZerto for dr migration to cloud overview
Zerto for dr migration to cloud overview
 
The Value of SCADA Infrastructure Virtualization on Wind Farms
The Value of SCADA Infrastructure Virtualization on Wind FarmsThe Value of SCADA Infrastructure Virtualization on Wind Farms
The Value of SCADA Infrastructure Virtualization on Wind Farms
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus

  • 1. Ali Kafel, VP of Business Development Ensuring High Availability and Resiliency for NFV Monday 15th February, 2016, 3.00 - 6.00pm Croke Park, Dublin 3, Ireland 1 MOVING IT TO THE FIELD (CO-LOCATED WITH ETSI NFV#13) The details of this presentation are covered in this White Paper: http://www.slideshare.net/akafel/nfv-resiliency-whitepaper-ali-kafel-stratus-technologies
  • 2.  Why We Need Resiliency vs High Availability  Achieving Resiliency Management for NFV  Proof point – ETSI PoC#35 Agenda 2
  • 3. 3 Stratus Technologies Intel Platforms ftServer Hardware Fault Tolerance Proprietary Platforms 1980 - Present Software Fault Tolerance everRun Enterprise 12,000+ Installed 2008 - Present Trusted Name in Fault Tolerant Computing for 35 years Stratus Fault Tolerant Cloud Resilient Cloud Technologies Based of proven SW infrastructure 2015-present
  • 4.  Why We Need Resiliency vs High Availability  Achieving Resiliency Management for NFV  Proof point – ETSI PoC#35 4
  • 5. 5 Why the need for Resiliency in NFV • It is no longer about voice services ….. Certain data and video services need HA and Resiliency more that voice • Even “mature” cloud technologies still lack HA and Resiliency uptime hours mins secs 99.9% 8.76 525.6 31536 99.99% 52.56 3154 99.999% 5.256 315.4 99.9999% 0.526 31.54 Down time
  • 6.  Reliability • How long a system performs its intended function. • MTBF = total time in service / number of failures  Availability • % of time an equipment is in an operable state ie. Service accessible and service continuity • Availability (A) = Uptime / (Uptime + Downtime); • A = MTBF / (MTBF + MTTR)  Resiliency • The ability to recover quickly from failures, to return to its original form / state to maintain operable state + QoS • Resiliency (R) = Availability (A) + QoS  What you need is R, not just A… because, for example:…  A 99.999% application that fails once a week for just 1 secs and disrupts active services is not Resilient and not acceptable  A 99.9999% application that causes increases latency during a fault is not acceptable Defining Reliability, Availability and Resiliency Stratus Technologies Page 2
  • 7. Resiliency Management cannot be done in the VNFs …..Because you cannot manage what you cannot see VNFs Virtualized Resources Performance Faults Resource Depletion Fault Impacts External Dependencies AccessNetworks Are exposed to Depend 0n VNFM SDNC-OL SDNC-UL Shared Storage Shared Network NFVI Fabric NODE HW C/N/S NODE SW C/N/S Virtualization SW vC, vN, Vs Facility Infra DCIM CoreNetworks Over 80% of system failure modes are not directly visible by the VNFs Infrastructure decoupling hides the information required to take actions on faults from VNFs VIM HW Faults SW Faults Config Faults Migrations Upgrades 7Stratus Technologies
  • 8. Resiliency management can be “designed In” in multiple ways but it’s best done in the Software Infrastructure Applications / VNFs Operating Environment Hardware • Transparent – no code change • Fast & Simple Deployment • No special App Software • Very expensive • Inefficient utilization • Special Hardware • Rigid Costs&Resources Pros Cons In the Hardware In the Applications In the Software Infrastructure Applications / VNFs . . . . Operating Environment Hardware • App specific state can be Customized • Can’t detect & manage all infrastructure faults • Code written for resiliency increased by ~40% • Most developers don’t have Resiliency experience • More complex & Longer time to develop Middleware Applications / VNFs Operating Environment with Resilience Layer Hardware • Needs to be adaptable to a wide range of Application Architectures • Broader & Faster fault detection and correlation • Faster and simpler Application development • Transparent – no code changes • Multiple levels of Resiliency Benefits: • Reduces Development & Verification time • Lower Risks • Faster time to market 8Stratus Technologies
  • 9.  Why We Need Resiliency vs High Availability  Achieving Resiliency Management for NFV  Proof point – ETSI PoC#35 9
  • 10. Resiliency Management It’s Complexity, Multi-Dimensional and more than just Fault Management Detection (Prediction) Localization Isolation Remediation (Service restoration) Recovery (Redundancy restoration) Resiliency on multiple factors • Speed of Service restoration & Redund. restoration • State Management: Service continuity • “Key state” versus “All state” • Redundancy mode: Resource consumption / cost • Application performance impact 10Stratus Technologies Availability Management Configuration Management Fault management
  • 11.  State Protection  Remembering the preceding events in a given sequence of interactions within the application  All or partial?  Service Restoration (or Failover)  Insuring that service is restored either through a fast restart or failover to an active secondary or hotStandy  The speed of Service Restoration depends on the type of application  Some applications need State Protection, most applications need fast Service Restoration Multi-dimensional aspects of Resiliency Two Key Elements: Service Restoration and State protection 11Stratus Technologies
  • 12. StateProtectionNoStateProtection StateManagement Slow (mins) Start from reset Key state stored on disk Re-instantiation after failure: No Standby “OSS, Billing” “Web server” Multi-dimensional aspects of Resiliency State Protection versus Service Restoration  Types of State Protection  Full state protection  Key state protection  No state protection (Stateless)  State Management has implications on  Transparency  Performance  Resources “Cold Standby” Service Restoration Speed 12Stratus Technologies
  • 13. StateProtectionNoStateProtection Service Restoration Speed StateManagement Slow (mins) Start from reset Failover Medium (secs) Key state Stored in RAM or Disk Key state stored on disk Pre-instantiated Before failure: Failover to running Standby “OSS, Billing” “email, SMS” “Web server” “vCE Router Forwarder” “Cold Standby” “Warm Standby”  Types of State Protection  Full state protection  Key state protection  No state protection (Stateless)  State Management has implications on  Transparency  Performance  Resources Re-instantiation after failure: No Standby Multi-dimensional aspects of Resiliency State Protection versus Service Restoration 13Stratus Technologies
  • 14. StateProtectionNoStateProtection StateManagement Slow (mins) Fast (msecs) Start from reset Failover + key state reload Failover Full VM state in RAM Failover Medium (secs) Key state Stored in RAM or Disk Key state stored on disk Service Accessibility Service Continuity “Warm Standby” “Hot Standby or Active-Active” “OSS, Billing” “email, SMS” “Voice control, Router Control” “Web server” “vPE Router Forwarder” “vCE Router Forwarder” “Cold Standby” Pre-instantiated Before failure: Failover to running Standby  Types of State Protection  Full state protection  Key state protection  No state protection (Stateless)  State Management has implications on  Transparency  Performance  Resources Re-instantiation after failure: No Standby  To do Fast Remediation you need  Pre-instantiation  State management Service Restoration Speed Multi-dimensional aspects of Resiliency State Protection versus Service Restoration 14Stratus Technologies
  • 15. Immense Pain Loss of Consciousness Loss of Bodily Control Temporary Brain Loss Fault Tolerant Systems Provide Service Continuity, Even During Failures Failure Cold Restart versus Hot Standby or Active-Active ……it’s like surviving a heart attack versus preventing one Cold Restart (Instant HA) Hot Standby Or Active-Active (Fault Tolerant) msecs secs mins hours days Fully Protected Backup Activated - Unprotected Restored to Fully Protected Redundancy Customer Affecting Application Outage NormalApp Restart All state is Lost All state is Preserved 15 Re-instantiation after failure: No Standby Pre-instantiated Before failure: Failover to running Standby Stratus Technologies Confidential
  • 16. State protection Guaranteeing Globally Consistent State  Different ways to describe StatePointing • Active-Standby synchronous VM replication • Also known Checkpointing with I/O barrier, I/O lock-stepping or buffering  What does it guarantee • Application transparency • IO barrier prevents all external communications from the speculative execution prior to state replication • Consistent VM memory replica between act-standby and hot- standby, at the confirmed statepoint 16
  • 17. We call it StatePointing (VM replication) Providing Service Continuity with fast Service Restoration  VM instances paired between primary and secondary hosts in the cloud infrastructure  State of primary (active) captured regularly and applied to secondary (HotStandby)  StatePoint™ = VM Checkpoint + I/O StateStepping • Provides globally consistent state  Fast service restoration from the most recent StatePoint upon primary failover to secondary  Automatic redundancy restoration through third host instantiation Hot Standby Host SP N-1 If the primary host fails, it automatically switches to the secondary host Active Host Guest Run Epoch N-1 Guest Run Epoch N SP N-1 SP N SP N Guest Run Epoch N+1 Guest Run Epoch N+2 Guest Run Epoch N+1 SP N+1 Third Host (created post primary failure) 17 Guest From Image SP N+X SP N+1 SP N+X 17 Active host Hot Standby host
  • 18. Act.-Stby. & Egress Network Traffic n-1 n+1 QEMU Monitor n QEMU Monitor QEMU Monitor QEMU Monitor QEMU Monitor QEMU Monitor Egress Network Queue Barrier; prevents transmission of queued egress packet(s) until the barrier is removed PCR PCR PCR Insert n PCR Pause, Capture, Resume (PCR); phases of Statepoint process when VM execution is suspended Note: For simplicity, n-2 interactions are not shown. 18 P1 P2 P3 P4 P5 P5 QEMU (Standby) Network Egress Queue [snapshots] QEMU (Active) Enqueue Insert n-1 state I/O barrier P1 P2 P3 P4 P5 P1 P2 P3 P4 P1 P2 P3 Guest VM (Active) Insert n+1 barrier n-1 I/O barrier Still on n-1 I/O barrier removed n I/O barrier still on n I/O barrier removed
  • 19. Multiple levels of resiliency Ensures flexibility and resource optimization based of applications Deliver Availability as an infrastructure service to virtual and cloud ecosystems Firewall MME IMS Web Server While every VNF needs Fault Management, not all need state protection VNF-C Forwarding Element VNF-C Forwarding Element VNF-C Forwarding Element VNF-C Control Element Monolithic VNFs De-composed VNFs (separate control and forwarding elements) Stateless Fast Path Forwarding Elements Stateful Control Element Fault Tolerant (includes State protection) High Availability (no State protection) Unprotected Modes of protection 19Stratus Technologies
  • 20. Commodity High Volume Networking Virtualization Commodity Hyper Scale COTS Computing Commodity High Volume Storage Linux EPC Linux PCRF Linux HSS Linux IMS … Linux OpticalTransport ControlPlane Linux L3Routing ControlPlane Linux Billing Linux CustomerCare Linux NOC Linux L2Switching ControlPlane Virtualized OSS/BSS Virtualized SDN Orchestration NFV Stratus Node Resiliency Services (NRS) Protection with Application transparency, no code changes Resiliency Functionality in the NFVI nodes & managed in the MANO 20 Stratus Resiliency Management Services (RMS) MANO OpenStack environment The Stratus Approach has implemented enhancements in KVM and plug-ins in OpenStack to make it seamless for the VNFs Stratus Technologies
  • 21.  SW Infrastructure Resiliency Management • Fault protection for all applications, no required code changes for most apps • State Protection, offering globally consistent state • Multiple levels of Resiliency – Software Defined Availability (SDA)  Control vs. Forwarding element, Stateful vs. stateless, etc  Benefits: • Reduces Development & Verification time • Lower Risks • Faster time to market Benefits of Resiliency Management that includes Fault Management, Availability Management and Configuration Management 21Stratus Technologies
  • 22.  Why We Need Resiliency vs High Availability  Achieving Resiliency Management for NFV  Proof point – ETSI PoC#35 22
  • 23. The Stratus led PoC (ETSI PoC#35) Participants of PoC#35 Availability Management with Stateful Fault Tolerance • Demonstrated at NFV World Congress May 6-8 in San Jose, CA OpenStack Summit, May 2015, Vancouver, Canada SDN World Congress Oct 2015, Dusseldorf, Germany • Completed 7/31/2015, final reported submitted http://nfvwiki.etsi.org/index.php?title=Availability_Management_with_Stateful_Fault_Tolerance Stratus Technologies
  • 24. 24  OpenStack based VIM mechanisms alone are insufficient for supporting carrier grade resiliency, but Stratus Cloud Technology solves that and provided stateful failover enabling service continuity with acceptable QoS • Service Restoration in millisecs • Redundancy Restoration in seconds  Any non resilient VNF can be made instantaneously Resilient with no code change (as long as it is OpenStack ready and there is no standard way to package VNF)  Multiple levels of Resiliency can be easily provided using Software Defined Resiliency in the Infrastructure, based on application requirement for State and service restoration speed What we proved with PoC#35 Stratus Technologies