Based on the lifecycle of the OPNFV Doctor project, this case study shows how operator requirements “on paper” have successfully been realized step-by-step and in close cooperation with upstream community projects into a mature fault management framework. A demo of the solution had been presented in a keynote at the last OpenStack Summit. The talk will describe how we have worked in the OPNFV Doctor project and will provide some lessons learned on this journey. With significant experience now of working OPNFV requirements upstream to OpenStack, we’ll share best practices for submitting contributions upstream, how to best communicate, and how to overcome the primary challenges.
Swimming upstream: OPNFV Doctor project case study
1. Swimming upstream -
Doctor project case study
Gerald Kunzmann, NTT DOCOMO
Carlos Goncalves, NEC
Tomi Juvonen, Nokia
2. AGENDA
– Introduction to OPNFV Doctor project
– Detailed work flow
– Pain points
– Best practises for upstream contributions
– Summary
3. OPNFV Doctor project - Introduction
• Goal:
– Develop and build fault management and maintenance framework for high
availability of Network Services running on top of virtualized infrastructure.
Proposed with a very clear target / key feature:
– Immediate notification of unavailability of virtualized resources
from VIM to Consumer
• Members:
– NEC (PTL: Ryota Mibu), AT&T, Cisco, Cloudbase Solutions, Corenova, Ericsson,
Hephaex, Huawei, Intel, KDDI, KT, Nokia, NTT DOCOMO, Spirent, Sprint,
Telecom Italia, Vmsec, ZTE
• https://wiki.opnfv.org/display/doctor/
4. OPNFV Doctor project – Timeline
Q3 Q4
Q1
2015
Q2 Q3 Q4
Q1
2016
Q2 Q3 Q4
Q1
2017
Q2
OPNFV launch
30 Sep, 2014
Doctor creation
2 Dec, 2014
Arno release
4 May, 2015
OPNFV Summit 2015
9 Nov, 2015
Bramahputra release
1 Mar, 2016
OPNFV Summit 2016
12 Jun, 2016
Colorado release
16 Sep, 2016
OpenStack Summit
Barcelona 2017
25 Oct, 2016 Danube release
4 Apr, 2017
ARNO
- Requirement document
BRAHMAPUTRA
- Ceilometer „Immediate
Notification“
- Nova „Mark Host Down“
- Functional test cases
- PoC demo at OPNFV Summit
- Documentation updates
COLORADO
- Nova:
„Get valid server state“
- Integration of Congress
as Doctor Inspector
- Extended functional tests
- PoC demo at OPNFV Summit and
OpenStack Summit Barcelona
- Documentation updates
DANUBE
- Neutron „Port Status update“
- Inspector design guidelines
- Performance profiler
- Documentation updates
6. 1. Doctor requirements document
• Use cases and scenarios
– Active-Standby configuration (1+1 redundancy):
• Consumer of infrastructure has configured ACT-STBY
• Fault in virtualized infrastructure (NFVI) inform the Consumer to switch to STBY instance
– Prevention actions based on fault prediction: Switch to STBY in case of a predicted fault
– NFVI maintenance: inform Consumer(s) of affected hardware about planned maintenance
• Requirements
1. Monitor physical and virtual resources and detect problems
2. Correlate faults and identify affected virtual resources
3. Notification of Consumer(s) of affected virtual resources
4. Execute steps 1-3 in less than e.g. 1 second to avoid service disruption
Consumer C1 Consumer C2 Consumer C3
Virtualized Infrastructure Manager
(VIM), e.g. OpenStack
Resource
Map
Server – VM mapping
Server S1 VM-1, VM-2
Server S2 VM-7
Server S3 VM-4
Ownership information
VM-1, VM-7 Consumer C1
VM-2 Consumer C2
VM-4 Consumer C3
Resource Pool
Hypervisor
Hardware
Server S1
VM-1
Hypervisor
Hardware
Server S2
Hypervisor
Hardware
Server S3
VM-2 VM-7 VM-4
X 1. Fault Monitoring
- Hardware fault
- Hypervisor fault
- Host OS fault
6. Execute Instruction
- e.g. migrate VM
2. Inform the Consumer?
If YES, find owner of
affected VMs from database
OpenStack Northbound Interface
3. FaultNotification
(VM ID, Fault ID)
5. Instruction
(VM ID)
4. Switch to SBY configuration
7. Doctor architecture and OpenStack mapping
Monitor
Notifier
Manager /
Consumer
Virtualized Infrastructure
(Resource Pool)
Alarm
Conf.
3. Update State
2. Find Affected
Application
Controller
Controller
Controller
Resource
Map
1. Raw Failure
Inspector
4. Notify all
5. Notify Error
0. Set Alarm
6. Action
Failure
Policy
Monitor
Monitor
Aodh
Vitrage
e.g. Zabbix
Cinder
Neutron
Nova
Congress
8. Resource state – missing feature
To be:
• Nova API shall support to change nova-compute state
• User shall be able to read OpenStack states and trust they are correct
As is (Kilo release):
• When a VM goes down due to a host HW, host OS or hypervisor failure,
nothing happens in OpenStack. The VMs of a crashed host/hypervisor
are reported to be live and OK through the OpenStack API.
• nova-compute state might change too slowly or the state is not reliable
if expecting also VMs to be down.
3. Gap analysis and solution brainstorming
Solution brainstorming:
• Discussion with experts on best way to address the gap
• BP spec for „mark nova-compute down“
Immediate notification – deficiency in operation
To be:
• VIM to immediately notify unavailability of virtual resource to VIM user.
• User shall only receive fault notifications related to owned resource(s).
As is (Kilo release):
• OpenStack Metering ‘Ceilometer’ can notify unavailability of resource.
• Due to innerworking of Ceilometer (polling of events), notification of
faults takes seconds to few minutes (depending on polling interval).
• Performance issue for Ceilometer in medium to large scale deployments.
10. 5. Test cases and user manual
• End to end test cases
– Upstream: unit tests and scope-restricted functional tests upstream
– Downstream: E2E functional tests will validate full systems
integration
• Manuals
– How to use implemented blueprint
– How it plays within the framework
– How to run the tests and interpret results
– Greatly promotes technology adoption!
11. 6. PoCs, demos and hackfests
Make people
aware of the
work
Collect
feedback
Find
supporters
Work
upstream
Integrate and
validate
12. 7. Upstream achievements
Project Blueprint Spec Drafter Lead Developer Status
Aodh Event Alarm Evaluator Ryota Mibu (NEC) Ryota Mibu (NEC) Completed (Liberty)
Nova New nova API call to mark nova-compute down Tomi Juvonen (Nokia) Roman Dobosz (Intel) Completed (Liberty)
Support forcing service down Tomi Juvonen (Nokia) Carlos Goncalves (NEC) Completed (Liberty)
Get valid server state Tomi Juvonen (Nokia) Tomi Juvonen (Nokia) Completed (Mitaka)
Add notification for service status change Balazs Gibizer (Ericsson) Balazs Gibizer (Ericsson) Completed (Mitaka)
Congress Push Type Datasource Driver Masahito Muroi (NTT) Masahito Muroi (NTT) Completed (Mitaka)
Adds Doctor Driver Masahito Muroi (NTT) Masahito Muroi (NTT) Completed (Mitaka)
Neutron Port data plane status Carlos Goncalves (NEC) Carlos Goncalves (NEC) Completed (Pike)
13. 8. Summary: typical work flow
Requirement
Architecture
& Gaps
Solution &
review
internally
Reach
upstream
Integrate,
test and
document
Present
demos,
collect
feedback
14. Pain points
•North America, Europe, Asia
Time zone
•OPNFV has handful of installers
•Each use different configuration management tools (e.g. Puppet, Ansible, Charm)
•Different default configurations (e.g. Ceilometer/Aodh, Nova)
Installers
•Internal and cross-project meetings
•Release planning and deliverables
•Alignment and collaboration with relevant SDOs (e.g. ETSI NFV)
Overhead
•OPNFV ships ~6-month old OpenStack release
•Need to backport bugs and features
•Address each installer individually
Backports
•Making sure system integration’s in place
•End to end scenarios require great knowledge across several upstream projectsTesting
•Sometimes forsaken
Documentation
16. Contribute to a Project
• If you want to have something done, you should
do it yourself
• Contribute to OpenStack project you are planning
to have your change
– You will gain respect in the community and people
get to know you
– You get to understand how everything works
• Project on-boarding sessions
5/19/2017 16
17. From OPNFV Requirements to a Spec
• Explain your problem well and have a good material
linked to tell things in more detail
• Make use cases carefully
• Try to find the benefit for a wider audience
• Use common-ground terminology rather than NFV-specific
• Make proposal for the implementation. Review will shape it right
5/19/2017 17
18. Discuss Your Change
• Join openstack-dev mailing list, Project IRC and weekly
meetings
• OpenStack has PTG just before the start of the next release
cycle. There each project team meets face-to-face to actually
work on the new features, not just talk. There are also cross-
project sessions
• OpenStack summit is arranged 2 months after release cycle
starts. There is now a forum with the opportunity to discuss
requirements with the project and cross-project development
teams for the current +1 release (now Queens' Forum). For the
first time, everyone will have the opportunity to discuss
requirements face to face, in time to get them into the current
+1 release
5/19/2017 18
19. Need a Wider View
• Email to OpenStack ops and dev mailing list
• Telco reviewers
• Talk with operators
– IRC meetings
– OpenStack Operators Telco and NFV Working
Group
– OpenStack Operators Mid-Cycle
– OpenStack Operators sessions in Forum
5/19/2017 19
20. Pike Release Timeline - Steps With Your Change
→ Just before the next release cycle starts, have your idea ready in your
head and start to draft the spec. Join the PTG if necessary to discuss
face-to-face with the project. Start making the code if you didn’t already
→ Pike-1: You need to have the spec merged. Next you should have the
first version of your code ready for review. Note! Project cores do not
have much time, make all the test cases and code as good as you can
to have it trough reviews in time. There is usually earlier similar patches
as an example
→ Pike-2: Earlier non-priority code freeze, but more relaxed in current
release
→ Pike-3: Feature freeze. Everything have to be merged before this date
→ Pike release: Might drop some follow up patches
5/19/2017 20
(Feb 20)
(Apr 10)
(Jun 05)
(Jul 24)
(Aug 28)
21. Lessons Learned
• Be prepared
• POC code
• Scope it right
• Discuss and communicate; Avoid wasted effort
5/19/2017 21
NOVA MARK HOST DOWN BP (Tomi, Roman, Carlos)
Have your proposals reviewed internally first
Listen to other point of views and converge
Smooths out lot of technical and editorial issues
Be more inclusive
Listen to other point of views and converge
Be willing to make compromises
Use common-ground terminology rather than NFV-specific
Which benefits does your proposal bring to non-NFV users?
Formulate solid use cases covering heterogeneous environments
Higher chances of acceptance and engineering resources