Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Openstack upgrade without_down_time_20141103r1

2,112 views

Published on

  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Openstack upgrade without_down_time_20141103r1

  1. 1. Openstack Upgrade Without Down Time November 5, 2014 Takashi Natsume, Software Innovation Center, NTT natsume.takashi@lab.ntt.co.jp Yankai Liu, Canonical yankai.liu@canonical.com
  2. 2. Agenda ● Introduction ● Live Upgrade Test Strategy and Plan ○ Pre-upgrade Investigation ○ Considerations in Creating Upgrade Procedure ○ Concrete Upgrade Procedure ○ Testing ○ Upgrade Test Results and Issues ● Summary 2
  3. 3. Introduction
  4. 4. Introduction Who We Are: Takashi Natsume Takashi Natsume has been working for NTT corporation since April, 2013.I am engaged in system design of public cloud systems based on OpenStack and functional verification of OpenStack. Before I was engaged in performance analysis and performance troubleshooting for systems. Yankai Liu Yankai Liu is the Cloud Architect at Canonical being responsible for cloud architecture design and delivery. I worked with NTT team to provide consultancy on the upgrade test project. 4
  5. 5. Openstack Upgrade Overview With the fast openstack releases rolling out, openstack upgrade becomes one of the key operation factors for the deployments, which can be performed off-line or live-upgrade. For the production deployments, live upgrade is desired to achieve these goals: ● Minimal or no down time ● Catch up the short release cycle of Openstack [1] ● Ensure the maintenance support(because of short maintenance period[2]) ● Reduce the cost comparing to off-line upgrade In this session, we will introduce how NTT designed and tested the live upgrade from Havana to Icehouse service by service. 5
  6. 6. The Goal of NTT Cloud Live Upgrade No impact on users’ resources usage ● Users can utilize their resources(VMs, virtual volumes,virtual networks) that have already created or are running without any interruption during live upgrade. For example, VM stop and network communication interruption ● No performance problem that affects users’ resource utilization significantly. No impact on users’ API calls ● During live upgrade, users can use the openstack API services as usual with: No errors or fails No incorrect results No performance problem that affects users’ operations significantly. 6
  7. 7. Upgrade environment and components •System environment • Built a test environment based on NTT production public cloud system architecture (See the figure in the next page.) •Upgrade components • OpenStack components • Nova, Cinder, Glance,Neutron,Keystone,Heat • Non-openstack components such as MySQL, RabbitMQ、Load balancer(ldirector) and OS were NOT included. •Upgrade version • Stable/havana(2014.2.2) to icehouse-1(nova, icehouse-3) 7
  8. 8. System Architecture Built for Upgrade Testing Active/Active: processes that do not retain their state Active/Standby: processes that retain their state    No HA(single): hypervisor hosts Processes that receive REST API requests can be blocked by deploying load balancers in front of them. OS: Ubuntu Server 12.04 LTS 8
  9. 9. Live Upgrade Test Strategy and Plan
  10. 10. NTT Cloud Live Upgrade Test Strategy and Plan Overall Strategy ● Step-by-step(Rolling) upgrade is needed for live upgrade ● Openstack components co-exist on different versions Live Upgrade Test Plan 1. Pre-upgrade investigation: items that should be considered in advance 2. Considerations in creating details procedure 3. Concrete upgrade procedure 4. Testing 5. Upgrade Test results and issues 10
  11. 11. Live Upgrade Test Strategy and Plan - Pre-upgrade investigation -
  12. 12. Pre-Investigation for Live Upgrade A) Database schema • Some cases that OpenStack database schemas are different between new version and old version. • Investigate on the DB schema changes before creating the upgrade plans B) Consistency of APIs between components C) Consistency of APIs in each component. • REST API • RPC API 12
  13. 13. Live Upgrade Test Strategy and Plan - Considerations in Creating Upgrade Procedure -
  14. 14. Considerations in Creating Upgrade Procedure •User resources • User resources that are on hosts to upgrade need be migrated to another host. 14
  15. 15. The order of upgrade Decide the upgrade order based on RPC API version compatibility in the component Process C Process B Process A Legends: RPC call Server Process A caller is upgraded after a callee upgrade. In this case, upgrade is performed in the order of process A, process B and process C. 15
  16. 16. Operations Required for Step by Step Upgrade •Blockade(Blocking requests) • load balancer (ldirectord(LVS)) • Disable Service(nova-compute, cinder-volume) •Check processings in progress • Check connections at the load balancer • e.g. glance-api • Check child processes • e.g. nova-novncproxy •If a graceful shutdown function can be used, it had better be used. • Nova: icehouse-1 or later • Cinder: icehouse-1 or later • Neutron: icehouse-2 or later • Heat: havana-3 or later(We fixed a bug in juno-1) • Glance: No need in our environment • Keystone: No need in our environment 16
  17. 17. Database Schema • Change database schema at the beginning of procedure and the end of procedure • The beginning • Add tables, add columns and add indexes • The end • Drop tables, delete columns and delete indexes • In current nova live upgrade procedure(community), nova-conductors are upgraded at the same time. (New version and old version nova-conductors don’t run at the same time.) • Conversion of data format should be considered • We need not convert the data format in our trial. There is no problem. • Check codes that define the database schema sufficiently. • For example, in nova • nova/db/sqlalchemy/migrate_repo/versions/* • Data conversion may be needed in some cases. • Adding 'triggers' in database tables? 17
  18. 18. Database Schema (cont’d) • Avoid database lock for a long time • We can use some tools • pt-online-schema-change[3] • oak-online-alter-table[4] 18
  19. 19. HA Configuration • In the point of view of live upgrade, Active/Active configuration is better. • But there are some cases that Active/Active cannot be configured, so Active/Standby is forced. • cinder-volume(depends on backends) • Active/Active can be configured by using Ceph (Refer to the discussion https://bugs.launchpad. net/cinder/+bug/1280367) • While Active/Active setup can’t be supported by all the drivers. https://bugs.launchpad.net/cinder/+bug/1322190 • neutron-server(depends on plugin) • neutron-l3-agent/neutron-dhcp-agent • nova-consoleauth • heat-engine(but multiple engine function has been implemented in icehouse-2.) 19
  20. 20. HA Configuration (cont’d) •In Active/Active case(controller) • At Load balancer, blocking the node which is in the upgrade process •In Active/Standby case • When switching Active/Standby, there is service down time of the component as expected. 20
  21. 21. Upgrade Procedure by HA Configuration Active/Active configuration Block requests/connections to target host Migrate users’ reources Upgrade host Unblock Repeat on each target hosts No HA(Single) Block requests to target host Migrate users’ reources Upgrade host Unblock Active/Standby configuration Upgrade ‘Standby’ host Block requests to ‘Active’ host (if possible) Switch Active/Standby Unblock Repeat on each target hosts Repeat on each target hosts 21
  22. 22. Live Upgrade Test Strategy and Plan - Concrete upgrade procedure -
  23. 23. System Architecture Built for Upgrade Testing Active/Active: processes that do not retain their state Active/Standby: processes that retain their state    No HA(single): hypervisor hosts Processes that receive REST API requests can be blocked by deploying load balancers in front of them. OS: Ubuntu Server 12.04 LTS 23
  24. 24. Overall Upgrade Procedure 24
  25. 25. Live Upgrade Test Strategy and Plan - Testing -
  26. 26. Create test plans, test tools and test data •Background workload during upgrade test • Background workload(API requests) covered patterns of calls between components and between processes in components in our use case. • Network communication(ping) • North-South • East-West • Remain VNC console connected during upgrade test 26
  27. 27. Build a test environment •Build a test environment • Same configurations as a production environment • HA configuration(Active/Active, Active/Standby) required. • In order to repeat upgrade testing, we constructed the environment to get back easily by using chef. 27
  28. 28. Execute(Test) the procedure •Evaluation criteria • No impact on users’ resources • Users can utilize their resources(VMs, virtual volumes,virtual networks) that have already created or are running without any interruption. • No performance problem that affects users’ resource utilization significantly. • No impact on users’ API calls • No error • No ‘wrong’ results • No performance problem that affects users’ operations significantly • Operation step does not need a lot of time • Consistency between records that OpenStack manages and actual resources. 28
  29. 29. Live Upgrade Test Strategy and Plan - Upgrade Test results and issues -
  30. 30. Identify issues •Solved issues • Heat Graceful shutdown issue • NTT team fixed it in juno-1 • https://bugs.launchpad.net/heat/+bug/1304244 •Remaining issues • Errors due to Active/Standby switchover • Volume Resource creation failure(ERROR state) • Errors due to mismatch of RPC API major versions • From nova-compute to nova-consoleauth • From nova-novncproxy to nova-consoleauth Communication interruption (expected to be resolved in Juno) • Neutron-l3-agent • Changing ‘admin_state_up’ of neutron-l3-agent to False solves ‘scheduling’ issue, but communication interruption occurred. • Interruption of the console connection • VM live migration/nova-novncproxy upgrade • Impossible to fallback after changing DB schema at the beginning 30
  31. 31. Lesson learns •Clean install • Some source code directories/files should be removed during the upgrade and fallback. Otherwise it will cause errors and issues. • When overwriting openstack components’ files, errors occurred. • AttributeError: type object 'foo' has no attribute 'bar' 31
  32. 32. Summary
  33. 33. Summary ● The goal of the upgrade test is to achieve the upgrade without down time.But there were some issues to prevent us from achieving upgrade openstack without down time. ● During our upgrade test, the down time of the services including: ○ Network downtime ■ neutron-l3-agent (expected to be fixed in Juno) ● Trade-off between the new vRouter creation failure and VM communication, e.g. a few of minutes downtime to schedule the new vRouter creation OR a few of minutes communication interruption for some VMs communication ○ Some API requests downtime during the Active/Standby switchover ● Neutron server ● Heat engine ● Cinder volume ○ Nova instance console connection interruption ■ Need reconnect or Need getting console url again. 33
  34. 34. Suggestions for communities • Cinder-volume drivers Active/Active HA support • Presently some drivers for commercial products prevent from configuring Active/Active • Consistency of RPC API major versions • 1 version rolling upgrade is (limited) supported in Nova. • It should be considered in all core projects. • If OpenStack components utilize oslo.messaging, errors caused by RPC API major version difference might occur during live upgrade. • Seamless console connection • There is a discussion In Juno summit for console seamless migration [5] • Consider live upgrade in REST API versions deprecation • SDN controller Active/Active HA support should be considered when integrating into Neutron as a plugin • Although Ceilometer is not in the test scope, there are still gaps to support Active/Active HA • Graceful shutdown of all services 34
  35. 35. Reference •[1] Release Cycle • https://wiki.openstack.org/wiki/Release_Cycle •[2] Releases • https://wiki.openstack.org/wiki/Releases •[3] Percona Toolkit • http://www.percona.com/software/percona-toolkit •[4] openark kit • http://code.openark.org/forge/openark-kit •[5] Improve performance of live migration on KVM • https://etherpad.openstack.org/p/juno-nova-kvm-live-migration 35

×