SlideShare a Scribd company logo
1 of 47
Download to read offline
Today&we’re&here&to&talk&about&upgrading&OpenStack
Ideally&we&don’t&want&to&break&everything
And&the&session&description&promised&you&we&wouldn’t&even&break&Neutron,&but&we’ll&
see&how&that&worked&out.
Both&principal&engineers&at&TWC&on&the&OpenStack&team
Clayton&C focus&on&automation,&CI/CD,&deployments,&etc
Sean&C focus&on&networking,&compute
● Our&OpenStack&team&started&with&four&people&about&two&years&ago
● We&did&our&proof&of&concept&implementation&on&Havana&and&then&after&the&
Atlanta&summit&decided&to&switch&everything&to&Icehouse&and&VXLAN&based&
networking&before&going&to&production&in&the&summer
● Since&then&we’ve&done&an&upgrade&to&Juno&and&Kilo
● These&are&the&versions&of&the&services&we’re&currently&running
○ This&talk&will&focus&on&our&last&round&of&control&node&upgrades,&which&
included&Nova,&Neutron,&Glance,&Cinder&and&Heat
○ Since&our&Kilo&upgrade,&we’ve&moved&Heat&into&a&Docker&container&and&
upgraded&it&to&Liberty
○ Horizon&and&Keystone&aren’t&included&because&those&were&already&on&
Kilo.
● There&are&a&few&core&tenets&that&we&feel&are&important&and&that&we&try&to&follow&
regarding&OpenStack&upgrades.
● The&first&one&is:&You&really&don’t&want&to&fall&behind.&
● We&plan&on&upgrading&every&6&months
We&think&you&should&also,&even&if&you&want&to&wait&for&bug&fixes&on&the&
stable&branch
The&primary&reason&is&that&is&the&only&tested&path&for&upgrades
And&with&rolling&upgrades&and&lazy&DB&migrations,&there&are&now&
intermediate&steps&that&have&to&be&done&between&releases
For&example,&in&Kilo,&nova&flavor&migration&must&be&run&before&upgrading&
to&Liberty
Automate&everything
If&you&don’t&automate&everything&then&when&you&start&your&testing….
You’re&going&to&feel&like&this&guy
Test&it&over&and&over
Get&your&process&down
Upgrades&might&impact&customers,&so&try&to&find&out&what&that&impact&is
● Our&team&gave&an&upgrades&talk&in&Vancouver,&some&of&you&may&have&been&to&
that&talk&also
○ We&appreciate&anyone&that&felt&like&they&wanted&to&hear&us&talk&about&
OpenStack&upgrades&twice&in&one&year.
● We’re&going&to&try&not&to&cover&too&much&of&the&same&ground,&the&Juno&talk&is&
on&Youtube&and&it&covers&our&overall&approach
○ We’re&going&to&talk&more&about&updates&to&that&approach&and&issues&
we&ran&into&while&upgrading&to&Kilo
● So&when&deciding&timing&for&our&Kilo&upgrade,&there&was&one&major&feature&we&
were&looking&forward&to:
● Like&most&people&using&OpenStack,&we&use&RabbitMQ&as&the&message&broker&
for&all&intra&service&communications
● Like&most&people&using&OpenStack,&we’ve&had&tons&of&problem&with&this,&
although&it’s&gotten&better
● The&biggest&remaining&problem&we’ve&seen&with&Juno&was&that&if&anything&
went&wrong,&OpenStack&services&would&not&realize&they&were&disconnected&
from&Rabbit
○ NovaCcompute&was&particularly&bad&about&this.
● AMQP&heartbeats&are&a&protocol&level&feature&that&let&the&RabbitMQ&server&
and&clients&check&in&on&each&other&regularly
○ If&one&of&them&goes&missing,&everything&gets&cleaned&up&and&clients&
can&go&reconnect&in&a&timely&fashion
○ This&was&added&as&an&experimental&feature&in&Kilo&and&we’d&heard&
good&things.
Before&you&start&down&the&path&of&upgrading,&you&have&to&know&requirements&for&
acceptable&downtime&and&outage.
This&also&requires&balancing&technical&capabilities&and&desires&with&customer&needs.
For&instance...
If&you&can&just&forklift&upgrade&to&a&new&environment&or&even&reinstall&the&same&
servers,&the&easiest&approach.&&
We,&as&operators,&love&this.&&It&makes&our&life&operationally&easy.
Another&option&we&like&is&to...
● ...think&of&the&upgrade&process&as&a&pit&stop…,&
● pulling&the&entire&cloud&out&of&the&race&and&swapping&workloads&over&a&short&
period&of&time.
● It’s&a&short&outage,&but&a&total&one.
● The&problem&is,&our&customers&don’t&want&_any_&outage
● This&is&what&our&customers&want.&&Zero&downtime!&&That’s&what&we&need.
● These&guys&change&two&tires&on&the&car&in&about&5&minutes,&while&the&car&is&
driving&down&the&road&the&whole&time.
● And,&unfortunately,&we&don’t&get&to&change&the&tires&on&just&one&side&of&the&car.
● In&the&end,&our&requirements&ended&up&being...
● Our&customers&are&ok&with&an&API&outage&for&say&10&or&15&minutes.
● They’re&not&ok&with&any&other&sort&of&outage
● This&is&basically&what&our&requirements&were&for&both&our&Juno&and&Kilo&
upgrades
● For&Juno&our&upgrade&weakness&was&networking.
● Let’s&talk&about&our&improvement&goals&for&our&Kilo&upgrade
For&the&Kilo&upgrade&we&also&integrated&lessons&learned&from&our&Juno&upgrade.
This&meant...
We&did&our&Juno&upgrade&in&the&early&evening&and&the&feedback&from&our&
customers&was&that&this&was&their&peak&time.&&For&Kilo&we&changed&our&
upgrade&time&to&be&2am&local&time.&(ugh)
We&also&realized&that&we&need&to&test&major&upgrades&using&production&data&from&
both&regions,&we&did&this&and&thankfully&didn’t&have&an&issues&there.
The&major&problem&with&our&Juno&upgrade&was&that&we&had&unexpected&network&
outages&when&upgrading&in&production:
Primary&reason&for&this&was&because&we&had&dramatically&more&routers&in&
our&production&environment&than&we&did&in&dev&or&staging.
In&dev&and&staging&the&outage&was&just&not&long&enough&for&us&to&notice&it&
and&we&weren’t&doing&good&monitoring
To&address&this:
We&put&tooling&in&place&to&spin&up&around&100&virtual&networks&and&routers&
and&an&instance&behind&each&one&in&order&to&give&us&a&more&realistic&
test&environment
We&also&put&in&place&high&granularity&ping&monitoring&of&those&instances&so&
we&could&get&good&metrics&about&what&was&going&on&during&our&
upgrade&testing.
This&was&really&effective&in&letting&us&understand&what&was&happening&
during&the&testing
● We&talked&about&how&important&upgrade&automation&is&before,&I&just&want&to&
touch&on&that&briefly&and&cover&how&we&handle&that
● All&of&our&upgrade&automation&is&done&using&Ansible&to&drive&changes&via&
Puppet
○ Puppet&is&responsible&for&all&package&management,&config&changes,&
service&restarts,&etc
○ Ansible&does&everything&else&and&handles&all&orchestration&and&
ordering
● This&is&something&we&covered&in&a&fair&amount&of&depth&in&our&Vancouver&talk&if&
you’re&interested&in&more&detail
● When&doing&our&Kilo&upgrade,&we&started&with&the&Juno&upgrade&automation&
and&we&were&able&to&reuse&nearly&all&of&it
● So&let’s&look&at&what&our&actual&upgrade&process&looks&like
● This&is&what&our&starting&point&looks&like&for&our&control&cluster.
● We&have&3&control&nodes.
○ Each&node&hosts&the&services&we’re&going&to&be&upgrading,&plus&a&
bunch&of&virtual&routers.
○ They&are&also&all&part&of&a&shared&MySQL&cluster&and&RabbitMQ&
cluster.
○ External&users&talk&to&these&nodes&via&a&hardware&load&balancer.
○ What’s&not&shown&here&is&that&internal&traffic&goes&through&HAProxy
● So&let’s&walk&through&the&process&of&the&actual&upgrade.&
○ Keep&in&mind&that&all&the&steps&you&are&seeing&were&automated&with&
Ansible&playbooks.
● The&goal&here&was&to&take&two&of&the&control&nodes&out&of&service&and&then&
upgrade&the&first&node.&Here’s&how&we&got&there.
● We&do&is&shutdown&and&backup&the&database&on&two&of&the&nodes
● Next&we&use&L3&agent&failover&to&move&all&the&routers&from&the&first&control&
node&to&the&other&two.
○ The&issue&we’re&trying&to&avoid&here&is&that&when&the&OVS&agent&is&
started&during&the&upgrade
■ It&will&drop&all&network&flows,&leading&to&a&loss&of&network&
connectivity.
■ We’re&going&to&talk&about&that&more&later&on
○ To&avoid&that,&we&shut&down&the&L3&agent&on&the&first&control&node
■ After&the&L3&agent&on&nodes&2&and&3&detect&the&“failure”&of&the&
L3&agent&on&the&first&control&node,&they’ll&start&taking&over&those&
routers
○ Once&all&routers&are&moved,&we&disable&the&L3&agent&on&node&1&via&the&
Neutron&API&so&that&when&it&comes&back&up&during&the&upgrade,&
routers&don’t&move&back&automatically.
● This&leaves&us&functional,&not&in&an&outage,&but&with&a&cluster&of&only&one.
● The&last&thing&we&do&before&starting&the&API&outage&is&get&a&list&of&all&instances&
with&floating&IPs
○ We&set&up&a&small&script&to&ping&all&the&floating&IPs&and&report&on&their&
status&while&we&proceed&with&the&upgrade
● Start&the&API&outage&by&turning&off&external&load&balancer
○ We&ran&into&some&issues&here,&but,&we’re&going&to&cover&that&later
● Then&we&shut&down&all&OpenStack&services&on&all&3&control&nodes.
○ The&goal&is&to&not&have&Juno&services&trying&to&make&changes&against&a&
Kilo&database
○ The&routers&continue&to&function&because&that&occurs&in&the&kernel
● Run&puppet&on&the&first&control&node.&&It&upgrades&all&the&packages,&updates&
config&file&settings&and&finally&restarts&all&the&services
○ We&set&OS_ENDPOINT_TYPE&to&internalURL&when&running&Puppet&
so&that&it&can&talk&via&the&internal&haproxy&load&balancer&instead&of&the&
external&endpoints&that&we’ve&disabled
○ This&also&sets&the&nova&API&compat&flag&so&that&Juno&compute&nodes&
can&still&talk&to&the&Kilo&control&services.
● When&this&is&complete,&we&run&a&simple&smoke&test&via&the&CLI&clients&to&verify&
the&services&have&basic&functionality&before&continuing&on
● Once&we’ve&completed&our&smoketests,&we&want&to&start&getting&things&back&to&
normal
● We&enable&the&L3&agent&on&the&Kilo&control&node,&it&will&detect&that&the&L3&
agent&on&the&other&two&nodes&is&dead.
● Once&it’s&given&up&on&them,&it&will&start&plumbing&out&everything&needed&for&the&
routers&on&the&first&control&node&and&they’ll&be&moved&automatically.
○ A&little&later&we’ll&talk&about&the&gross&workarounds&that&were&needed&to&
make&this&work&well.
● We&reCenable&the&load&balancer.&We’re&out&of&outage&and&back&to&a&one&node&
cluster.
○ Length&of&the&API&outage&is&basically&the&time&to&move&routers,&install&
new&packages&and&run&DB&migrations
● We&can&now&relax&a&bit,&the&worst&is&mostly&over.&but&we&have&two&more&control&
nodes&to&upgrade
● The&next&step&is&to&get&the&MySQL&Galera&cluster&back&up&and&running.&
● When&we&start&the&database&on&the&other&nodes,&Galera&replication&will&ensure&
the&database&on&the&other&nodes&are&up&to&date.
○ No&more&database&migrations&are&needed.
Then&we&let&puppet&run&through&the&other&two&nodes&one&by&one,&upgrading&packages&
to&Kilo&and&restarting&services.
Once&all&nodes&are&upgraded&we’re&nearly&done,&&except&one&node&is&hosting&all&the&
routers.&We&have&a&script&that&will&rebalance&the&routers&evenly&across&the&nodes,&
while&avoiding&moving&any&high&profile&tenants
● And&now&we’re&done&with&control&nodes.&We&do&a&bunch&more&testing&here,&
including
○ LiveCmigrating&a&canary&instances&on&compute&nodes
○ Running&our&regression&test&suite
○ Checking&logs,&etc.
● To&finish&the&upgrade,&we&need&to&get&the&compute&nodes&upgraded
● We&live&migrate&all&instances&off&of&a&few&compute&nodes&and&put&canary&
instances&on&them
● Upgrade&those&nodes&and&do&extensive&testing&on&them
○ Live&migration,&volume&attach/detach,&etc
● Proceed&with&a&normal&deploy
○ This&causes&a&short&outage&because&the&OVS&agent&drops&all&flows&
when&it’s&restarted.
○ Unfortunately&we&can’t&avoid&this&for&Kilo
● Control&and&Compute&upgrades&took&less&than&3&hours&per&region,&and&we&did&
the&two&regions&on&separate&nights.
● The&last&thing&we&did&was&merge&a&change&to&remove&the&API&compat&flag&on&
the&control&nodes&and&deploy&that&as&part&of&the&next&normal&deploy
Overview
● As&we&mentioned&before,&a&big&problem&in&our&Juno&upgrade&was&loss&of&
customer&network&connectivity&during&the&upgrade
● We&tracked&this&down&to&several&causes:
○ Tunnel&MAC&learning&flows&have&a&default&timeout&of&5&minutes&and&
require&L2&Agent&to&be&running&to&refresh.&&If&your&upgrade&takes&more&
than&more&than&5&minutes,&they’re&going&to&expire&and&you’re&going&to&
drop&customer&traffic.
○ On&startup&the&OVS&L2&agent&flushes&all&flows.&&
■ Dropping&all&the&flows&wouldn’t&be&too&bad,&except&that&
rebuilding&them&on&a&busy&control&node&is&*really*&slow
■ Over&10C15&minutes&for&a&complete&rebuild&2500&flows&for&50C60&
routers.
○ The&other&issue&we&ran&into&was&caused&by&our&abuse&of&Router&HA&
Agent&Failover&beyond&it’s&design.
■ The&router&on&the&old&control&node&would&continue&ARPing&for&
the&gateway,&and&blackholing&the&traffic
● Here’s&how&we&addressed&these...
Detail
● Early&in&the&upgrade&we&change&the&OVS&MAC&learning&flow&timeouts&on&all&
compute&and&control&nodes&from&the&default&of&5&minutes&to&30&minutes.&&
○ The&reason&we&do&this&is&that&we&know&we’re&going&to&have&Neutron&
down&long&enough&during&the&upgrade&that&the&5&minute&timers&will&
expire&and&we’ll&start&dropping&traffic
○ There&is&still&the&remaining&issue&that&any&*new*&flows&may&expire&
before&the&upgrade&is&complete
■ We&didn’t&observe&this&being&an&issue&in&practice.
Detail
● First&work&around&is&to&avoid&ever&restarting&the&OVS&agent&on&a&node&that&is&
actively&passing&traffic.
○ On&control&nodes&you&just&move&the&routers&to&a&box&that’s&not&actively&
being&upgraded
○ On&compute&nodes&you&could&do&live&migration,&but&we&decided&not&to,&
since&rebuilding&flows&there&is&much&faster&due&to&lower&density.
● We&use&L3&agent&failover&to&preCbuild&flows&when&we&move&routers.&&xxxx
○ This&means&that&the&time&to&build&those&flows&occurs&before&we&have&an&
outage,&instead&of&during.
● Lastly,&the&long&term&fix&for&this&is&in&Liberty.
○ In&Liberty,&the&OVS&agent&will&tag&flows&with&a&cookie&so&that&it&can&
properly&identify&the&flows&in&the&future
○ On&restart,&Instead&of&rebuilding&everything&it&will&synchronize&the&OVS&
flow&state&with&what&Neutron&wants&it&to&be,&instead&of&the&brute&force&
approach&that&it&used&to&take
Detail
● Lastly,&we&had&to&work&around&this&issue&with&routers&not&moving&properly&
sometimes
● After&moving&the&routers&to&the&new&control&node,&we&cleaned&them&up&on&old&
hosting&control&node&with&the&following&steps:
○ Delete&flows&in&the&integration&and&tunnel&bridges
○ Delete&all&the&router&ports
○ Delete&the&router&namespaces
● This&is&absolutely&a&brute&force&approach,&but&it&was&very&effective&in&avoiding&
the&ARP&issue&and&we&had&very&few&tenants&losing&networking&with&this&
approach.
● So&how&did&our&testing&and&upgrade&go?
Let’s&use&realCworld&tropical&storm&Kilo&as&a&metaphor&for&our&Kilo&upgrade
It&slowly&meandered&all&over&the&place&and&it&eventually&died&out&after&
about&3&weeks.
The&tropical&storm&was&the&3rd&longest&lasting&tropical&storm&in&record&
history
We&ran&into&a&wide&variety&of&minor&and&major&problems&and&we&wish&our&Kilo&
upgrade&had&only&lasted&3&weeks&like&the&storm&did
Even&with&lessons&learned&from&Juno
Partially&this&was&because&we&put&more&network&testing&in&place&and&had&to&
improve&our&tooling&and&that’s&a&worthwhile&investment
But&we&also&ran&into&a&lot&more&problems&with&the&Kilo&upgrade.
Some&of&that&was&our&own&fault,&and&some&of&it….was&other&people’s&fault.
● After&our&upgrade&in&our&second&region&we&realized&that&cinderCvolume&was&
completely&broken
○ It&was&really&odd,&because&we’d&done&exactly&the&same&thing&in&the&
other&region&and&it&worked&with&no&issues
● Eventually&we&tracked&it&down&to&this
○ The&os_region_name&variable&is&what&Nova&uses&to&determine&which&
region’s&cinder&endpoint&it&should&talk&to.
○ If&you&only&have&one&region,&this&doesn’t&matter&at&all,&there&is&only&one&
cinder&endpoint
■ If&you&have&multiCregions,&the&libraries&pick&the&endpoint&with&the&
lowest&UUID
■ So&when&Nova&tried&to&attach&a&volume,&it&was&talking&to&cinder&
in&the&wrong&data&center!
■ So&it&was&dumb&luck&we&ran&into&this&at&the&second&region,&
instead&of&the&first.
○ The&problem&is&that&os_region_name&used&to&be&in&the&DEFAULT&
section.
○ In&Kilo&it&moved&to&the&[cinder]&section,&but&we&didn’t&catch&that
● DEFAULT/os_region_name&was&deprecated&in&Juno,&but&we&apparently&
ignored&that&when&we&did&our&upgrade
○ There&was&no&mention&of&the&removal&of&the&backwards&compatability&
in&the&Kilo&release&notes
● If&you&have&more&than&100C200&routers&with&pythonCneutronclient&2.3.x,&you&
can&run&into&this&issue
○ Returns&“Request&URI&too&long”
● This&is&a&bug&that&had&already&been&fixed&upstream,&but&Canonical&packaged&
the&version&that&was&in&in&the&global&requirements&list
● The&global&requirements&list&had&the&Juno&version&of&neutron&client&until&
August
● Attempting&to&downgrade&the&Neutron&client&packages&to&work&around&this&is&
how&we&ended&up&accidently&uninstalling&Nova.
● So&with&the&Kilo&upgrade,&you&need&to&migrate&flavor&data&after&the&upgrade&to&
get&things&into&the&new&way&of&storing&that&data.
● Once&nova&is&brought&up,&it&starts&lazily&migrating&this&data&as&flavors&are&
accessed
● Shortly&after&the&upgrade&in&a&shared&dev&environment&we&*accidently*&
uninstalled&Nova&on&all&nodes
● We&ended&up&with&flavor&data&that&was&partially&migrated&because&of&this,&and&
that&caused&Nova&to&crash&on&startup.
● We&spent&hours&tracking&this&down&and&eventually&had&to&fix&it&by&hand&by&
editing&the&database&entries.
● After&this&we&changed&our&automation&to&migrate&the&flavor&data&immediately&
after&doing&the&upgrade,&and&before&we&brought&API&services&back&online
● In&Kilo,&Neutron&added&a&new&option&‘allow_automatic_dhcp_failover’
○ This&provides&the&ability&to&have&DHCP&server&health&checked&
regularly,&and&if&one&failed,&it&would&automatically&be&spun&up&on&
another&DHCP&agent.
● Unfortunately,&it&detects&spurious&failures&pretty&regularly,&for&us&multiple&times&
a&day
● Unfortunately,&when&it&does&fail&over,&it&hits&another&bug&a&good&percentage&of&
the&time&that&causes&the&DHCP&neutron&ports&to&get&stuck&in&creating&status
○ So&in&effect&this&was&killing&good&DHCP&servers&instead&of&recovering&
bad&ones
● We&don’t&even&need&this&feature,&we&run&three&control&nodes,&and&two&DHCP&
agents&per&network
● However,&it&defaults&to&on,&so&for&about&a&week&after&our&upgrade&we’d&have&
tenants&dropping&offline&because&their&DHCP&server&hit&this&combination&of&
bugs&and&is&dead&until&we&manually&clean&things&up
● There&was&no&mention&of&this&feature&in&the&release&notes.
● Part&of&how&we&discovered&that&this&feature&existed&and&was&buggy&was&by&
looking&at&the&DHCP&code&changes&on&the&master&branch&for&neutron&and&
comparing&it&to&the&kilo&branch
○ We&realized&this&feature&had&a&lot&of&bugs&when&we&found&lots&of&fixes&
for&it&on&the&master&branch.
○ Of&the&half&dozen&fixes,&only&one&or&two&of&them&were&backported.
● We&ended&up&just&turning&off&this&off
● As&I&implied&before,&we&ran&into&issues&with&validating&services&while&the&
external&endpoints&were&offline
● Normally&the&CLI&clients&get&a&list&of&service&endpoints&from&keystone&and&
default&to&the&public&one
○ By&setting&the&OS_ENDPOINT_TYPE&environment&variable&or&passing&
the&same&thing&in&via&a&commandCline&option,&you&can&override&this&and&
tell&them&to&use&the&internalURL,&which&for&us&is&separate&and&based&on&
HAProxy
● The&issue&is&that&some&of&the&CLI&clients,&including&Neutron&and&Cinder&were&
broken,&and&would&ignore&both&of&these.
● This&broke&our&Puppet&runs&during&the&upgrade&and&it&broke&our&smoke&test&
scripts
● Unfortunately,&because&we&found&this&issue&very&late&in&the&process,&we&ended&
up&deciding&to&just&leave&the&external&LB&for&our&production&upgrades.
● We&also&ran&into&schema&problems&with&Glance.
● In&Kilo,&Nova&started&using&the&V2&Glance&API
● The&V2&API&does&schema&validation,&but&the&v1&API&doesn’t&really
○ So&it&was&possible&to&create&images&with&attributes&via&the&V1&api,&that&
the&V2&api&thought&was&invalid.
○ Like&description&being&NULL&instead&of&an&empty&string
○ When&that&happens,&Nova&couldn’t&do&anything&with&the&image,&
because&it&would&fail&schema&validation&via&the&V2&API
● There&was&no&way&to&tell&Nova&to&use&the&V1&API&instead
● Flavio&from&the&Glance&team&helped&us&get&this&fixed&very&quickly
● Canonical&backported&it&quickly
● We&ran&into&a&similar&issue&with&Glance&but&in&the&schema&file&instead&of&in&
Glance&code
● The&attributes&this&time&were&kernel_id&and&ramdisk_id
● We&changed&the&schema&file&to&allow&these&fields&to&be&nullable
● This&has&been&fixed&upstream&in&the&same&way.
● When&doing&the&first&upgrade&in&our&shared&dev&environment,&we&ran&into&a&
problem&with&Nova&migrations
● MySQL&was&failing&to&run&a&migration&to&convert&a&column&from&NULL&to&a&
NOT&NULL
● It&was&failing&because&MySQL&5.6&has&a&bug&that&prevents&converting&a&
column&to&NOT&NULL&if&it&has&a&foreign&key&constraint
● This&didn’t&happen&in&all&of&our&environments,&and&if&we&did&a&mysqldump&and&
restore,&the&problem&went&away
● We&opened&a&support&case&with&Percona,&waited&for&them&to&track&it&down&and&
got&a&new&build&from&them&that&resolved&the&issue.
● If&you&see&a&problem&like&this&when&running&DB&migrations,&your&problem&is&
probably&due&to&existing&database&tables&not&matching&the&default&database&
sort&order,&or&collation.
What&happened&for&us&is&that&we&had&some&databases&using&utf8_unicode_ci&and&
the&upstream&Puppet&modules&changed&the&default&database&collation&to&
utf8_general_ci
That&means&newly&created&tables&had&a&different&sort&order&than&the&
existing&ones&and&when&adding&foreign&keys&between&an&old&and&new&
table,&MySQL&would&refuse&add&them
This&could&happen&for&any&database&in&theory,&for&any&migration&that&changes&
foreign&keys.
● Keystone&middleware&that&all&projects&use&for&token&validation&was&moved&into&
a&separate&package&in&Juno,&but&Juno&still&supported&the&old&library&names.&&
In&Kilo&the&old&names&were&removed,&but&this&wasn’t&mentioned&in&the&Kilo&release&
notes.&&
The&control&nodes&we&had&that&were&upgraded&from&icehouse&still&had&the&old&
value
This&was&an&easy&fix&once&we&found&it.
Issues&like&this&are&particularly&hard&to&find,&since&oslo.configs&normal&
deprecation&mechanisms&can’t&cover&this&scenario
● Last&but&not&least,&we&found&this&problem&after&completing&our&first&prod&
upgrade&and&turning&API&services&back&on
New&feature&in&Nova&scheduler&called&“scheduler_tracks_instance_changes”.&&
This&can&track&instance&state&to&allow&scheduler&filters&to&make&more&
informed&decisions.
This&is&the&commit&message&for&the&new&feature
On&startup&the&scheduler&polls&all&compute&nodes&for&instance&state&in&batches&of&
10&at&a&time
Our&experience&was&that&this&meant&that&novaCscheduler&was&chewing&up&100%&of&
a&core&until&this&was&done&and&it&took&forever&to&finish
RabbitMQ&would&get&disconnected&CC we&believe&because&heartbeats&were&
failing&due&to&the&thread&not&being&scheduled
Even&after&turning&off&heartbeats,&we&still&saw&instances&not&being&
scheduled&while&this&was&enabled
We&don’t&use&any&scheduler&filters,&we&didn’t&need&it,&turned&it&off
Only&vague&notions&of&this&in&the&release&notes,&and&we&didn’t&understand&what&
was&going&on&until&we&found&this&commit&message.
DocImpact&tag&definitely&didn’t&translate&to&release&note&updates&in&this&
case.
● After&all&those&issues,&this&is&about&how&we&felt&by&the&time&we&were&done&with&
our&prod&kilo&upgrades
● If&you&haven’t&seen&Groundhog&Day,&you&should,&it’s&literally&a&classic.
● So&a&number&of&these&problems&we&ran&into&are&because&we&didn’t&pay&
attention&to&deprecations&in&Juno,&and&when&those&features&were&removed&in&
Kilo,&we&didn’t&know&because&we&just&read&the&Kilo&release&notes&for&our&Kilo&
upgrade,&not&the&Juno&release&notes&for&our&Kilo&upgrade.
● MySQL&has&bugs,&we’re&good&at&finding&them&with&OpenStack&upgrades.&&
Yay?
● Part&of&the&reason&we&upgrade&is&that&we&want&new&features&(and&bug&fixes),&
but&at&least&two&of&the&problems&we&had&were&because&new&features&were&on&
by&default,&and&they&were&buggy.
● Buggy&services&are&one&thing,&but&in&both&cases,&there&was&no&real&
documentation&around&these&features.
○ One&of&them&wasn’t&mentioned&in&the&release&notes&at&all,&and&the&other&
had&no&detail&about&what&it&did
● And&to&give&credit&where&credit&is&due,&some&projects&are&really&good&at&release&
notes.
○ The&Cinder&Kilo&release&notes&were&widely&credited&as&being&good&at&
the&Operator’s&MidCCycle&meetup
○ Looking&through&the&Liberty&release&notes,&the&Nova&section&is&really&
really&good.&&It&would&be&nice&if&everyone&followed&their&example.
● So&with&that&litany&of&horrible&issues,&you&may&be&wondering&if&we&thought&
upgrading&was&worthwhile:
● After&resolving&these&issues,&overall&stability&has&been&improved
● So&AMQP&heartbeats&have&increased&stability&dramatically&for&us.
○ This&has&cleared&up&a&lot&of&intermittent&issues&for&us,&and&also&allowed&
us&to&put&RabbitMQ&behind&a&load&balancer.
○ We&wanted&to&put&Rabbit&behind&a&load&balancer,&because&we’re&in&the&
process&of&moving&our&OpenStack&environments&to&a&new&network&
architecture,&and&this&helps&us&quiece&RabbitMQ&before&taking&it&offline.
● To&wrap&up,&let’s&talk&about&our&next&upgrade
● We’ve&started&some&work&on&moving&to&Liberty&already
○ We’re&on&master&for&all&of&the&Puppet&modules&now&(except&keystone)
○ We&don’t&know&what&the&timing&for&our&Liberty&upgrade&will&be&yet,&but&
I’ll&be&surprised&if&it’s&not&before&Austin
● We’ve&learned&that&no&matter&what,&we’re&going&to&run&into&weird&problems.
○ For&example,&we&ran&into&MySQL&bugs&in&both&Juno&and&Kilo&upgrades,&
so&apparently&we&should&just&assume&that&will&happen&and&add&another&
two&weeks&to&get&that&fixed….
● We’re&going&to&continue&moving&services&into&containers.&&We’ve&got&heat&and&
designate&in&containers&now,&and&it’s&allowed&us&to&upgrade&them&(or&not)&
independently&of&other&services.
○ This&will&allow&us&to&avoid&having&to&deal&with&conflicting&dependencies&
between&services
○ It&also&allows&us&to&stage&the&new&version&of&a&service&before&the&
upgrade.&&Right&now&a&lot&of&our&upgrade&time&is&actually&installing&
packages.
● As&we’ve&mentioned&before,&a&lot&of&the&complexity&in&our&upgrades&have&to&do&
with&the&fact&that&upgrading&the&OVS&agent&causes&it&to&drop&all&active&flows.
○ We’re&really&looking&forward&to&deleting&a&bunch&of&code,&assuming&this&
works&in&Liberty&(it’s&on&by&default)
● Lastly,&we’re&hoping&to&move&to&using&HA&routers&once&we’re&on&Liberty,&and&
with&that&in&place&we&hope&to&avoid&moving&any&routers&around&during&the&
upgrades
○ Hopefully&that&will&help&with&our&Mitaka&upgrade
● That’s&all&we’ve&got,&we&appreciate&everyone&coming
● Hopefully&have&some&time&for&questions

More Related Content

Viewers also liked

Decoding Japanese part 1 draft ver1.
Decoding Japanese part 1 draft ver1.Decoding Japanese part 1 draft ver1.
Decoding Japanese part 1 draft ver1.Robert Kasza
 
Gestor de proyectos docent tic sicpe g8 d
Gestor de proyectos docent tic sicpe g8 dGestor de proyectos docent tic sicpe g8 d
Gestor de proyectos docent tic sicpe g8 dgustavo aldana
 
deel 1: Hoe word je een goede timemanager met outlook
deel 1: Hoe word je een goede timemanager met outlookdeel 1: Hoe word je een goede timemanager met outlook
deel 1: Hoe word je een goede timemanager met outlookAnn Deraedt
 
Music & Spirituality in Bali
Music & Spirituality in BaliMusic & Spirituality in Bali
Music & Spirituality in BaliRebekah Moore
 
Noroff MCSE 2003 Vitnemål
Noroff MCSE 2003 VitnemålNoroff MCSE 2003 Vitnemål
Noroff MCSE 2003 VitnemålRoger Fredheim
 
Naturoromandie.ch - episode #1 - La cure d'automne par Arkopharma
Naturoromandie.ch - episode #1 - La cure d'automne par ArkopharmaNaturoromandie.ch - episode #1 - La cure d'automne par Arkopharma
Naturoromandie.ch - episode #1 - La cure d'automne par ArkopharmaJulien Henzelin
 

Viewers also liked (8)

Decoding Japanese part 1 draft ver1.
Decoding Japanese part 1 draft ver1.Decoding Japanese part 1 draft ver1.
Decoding Japanese part 1 draft ver1.
 
Gestor de proyectos docent tic sicpe g8 d
Gestor de proyectos docent tic sicpe g8 dGestor de proyectos docent tic sicpe g8 d
Gestor de proyectos docent tic sicpe g8 d
 
National Workers Union Press Release - september 16, 2013, Job Josses
National Workers Union Press Release - september 16, 2013, Job JossesNational Workers Union Press Release - september 16, 2013, Job Josses
National Workers Union Press Release - september 16, 2013, Job Josses
 
deel 1: Hoe word je een goede timemanager met outlook
deel 1: Hoe word je een goede timemanager met outlookdeel 1: Hoe word je een goede timemanager met outlook
deel 1: Hoe word je een goede timemanager met outlook
 
Music & Spirituality in Bali
Music & Spirituality in BaliMusic & Spirituality in Bali
Music & Spirituality in Bali
 
Noroff MCSE 2003 Vitnemål
Noroff MCSE 2003 VitnemålNoroff MCSE 2003 Vitnemål
Noroff MCSE 2003 Vitnemål
 
2016 CIO Outlook
2016 CIO Outlook2016 CIO Outlook
2016 CIO Outlook
 
Naturoromandie.ch - episode #1 - La cure d'automne par Arkopharma
Naturoromandie.ch - episode #1 - La cure d'automne par ArkopharmaNaturoromandie.ch - episode #1 - La cure d'automne par Arkopharma
Naturoromandie.ch - episode #1 - La cure d'automne par Arkopharma
 

Similar to Upgrade OpenStack Services with Zero Downtime

201210611 danish delegation
201210611 danish delegation201210611 danish delegation
201210611 danish delegationMartijn Kriens
 
Eastside incubator - Startup in Seattle
Eastside incubator - Startup in SeattleEastside incubator - Startup in Seattle
Eastside incubator - Startup in SeattleBryan Starbuck
 
Using iMac Built-in Screen Sharing
Using iMac Built-in Screen SharingUsing iMac Built-in Screen Sharing
Using iMac Built-in Screen SharingHock Leng PUAH
 
Top 10 Agile Gotchas, Problems and Challenges + What you can do about them
Top 10 Agile Gotchas, Problems and Challenges + What you can do about themTop 10 Agile Gotchas, Problems and Challenges + What you can do about them
Top 10 Agile Gotchas, Problems and Challenges + What you can do about themMichael Sahota
 
MariaDB Galera Cluster presentation
MariaDB Galera Cluster presentationMariaDB Galera Cluster presentation
MariaDB Galera Cluster presentationFrancisco Gonçalves
 
HA Solutions for MySQL and MariaDB
HA Solutions for MySQL and MariaDBHA Solutions for MySQL and MariaDB
HA Solutions for MySQL and MariaDBjoffrey92
 
[오픈콘텐츠랩/김용석님] 프레젠테이션을 준비하는 스타트업들에게 (12/23 공개강의자료)
[오픈콘텐츠랩/김용석님] 프레젠테이션을 준비하는 스타트업들에게 (12/23 공개강의자료)[오픈콘텐츠랩/김용석님] 프레젠테이션을 준비하는 스타트업들에게 (12/23 공개강의자료)
[오픈콘텐츠랩/김용석님] 프레젠테이션을 준비하는 스타트업들에게 (12/23 공개강의자료)opencontentslab
 

Similar to Upgrade OpenStack Services with Zero Downtime (7)

201210611 danish delegation
201210611 danish delegation201210611 danish delegation
201210611 danish delegation
 
Eastside incubator - Startup in Seattle
Eastside incubator - Startup in SeattleEastside incubator - Startup in Seattle
Eastside incubator - Startup in Seattle
 
Using iMac Built-in Screen Sharing
Using iMac Built-in Screen SharingUsing iMac Built-in Screen Sharing
Using iMac Built-in Screen Sharing
 
Top 10 Agile Gotchas, Problems and Challenges + What you can do about them
Top 10 Agile Gotchas, Problems and Challenges + What you can do about themTop 10 Agile Gotchas, Problems and Challenges + What you can do about them
Top 10 Agile Gotchas, Problems and Challenges + What you can do about them
 
MariaDB Galera Cluster presentation
MariaDB Galera Cluster presentationMariaDB Galera Cluster presentation
MariaDB Galera Cluster presentation
 
HA Solutions for MySQL and MariaDB
HA Solutions for MySQL and MariaDBHA Solutions for MySQL and MariaDB
HA Solutions for MySQL and MariaDB
 
[오픈콘텐츠랩/김용석님] 프레젠테이션을 준비하는 스타트업들에게 (12/23 공개강의자료)
[오픈콘텐츠랩/김용석님] 프레젠테이션을 준비하는 스타트업들에게 (12/23 공개강의자료)[오픈콘텐츠랩/김용석님] 프레젠테이션을 준비하는 스타트업들에게 (12/23 공개강의자료)
[오픈콘텐츠랩/김용석님] 프레젠테이션을 준비하는 스타트업들에게 (12/23 공개강의자료)
 

Recently uploaded

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Upgrade OpenStack Services with Zero Downtime