Dave + Manasi
1 minute
My name is Manasi Prabhavalkar. Fresh out of college after completing my Masters degree at NC State, I was fortunate enough to have landed the most interesting job. For the past 2 years I have been a Systems Architect for OpenStack in the Engineering Shared Infrastructure Services organization at NetApp.
Today we are here to share our experience as an Engineering org into the rapidly evolving world of OpenStack
Dave
1 minute
For each hexagon, a few bullet points should guide the conversation:
Overall – this presentation represents a 1.5 year journey of
Intro – introduce our mission statement, what we do, how we started with OpenStack
Automation – absolutely needed, how we were able to leverage Puppet for deployments
Upgrades – how we were able to take said automation and do upgrades to Kilo locally
Globalizing – Lets adapt to deploying openstack globally since we’re a global team and company
Global Upgrades – best of both worlds, lets employ what we’ve learned locally to our entire team, and seamlessly upgrade to OpenStack Liberty
Next steps – where we’re going next, what keeps us excited throughout the day
Our Agenda for today is going to represent how we introduced and past year and half at NetApp
Today we are going to share the NetApp Engineering success story
Our learnings
In this session we are going to take you guys through our journey of implementing OpenStack in NetApp Engineering.
time-span of a year and a half
This session today represents our exciting journey of implementing OpenStack and some of the milestones that we achieved throughout.
It is a story of our engineering org adopting OpenStack as a small part of our internal private cloud, and making it a huge success
We are going to talk about who we are and what we do as an organization
how we started off with OpenStack
We are going to talk about who we are as an organization and how we decided to embrace OpenStack back in Sept 2014.
And how our journey just got interesting after that.
How we made our way through automating deployments, automating upgrades using Puppet and went on to globalize OpenStack at our 3 major sites. The high point of our journey was implementing 2 live-upgrades in a Production env in a time-span of just 5 months.
So stay tuned
Dave
3 minutes
6gb Mem
IaaS
Puppet comfort
Today our Global Engg cloud GEC as we call it is a self-service cloud portal that has 3 different hypervisors under its belt. Vmware, HyperV and KVM on OpenStack.
Why OpenStack?
NetApp made a strategic decision to embrace OpenStack, we are Customer Zero
NetApp has been involved with OpenStack since 2011, both from a development perspective (Folsom release) and from an internal deployment perspective
Needed to reduce Hypervisor licensing costs
Increase breadth of NetApp QA testing
Match customer expectations and deployments
Scalable Multi-Region Design
15 compute nodes in each region (1000 VM per region)
Ceilometer in each region for performance
Secure Multi-Tenancy (71 SVMs, GEC Service based Tenancy model, build environment as service)
Modular Scale as you Grow Architecture
So now lets talk about how we got there.
Manasi
3 minutes
Explain a region
Region arch
Scale model
HA features
Stats
We talked about the highly available Keystone service, Horizon and we also had a highly available GaleraDB cluster hosting the shared databases which mainly included keystone, cinder and glance. Then there was the single controller node and in order to address that concern we decided to go with Region architecture. So we stamped out a region with One controller, its own native DB and MongoDB and 15 Compute nodes.
This allowed us to scale horizontally by having new regions which shared the same Keystone and Horizon services that we called as Region Zero.
expectations
Each region serves as its own OpenStack deployment with its own Nova and Neutron services. Image store and Cinder share is backed by NetApp NFS backend and are shared across all the regions. So basically each regions hosts all the services of OpenStack however the glance and cinder in each region talk to the same shared db in region zero
The motivation behind this arch was
Staring off with each region having a /22 cidr and so a VM capacity of 1000
This gave us a scaling model helping us grow by 1000VMs every time we add a new region.
Also we expected our region zero to handle upto 10 regions after which we would consider adding a new node to the shared region.
Each region has its own neutron and nova DB and is backed by a NetApp NFS store for instances
This region strategy helped us keep the OpenStack arch as close to our Vmware and HyperV archs as possible.
One OpenStack region was analogous to a Vmware/hyperV cluster of 15 compute. We wanted to keep it as familiar as possible and so we did not change too many things at the same time. This helped the Operations team to be more comfortable in adopting KVM on OpenStack as a new addition to GEC.
Now even if a Region fails, still the OpenStack requests coming to GEC can be routed to any of the other regions making it highly available for our customers.
The arch phase was the most imp milestone of our journey. We successfully came up with a highly available, modular and so easily scalable architecture that helped us set the stage for defining our live-upgrade strategy as well.
Today we have OpenStack globally at 4 sites with 10 regions and 160 Compute giving us a VM capacity of 7500 which serves as 10% of the total GEC capacity.
4 sites
10 regions
160 Compute
~7500 6GB VM capacity
GEC Total Capacity: 70K
Manasi 2min Puppet+FlexPod
Puppet automation takes over
Puppet roles
Big picture
Deployed Juno in 90 minutes
Once the storage is ready and the node is prepped we feed it to our puppet master. The puppetmaster consists of all the necessary code to spin up a new Production-ready OpenStack env.
The puppet master assigns it a suitable role in our architecture and then configures it for us.
Our arch allows for 8 different roles which are Web(horizon), database(regional db), mongodb (for ceilometer), keystone, GaleraDB (for shared db), compute, controller and lb
By this time we were on the Juno release of OpenStack in prod with a region zero and 3 deployment regions. It took Puppet just 90min to spin up the entire env ready to deliver a VM capacity of 3000 instances.
Basically our automation strategy involves configuring storage and let Puppet handle the rest
Who thought automating OpenStack deployment would be so much easier
Manasi
4 minutes
Make sure hardware upgraded
Juno to Kilo upgrade
Define strategy
Segmented upgrades
Explain each segment
Time to upgrade
No end user disruptions
After successfully automating the OpenStack deployment with Puppet we decided to automate live-upgrades too.
Now it was time to upgrade our env from Juno to Kilo
We wanted to define an upgrade strategy which was repeatable and automated. We also wanted it to be non-disruptive enabling existing production VMs to work during the upgrade. Our modular architecture helped us make this task easier to accomplish.
This
Now let me take you guys through our live-upgrade strategy.
We started off by upgrading the keystone node first as it is shared across all of the regions in our env
We did the upgrade serially to maintain service continuity. All the regions continued to work with the upgraded keystone because of backwards compatibility.
Then we moved on to the web nodes and upgraded them serially too to maintain service continuity. Users in our env use the GEC portal as the user interface so the web upgrade was non-disruptive.
Next we went to the controller. We upgraded each controller serially across regions. Existing VMs continued to work during the upgrade however the region would not be able to service new requests during the 5 minute puppet process. Our strategy was to toggle off a region in our GEC portal and stop any new deployments to it, then upgrade the region controller and once the upgrade was successful toggle the region back on again in the portal.
After the controllers we moved on to compute nodes. We took the first compute node in each region, live-migrated all VMs to other nodes in that region and then upgraded the empty compute node. We upgraded all compute serially within a region but in parallel across the regions.
Our Puppet process took approximately 5 minutes for each node to upgrade
Manasi
2 minutes
Global upgrade
Time for Kilo to Liberty
Prev experience/lessons learnt
4 env to upgrade, firmware first
Smallest -> largest stats
Span of a week
Upgrade roadmap
More capacity planned
When we rolled out OpenStack globally we were on the Kilo release of OpenStack. During the start of 2016 it was time for next live-upgrade to Liberty.
This time we had a successfully tested upgrade strategy but 4 different sites to be upgraded as compared to only 1 that we had before.
This is where the previous upgrade experience, puppet automation and the training sessions for operations paid off. This upgrade was so much smoother thanks to the lessons learned from the previous experience as well as maturity of OpenStack.
This time we had the local operations team running the show and we acting as mere advisors. This time we were more confident and prepared making this the most successful global upgrade to date.
**In a span of a week we had globally upgraded to Liberty in all our OpenStack envs.
The largest upgrade was at RTP with 900 active VMs and 86 nodes which took well under 4 hours to accomplish thanks to our operations team.
Upgrade roadmap –
As soon as a new release candidate is launched we get that in our dev env and update our Puppet automation for the next upgrade.
When it is GA released we start testing out live-upgrades in Dev and also test it against our GEC portal for stability
After rigorous testing in Dev for a week after GA we move it to the staging env for 2 weeks.
When we are satisfied with the results then we schedule a global upgrade in Production after 6 weeks of GA release.
Site 1: 5 active
Site 2: 30 active
Site 3: 50 active
Site 4: 900 active
Dave (Lessons Learned)
Manasi (Advice for you)
4 minutes
LM Ours = 5
Manasi
Other projects and plans
Trove = Oracle and MongoDB primarily
1 minute
Dave
Manila
Rest
1 minute