Learn how Walmart uses node groups and facts to partition our fleet, which currently includes over 87,000 nodes running Linux and Windows, across various business segments. We will also discuss our strategies for performing rollouts at varying speeds, which includes how we use the classifier, how we use environments, and how we use the feature flag design pattern.
2. Legal Disclaimer
Any reference in this presentation to any specific commercial product, process, or
service, or the use of any trade, firm or corporation name is for the information
purposes only and does not constitute an endorsement or recommendation by Wal-
Mart Stores, Inc.
8. Flexible classification + differentiation
The art of balancing new changes and keeping systems stable
Determine
how you
differentiate
nodes
Define your
own node
hierarchy
Determine
how to use
environments
Keep your
classification
DRY
10. Flexible classification + differentiation
The art of balancing new changes and keeping systems stable
Production
Pilot production
Certification
Test
Development
Puppet Dev
Agent-specific
Puppet
Enterprise
PE
agent
PE
master
PE
MCollective
PE
PuppetDB
PE
console
Wservers
Temp node
group trees
EnvironmentsPuppet infrastructure Business classifications
Lservers
Lservers
-store
Lservers
-DC
Lservers
-HO
Lservers-store-
(ISP, WSP, NAS)
Lservers-store-ISP-Pilot
Lservers-store-ISP-Cert
Lservers-store-ISP-Dev
12. Rollouts
Controlling the pace of change
Production
Pilot production
Certification
Test
Development
Puppet Dev
Agent-specific
Lservers Wservers
Lservers
-store
Lservers
-DC
Lservers
-HO
Lservers-store-
(ISP, WSP, NAS)
Lservers-store-ISP-Pilot
Lservers-store-ISP-Cert
Lservers-store-ISP-Dev
Business classificationsEnvironment Temp node group trees
13. Classification bonus side-effects
We can use Classifier to predict changes
Our Classification hierarchy is living documentation
for how we see our infrastructure
Classifier helps us to think about changes abstractly
(the config I want exists, how do I get a box to get it?)
15. Environmental promotion
The road to production
Production
Pilot production
Certification
Test
Development
Puppet Dev
Agent-specific
Ideally:
Full Unit Test Coverage
Full Acceptance Test Coverage
Generally:
Make Test as close
as possible to production
Have a few steps to control
populations for risky stuff
16. Temporary node groups
● The new class or feature
uses the “scaffolding”
● Being successful in successively
larger populations builds confidence
● Failing in smaller populations
contains damage
● In the most extreme cases, we can
unclassify a class completely, and
basically re-deploy it from scratch
“It’s ready for production…just not ALL of production.”
17. Feature flags workflow
“Do I do the risky thing or not?”
#Additionalconfiguration for PE PuppetDB load balanced nodes
class profile::pe::puppetdb(
$shared_cert_name = 'puppetdb-shared',
$command_threads = 4,
$store_usage = undef,
$temp_usage = undef,
$memory_usage = undef,
Enum['absent', 'present'] $puppetdb_gc_cron_ensure = 'absent',
Boolean $use_lb_cert = true,
) {
# Note setting this false also requires setting hiera keys to nodename
# for puppet_enterprise::profile::puppetdb::certnameas these default to
# the LB cert in scale test and production
if $use_lb_cert {
class {'::puppetdb_shared_cert::puppetdb':
certname => $shared_cert_name,
before => Puppet_enterprise::Certs['pe-puppetdb'],
}
}
19. Key Takeaways
Take the first step, then learn and grow.
Controlling the pace of change can help build credibility.
Making something easier to build makes it easier to remove.
The rate of change seems to only be increasing.
The information appearing in this presentation is for general informational purposes only and is not intended to provide technical or any other advice to any individual or entity. Reference in this presentation to any specific commercial product, process, or service, or the use of any trade, firm or corporation name is for the information and convenience only, and does not constitute endorsement, recommendation, or favoring by Wal-Mart Stores, Inc.
The wayback machine…
I still own this shirt, and the puppetconf 2014 t-shirt I was wearing beneath it.
We’ve learned a lot since then, though…
These are summaries, but they give an idea of what kind of scale we’re talking about.
These are not just data center nodes – these nodes are distributed in our stores and distribution centers all over the world.
One change could potentially affect nearly all of them
When we first started doing Puppet, we had tons of divergence.
Our goal was to reduce the divergence as much as possible – duplication results in extra costs, because more choices means more troubleshooting paths.
Our practices had diverged between the different lines of business we served, between the platforms we serviced, and our preferred approaches to problem solving. We were looking for a way to bring a lot of those practices together, to consolidate as much as we could.
Are we all identical everywhere? No, not by a long shot, but we’re a lot more similar than we were when we started.
Some of our challenges are obvious – some are less obvious.
Managing a distributed fleet this size comes with many problems – we’re always refreshing something, somewhere. We’re always introducing something new, somewhere.
When someone decides we need a new server image, we need to deploy them quickly.
Against this backdrop, no one else is standing still, either. Applications are being updated constantly, and we have ongoing compliance and security posture requirements.
On top of all of this, a lot of our deployed footprint is not in data centers, so it is subject to bad power, city power problems, and all kinds of natural disasters.
All that diversity increases costs, because there’s more to troubleshoot; more possibilities take more time. But our infrastructure doesn’t get to write a blank check.
Ongoing state management is critical.
This is a picture from the aftermath of Hurricane Sandy.
Most of our compute footprint runs in the distributed sites that they service
Those sites can be in the path of natural disasters
Natural disasters mean unpredictable outages
It really helps to be able to express changes abstractly and not to have to worry about when sites come up or if they miss a check-in or two
Our customers depend on us to be up and running as soon as possible after all kinds of situations, of which recent natural disasters are only one example.
Prep for blizzards etc. are also a thing.
Determine how you differentiate nodes
Define your own node hierarchy
Determine how to use environments
Keep your classification DRY
========
Use facts that are mutually exclusive at the top level, for the groups that matter to you.
Feel free to invent your own facts, or synthesize other facts to make these decisions.
Maybe you see Linux and Windows servers in the same tree? Maybe you don’t? The decision is entirely up to you.
Determine how you’re going to use environments. Are you going to have standing non-prod environments? Are they going to be dynamic?
DRY is Don’t Repeat Yourself. It’s easy to create a structure where you have to repeat a lot of things; this will create a lot more work for you. Try to factor out things that are common and move those up the hierarchy. As we will see, this will give us opportunities to use that hierarchy to make changes.
========
We have two major, and diametrically opposed, configuration problems:
How to maintain state and substantial similarity for about 80% of our nodes
How to provide rational choices and a straightforward path of customization for the rest of our node population
Just to keep it interesting, we have some major cross-cutting concerns (like security policies) that apply equally to both our distributed and centralized workloads
Dynamic classification has helped us do both of these reasonably well,
And that’s important, because a lot of people depend on us…
========
Our configuration hierarchy has three primary branches:
Environment Groups (which don’t assign classes)
PE Infrastructure Groups (for PE and Agent Bits)
“Business” Classifications
We currently have 6 of these
We try to ensure one leaf node group from each tree, but occasionally make exceptions
========
Once you have a substantial number of nodes under management, one of the first questions that you’ll probably be asked is how you control the pace of change.
You want the “big red button”, but not all changes are created equal.
How do you make it easy to deploy change globally, but provide hooks and stopping points to control the pace of change for changes that might be more risky?
We have three major strategies, which we mix and match when we need to.
The primary axes of change:
Hieradata (our own, defined hierarchy)
Environmental promotion (dev, …, production)
Temporary node groups (scaffolding, for new classes)
Feature Flag Workflow (for existing classes)
================
There are several really cool side effects of doing classification based on what a node might be.
For one thing, since node groups don’t have to assign classes or override configuration in any way, we can use them to provide visilbility into what would change if we applied a new class or paramter to a given group.
Our classification hierarchy is a living visual representation of how we see our node population.
And very subtly, but I think very importantly, it allows us to think about our nodes abstractly. What if we did X? What if we did Y? It helps us see our node population as a living, changing, dynamic thing.
Do you promote your nodes from Dev to Production?
Or Do you have different node sets for test and production?
Practices differ, and no solution is ideal.
We are a “Code to Nodes” shop, we have standing sets of canary nodes that we promote code through on its way to production.
Ideally:
Full Unit Test Coverage
Full Acceptance Test Coverage
Generally:
Make Test as close as possible to production
Have a few steps to control populations for risky stuff
We have 4 pre-production environments and 2 production ones
======
Temporary node groups make use of the hierarchy to filter change to a subset of nodes.
New classes can be added at a lower level of the hierarchy and “walked up” the tree
Each step in the deployment is an opportunity to validate correctness, and if not correct, to control damage
In the scariest of theoretical cases, we would completely withdraw a class (remove it from all node groups) and re-add a new version, which would basically be a brand-new rollout.
A Feature Flag is a new parameter to a class that controls the new behavior
As a parameter, you can control it with Hiera or with the Console
Ideal for complex new deployments or parts of classes that cannot be declassified
Visibility and trackability means other people can see what you see. If others can see what you see, maybe that can stave off a few after-hours calls? It also starts to move from being “someone’s” system to being “the” system – objectivity is important.
Objectivity, in turn, seems to help with some of the defensiveness of people who are skeptical of or resistant to change.
The more people buy into a system, the more it gets used – the more something gets used, the more comfortable people are with it.
The journey of a thousand miles begins with the first step.
The two best times to plant a tree are 30 years ago and today.
No one starts something and masters it – good grief, we have tons to learn still, and we’ve been doing this for years now. But we’ve made very significant progress, and we will make more in the future – and we wouldn’t be there if we hadn’t taken the first step. The first step leads to many more.
When we first started, the questions we asked were all about how we could keep things in sync and up to standard. As we grew, the questions are becoming, much more, “How can I get to the next thing? How do I make it easier to migrate entire platforms?” It’s hard to imagine, knowing where we were when we started, that we would be asking these kinds of questions. Hopefully, we will have some great answers to these questions and discover some even more interesting questions.
Because it seems that the rate of change is only increasing. We have to drive down cycle times, because our customers and our users are depending on us to.