SlideShare a Scribd company logo
1 of 21
Watching
the Watchers
Capacity management and
classification strategies
for large Puppet node populations
Martin Jackson, Walmart
@mjolnir40k
+
Legal Disclaimer
Any reference in this presentation to any specific commercial product, process, or
service, or the use of any trade, firm or corporation name is for the information
purposes only and does not constitute an endorsement or recommendation by Wal-
Mart Stores, Inc.
Two years
ago…
Where we are today
Resources managed
67,400,000+
Production changes / day
8.33
Nodes managed
87,000+
Our challenge
Locations
Data Centers
Distribution Centers
Stores
Platforms
Linux
Windows
Processes
Internal Build
Practices
Consolidating systems management practices across:
Other challenges
Managing a dynamic system
Flexible classification + differentiation
The art of balancing new changes and keeping systems stable
Determine
how you
differentiate
nodes
Define your
own node
hierarchy
Determine
how to use
environments
Keep your
classification
DRY
Our solution
Dynamic classification + capacity management
380
node groups
20
node groups
20%80%
Flexible classification + differentiation
The art of balancing new changes and keeping systems stable
Production
Pilot production
Certification
Test
Development
Puppet Dev
Agent-specific
Puppet
Enterprise
PE
agent
PE
master
PE
MCollective
PE
PuppetDB
PE
console
Wservers
Temp node
group trees
EnvironmentsPuppet infrastructure Business classifications
Lservers
Lservers
-store
Lservers
-DC
Lservers
-HO
Lservers-store-
(ISP, WSP, NAS)
Lservers-store-ISP-Pilot
Lservers-store-ISP-Cert
Lservers-store-ISP-Dev
Are You Feeling Lucky?
Rollouts
Controlling the pace of change
Production
Pilot production
Certification
Test
Development
Puppet Dev
Agent-specific
Lservers Wservers
Lservers
-store
Lservers
-DC
Lservers
-HO
Lservers-store-
(ISP, WSP, NAS)
Lservers-store-ISP-Pilot
Lservers-store-ISP-Cert
Lservers-store-ISP-Dev
Business classificationsEnvironment Temp node group trees
Classification bonus side-effects
We can use Classifier to predict changes
Our Classification hierarchy is living documentation
for how we see our infrastructure
Classifier helps us to think about changes abstractly
(the config I want exists, how do I get a box to get it?)
Code to nodes / Nodes to code
Environmental promotion
The road to production
Production
Pilot production
Certification
Test
Development
Puppet Dev
Agent-specific
Ideally:
Full Unit Test Coverage
Full Acceptance Test Coverage
Generally:
Make Test as close
as possible to production
Have a few steps to control
populations for risky stuff
Temporary node groups
● The new class or feature
uses the “scaffolding”
● Being successful in successively
larger populations builds confidence
● Failing in smaller populations
contains damage
● In the most extreme cases, we can
unclassify a class completely, and
basically re-deploy it from scratch
“It’s ready for production…just not ALL of production.”
Feature flags workflow
“Do I do the risky thing or not?”
#Additionalconfiguration for PE PuppetDB load balanced nodes
class profile::pe::puppetdb(
$shared_cert_name = 'puppetdb-shared',
$command_threads = 4,
$store_usage = undef,
$temp_usage = undef,
$memory_usage = undef,
Enum['absent', 'present'] $puppetdb_gc_cron_ensure = 'absent',
Boolean $use_lb_cert = true,
) {
# Note setting this false also requires setting hiera keys to nodename
# for puppet_enterprise::profile::puppetdb::certnameas these default to
# the LB cert in scale test and production
if $use_lb_cert {
class {'::puppetdb_shared_cert::puppetdb':
certname => $shared_cert_name,
before => Puppet_enterprise::Certs['pe-puppetdb'],
}
}
Results
Visibility and
tracking
of changes
Buy-in from
other teams
Greatly increased
comfort level
with change
Key Takeaways
Take the first step, then learn and grow.
Controlling the pace of change can help build credibility.
Making something easier to build makes it easier to remove.
The rate of change seems to only be increasing.
Questions?
Thank you!

More Related Content

Similar to PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores

Ahesanali Vijapura - QA Manager
Ahesanali Vijapura - QA ManagerAhesanali Vijapura - QA Manager
Ahesanali Vijapura - QA Manager
ahesanvijapura
 

Similar to PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores (20)

Lessons Learned in a Continuously Developing Service-Oriented Architecture
Lessons Learned in a Continuously Developing Service-Oriented ArchitectureLessons Learned in a Continuously Developing Service-Oriented Architecture
Lessons Learned in a Continuously Developing Service-Oriented Architecture
 
DevOps for the Discouraged
DevOps for the Discouraged DevOps for the Discouraged
DevOps for the Discouraged
 
DevOps Roadshow - removing barriers between development and operations
DevOps Roadshow - removing barriers between development and operationsDevOps Roadshow - removing barriers between development and operations
DevOps Roadshow - removing barriers between development and operations
 
Virtualising Tier 1 Apps
Virtualising Tier 1 AppsVirtualising Tier 1 Apps
Virtualising Tier 1 Apps
 
Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018
 
Puppet At Twitter - Puppet Camp Silicon Valley
Puppet At Twitter - Puppet Camp Silicon ValleyPuppet At Twitter - Puppet Camp Silicon Valley
Puppet At Twitter - Puppet Camp Silicon Valley
 
11 tools for your PHP devops stack
11 tools for your PHP devops stack11 tools for your PHP devops stack
11 tools for your PHP devops stack
 
AvenDATA and Devops
AvenDATA and DevopsAvenDATA and Devops
AvenDATA and Devops
 
Apex Unit Testing in the Real World
Apex Unit Testing in the Real WorldApex Unit Testing in the Real World
Apex Unit Testing in the Real World
 
Presentation strategies for monitoring large data centers with oracle ente...
Presentation    strategies for monitoring large data centers with oracle ente...Presentation    strategies for monitoring large data centers with oracle ente...
Presentation strategies for monitoring large data centers with oracle ente...
 
Combined Project
Combined ProjectCombined Project
Combined Project
 
Resume 2 year
Resume  2 yearResume  2 year
Resume 2 year
 
Hot sos em12c_metric_extensions
Hot sos em12c_metric_extensionsHot sos em12c_metric_extensions
Hot sos em12c_metric_extensions
 
Managing an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product LeaderManaging an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product Leader
 
Software Development 2020 - Swimming upstream in the container revolution
Software Development 2020 - Swimming upstream in the container revolutionSoftware Development 2020 - Swimming upstream in the container revolution
Software Development 2020 - Swimming upstream in the container revolution
 
Swimming upstream in the container revolution
Swimming upstream in the container revolutionSwimming upstream in the container revolution
Swimming upstream in the container revolution
 
NextBuild 2015 - Swimming upstream in the container revolution
NextBuild 2015 - Swimming upstream in the container revolutionNextBuild 2015 - Swimming upstream in the container revolution
NextBuild 2015 - Swimming upstream in the container revolution
 
Ahesanali Vijapura - QA Manager
Ahesanali Vijapura - QA ManagerAhesanali Vijapura - QA Manager
Ahesanali Vijapura - QA Manager
 
Devoxx BE 2015 - Swimming upstream in the container revolution
Devoxx BE 2015 - Swimming upstream in the container revolutionDevoxx BE 2015 - Swimming upstream in the container revolution
Devoxx BE 2015 - Swimming upstream in the container revolution
 
EuregJUG 2016-01-07 - Swimming upstream in the container revolution
EuregJUG 2016-01-07 - Swimming upstream in the container revolutionEuregJUG 2016-01-07 - Swimming upstream in the container revolution
EuregJUG 2016-01-07 - Swimming upstream in the container revolution
 

More from Puppet

Puppet camp2021 testing modules and controlrepo
Puppet camp2021 testing modules and controlrepoPuppet camp2021 testing modules and controlrepo
Puppet camp2021 testing modules and controlrepo
Puppet
 
2021 04-15 operational verification (with notes)
2021 04-15 operational verification (with notes)2021 04-15 operational verification (with notes)
2021 04-15 operational verification (with notes)
Puppet
 
Enforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automationEnforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automation
Puppet
 

More from Puppet (20)

Puppet camp2021 testing modules and controlrepo
Puppet camp2021 testing modules and controlrepoPuppet camp2021 testing modules and controlrepo
Puppet camp2021 testing modules and controlrepo
 
Puppetcamp r10kyaml
Puppetcamp r10kyamlPuppetcamp r10kyaml
Puppetcamp r10kyaml
 
2021 04-15 operational verification (with notes)
2021 04-15 operational verification (with notes)2021 04-15 operational verification (with notes)
2021 04-15 operational verification (with notes)
 
Puppet camp vscode
Puppet camp vscodePuppet camp vscode
Puppet camp vscode
 
Modules of the twenties
Modules of the twentiesModules of the twenties
Modules of the twenties
 
Applying Roles and Profiles method to compliance code
Applying Roles and Profiles method to compliance codeApplying Roles and Profiles method to compliance code
Applying Roles and Profiles method to compliance code
 
KGI compliance as-code approach
KGI compliance as-code approachKGI compliance as-code approach
KGI compliance as-code approach
 
Enforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automationEnforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automation
 
Keynote: Puppet camp compliance
Keynote: Puppet camp complianceKeynote: Puppet camp compliance
Keynote: Puppet camp compliance
 
Automating it management with Puppet + ServiceNow
Automating it management with Puppet + ServiceNowAutomating it management with Puppet + ServiceNow
Automating it management with Puppet + ServiceNow
 
Puppet: The best way to harden Windows
Puppet: The best way to harden WindowsPuppet: The best way to harden Windows
Puppet: The best way to harden Windows
 
Simplified Patch Management with Puppet - Oct. 2020
Simplified Patch Management with Puppet - Oct. 2020Simplified Patch Management with Puppet - Oct. 2020
Simplified Patch Management with Puppet - Oct. 2020
 
Accelerating azure adoption with puppet
Accelerating azure adoption with puppetAccelerating azure adoption with puppet
Accelerating azure adoption with puppet
 
Puppet catalog Diff; Raphael Pinson
Puppet catalog Diff; Raphael PinsonPuppet catalog Diff; Raphael Pinson
Puppet catalog Diff; Raphael Pinson
 
ServiceNow and Puppet- better together, Kevin Reeuwijk
ServiceNow and Puppet- better together, Kevin ReeuwijkServiceNow and Puppet- better together, Kevin Reeuwijk
ServiceNow and Puppet- better together, Kevin Reeuwijk
 
Take control of your dev ops dumping ground
Take control of your  dev ops dumping groundTake control of your  dev ops dumping ground
Take control of your dev ops dumping ground
 
100% Puppet Cloud Deployment of Legacy Software
100% Puppet Cloud Deployment of Legacy Software100% Puppet Cloud Deployment of Legacy Software
100% Puppet Cloud Deployment of Legacy Software
 
Puppet User Group
Puppet User GroupPuppet User Group
Puppet User Group
 
Continuous Compliance and DevSecOps
Continuous Compliance and DevSecOpsContinuous Compliance and DevSecOps
Continuous Compliance and DevSecOps
 
The Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick Maludy
The Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick MaludyThe Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick Maludy
The Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick Maludy
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 

PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores

  • 1. Watching the Watchers Capacity management and classification strategies for large Puppet node populations Martin Jackson, Walmart @mjolnir40k +
  • 2. Legal Disclaimer Any reference in this presentation to any specific commercial product, process, or service, or the use of any trade, firm or corporation name is for the information purposes only and does not constitute an endorsement or recommendation by Wal- Mart Stores, Inc.
  • 4. Where we are today Resources managed 67,400,000+ Production changes / day 8.33 Nodes managed 87,000+
  • 5. Our challenge Locations Data Centers Distribution Centers Stores Platforms Linux Windows Processes Internal Build Practices Consolidating systems management practices across:
  • 7.
  • 8. Flexible classification + differentiation The art of balancing new changes and keeping systems stable Determine how you differentiate nodes Define your own node hierarchy Determine how to use environments Keep your classification DRY
  • 9. Our solution Dynamic classification + capacity management 380 node groups 20 node groups 20%80%
  • 10. Flexible classification + differentiation The art of balancing new changes and keeping systems stable Production Pilot production Certification Test Development Puppet Dev Agent-specific Puppet Enterprise PE agent PE master PE MCollective PE PuppetDB PE console Wservers Temp node group trees EnvironmentsPuppet infrastructure Business classifications Lservers Lservers -store Lservers -DC Lservers -HO Lservers-store- (ISP, WSP, NAS) Lservers-store-ISP-Pilot Lservers-store-ISP-Cert Lservers-store-ISP-Dev
  • 11. Are You Feeling Lucky?
  • 12. Rollouts Controlling the pace of change Production Pilot production Certification Test Development Puppet Dev Agent-specific Lservers Wservers Lservers -store Lservers -DC Lservers -HO Lservers-store- (ISP, WSP, NAS) Lservers-store-ISP-Pilot Lservers-store-ISP-Cert Lservers-store-ISP-Dev Business classificationsEnvironment Temp node group trees
  • 13. Classification bonus side-effects We can use Classifier to predict changes Our Classification hierarchy is living documentation for how we see our infrastructure Classifier helps us to think about changes abstractly (the config I want exists, how do I get a box to get it?)
  • 14. Code to nodes / Nodes to code
  • 15. Environmental promotion The road to production Production Pilot production Certification Test Development Puppet Dev Agent-specific Ideally: Full Unit Test Coverage Full Acceptance Test Coverage Generally: Make Test as close as possible to production Have a few steps to control populations for risky stuff
  • 16. Temporary node groups ● The new class or feature uses the “scaffolding” ● Being successful in successively larger populations builds confidence ● Failing in smaller populations contains damage ● In the most extreme cases, we can unclassify a class completely, and basically re-deploy it from scratch “It’s ready for production…just not ALL of production.”
  • 17. Feature flags workflow “Do I do the risky thing or not?” #Additionalconfiguration for PE PuppetDB load balanced nodes class profile::pe::puppetdb( $shared_cert_name = 'puppetdb-shared', $command_threads = 4, $store_usage = undef, $temp_usage = undef, $memory_usage = undef, Enum['absent', 'present'] $puppetdb_gc_cron_ensure = 'absent', Boolean $use_lb_cert = true, ) { # Note setting this false also requires setting hiera keys to nodename # for puppet_enterprise::profile::puppetdb::certnameas these default to # the LB cert in scale test and production if $use_lb_cert { class {'::puppetdb_shared_cert::puppetdb': certname => $shared_cert_name, before => Puppet_enterprise::Certs['pe-puppetdb'], } }
  • 18. Results Visibility and tracking of changes Buy-in from other teams Greatly increased comfort level with change
  • 19. Key Takeaways Take the first step, then learn and grow. Controlling the pace of change can help build credibility. Making something easier to build makes it easier to remove. The rate of change seems to only be increasing.

Editor's Notes

  1. The information appearing in this presentation is for general informational purposes only and is not intended to provide technical or any other advice to any individual or entity. Reference in this presentation to any specific commercial product, process, or service, or the use of any trade, firm or corporation name is for the information and convenience only, and does not constitute endorsement, recommendation, or favoring by Wal-Mart Stores, Inc.
  2. The wayback machine… I still own this shirt, and the puppetconf 2014 t-shirt I was wearing beneath it. We’ve learned a lot since then, though…
  3. These are summaries, but they give an idea of what kind of scale we’re talking about. These are not just data center nodes – these nodes are distributed in our stores and distribution centers all over the world. One change could potentially affect nearly all of them
  4. When we first started doing Puppet, we had tons of divergence. Our goal was to reduce the divergence as much as possible – duplication results in extra costs, because more choices means more troubleshooting paths. Our practices had diverged between the different lines of business we served, between the platforms we serviced, and our preferred approaches to problem solving. We were looking for a way to bring a lot of those practices together, to consolidate as much as we could. Are we all identical everywhere? No, not by a long shot, but we’re a lot more similar than we were when we started.
  5. Some of our challenges are obvious – some are less obvious. Managing a distributed fleet this size comes with many problems – we’re always refreshing something, somewhere. We’re always introducing something new, somewhere. When someone decides we need a new server image, we need to deploy them quickly. Against this backdrop, no one else is standing still, either. Applications are being updated constantly, and we have ongoing compliance and security posture requirements. On top of all of this, a lot of our deployed footprint is not in data centers, so it is subject to bad power, city power problems, and all kinds of natural disasters. All that diversity increases costs, because there’s more to troubleshoot; more possibilities take more time. But our infrastructure doesn’t get to write a blank check.
  6. Ongoing state management is critical. This is a picture from the aftermath of Hurricane Sandy. Most of our compute footprint runs in the distributed sites that they service Those sites can be in the path of natural disasters Natural disasters mean unpredictable outages It really helps to be able to express changes abstractly and not to have to worry about when sites come up or if they miss a check-in or two Our customers depend on us to be up and running as soon as possible after all kinds of situations, of which recent natural disasters are only one example. Prep for blizzards etc. are also a thing.
  7. Determine how you differentiate nodes Define your own node hierarchy Determine how to use environments Keep your classification DRY ======== Use facts that are mutually exclusive at the top level, for the groups that matter to you. Feel free to invent your own facts, or synthesize other facts to make these decisions. Maybe you see Linux and Windows servers in the same tree? Maybe you don’t? The decision is entirely up to you. Determine how you’re going to use environments. Are you going to have standing non-prod environments? Are they going to be dynamic? DRY is Don’t Repeat Yourself. It’s easy to create a structure where you have to repeat a lot of things; this will create a lot more work for you. Try to factor out things that are common and move those up the hierarchy. As we will see, this will give us opportunities to use that hierarchy to make changes. ========
  8. We have two major, and diametrically opposed, configuration problems: How to maintain state and substantial similarity for about 80% of our nodes How to provide rational choices and a straightforward path of customization for the rest of our node population Just to keep it interesting, we have some major cross-cutting concerns (like security policies) that apply equally to both our distributed and centralized workloads Dynamic classification has helped us do both of these reasonably well, And that’s important, because a lot of people depend on us…
  9. ======== Our configuration hierarchy has three primary branches: Environment Groups (which don’t assign classes) PE Infrastructure Groups (for PE and Agent Bits) “Business” Classifications We currently have 6 of these We try to ensure one leaf node group from each tree, but occasionally make exceptions ======== Once you have a substantial number of nodes under management, one of the first questions that you’ll probably be asked is how you control the pace of change.
  10. You want the “big red button”, but not all changes are created equal. How do you make it easy to deploy change globally, but provide hooks and stopping points to control the pace of change for changes that might be more risky? We have three major strategies, which we mix and match when we need to.
  11. The primary axes of change: Hieradata (our own, defined hierarchy) Environmental promotion (dev, …, production) Temporary node groups (scaffolding, for new classes) Feature Flag Workflow (for existing classes) ================
  12. There are several really cool side effects of doing classification based on what a node might be. For one thing, since node groups don’t have to assign classes or override configuration in any way, we can use them to provide visilbility into what would change if we applied a new class or paramter to a given group. Our classification hierarchy is a living visual representation of how we see our node population. And very subtly, but I think very importantly, it allows us to think about our nodes abstractly. What if we did X? What if we did Y? It helps us see our node population as a living, changing, dynamic thing.
  13. Do you promote your nodes from Dev to Production? Or Do you have different node sets for test and production? Practices differ, and no solution is ideal. We are a “Code to Nodes” shop, we have standing sets of canary nodes that we promote code through on its way to production.
  14. Ideally: Full Unit Test Coverage Full Acceptance Test Coverage Generally: Make Test as close as possible to production Have a few steps to control populations for risky stuff We have 4 pre-production environments and 2 production ones ======
  15. Temporary node groups make use of the hierarchy to filter change to a subset of nodes. New classes can be added at a lower level of the hierarchy and “walked up” the tree Each step in the deployment is an opportunity to validate correctness, and if not correct, to control damage In the scariest of theoretical cases, we would completely withdraw a class (remove it from all node groups) and re-add a new version, which would basically be a brand-new rollout.
  16. A Feature Flag is a new parameter to a class that controls the new behavior As a parameter, you can control it with Hiera or with the Console Ideal for complex new deployments or parts of classes that cannot be declassified
  17. Visibility and trackability means other people can see what you see. If others can see what you see, maybe that can stave off a few after-hours calls? It also starts to move from being “someone’s” system to being “the” system – objectivity is important. Objectivity, in turn, seems to help with some of the defensiveness of people who are skeptical of or resistant to change. The more people buy into a system, the more it gets used – the more something gets used, the more comfortable people are with it.
  18. The journey of a thousand miles begins with the first step. The two best times to plant a tree are 30 years ago and today. No one starts something and masters it – good grief, we have tons to learn still, and we’ve been doing this for years now. But we’ve made very significant progress, and we will make more in the future – and we wouldn’t be there if we hadn’t taken the first step. The first step leads to many more. When we first started, the questions we asked were all about how we could keep things in sync and up to standard. As we grew, the questions are becoming, much more, “How can I get to the next thing? How do I make it easier to migrate entire platforms?” It’s hard to imagine, knowing where we were when we started, that we would be asking these kinds of questions. Hopefully, we will have some great answers to these questions and discover some even more interesting questions. Because it seems that the rate of change is only increasing. We have to drive down cycle times, because our customers and our users are depending on us to.