PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores

•Download as PPTX, PDF•

2 likes•209 views

Learn how Walmart uses node groups and facts to partition our fleet, which currently includes over 87,000 nodes running Linux and Windows, across various business segments. We will also discuss our strategies for performing rollouts at varying speeds, which includes how we use the classifier, how we use environments, and how we use the feature flag design pattern.

Technology

Watching
the Watchers
Capacity management and
classification strategies
for large Puppet node populations
Martin Jackson, Walmart
@mjolnir40k
+

Legal Disclaimer
Any reference in this presentation to any specific commercial product, process, or
service, or the use of any trade, firm or corporation name is for the information
purposes only and does not constitute an endorsement or recommendation by Wal-
Mart Stores, Inc.

Where we are today
Resources managed
67,400,000+
Production changes / day
8.33
Nodes managed
87,000+

Our challenge
Locations
Data Centers
Distribution Centers
Stores
Platforms
Linux
Windows
Processes
Internal Build
Practices
Consolidating systems management practices across:

Other challenges
Managing a dynamic system

Flexible classification + differentiation
The art of balancing new changes and keeping systems stable
Determine
how you
differentiate
nodes
Define your
own node
hierarchy
Determine
how to use
environments
Keep your
classification
DRY

Our solution
Dynamic classification + capacity management
380
node groups
20
node groups
20%80%

Flexible classification + differentiation
The art of balancing new changes and keeping systems stable
Production
Pilot production
Certification
Test
Development
Puppet Dev
Agent-specific
Puppet
Enterprise
PE
agent
PE
master
PE
MCollective
PE
PuppetDB
PE
console
Wservers
Temp node
group trees
EnvironmentsPuppet infrastructure Business classifications
Lservers
Lservers
-store
Lservers
-DC
Lservers
-HO
Lservers-store-
(ISP, WSP, NAS)
Lservers-store-ISP-Pilot
Lservers-store-ISP-Cert
Lservers-store-ISP-Dev

Rollouts
Controlling the pace of change
Production
Pilot production
Certification
Test
Development
Puppet Dev
Agent-specific
Lservers Wservers
Lservers
-store
Lservers
-DC
Lservers
-HO
Lservers-store-
(ISP, WSP, NAS)
Lservers-store-ISP-Pilot
Lservers-store-ISP-Cert
Lservers-store-ISP-Dev
Business classificationsEnvironment Temp node group trees

Classification bonus side-effects
We can use Classifier to predict changes
Our Classification hierarchy is living documentation
for how we see our infrastructure
Classifier helps us to think about changes abstractly
(the config I want exists, how do I get a box to get it?)

Environmental promotion
The road to production
Production
Pilot production
Certification
Test
Development
Puppet Dev
Agent-specific
Ideally:
Full Unit Test Coverage
Full Acceptance Test Coverage
Generally:
Make Test as close
as possible to production
Have a few steps to control
populations for risky stuff

Temporary node groups
● The new class or feature
uses the “scaffolding”
● Being successful in successively
larger populations builds confidence
● Failing in smaller populations
contains damage
● In the most extreme cases, we can
unclassify a class completely, and
basically re-deploy it from scratch
“It’s ready for production…just not ALL of production.”

$Feature flags workflow “Do I do the risky thing or not?” #Additionalconfiguration for PE PuppetDB load balanced nodes class profile::pe::puppetdb( $shared_cert_name = 'puppetdb-shared', $command_threads = 4, $store_usage = undef, $temp_usage = undef, $memory_usage = undef, Enum['absent', 'present'] $puppetdb_gc_cron_ensure = 'absent', Boolean $use_lb_cert = true, ) { # Note setting this false also requires setting hiera keys to nodename # for puppet_enterprise::profile::puppetdb::certnameas these default to # the LB cert in scale test and production if $use_lb_cert { class {'::puppetdb_shared_cert::puppetdb': certname => $shared_cert_name, before => Puppet_enterprise::Certs['pe-puppetdb'], } }$

Results
Visibility and
tracking
of changes
Buy-in from
other teams
Greatly increased
comfort level
with change

Key Takeaways
Take the first step, then learn and grow.
Controlling the pace of change can help build credibility.
Making something easier to build makes it easier to remove.
The rate of change seems to only be increasing.

Similar to PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores

The goal of this presentation is to highlight the successes in applying modern development practices that tend to be regarded as “too much overhead” for small development teams. I was one of those nay-sayers. This presentation is a journey in moving an overburdened development team into a more efficient environment where regular progress is made and realistic expectations in timelines become possible. I plan to cover topics such as agile project management, application design strategies, version control, unit testing, and all the reasons why these globally common practices are well worth buy-in at a developer and managerial level. Given time constraint on presentation length, the depth of technical discussion will be relatively shallow. As I progress through the talk, I plan to use a recent development project for illustrative purposes. By not diving too deep, the discussion can instead focus more on proving that these processes do have real return on investment for developers and project managers alike.

Lessons Learned in a Continuously Developing Service-Oriented Architecture

mdwheele

You got DevOpsed! Your sysadmin team got renamed as the DevOps team. Developers got prod access. Code deploys to prod happen multiple times a day now. In the eyes of the business, things are great. Yet, the security team continues to be left out and really nothing seems to be better. In fact it feels worse. Time to learn how to hack some devops for great good. This talk will equip you with advice and tools to join in on the devops. You will also leave with a sample continuous delivery pipeline that is armed to dangerous and ready to identify security issues in a typical web application stack. We'll use a range of open source technology including OWASP ZAP, gauntlt, brakeman, nmap, sqlmap, arachni and more.

DevOps for the Discouraged

James Wickett

DevOps Roadshow - removing barriers between development and operations

Microsoft Developer Norway

Virtualising Tier 1 Apps

Iwan Rahabok

Deploying ML models to production (frequently and safely) - PYCON 2018

David Tan

Puppet At Twitter - Puppet Camp Silicon Valley

Puppet

11 tools for your PHP devops stack

Kris Buytaert

AvenDATA and Devops

Rajbahadur Rajput

Apex Unit Testing in the Real World

Salesforce Developers

Presentation strategies for monitoring large data centers with oracle ente...

xKinAnx

Combined Project

Christopher Sveda

Resume 2 year

pawan kumar

Hot sos em12c_metric_extensions

Kellyn Pot'Vin-Gorman

Managing an Experimentation Platform by LinkedIn Product Leader

Product School

Software Development 2020 - Swimming upstream in the container revolution

Bert Jan Schrijver

Swimming upstream in the container revolution

nextbuild

NextBuild 2015 - Swimming upstream in the container revolution

Bert Jan Schrijver

Ahesanali Vijapura - QA Manager

ahesanvijapura

Devoxx BE 2015 - Swimming upstream in the container revolution

Bert Jan Schrijver

EuregJUG 2016-01-07 - Swimming upstream in the container revolution

Bert Jan Schrijver

Similar to PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores (20)

Lessons Learned in a Continuously Developing Service-Oriented Architecture

DevOps for the Discouraged

DevOps Roadshow - removing barriers between development and operations

Virtualising Tier 1 Apps

Deploying ML models to production (frequently and safely) - PYCON 2018

Puppet At Twitter - Puppet Camp Silicon Valley

11 tools for your PHP devops stack

AvenDATA and Devops

Apex Unit Testing in the Real World

Presentation strategies for monitoring large data centers with oracle ente...

Combined Project

Resume 2 year

Hot sos em12c_metric_extensions

Managing an Experimentation Platform by LinkedIn Product Leader

Software Development 2020 - Swimming upstream in the container revolution

Swimming upstream in the container revolution

NextBuild 2015 - Swimming upstream in the container revolution

Ahesanali Vijapura - QA Manager

Devoxx BE 2015 - Swimming upstream in the container revolution

EuregJUG 2016-01-07 - Swimming upstream in the container revolution

More from Puppet

Puppet camp2021 testing modules and controlrepo

Puppet

Puppetcamp r10kyaml

Puppet

2021 04-15 operational verification (with notes)

Puppet

Puppet camp vscode

Puppet

Modules of the twenties

Puppet

Applying Roles and Profiles method to compliance code

Puppet

KGI compliance as-code approach

Puppet

Enforce compliance policy with model-driven automation

Puppet

Keynote: Puppet camp compliance

Puppet

As the leading IT Service Management and IT Operations Management platform in the marketplace, ServiceNow is used by many organizations to address everything from self service IT requests to Change, Incident and Problem Management. The strength of the platform is in the workflows and processes that are built around the shared data model, represented in the CMDB. This provides the ‘single source of truth’ for the organization. Puppet Enterprise is a leading automation platform focused on the IT Configuration Management and Compliance space. Puppet Enterprise has a unique perspective on the state of systems being managed, constantly being updated and kept accurate as part of the regular Puppet operation. Puppet Enterprise is the automation engine ensuring that the environment stays consistent and in compliance. In this webinar, we will explore how to maximize the value of both solutions, with Puppet Enterprise automating the actions required to drive a change, and ServiceNow governing the process around that change, from definition to approval. We will introduce and demonstrate several published integration points between the two solutions, in the areas of Self-Service Infrastructure, Enriched Change Management and Automated Incident Registration.

Automating it management with Puppet + ServiceNow

Puppet

Puppet: The best way to harden Windows

Puppet

Does your company struggle with patching systems? If so, you’re not alone — most organizations have attempted to solve this issue by cobbling together multiple tools, processes, and different teams, which can make an already complicated issue worse. Puppet helps keep hosts healthy, secure and compliant by replacing time-consuming and error prone patching processes with Puppet’s automated patching solution. Join this webinar to learn how to do the following with Puppet: Eliminate manual patching processes with pre-built patching automation for Windows and Linux systems. Gain visibility into patching status across your estate regardless of OS with new patching solution from the PE console. Ensure your systems are compliant and patched in a healthy state How Puppet Enterprise makes patch management easy across your Windows and Linux operating systems. Presented by: Margaret Lee, Product Manager, Puppet, and Ajay Sridhar, Sr. Sales Engineer, Puppet.

Simplified Patch Management with Puppet - Oct. 2020

Puppet

Accelerating azure adoption with puppet

Puppet

Puppet catalog Diff; Raphael Pinson

Puppet

ServiceNow and Puppet- better together, Kevin Reeuwijk

Puppet

Take control of your dev ops dumping ground

Puppet

100% Puppet Cloud Deployment of Legacy Software

Puppet

Puppet User Group

Puppet

Continuous Compliance and DevSecOps

Puppet

The Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick Maludy

Puppet

More from Puppet (20)

Puppet camp2021 testing modules and controlrepo

Puppetcamp r10kyaml

2021 04-15 operational verification (with notes)

Puppet camp vscode

Modules of the twenties

Applying Roles and Profiles method to compliance code

KGI compliance as-code approach

Enforce compliance policy with model-driven automation

Keynote: Puppet camp compliance

Automating it management with Puppet + ServiceNow

Puppet: The best way to harden Windows

Simplified Patch Management with Puppet - Oct. 2020

Accelerating azure adoption with puppet

Puppet catalog Diff; Raphael Pinson

ServiceNow and Puppet- better together, Kevin Reeuwijk

Take control of your dev ops dumping ground

100% Puppet Cloud Deployment of Legacy Software

Puppet User Group

Continuous Compliance and DevSecOps

The Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick Maludy

Recently uploaded

In today’s fast-paced digital world, harnessing the power of artificial intelligence (AI) can significantly enhance productivity and creativity across various domains. With the advent of advanced language models like ChatGPT, developers, marketers, data analysts, and professionals in numerous other fields can now leverage AI-generated prompts to spark innovative ideas and streamline their workflows.

1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT

iSEO AI

Where to Learn More About FDO _ Richard at FIDO Alliance.pdf

FIDO Alliance

ADP Passwordless Journey Case Study.pptx

FIDO Alliance

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

Stefan Dietze

TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...

marcuskenyatta275

Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.

2024 May Patch Tuesday

Ivanti

Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...

ScyllaDB

Oauth 2.0 Introduction and Flows with MuleSoft

shyamraj55

WebAssembly is Key to Better LLM Performance

Samy Fodil

Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf

FIDO Alliance

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?

Paolo Missier

Working together SRE & Platform Engineering

Marcus Vechiato

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...

ScyllaDB

Intro in Product Management - Коротко про професію продакт менеджера

Mark Opanasiuk

Portal Kombat : extension du réseau de propagande russe

中央社

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...

FIDO Alliance

Delivered by Michael Down at Gartner Data & Analytics Summit London 2024 - Your enemies use GenAI too: Staying ahead of fraud with Neo4j. Fraudsters exploit the latest technologies like generative AI to stay undetected. Static applications can’t adapt quickly enough. Learn why you should build flexible fraud detection apps on Neo4j’s native graph database combined with advanced data science algorithms. Uncover complex fraud patterns in real-time and shut down schemes before they cause damage.

Your enemies use GenAI too - staying ahead of fraud with Neo4j

Neo4j

This talk offers actionable insights at an executive level for enhancing productivity and refining your portfolio management approach to propel your organization to greater heights. Key Points Covered: 1. Experience Transformation: - The core challenge remains consistent across organizations: converting budget into user-centric designs. - Strategies for deploying design resources effectively in both startups and large enterprises. 2. Strategic Frameworks: - Introduction to the "Ziggurat of Impact" model, detailing layers from basic system interactions to comprehensive customer experiences. - Practical insights on creating frameworks that scale with organizational complexity. 3. Organizational Impact: - Real-world examples of navigating design in large settings, focusing on the synthesis of consumer products and customer experiences. - Emphasis on the importance of designing systems that directly influence customer interactions. 4. Design Execution: - Detailed walkthrough of organizational layers affecting design execution, from touchpoints and customer activities to shared capabilities. - How to ensure design influences both the micro and macro aspects of customer interactions. 5. Measurement and Adaptation: - Techniques for measuring the impact of design decisions and adapting strategies based on data-driven insights. - The critical role of continuous improvement and feedback in refining customer experiences.

Structuring Teams and Portfolios for Success

UXDXConf

At Skynet Technologies, our team of accessibility experts performs automated, semi-automated, and manual audits of websites and web applications as per WCAG 2.2 level AA, ADA, and section 508. Based on evaluations of the accessibility compliance level of the website’s UI, design, source code, navigation, interactive elements, and overall usability, we will provide a digital accessibility evaluation report with in-depth details of potential accessibility barriers and remediation recommendations. Get a manual website WCAG audit (2.0, 2.1, 2.2 level AA) for a small website: 10 pages: $2,500 within 7 business days 30 pages: $7,500 within 14 business days 50 pages: $12,500 within 28 business days For medium websites: 100 pages: $25,000 within 6 weeks For larger websites or audits of all pages, please reach out hello@skynettechnologies.com.

Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...

Skynet Technologies

Join me in this session where I'll share our journey of building a fully serverless application that flawlessly managed check-ins for an event with a staggering 80 thousand registrations. We'll dive into three key strategies that made this possible. Firstly, by harnessing DynamoDB global tables, we ensured global service availability and data replication across regions, boosting performance and disaster recovery. Next, we'll explore how we seamlessly integrated real-time updates into the app using Appsync subscriptions, making the experience dynamic and engaging for users. Finally, I'll discuss how provisioned concurrency not only improved performance but also kept costs in check, highlighting the cost-effectiveness of serverless architectures. Through these strategies and the inherent scalability of serverless technology, our application effortlessly handled massive user loads without manual intervention. This session is a real world example to the power and efficiency of modern cloud-based solutions in enabling seamless scalability and robust performance with Serverless

How we scaled to 80K users by doing nothing!.pdf

Srushith Repakula

Recently uploaded (20)

1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT

Where to Learn More About FDO _ Richard at FIDO Alliance.pdf

ADP Passwordless Journey Case Study.pptx

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...

2024 May Patch Tuesday

Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...

Oauth 2.0 Introduction and Flows with MuleSoft

WebAssembly is Key to Better LLM Performance

Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?

Working together SRE & Platform Engineering

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...

Intro in Product Management - Коротко про професію продакт менеджера

Portal Kombat : extension du réseau de propagande russe

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...

Your enemies use GenAI too - staying ahead of fraud with Neo4j

Structuring Teams and Portfolios for Success

Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...

How we scaled to 80K users by doing nothing!.pdf

PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores

1. Watching the Watchers Capacity management and classification strategies for large Puppet node populations Martin Jackson, Walmart @mjolnir40k +

2. Legal Disclaimer Any reference in this presentation to any specific commercial product, process, or service, or the use of any trade, firm or corporation name is for the information purposes only and does not constitute an endorsement or recommendation by Wal- Mart Stores, Inc.

3. Two years ago…

4. Where we are today Resources managed 67,400,000+ Production changes / day 8.33 Nodes managed 87,000+

5. Our challenge Locations Data Centers Distribution Centers Stores Platforms Linux Windows Processes Internal Build Practices Consolidating systems management practices across:

6. Other challenges Managing a dynamic system

8. Flexible classification + differentiation The art of balancing new changes and keeping systems stable Determine how you differentiate nodes Define your own node hierarchy Determine how to use environments Keep your classification DRY

9. Our solution Dynamic classification + capacity management 380 node groups 20 node groups 20%80%

10. Flexible classification + differentiation The art of balancing new changes and keeping systems stable Production Pilot production Certification Test Development Puppet Dev Agent-specific Puppet Enterprise PE agent PE master PE MCollective PE PuppetDB PE console Wservers Temp node group trees EnvironmentsPuppet infrastructure Business classifications Lservers Lservers -store Lservers -DC Lservers -HO Lservers-store- (ISP, WSP, NAS) Lservers-store-ISP-Pilot Lservers-store-ISP-Cert Lservers-store-ISP-Dev

11. Are You Feeling Lucky?

12. Rollouts Controlling the pace of change Production Pilot production Certification Test Development Puppet Dev Agent-specific Lservers Wservers Lservers -store Lservers -DC Lservers -HO Lservers-store- (ISP, WSP, NAS) Lservers-store-ISP-Pilot Lservers-store-ISP-Cert Lservers-store-ISP-Dev Business classificationsEnvironment Temp node group trees

13. Classification bonus side-effects We can use Classifier to predict changes Our Classification hierarchy is living documentation for how we see our infrastructure Classifier helps us to think about changes abstractly (the config I want exists, how do I get a box to get it?)

14. Code to nodes / Nodes to code

15. Environmental promotion The road to production Production Pilot production Certification Test Development Puppet Dev Agent-specific Ideally: Full Unit Test Coverage Full Acceptance Test Coverage Generally: Make Test as close as possible to production Have a few steps to control populations for risky stuff

16. Temporary node groups ● The new class or feature uses the “scaffolding” ● Being successful in successively larger populations builds confidence ● Failing in smaller populations contains damage ● In the most extreme cases, we can unclassify a class completely, and basically re-deploy it from scratch “It’s ready for production…just not ALL of production.”

17. Feature flags workflow “Do I do the risky thing or not?” #Additionalconfiguration for PE PuppetDB load balanced nodes class profile::pe::puppetdb( $shared_cert_name = 'puppetdb-shared', $command_threads = 4, $store_usage = undef, $temp_usage = undef, $memory_usage = undef, Enum['absent', 'present'] $puppetdb_gc_cron_ensure = 'absent', Boolean $use_lb_cert = true, ) { # Note setting this false also requires setting hiera keys to nodename # for puppet_enterprise::profile::puppetdb::certnameas these default to # the LB cert in scale test and production if $use_lb_cert { class {'::puppetdb_shared_cert::puppetdb': certname => $shared_cert_name, before => Puppet_enterprise::Certs['pe-puppetdb'], } }

18. Results Visibility and tracking of changes Buy-in from other teams Greatly increased comfort level with change

19. Key Takeaways Take the first step, then learn and grow. Controlling the pace of change can help build credibility. Making something easier to build makes it easier to remove. The rate of change seems to only be increasing.

20. Questions?

21. Thank you!

Editor's Notes

The information appearing in this presentation is for general informational purposes only and is not intended to provide technical or any other advice to any individual or entity. Reference in this presentation to any specific commercial product, process, or service, or the use of any trade, firm or corporation name is for the information and convenience only, and does not constitute endorsement, recommendation, or favoring by Wal-Mart Stores, Inc.
The wayback machine… I still own this shirt, and the puppetconf 2014 t-shirt I was wearing beneath it. We’ve learned a lot since then, though…
These are summaries, but they give an idea of what kind of scale we’re talking about. These are not just data center nodes – these nodes are distributed in our stores and distribution centers all over the world. One change could potentially affect nearly all of them
When we first started doing Puppet, we had tons of divergence. Our goal was to reduce the divergence as much as possible – duplication results in extra costs, because more choices means more troubleshooting paths. Our practices had diverged between the different lines of business we served, between the platforms we serviced, and our preferred approaches to problem solving. We were looking for a way to bring a lot of those practices together, to consolidate as much as we could. Are we all identical everywhere? No, not by a long shot, but we’re a lot more similar than we were when we started.
Some of our challenges are obvious – some are less obvious. Managing a distributed fleet this size comes with many problems – we’re always refreshing something, somewhere. We’re always introducing something new, somewhere. When someone decides we need a new server image, we need to deploy them quickly. Against this backdrop, no one else is standing still, either. Applications are being updated constantly, and we have ongoing compliance and security posture requirements. On top of all of this, a lot of our deployed footprint is not in data centers, so it is subject to bad power, city power problems, and all kinds of natural disasters. All that diversity increases costs, because there’s more to troubleshoot; more possibilities take more time. But our infrastructure doesn’t get to write a blank check.
Ongoing state management is critical. This is a picture from the aftermath of Hurricane Sandy. Most of our compute footprint runs in the distributed sites that they service Those sites can be in the path of natural disasters Natural disasters mean unpredictable outages It really helps to be able to express changes abstractly and not to have to worry about when sites come up or if they miss a check-in or two Our customers depend on us to be up and running as soon as possible after all kinds of situations, of which recent natural disasters are only one example. Prep for blizzards etc. are also a thing.
Determine how you differentiate nodes Define your own node hierarchy Determine how to use environments Keep your classification DRY ======== Use facts that are mutually exclusive at the top level, for the groups that matter to you. Feel free to invent your own facts, or synthesize other facts to make these decisions. Maybe you see Linux and Windows servers in the same tree? Maybe you don’t? The decision is entirely up to you. Determine how you’re going to use environments. Are you going to have standing non-prod environments? Are they going to be dynamic? DRY is Don’t Repeat Yourself. It’s easy to create a structure where you have to repeat a lot of things; this will create a lot more work for you. Try to factor out things that are common and move those up the hierarchy. As we will see, this will give us opportunities to use that hierarchy to make changes. ========
We have two major, and diametrically opposed, configuration problems: How to maintain state and substantial similarity for about 80% of our nodes How to provide rational choices and a straightforward path of customization for the rest of our node population Just to keep it interesting, we have some major cross-cutting concerns (like security policies) that apply equally to both our distributed and centralized workloads Dynamic classification has helped us do both of these reasonably well, And that’s important, because a lot of people depend on us…
======== Our configuration hierarchy has three primary branches: Environment Groups (which don’t assign classes) PE Infrastructure Groups (for PE and Agent Bits) “Business” Classifications We currently have 6 of these We try to ensure one leaf node group from each tree, but occasionally make exceptions ======== Once you have a substantial number of nodes under management, one of the first questions that you’ll probably be asked is how you control the pace of change.
You want the “big red button”, but not all changes are created equal. How do you make it easy to deploy change globally, but provide hooks and stopping points to control the pace of change for changes that might be more risky? We have three major strategies, which we mix and match when we need to.
The primary axes of change: Hieradata (our own, defined hierarchy) Environmental promotion (dev, …, production) Temporary node groups (scaffolding, for new classes) Feature Flag Workflow (for existing classes) ================
There are several really cool side effects of doing classification based on what a node might be. For one thing, since node groups don’t have to assign classes or override configuration in any way, we can use them to provide visilbility into what would change if we applied a new class or paramter to a given group. Our classification hierarchy is a living visual representation of how we see our node population. And very subtly, but I think very importantly, it allows us to think about our nodes abstractly. What if we did X? What if we did Y? It helps us see our node population as a living, changing, dynamic thing.
Do you promote your nodes from Dev to Production? Or Do you have different node sets for test and production? Practices differ, and no solution is ideal. We are a “Code to Nodes” shop, we have standing sets of canary nodes that we promote code through on its way to production.
Ideally: Full Unit Test Coverage Full Acceptance Test Coverage Generally: Make Test as close as possible to production Have a few steps to control populations for risky stuff We have 4 pre-production environments and 2 production ones ======
Temporary node groups make use of the hierarchy to filter change to a subset of nodes. New classes can be added at a lower level of the hierarchy and “walked up” the tree Each step in the deployment is an opportunity to validate correctness, and if not correct, to control damage In the scariest of theoretical cases, we would completely withdraw a class (remove it from all node groups) and re-add a new version, which would basically be a brand-new rollout.
A Feature Flag is a new parameter to a class that controls the new behavior As a parameter, you can control it with Hiera or with the Console Ideal for complex new deployments or parts of classes that cannot be declassified
Visibility and trackability means other people can see what you see. If others can see what you see, maybe that can stave off a few after-hours calls? It also starts to move from being “someone’s” system to being “the” system – objectivity is important. Objectivity, in turn, seems to help with some of the defensiveness of people who are skeptical of or resistant to change. The more people buy into a system, the more it gets used – the more something gets used, the more comfortable people are with it.
The journey of a thousand miles begins with the first step. The two best times to plant a tree are 30 years ago and today. No one starts something and masters it – good grief, we have tons to learn still, and we’ve been doing this for years now. But we’ve made very significant progress, and we will make more in the future – and we wouldn’t be there if we hadn’t taken the first step. The first step leads to many more. When we first started, the questions we asked were all about how we could keep things in sync and up to standard. As we grew, the questions are becoming, much more, “How can I get to the next thing? How do I make it easier to migrate entire platforms?” It’s hard to imagine, knowing where we were when we started, that we would be asking these kinds of questions. Hopefully, we will have some great answers to these questions and discover some even more interesting questions. Because it seems that the rate of change is only increasing. We have to drive down cycle times, because our customers and our users are depending on us to.

PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores

Recommended

Recommended

More Related Content

Similar to PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores

Similar to PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores (20)

More from Puppet

More from Puppet (20)

Recently uploaded

Recently uploaded (20)

PuppetConf 2017: Watching the Watchers- Martin Jackson, Walmart Stores

Editor's Notes