Adding Windows servers to a Puppet instance can feel like a daunting task, even more so when you already have a large number of Linux servers in Puppet already. Learn how Walmart integrated their Windows servers into Puppet Enterprise. We’ll discuss not only why we chose Puppet over other tools, but why and how we still use tools like DSC, SCCM and GPOs. We’ll also go over the successes and pitfalls we had along the way in using Puppet on Windows, onboarding other teams, and evangelizing our team’s vision to others.
2. Any reference in this presentation to any specific commercial product, process, or service, or
the use of any trade, firm or corporation name is for the information purposes only and
does not constitute an endorsement or recommendation by Wal-Mart Stores, Inc.
8. … until it didn’t
13 major bugs in first 6 months
Only 4 people knew how it worked
Very limited documentation
Powershell v5 would require rewriting scripts
We had to constantly tweak WMI settings after 1000 nodes
10. Then the world changed
47,000 Linux servers were already managed by Puppet
Linux teammates were planning the upgrade to Puppet Enterprise
Puppet 3.7 introduced 64-bit support for Windows
PowerShell v5 introduced better 3rd party integration
Puppet Labs DSC module was announced
11. Integrating Tools
SCCM
Image management
Software delivery
Patch management
Active Directory
User policy settings
Server-side security
policy settings
Puppet
Native resource types
DSC
Additional plugins and providers
Choosing the best tool for the job
12. How Windows got invited to the party
Start small and build on success
• Started with new OS version (no “legacy” to worry about)
• All new builds are now done using Puppet
• > 5,000 servers have been built using Puppet > 7,000
including cloud servers
• > 30,000 managed Windows nodes
14. Gotchas
Workarounds for Windows
Plugin sync issues
Standard DSC resources had
more stuff than we needed
External facts and PowerShell
TBD
We removed
DSC resources we don't use
Setting up test environments
is important
Issue Workaround
15. Four pillars of success
Documentation Automation Collaboration Measurement
Show of hands..
How many of you are Windows admins – you only do Windows?
How many of you are Windows admins but you also manage Linux?
How many of you are Linux admins?
I hate to say it, but you may be in the wrong place..
Intros here…
Chris:
Derek:
Chris:
So, yeah – 11,500 stores under 72 banners in 28 countries – e-commerce sites in 11 countries, and growing..:
we have a lot of variety not only in our stores and on our websites, but there’s also tons of variety within our infrastructure.
[Chris]
So, yeah – 11,500 stores under 72 banners in 28 countries – e-commerce sites in 11 countries, and growing..:
we have a lot of variety not only in our stores and on our websites, but there’s also tons of variety within our infrastructure.
For Windows, it’s 40,000 servers – in stores, in distribution centers, in our data centers, and in the cloud
Some of our main challenges are really around not knowing what really matters from a configuration standpoint on a server or group of servers
There are snowflakes. A lot of snowflakes. Snowflakes that were set up in a particular way for a particular reason
It’s the kind of stuff that gets passed on with tribal knowledge – but eventually it gets lost
[Derek]
Outdated toolchain (SCCM, domain & local policy, scripts)
[Derek] Point and click culture…
[Chris] “Hi my name is Chris and I’m a recovering GUI user..”
[Derek} more stuff
[Chris]
So with all of this in mind, with all of these challenges and restrictions…
We knew that in order to make a difference, we had to re-examine our tools, processes, and culture
We wanted
- To be able to manage configuration at scale
- We needed better Visibility into our environment, better Tracking of changes
- To be able to shift gears and go faster with automation and change
But Where do you start?
[Derek]
For us, we looked at a few different things:
- Greenfield – new OS, new environments (cloud)
- Brownfield - Look for opportunities for the quick wins – e.g. local user management in stores, SQL Server DBAs, security teams
Puppet was actually NOT in our first iteration
[Derek]
Wait for it… we called it… Walmart DSC
We had to constantly tweak WMI settings after 1000 nodes
DSC doesn’t seem to have been scale tested. DSC lives in WMI which is memory constrained. When we tried to compile large lists of servers, it would hang and break.
We had to constantly tweak WMI settings after 1000 nodes
DSC doesn’t seem to have been scale tested. DSC lives in WMI which is memory constrained. When we tried to compile large lists of servers, it would hang and break.
[Derek]
Start this off – about 64-bit support, DSC resources, etc..
[Chris]
Teaming up with our Linux counterparts also had several advantages:
Same workflow for implementing regardless of platform – repos, PRs, same classification system, node groups, etc (plug Marty)
Single source for change tracking and reporting
More buy-in across teams, promotes more usage and internal community growth
[Derek]
Start this off – about 64-bit support, DSC resources, etc..
[Chris]
You know we saw that the Linux guys were doing some pretty cool stuff and we wanted to be a part of that.
Teaming up with our Linux counterparts had several advantages:
Our Windows guys could follow the Same workflow for deploying configuration changes regardless of platform –
Single source for change tracking and reporting
More buy-in across teams, promotes more usage and internal community growth
How many of you use all of these tools in your environment?
Ok. Due to scale and complexity, our strategy is really about leveraging different tools for different strengths and use cases.
Take SCCM – it’s a virtual swiss army knife, striving to be that all-in-one tool that can help you manage everything – on servers, desktops and even mobile devices. Sure it has a ton of features, but sometimes those little scissors just won’t cut it..
So for us, the sweet spot is in:
- Image management, operating systems deployment, patch management
And using content providers for software delivery really helps us protect the WAN and sites with very limited bandwidth (like in a jungle somewhere)
Active Directory is like an industrial power washer – it’s powerful and there’s a bunch of settings and a variety of nozzles. You can blast some settings out, but there’s really no guarantee they they will apply properly.
Servers don’t report back and say “yep, I got it. I’m good” so there’s a loss of visibility
It’s kind of a spray and pray tool. One time, I asked my 14 year old to powerwash the deck – I came home to find it very clean, but also very splintered!
Not exactly the settings I would have used or the outcome I was looking for.
If you have multiple domains, you have to maintain those GPO settings in those other domains as well… this can lead to some serious maintenance issues.
With all that said, we still use it for some things. Our use cases:
Users settings on RDS servers – look and feel of the start screen with specific shortcuts, browser, etc
Some policies in local security stores
Puppet is the scalpel – it has a specific purpose in managing config and it does it well..
[Derek]
So we had decided to use Puppet – Now we had to decide how to actually get started
Leveraging same workflow, same methods for managing infrastructure, same business units, just another platform.
One of the key takeaways we learned from the Linux team was to start small: get one agent and config one resource and then build upon that.
Empowering technology groups to manage their specific settings
e.g. Windows baseline profile – then layer additional configurations on top…
> 5,000 new servers
> 2,000 in cloud
>30,000 Windows servers managed today across data center, store, and distribution centers as well as cloud
2 years in and a lot of teams don’t even realize Puppet is running on their servers. This is a blessing and a curse. We haven’t impacted those users negatively, but it means we still have more outreach to do.
[Derek]
Adopted roles and profiles before Linux team. Shared our learnings with them.
[Chris]
Compliance
Puppet is declarative – so it’s self-documenting; it’s easy to show auditors a manifest or report – instead of manually updating 30 page doc
Natural disasters do happen – and some of our stores may be in a direct path and they may lose power for some period of time, but it’s good to know
That when they come back online and check in with the master, they will eventually become compliant and get any other changes as well
[Derek]
Increased visibility into our infrastructure. E.g. example of proxy settings changing on several servers due to “another application”
[Chris]
Speed -
At our scale changing 30,000 servers at once can be scary.
Teams that were clamoring for us to move faster are now asking us to slow down.
But we are able to Implement and track change across different environments at scale. – If you want to know more… our colleague Marty Jackson
Will go in-depth on how we classify our nodes and manage change across environments.
So be sure to check out his talk tomorrow at 10:30
[Derek]
Partnering to solve problems
[Derek]
It wasn’t all smooth sailing… did hit a few bumps along the way.
What about mentioning the gotcha of having more than one tool managing the same setting… because that NEVER happens! (GPO, Puppet, SCCM, etc.)
Mention Glenn Sarti’s blog post on Puppet facts on Windows.
[Chris]
Looking back on our accomplishments, we categorized the key things that really helped us
Documentation –
What really helped with onboarding was having robust, shared documentation with the Linux team. Our team helped improve the documentation so that it would work for Windows folks as well. Puppet is easy, git is hard. The command line doesn’t come naturally to most Windows admins.
Automation –
When possible, automate the bottlenecks, automate the things that don’t provide value and may waste some time. Encrypting with hiera eyaml was a bottleneck for us. (Setting up encryption environment). We automated the generation of Hiera eyaml code by creating a website that does it, to make it easier for end users.
[Derek]
Collaboration –
As a result of all this, we are better business partners
Not only do we collaborate with the windows engineers and our Linux counterparts, but we also collaborate with other groups – application owners, SQL DBAs.
There’s lots that we can learn and share. #help_puppet
Measurement – Managers and upper management love numbers and metrics. Going from a gut feeling of “I think we have this many servers” or “We fixed about this many servers” to “well, here’s the report showing that the issue was automatically remediated on 200 servers in the last 24 hours” is huge.
[Chris]
Our journey isn’t over.
We need to get All Windows servers managed with Puppet. This includes what’s left in the brownfield.
We will continue to manage more and more on these as we grow.
We will partner and collaborate with others to develop content and help them deploy and mange their own modules
We will build onto our baseline configuration and extend it for Server 2016 as necessary
We will make it easier for our users to deploy Custom DSC resources
[Derek]
PQL
Infrastructure apps:
- SQL Server
IIS
Infrastructure app integration
We’re adding more OS versions and environments
2008, 2008 R2, and 2016 servers
[Derek]
So if you’re thinking about taking this journey… here are some things to help you on your path:
Collaboration is key
Not a solo journey
Ok to start small..
Work with teams/individuals that are apt to be early adopters and embrace change
Provide the quick wins to gain trust and cooperation -> then expand
Punch this up…
Collaboration