First, the bad news! (*)
(*) We’re on 0.25.4. Things may be different now.
Problems
Sort of idempotent
Ruby file transfer is inefficient
Minimize use of recursion (home dirs, etc.)
Single run is non-deterministic
Order matters if order is specified
Not specifying dependency order creates out of
order delivery
Working Together
Puppet is great with a small team
Management is hard with a large number of admins
Unforeseen interactions between changes
No simple means of review
Security
Anyone who can check into the tree can kill production
with simple mistakes
SVN access is effectively root equivalent
Divergence from desired configuration through use of
chattr +i puppetmanagedfile
You can’t chattr +i with broken fingers.
Puppet DSL
Puppet DSL not Ruby enough
Stated as a plus, but really a minus when most
engineers expect Ruby
Incomplete conditionals in the DSL
Cron
Removing configuration for a cron job leaves the
cronjob behind
Need to specify ensure => absent
If you forget the command with absent,
duplicate cronjob entries can occur
The vestigal tail of these “ensure absent” lines end
up living in the config long after they are needed
Cron + NTP
NTP synchronizes the system time
Cron granularity is one second
Performance regression if you make puppet install
many jobs across different modules, on the same
zero second
Introduce random delays before jobs
sleep $(($RANDOM % 60));
do_something...
Test and “Canary”
No facility in puppet for testing
Monolithic design
Controlled Deploys are preferable to “full change”
Use representative machines first
Push to cluster when everything works.
Machine Database
Node membership to classes, and the nodes
themselves in a puppet configuration are not well
exposed.
Once entered, parsing is the only option to retrieve the
machine list and associated “roles” from the SVN tree.
ldapnodes is a possible solution here.
Node Class Changes
Still an unsolved problem
Removing class definition from a node leaves all of the
configuration from the class behind
Have to re-kickstart the host to get to a base state
Change Process
HEAD
test
integrate
~10% of hosts
Watch for failures!
TEST Test Integration
Change Process
HEAD
TEST
production
integrate
100%
Production Final Review
Change Process
HEAD
cherry
pick TEST
(bypass)
Production No Review.
Testing / Staging
A test infrastructure is needed to ensure that updates
don’t kill production
People make mistakes
Treat the puppet config as if it were code
Security
Restrict access to SVN tree itself (through ACLs)
Create a concept of an OWNER for each module and
manifest subdir; restrict access.
Enforce ownership during SVN checkin
Enforce a proper review process
SVN can be smarter
Post-Commit checks
BIND (Verify zones, DNS, SOA++)
A mistake here is a full site outage
Verify puppet config
Create Reviewboard Entries
puppet-util
A script on each box to select the current branch
Set the branch (by modifying facter fact + config)
Show current branch
Enable or Disable puppetd in emergencies or ad-hoc
testing
User Security
Distrust puppet for creating user accounts
Build them from an LDAP infrastructure
Base package connects to LDAP and creates users
based on group and machine role
You still have to deal with RPMs creating system
users
Machine Database
No machine database in puppet
We used Django, MySQL, but you could use LDAP
Role membership imported to DB by parsing existing
puppet definitions and special variables in the node
stanza
Ad hoc scripting
No facility in puppet for immediate execution of
command on many hosts
SSH in a loop is not a solution at scale
Threaded SSH system through our own tool
Uses Paraminko open source (Python)
see also: func
Multiple Instances
Three complete puppetmasterd instances on each
puppet master machine, on different ports, pointed to
different SVN branches
HEAD
TEST
PRODUCTION
Handling many clients
Distribute:
the SVN tree (eliminate the SPOF)
Use more puppet servers
Rsync manifests, then run puppet
Selectively update hosts (func)
Puppet Web Server
Don’t run WEBRick (script/server) - too slow
Unicorn (best choice)
Passenger (mod_rails)
mongrel?
Distributed Puppet
Too many clients eventually overwhelm the Master
You must deploy more hosts
Distribute cron jobs
Randomize start times
Distribute the master itself