John Adams Puppet Camp 2010Presentation Transcript
Puppet at High Scale
John Adams, Twitter
First, the bad news! (*)
(*) We’re on 0.25.4. Things may be different now.
Sort of idempotent
Ruby ﬁle transfer is inefﬁcient
Minimize use of recursion (home dirs, etc.)
Single run is non-deterministic
Order matters if order is speciﬁed
Not specifying dependency order creates out of
Puppet is great with a small team
Management is hard with a large number of admins
Unforeseen interactions between changes
No simple means of review
Anyone who can check into the tree can kill production
with simple mistakes
SVN access is effectively root equivalent
Divergence from desired conﬁguration through use of
chattr +i puppetmanagedfile
You can’t chattr +i with broken ﬁngers.
Puppet DSL not Ruby enough
Stated as a plus, but really a minus when most
engineers expect Ruby
Incomplete conditionals in the DSL
Removing conﬁguration for a cron job leaves the
Need to specify ensure => absent
If you forget the command with absent,
duplicate cronjob entries can occur
The vestigal tail of these “ensure absent” lines end
up living in the conﬁg long after they are needed
Cron + NTP
NTP synchronizes the system time
Cron granularity is one second
Performance regression if you make puppet install
many jobs across different modules, on the same
Introduce random delays before jobs
sleep $(($RANDOM % 60));
Test and “Canary”
No facility in puppet for testing
Controlled Deploys are preferable to “full change”
Use representative machines ﬁrst
Push to cluster when everything works.
Node membership to classes, and the nodes
themselves in a puppet conﬁguration are not well
Once entered, parsing is the only option to retrieve the
machine list and associated “roles” from the SVN tree.
ldapnodes is a possible solution here.
Node Class Changes
Still an unsolved problem
Removing class deﬁnition from a node leaves all of the
conﬁguration from the class behind
Have to re-kickstart the host to get to a base state
(*) the good news!
Our world is changing.
The end of the “Systems Administrator”
The beginning of “DevOps”
Stop Wasting Time
Start Delivering Great Ops Software
Stop administering individual machines.
Puppet deﬁnitions are code
Incorporate Cross-functional skills.
Build a bridge between your developers and the ops
Let’s ﬁx This.
initial Generate Review.
commit Ad-hoc tests.
~10% of hosts
Watch for failures!
TEST Test Integration
Production Final Review
Production No Review.
Testing / Staging
A test infrastructure is needed to ensure that updates
don’t kill production
People make mistakes
Treat the puppet conﬁg as if it were code
Restrict access to SVN tree itself (through ACLs)
Create a concept of an OWNER for each module and
manifest subdir; restrict access.
Enforce ownership during SVN checkin
Enforce a proper review process
SVN can be smarter
BIND (Verify zones, DNS, SOA++)
A mistake here is a full site outage
Verify puppet conﬁg
Create Reviewboard Entries
A script on each box to select the current branch
Set the branch (by modifying facter fact + conﬁg)
Show current branch
Enable or Disable puppetd in emergencies or ad-hoc
Visualize and centralize change
Keep teams informed
Prevent Unknown Interactions
Distrust puppet for creating user accounts
Build them from an LDAP infrastructure
Base package connects to LDAP and creates users
based on group and machine role
You still have to deal with RPMs creating system
No machine database in puppet
We used Django, MySQL, but you could use LDAP
Role membership imported to DB by parsing existing
puppet deﬁnitions and special variables in the node
Ad hoc scripting
No facility in puppet for immediate execution of
command on many hosts
SSH in a loop is not a solution at scale
Threaded SSH system through our own tool
Uses Paraminko open source (Python)
see also: func
Three complete puppetmasterd instances on each
puppet master machine, on different ports, pointed to
different SVN branches
Handling many clients
the SVN tree (eliminate the SPOF)
Use more puppet servers
Rsync manifests, then run puppet
Selectively update hosts (func)
Puppet Web Server
Don’t run WEBRick (script/server) - too slow
Unicorn (best choice)