Puppet at High Scale
John Adams, Twitter
First, the bad news! (*)



(*) We’re on 0.25.4. Things may be different now.
Problems

 Sort of idempotent
 Ruby file transfer is inefficient
   Minimize use of recursion (home dirs, etc.)
 Single run is non-deterministic
   Order matters if order is specified
   Not specifying dependency order creates out of
   order delivery
Working Together

Puppet is great with a small team
Management is hard with a large number of admins
  Unforeseen interactions between changes
  No simple means of review
Security

 Anyone who can check into the tree can kill production
 with simple mistakes
 SVN access is effectively root equivalent
 Divergence from desired configuration through use of
 chattr +i puppetmanagedfile
   You can’t chattr +i with broken fingers.
Puppet DSL


Puppet DSL not Ruby enough
 Stated as a plus, but really a minus when most
 engineers expect Ruby
 Incomplete conditionals in the DSL
Cron
 Removing configuration for a cron job leaves the
 cronjob behind
 Need to specify ensure => absent
   If you forget the command with absent,
   duplicate cronjob entries can occur
 The vestigal tail of these “ensure absent” lines end
 up living in the config long after they are needed
Cron + NTP
NTP synchronizes the system time
Cron granularity is one second
  Performance regression if you make puppet install
  many jobs across different modules, on the same
  zero second
Introduce random delays before jobs
  sleep $(($RANDOM % 60));
  do_something...
Test and “Canary”

No facility in puppet for testing
  Monolithic design
Controlled Deploys are preferable to “full change”
  Use representative machines first
  Push to cluster when everything works.
Machine Database

Node membership to classes, and the nodes
themselves in a puppet configuration are not well
exposed.
Once entered, parsing is the only option to retrieve the
machine list and associated “roles” from the SVN tree.
ldapnodes is a possible solution here.
Node Class Changes


Still an unsolved problem
Removing class definition from a node leaves all of the
configuration from the class behind
  Have to re-kickstart the host to get to a base state
Why Puppet?(*)



(*) the good news!
Configuration Management


Our world is changing.
The end of the “Systems Administrator”
The beginning of “DevOps”
Configuration Management

Consistent edits
Trackable Changes
Consistent ability to Rebuild
Find Variance
DevOps


Stop Wasting Time
Start Delivering Great Ops Software
Stop administering individual machines.
DevOps


Puppet definitions are code
Incorporate Cross-functional skills.
Build a bridge between your developers and the ops
team.
Let’s fix This.
Change Process
 initial          Generate Review.
           HEAD
commit             Ad-hoc tests.
Change Process
                      HEAD
       test
    integrate
 ~10% of hosts
Watch for failures!
                      TEST   Test Integration
Change Process
               HEAD




               TEST
production
 integrate
   100%
             Production   Final Review
Change Process
                 HEAD



 cherry
  pick           TEST
(bypass)



               Production   No Review.
Testing / Staging


 A test infrastructure is needed to ensure that updates
 don’t kill production
 People make mistakes
 Treat the puppet config as if it were code
Security

 Restrict access to SVN tree itself (through ACLs)
 Create a concept of an OWNER for each module and
 manifest subdir; restrict access.
 Enforce ownership during SVN checkin
 Enforce a proper review process
SVN can be smarter

Post-Commit checks
 BIND (Verify zones, DNS, SOA++)
   A mistake here is a full site outage
 Verify puppet config
 Create Reviewboard Entries
puppet-util

 A script on each box to select the current branch
   Set the branch (by modifying facter fact + config)
   Show current branch
   Enable or Disable puppetd in emergencies or ad-hoc
   testing
=
Reviewboard

www.reviewboard.org


Visualize and centralize change
Keep teams informed
Prevent Unknown Interactions
User Security

 Distrust puppet for creating user accounts
 Build them from an LDAP infrastructure
 Base package connects to LDAP and creates users
 based on group and machine role
   You still have to deal with RPMs creating system
   users
Machine Database

No machine database in puppet
  We used Django, MySQL, but you could use LDAP
Role membership imported to DB by parsing existing
puppet definitions and special variables in the node
stanza
Ad hoc scripting

 No facility in puppet for immediate execution of
 command on many hosts
 SSH in a loop is not a solution at scale
 Threaded SSH system through our own tool
   Uses Paraminko open source (Python)
 see also: func
Multiple Instances

 Three complete puppetmasterd instances on each
 puppet master machine, on different ports, pointed to
 different SVN branches
   HEAD
   TEST
   PRODUCTION
Handling many clients

Distribute:
  the SVN tree (eliminate the SPOF)
  Use more puppet servers
Rsync manifests, then run puppet
  Selectively update hosts (func)
Puppet Web Server

Don’t run WEBRick (script/server) - too slow
  Unicorn (best choice)
  Passenger (mod_rails)
  mongrel?
Distributed Puppet

                  SVN




  PM               PM             PM




host
 host            host
                  host          host
                                 host
  host
   host            host
                    host          host
                                   host
    host             host           host
Distributed Puppet

Too many clients eventually overwhelm the Master
You must deploy more hosts
Distribute cron jobs
  Randomize start times
Distribute the master itself
Questions?

John Adams Puppet Camp 2010

  • 1.
    Puppet at HighScale John Adams, Twitter
  • 2.
    First, the badnews! (*) (*) We’re on 0.25.4. Things may be different now.
  • 3.
    Problems Sort ofidempotent Ruby file transfer is inefficient Minimize use of recursion (home dirs, etc.) Single run is non-deterministic Order matters if order is specified Not specifying dependency order creates out of order delivery
  • 4.
    Working Together Puppet isgreat with a small team Management is hard with a large number of admins Unforeseen interactions between changes No simple means of review
  • 5.
    Security Anyone whocan check into the tree can kill production with simple mistakes SVN access is effectively root equivalent Divergence from desired configuration through use of chattr +i puppetmanagedfile You can’t chattr +i with broken fingers.
  • 6.
    Puppet DSL Puppet DSLnot Ruby enough Stated as a plus, but really a minus when most engineers expect Ruby Incomplete conditionals in the DSL
  • 7.
    Cron Removing configurationfor a cron job leaves the cronjob behind Need to specify ensure => absent If you forget the command with absent, duplicate cronjob entries can occur The vestigal tail of these “ensure absent” lines end up living in the config long after they are needed
  • 8.
    Cron + NTP NTPsynchronizes the system time Cron granularity is one second Performance regression if you make puppet install many jobs across different modules, on the same zero second Introduce random delays before jobs sleep $(($RANDOM % 60)); do_something...
  • 9.
    Test and “Canary” Nofacility in puppet for testing Monolithic design Controlled Deploys are preferable to “full change” Use representative machines first Push to cluster when everything works.
  • 10.
    Machine Database Node membershipto classes, and the nodes themselves in a puppet configuration are not well exposed. Once entered, parsing is the only option to retrieve the machine list and associated “roles” from the SVN tree. ldapnodes is a possible solution here.
  • 11.
    Node Class Changes Stillan unsolved problem Removing class definition from a node leaves all of the configuration from the class behind Have to re-kickstart the host to get to a base state
  • 12.
  • 13.
    Configuration Management Our worldis changing. The end of the “Systems Administrator” The beginning of “DevOps”
  • 14.
    Configuration Management Consistent edits TrackableChanges Consistent ability to Rebuild Find Variance
  • 15.
    DevOps Stop Wasting Time StartDelivering Great Ops Software Stop administering individual machines.
  • 16.
    DevOps Puppet definitions arecode Incorporate Cross-functional skills. Build a bridge between your developers and the ops team.
  • 17.
  • 18.
    Change Process initial Generate Review. HEAD commit Ad-hoc tests.
  • 19.
    Change Process HEAD test integrate ~10% of hosts Watch for failures! TEST Test Integration
  • 20.
    Change Process HEAD TEST production integrate 100% Production Final Review
  • 21.
    Change Process HEAD cherry pick TEST (bypass) Production No Review.
  • 22.
    Testing / Staging A test infrastructure is needed to ensure that updates don’t kill production People make mistakes Treat the puppet config as if it were code
  • 23.
    Security Restrict accessto SVN tree itself (through ACLs) Create a concept of an OWNER for each module and manifest subdir; restrict access. Enforce ownership during SVN checkin Enforce a proper review process
  • 24.
    SVN can besmarter Post-Commit checks BIND (Verify zones, DNS, SOA++) A mistake here is a full site outage Verify puppet config Create Reviewboard Entries
  • 25.
    puppet-util A scripton each box to select the current branch Set the branch (by modifying facter fact + config) Show current branch Enable or Disable puppetd in emergencies or ad-hoc testing
  • 26.
  • 28.
    Reviewboard www.reviewboard.org Visualize and centralizechange Keep teams informed Prevent Unknown Interactions
  • 29.
    User Security Distrustpuppet for creating user accounts Build them from an LDAP infrastructure Base package connects to LDAP and creates users based on group and machine role You still have to deal with RPMs creating system users
  • 30.
    Machine Database No machinedatabase in puppet We used Django, MySQL, but you could use LDAP Role membership imported to DB by parsing existing puppet definitions and special variables in the node stanza
  • 31.
    Ad hoc scripting No facility in puppet for immediate execution of command on many hosts SSH in a loop is not a solution at scale Threaded SSH system through our own tool Uses Paraminko open source (Python) see also: func
  • 32.
    Multiple Instances Threecomplete puppetmasterd instances on each puppet master machine, on different ports, pointed to different SVN branches HEAD TEST PRODUCTION
  • 33.
    Handling many clients Distribute: the SVN tree (eliminate the SPOF) Use more puppet servers Rsync manifests, then run puppet Selectively update hosts (func)
  • 34.
    Puppet Web Server Don’trun WEBRick (script/server) - too slow Unicorn (best choice) Passenger (mod_rails) mongrel?
  • 35.
    Distributed Puppet SVN PM PM PM host host host host host host host host host host host host host host host
  • 36.
    Distributed Puppet Too manyclients eventually overwhelm the Master You must deploy more hosts Distribute cron jobs Randomize start times Distribute the master itself
  • 37.