SlideShare a Scribd company logo
1 of 65
puppet @ 100,000+ agents 
John Jawed (“JJ”) 
eBay/PayPal
but I don’t have 100,000 agents 
issues ahead encountered at <1000 agents
me 
responsible for Puppet/Foreman @ eBay 
how I got here: 
engineer -> engineer with root access -> system/infrastructure 
engineer
free time: PuppyConf
puppet @ eBay, quick facts 
-> perhaps the largest Puppet deployment 
-> more definitively the most diverse 
-> manages core security 
-> trying to solve the “p100k” problems
#’s 
• 100K+ agents 
– Solaris, Linux, and Windows 
– Production & QA 
– Cloud (openstack & VMware) + bare metal 
• 32 different OS versions, 43 hardware configurations 
– Over 300 permutations in production 
• Countless apps from C/C++ to Hadoop 
– Some applications over 15+ years old
currently 
• 3-4 puppet masters per data center 
• foreman for ENC, statistics, and fact collection 
• 150+ puppet runs per second 
• separate git repos per environment, common core 
modules 
– caching git daemon used by ppm’s
nodes growing, sometimes violently 
linear growth trendline
setup puppetmasters 
setup puppet master, it’s the CA too 
sign and run 400 agents concurrently, that’s less than 
half a percent of all the nodes you need to get 
through.
not exactly puppet issues 
entropy unavailable 
crypto is CPU heavy (heavier than you ever have and 
still believe) 
passenger children are all busy
OK, let’s setup separate hosts which only function as a 
CA
multiple dedicated CA’s 
much better, distributed the CPU I/O and helped the 
entropy problem. 
the PPM’s can handle actual puppet agent runs 
because they aren’t tied up signing. Great!
wait, how do the CA’s know about each others certs? 
some sort of network file system (NFS sounds okay).
shared storage for CA cluster 
-> Get a list of pending signing requests (should be small!) 
# puppet cert list 
… 
wait 
… 
wait 
…
optimize CA’s for large # of certs 
Traversing a large # of certs is too slow over NFS. 
-> Profile 
-> Implement optimization 
-> Get patch accepted (PUP-1665, 8x improvement)
<3 puppetlabs team
optimizing foreman 
- read heavy is fine, DB’s do it well. 
- read heavy in a write heavy environment is more challenging. 
- foreman writes a lot of log, fact, and report data post puppet run. 
- majority of requests are to get ENC data 
- use makara with PG read slaves 
(https://github.com/taskrabbit/makara) to scale ENC requests 
- Needs updates to foreigner (gem) 
- If ENC requests areslow, puppetmasters fall over.
optimizing foreman 
ENC requests load balanced to read slaves 
fact/report/host info write requests sent to master 
makara knows how to arbitrate the connection (great 
job TaskRabbit team!)
more optimizations 
make sure RoR cache is set to use dalli 
(config.cache_store = :dalli_store), see foreman wiki 
fact collection optimization (already in upstream), 
without this reporting facts back to foreman can kill a 
busy puppetmaster! (if you care: 
https://github.com/theforeman/puppet-foreman/ 
pull/145)
<3 the foreman team
let’s add more nodes 
Adding another 30,000 nodes (that’s 30% coverage). 
Agent setup: pretty standard stuff, puppet agent as a 
service.
results 
average puppet run: 29 seconds. 
not horrible. but average latency is a lie because that 
usually represents the mean average (sum of N / N). 
the actual puppet run graph looks more like…
curve impossible 
No one in operations or infrastructure ever wants a service runtime graph like this. 
mean 
average
PPM running @ medium load 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby 
17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby 
17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby 
16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby 
17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby 
17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby 
17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby 
17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby 
… system processes
60 seconds later…idle 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
17343 puppet 20 0 344m 77m 3828 S 11.6 0.1 74:47.23 ruby 
31152 puppet 20 0 203m 9048 2568 S 11.3 0.0 0:03.67 httpd 
29435 puppet 20 0 203m 9208 2668 S 10.9 0.0 0:05.46 httpd 
16220 puppet 20 0 337m 74m 3828 S 10.3 0.1 70:07.42 ruby 
16354 puppet 20 0 339m 75m 3816 S 10.3 0.1 62:11.71 ruby 
… system processes
120 seconds later…thrashing 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
16765 puppet 20 0 341m 76m 3828 S 94.0 0.1 67:14.92 ruby 
17197 puppet 20 0 343m 75m 3828 S 93.7 0.1 62:50.01 ruby 
17174 puppet 20 0 353m 78m 3996 S 92.7 0.1 70:07.44 ruby 
16330 puppet 20 0 338m 74m 3828 S 90.8 0.1 66:08.81 ruby 
17231 puppet 20 0 344m 75m 3820 S 89.8 0.1 70:00.47 ruby 
17238 puppet 20 0 353m 76m 3996 S 89.8 0.1 69:11.94 ruby 
17187 puppet 20 0 343m 76m 3820 S 88.2 0.1 70:48.66 ruby 
17156 puppet 20 0 353m 75m 3984 S 87.8 0.1 64:44.62 ruby 
17152 puppet 20 0 353m 75m 3984 S 86.3 0.1 64:44.62 ruby 
17153 puppet 20 0 353m 75m 3984 S 85.3 0.1 64:44.62 ruby 
17151 puppet 20 0 353m 75m 3984 S 82.9 0.1 64:44.62 ruby 
… more ruby processes
what we really want 
A flat consistent runtime curve, this is important for any production service. 
Without predictability there is no reliability!
consistency @ medium load 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby 
17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby 
17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby 
16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby 
17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby 
17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby 
17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby 
17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby 
… system processes
hurdle: runinterval 
near impossible to get a flat curve because of uneven 
and chaotic agent run distribution. 
runinterval is non-deterministic … even if you manage 
to sync up service times eventually it’s nebulous.
the puppet agent daemon approach is not going to 
work.
plan A: puppet via cron 
generate run time based some deterministic agent data 
point (IP, MAC address, hostname, etc.). 
IE, if you wanted a puppet run every 30 minutes, your 
crontab may look like: 
08 * * * * puppet agent -t 
38 * * * * puppet agent -t
plan A yields 
Fewer and predictable spikes
Improved. 
But does not scale because cronjobs help run times 
become deterministic but lack even distribution.
eliminate all masters? masterless puppet 
kicking the can down the road, somewhere 
infrastructure still has to serve the files and catalog to 
agents. 
masterless puppet creates a whole host of other 
issues (file transfer channels, catalog compiler host).
eliminate all masters? masterless puppet 
…and the same issues exists in albeit in different 
forms. 
shifts problems to “compile interval” and 
“manifest/module push interval”.
plan Z: increase your runinterval 
Z, the zombie apocalypse plan (do not do this!). 
delaying failure till you are no longer responsible for it 
(hopefully).
alternate setups 
SSL termination on load balancer – expensive 
- LB’s are difficult to deploy, cost more (you still 
need fail over otherwise it’s a SPoF!) 
caching – cache is meant to make things faster, not 
required to work. If cache is required to make services 
functional, solving the wrong problem.
zen moment 
maybe the issue isn’t about timing the agent from 
the host. 
maybe the issue is that the agent doesn’t know when 
there’s enough capacity to reliably and predictably run 
puppet.
enforcing states is delayed 
runinterval/cronjobs/masterless setups still render 
puppet as a suboptimal solution in a state sensitive 
environment (customer and financial data). 
the problem is not unique to puppet. salt, coreOS, et 
al. are susceptible.
security trivia 
web service REST3DotOh just got compromised and 
allows a sensitive file managed by puppet to be 
manipulated. 
Q: how/when does puppet set the proper state?
the how; sounds awesome 
A: every puppet runs ensures that a file is in its’ 
intended state and records the previous state if it was 
not.
the when; sounds far from awesome 
A: whenever puppet is scheduled to run next. up to 
runinterval minutes from the compromise, masterless 
push, or cronjob execution.
smaller intervals help but… 
all the strategies have one common issue: 
puppet masters do not scale with smaller intervals, 
exasperate spikes in the runtime curve.
this needs to change
pvc 
“pvc” – open source & lightweight process for a 
deterministic and evenly distributed puppet service 
curve… 
…and reactive state enforcement puppet runs.
pvc 
a different approach that executes puppet runs based on 
available capacity and local state changes. 
pings from an agent to check if its’ time to run puppet. 
file monitoring to force puppet runs when important files 
change outside of puppet (think /etc/shadow, 
/etc/sudoers).
pvc 
basic concepts: 
- Frequent pings to determine when to run puppet 
- Tied in to backend PPM health/capacity 
- Frequent fact collection without needing to run puppet 
- Sensitive files should be subject to monitoring 
- on change or updates outside of puppet, immediately run 
puppet! 
- efficiency an important factor.
pvc advantages 
-> variable puppet agent run timing 
- allows the flat and predictable service curve (what we 
want). 
- more frequent puppet runs when capacity is available, 
less frequent puppet runs less capacity is available.
pvc advantages 
-> improves security (kind of a big deal these days) 
- puppet runs when state changes rather than waiting to 
run. 
- efficient, uses inotify to monitor files. 
- if a file being monitored is changed, a puppet run is 
forced.
pvc advantages 
- orchestration between foreman & puppet 
- controlled rollout of changes 
- upload facts between puppet runs into foreman
pvc – backend 
3 endpoints – all get the ?fqdn=<certname> parameter 
GET /host – should pvc run puppet or facter? 
POST /report – raw puppet run output, files monitored 
were changed 
POST /facts – facter output (puppet facts in JSON)
pvc – /host 
> curl http://hi.com./host?fqdn=jj.e.com 
< PVC_RETURN=0 
< PVC_RUN=1 
< PVC_PUPPET_MASTER=puppet.vip.e.com 
< PVC_FACT_RUN=0 
< PVC_CHECK_INTERVAL=60 
< PVC_FILES_MONITORED="/etc/security/access.conf /etc/passwd"
pvc – /facts 
allows collecting of facts outside of the normal puppet 
run, useful for monitoring. 
set PVC_FACT_RUN to report facts back to the pvc 
backend.
pvc – git for auditing 
push actual changes between runs into git 
- branch per host, parentless branches & commits 
are cheap. 
- easy to audit fact changes (fact blacklist to 
prevent spam) and changes between puppet runs. 
- keeping puppet reports between runs is not 
helpful.
pvc – incremental rollouts 
select candidate hosts based on your criteria and set an environment variable 
via the /host endpoint output: 
FACTER_UPDATE_FLAG=true 
in your manifest, check: 
if $::UPDATE_FLAG { 
… 
}
example pvc.conf 
host_endpoint=http://jj.e.com./host 
report_endpoint=http://jj.e.com./report 
facts_endpoint=http://jj.e.com./facts 
info=1 
warnings=1
pvc – available on github 
$ git clone https://github.com/johnj/pvc 
make someone happy, achieve:
wishlist 
stuff pvc should probably have: 
• authentication of some sort 
• a more general backend, currently tightly integrated 
into internal PPM infrastructure health 
• whatever other users wish it had
misc. lessons learned 
your ENC has to be fast, or your puppetmasters fail 
without ever doing anything. 
upgrade ruby to 2.x for the performance improvements. 
serve static module files with a caching http server 
(nginx).
contact 
@johnjawed 
https://github.com/johnj 
jj@x.com

More Related Content

What's hot

OpenNebula and SaltStack - OpenNebulaConf 2013
OpenNebula and SaltStack - OpenNebulaConf 2013OpenNebula and SaltStack - OpenNebulaConf 2013
OpenNebula and SaltStack - OpenNebulaConf 2013
databus.pro
 
Openstack il2014 staypuft- your friendly foreman openstack installer
Openstack il2014   staypuft- your friendly foreman openstack installerOpenstack il2014   staypuft- your friendly foreman openstack installer
Openstack il2014 staypuft- your friendly foreman openstack installer
Arthur Berezin
 

What's hot (20)

Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014
 
Foreman presentation
Foreman presentationForeman presentation
Foreman presentation
 
Linux host orchestration with Foreman, Puppet and Gitlab
Linux host orchestration with Foreman, Puppet and GitlabLinux host orchestration with Foreman, Puppet and Gitlab
Linux host orchestration with Foreman, Puppet and Gitlab
 
Salt conf 2014 - Using SaltStack in high availability environments
Salt conf 2014 - Using SaltStack in high availability environmentsSalt conf 2014 - Using SaltStack in high availability environments
Salt conf 2014 - Using SaltStack in high availability environments
 
Managing your SaltStack Minions with Foreman
Managing your SaltStack Minions with ForemanManaging your SaltStack Minions with Foreman
Managing your SaltStack Minions with Foreman
 
SaltConf 2014: Safety with powertools
SaltConf 2014: Safety with powertoolsSaltConf 2014: Safety with powertools
SaltConf 2014: Safety with powertools
 
PXEless Discovery with Foreman
PXEless Discovery with ForemanPXEless Discovery with Foreman
PXEless Discovery with Foreman
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...
 
Full Stack Automation with Katello & The Foreman
Full Stack Automation with Katello & The ForemanFull Stack Automation with Katello & The Foreman
Full Stack Automation with Katello & The Foreman
 
OpenNebula and SaltStack - OpenNebulaConf 2013
OpenNebula and SaltStack - OpenNebulaConf 2013OpenNebula and SaltStack - OpenNebulaConf 2013
OpenNebula and SaltStack - OpenNebulaConf 2013
 
The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016
 
Lifecycle Management with Foreman
Lifecycle Management with ForemanLifecycle Management with Foreman
Lifecycle Management with Foreman
 
De-centralise and Conquer: Masterless Puppet in a Dynamic Environment
De-centralise and Conquer: Masterless Puppet in a Dynamic EnvironmentDe-centralise and Conquer: Masterless Puppet in a Dynamic Environment
De-centralise and Conquer: Masterless Puppet in a Dynamic Environment
 
Openstack il2014 staypuft- your friendly foreman openstack installer
Openstack il2014   staypuft- your friendly foreman openstack installerOpenstack il2014   staypuft- your friendly foreman openstack installer
Openstack il2014 staypuft- your friendly foreman openstack installer
 
Foreman in your datacenter
Foreman in your datacenterForeman in your datacenter
Foreman in your datacenter
 
Configuration Management - Finding the tool to fit your needs
Configuration Management - Finding the tool to fit your needsConfiguration Management - Finding the tool to fit your needs
Configuration Management - Finding the tool to fit your needs
 
Puppet meetup testing
Puppet meetup testingPuppet meetup testing
Puppet meetup testing
 
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
 
SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability EnvironmentsSaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
 
High availability for puppet - 2016
High availability for puppet - 2016High availability for puppet - 2016
High availability for puppet - 2016
 

Viewers also liked

How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...
How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...
How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...
Ritta Narita
 
шевченко т г 1
шевченко т г 1шевченко т г 1
шевченко т г 1
nvkschool_106
 
Macabio chapter5 projectmanagement
Macabio chapter5 projectmanagementMacabio chapter5 projectmanagement
Macabio chapter5 projectmanagement
Arvin Dela Cruz
 
Thyatira
ThyatiraThyatira
Thyatira
tccdeaf
 
Planificador de proyectos actual (1)
Planificador de proyectos actual (1)Planificador de proyectos actual (1)
Planificador de proyectos actual (1)
adrizinemcali2014
 
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
Rowena Marella-Daw
 

Viewers also liked (20)

Monitis: All-in-One Systems Monitoring from the Cloud
Monitis: All-in-One Systems Monitoring from the CloudMonitis: All-in-One Systems Monitoring from the Cloud
Monitis: All-in-One Systems Monitoring from the Cloud
 
Vox pupuli
Vox pupuliVox pupuli
Vox pupuli
 
How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...
How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...
How to create multiprocess server on windows with ruby - rubykaigi2016 Ritta ...
 
Intro to Systems Orchestration with MCollective
Intro to Systems Orchestration with MCollectiveIntro to Systems Orchestration with MCollective
Intro to Systems Orchestration with MCollective
 
Configuration Changes Don't Have to be Scary: Testing with containers
Configuration Changes Don't Have to be Scary: Testing with containersConfiguration Changes Don't Have to be Scary: Testing with containers
Configuration Changes Don't Have to be Scary: Testing with containers
 
La importancia de la educación financiera
La importancia de la educación financieraLa importancia de la educación financiera
La importancia de la educación financiera
 
шевченко т г 1
шевченко т г 1шевченко т г 1
шевченко т г 1
 
New constitution - what principles should guide our business?
New constitution - what principles should guide our business?New constitution - what principles should guide our business?
New constitution - what principles should guide our business?
 
Apa style course work chile earthquake 2010
Apa style course work   chile earthquake 2010Apa style course work   chile earthquake 2010
Apa style course work chile earthquake 2010
 
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic ConnectivityEfficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
 
8 reasons Images Matter, plus learn how to upload custom images on Listly
 8 reasons Images Matter, plus learn how to upload custom images on Listly 8 reasons Images Matter, plus learn how to upload custom images on Listly
8 reasons Images Matter, plus learn how to upload custom images on Listly
 
Desições sobre guarda
Desições sobre guardaDesições sobre guarda
Desições sobre guarda
 
Pharma Social Media Tools (Slideshare)
Pharma Social Media Tools (Slideshare)Pharma Social Media Tools (Slideshare)
Pharma Social Media Tools (Slideshare)
 
Macabio chapter5 projectmanagement
Macabio chapter5 projectmanagementMacabio chapter5 projectmanagement
Macabio chapter5 projectmanagement
 
Thyatira
ThyatiraThyatira
Thyatira
 
Cwts activity module 2
Cwts activity module 2Cwts activity module 2
Cwts activity module 2
 
Planificador de proyectos actual (1)
Planificador de proyectos actual (1)Planificador de proyectos actual (1)
Planificador de proyectos actual (1)
 
Винтовая симметрия и золотое сечение
Винтовая симметрия и золотое сечениеВинтовая симметрия и золотое сечение
Винтовая симметрия и золотое сечение
 
Top 5 call center software solutions
Top 5 call center software solutionsTop 5 call center software solutions
Top 5 call center software solutions
 
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
 

Similar to Puppet Availability and Performance at 100K Nodes - PuppetConf 2014

Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickr
xlight
 
2012 07 making disqus realtime@euro python
2012 07 making disqus realtime@euro python2012 07 making disqus realtime@euro python
2012 07 making disqus realtime@euro python
Adam Hitchcock
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
Ontico
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
John Allspaw
 

Similar to Puppet Availability and Performance at 100K Nodes - PuppetConf 2014 (20)

Islands: Puppet at Bulletproof Networks
Islands: Puppet at Bulletproof NetworksIslands: Puppet at Bulletproof Networks
Islands: Puppet at Bulletproof Networks
 
Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickr
 
sun solaris
sun solarissun solaris
sun solaris
 
Getput suite
Getput suiteGetput suite
Getput suite
 
2012 07 making disqus realtime@euro python
2012 07 making disqus realtime@euro python2012 07 making disqus realtime@euro python
2012 07 making disqus realtime@euro python
 
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Large-scaled Deploy Over 100 Servers in 3 Minutes
Large-scaled Deploy Over 100 Servers in 3 MinutesLarge-scaled Deploy Over 100 Servers in 3 Minutes
Large-scaled Deploy Over 100 Servers in 3 Minutes
 
vBACD - Introduction to Opscode Chef - 2/29
vBACD - Introduction to Opscode Chef - 2/29vBACD - Introduction to Opscode Chef - 2/29
vBACD - Introduction to Opscode Chef - 2/29
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
Consul administration at scale
Consul administration at scaleConsul administration at scale
Consul administration at scale
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Non-blocking I/O, Event loops and node.js
Non-blocking I/O, Event loops and node.jsNon-blocking I/O, Event loops and node.js
Non-blocking I/O, Event loops and node.js
 
Lxbrand
LxbrandLxbrand
Lxbrand
 
Cfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymoreCfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymore
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
 
FPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow Switch
 
BKK16-104 sched-freq
BKK16-104 sched-freqBKK16-104 sched-freq
BKK16-104 sched-freq
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard way
 
PuppetConf 2014 Killer R10K Workflow With Notes
PuppetConf 2014 Killer R10K Workflow With NotesPuppetConf 2014 Killer R10K Workflow With Notes
PuppetConf 2014 Killer R10K Workflow With Notes
 

More from Puppet

Puppet camp2021 testing modules and controlrepo
Puppet camp2021 testing modules and controlrepoPuppet camp2021 testing modules and controlrepo
Puppet camp2021 testing modules and controlrepo
Puppet
 
2021 04-15 operational verification (with notes)
2021 04-15 operational verification (with notes)2021 04-15 operational verification (with notes)
2021 04-15 operational verification (with notes)
Puppet
 
Enforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automationEnforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automation
Puppet
 

More from Puppet (20)

Puppet camp2021 testing modules and controlrepo
Puppet camp2021 testing modules and controlrepoPuppet camp2021 testing modules and controlrepo
Puppet camp2021 testing modules and controlrepo
 
Puppetcamp r10kyaml
Puppetcamp r10kyamlPuppetcamp r10kyaml
Puppetcamp r10kyaml
 
2021 04-15 operational verification (with notes)
2021 04-15 operational verification (with notes)2021 04-15 operational verification (with notes)
2021 04-15 operational verification (with notes)
 
Puppet camp vscode
Puppet camp vscodePuppet camp vscode
Puppet camp vscode
 
Modules of the twenties
Modules of the twentiesModules of the twenties
Modules of the twenties
 
Applying Roles and Profiles method to compliance code
Applying Roles and Profiles method to compliance codeApplying Roles and Profiles method to compliance code
Applying Roles and Profiles method to compliance code
 
KGI compliance as-code approach
KGI compliance as-code approachKGI compliance as-code approach
KGI compliance as-code approach
 
Enforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automationEnforce compliance policy with model-driven automation
Enforce compliance policy with model-driven automation
 
Keynote: Puppet camp compliance
Keynote: Puppet camp complianceKeynote: Puppet camp compliance
Keynote: Puppet camp compliance
 
Automating it management with Puppet + ServiceNow
Automating it management with Puppet + ServiceNowAutomating it management with Puppet + ServiceNow
Automating it management with Puppet + ServiceNow
 
Puppet: The best way to harden Windows
Puppet: The best way to harden WindowsPuppet: The best way to harden Windows
Puppet: The best way to harden Windows
 
Simplified Patch Management with Puppet - Oct. 2020
Simplified Patch Management with Puppet - Oct. 2020Simplified Patch Management with Puppet - Oct. 2020
Simplified Patch Management with Puppet - Oct. 2020
 
Accelerating azure adoption with puppet
Accelerating azure adoption with puppetAccelerating azure adoption with puppet
Accelerating azure adoption with puppet
 
Puppet catalog Diff; Raphael Pinson
Puppet catalog Diff; Raphael PinsonPuppet catalog Diff; Raphael Pinson
Puppet catalog Diff; Raphael Pinson
 
ServiceNow and Puppet- better together, Kevin Reeuwijk
ServiceNow and Puppet- better together, Kevin ReeuwijkServiceNow and Puppet- better together, Kevin Reeuwijk
ServiceNow and Puppet- better together, Kevin Reeuwijk
 
Take control of your dev ops dumping ground
Take control of your  dev ops dumping groundTake control of your  dev ops dumping ground
Take control of your dev ops dumping ground
 
100% Puppet Cloud Deployment of Legacy Software
100% Puppet Cloud Deployment of Legacy Software100% Puppet Cloud Deployment of Legacy Software
100% Puppet Cloud Deployment of Legacy Software
 
Puppet User Group
Puppet User GroupPuppet User Group
Puppet User Group
 
Continuous Compliance and DevSecOps
Continuous Compliance and DevSecOpsContinuous Compliance and DevSecOps
Continuous Compliance and DevSecOps
 
The Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick Maludy
The Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick MaludyThe Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick Maludy
The Dynamic Duo of Puppet and Vault tame SSL Certificates, Nick Maludy
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Puppet Availability and Performance at 100K Nodes - PuppetConf 2014

  • 1. puppet @ 100,000+ agents John Jawed (“JJ”) eBay/PayPal
  • 2. but I don’t have 100,000 agents issues ahead encountered at <1000 agents
  • 3. me responsible for Puppet/Foreman @ eBay how I got here: engineer -> engineer with root access -> system/infrastructure engineer
  • 5. puppet @ eBay, quick facts -> perhaps the largest Puppet deployment -> more definitively the most diverse -> manages core security -> trying to solve the “p100k” problems
  • 6. #’s • 100K+ agents – Solaris, Linux, and Windows – Production & QA – Cloud (openstack & VMware) + bare metal • 32 different OS versions, 43 hardware configurations – Over 300 permutations in production • Countless apps from C/C++ to Hadoop – Some applications over 15+ years old
  • 7. currently • 3-4 puppet masters per data center • foreman for ENC, statistics, and fact collection • 150+ puppet runs per second • separate git repos per environment, common core modules – caching git daemon used by ppm’s
  • 8.
  • 9. nodes growing, sometimes violently linear growth trendline
  • 10.
  • 11. setup puppetmasters setup puppet master, it’s the CA too sign and run 400 agents concurrently, that’s less than half a percent of all the nodes you need to get through.
  • 12.
  • 13. not exactly puppet issues entropy unavailable crypto is CPU heavy (heavier than you ever have and still believe) passenger children are all busy
  • 14. OK, let’s setup separate hosts which only function as a CA
  • 15. multiple dedicated CA’s much better, distributed the CPU I/O and helped the entropy problem. the PPM’s can handle actual puppet agent runs because they aren’t tied up signing. Great!
  • 16. wait, how do the CA’s know about each others certs? some sort of network file system (NFS sounds okay).
  • 17. shared storage for CA cluster -> Get a list of pending signing requests (should be small!) # puppet cert list … wait … wait …
  • 18.
  • 19. optimize CA’s for large # of certs Traversing a large # of certs is too slow over NFS. -> Profile -> Implement optimization -> Get patch accepted (PUP-1665, 8x improvement)
  • 21. optimizing foreman - read heavy is fine, DB’s do it well. - read heavy in a write heavy environment is more challenging. - foreman writes a lot of log, fact, and report data post puppet run. - majority of requests are to get ENC data - use makara with PG read slaves (https://github.com/taskrabbit/makara) to scale ENC requests - Needs updates to foreigner (gem) - If ENC requests areslow, puppetmasters fall over.
  • 22. optimizing foreman ENC requests load balanced to read slaves fact/report/host info write requests sent to master makara knows how to arbitrate the connection (great job TaskRabbit team!)
  • 23. more optimizations make sure RoR cache is set to use dalli (config.cache_store = :dalli_store), see foreman wiki fact collection optimization (already in upstream), without this reporting facts back to foreman can kill a busy puppetmaster! (if you care: https://github.com/theforeman/puppet-foreman/ pull/145)
  • 25. let’s add more nodes Adding another 30,000 nodes (that’s 30% coverage). Agent setup: pretty standard stuff, puppet agent as a service.
  • 26. results average puppet run: 29 seconds. not horrible. but average latency is a lie because that usually represents the mean average (sum of N / N). the actual puppet run graph looks more like…
  • 27. curve impossible No one in operations or infrastructure ever wants a service runtime graph like this. mean average
  • 28. PPM running @ medium load PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby 17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby 17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby 16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby 17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby 17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby 17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby 17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby … system processes
  • 29. 60 seconds later…idle PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17343 puppet 20 0 344m 77m 3828 S 11.6 0.1 74:47.23 ruby 31152 puppet 20 0 203m 9048 2568 S 11.3 0.0 0:03.67 httpd 29435 puppet 20 0 203m 9208 2668 S 10.9 0.0 0:05.46 httpd 16220 puppet 20 0 337m 74m 3828 S 10.3 0.1 70:07.42 ruby 16354 puppet 20 0 339m 75m 3816 S 10.3 0.1 62:11.71 ruby … system processes
  • 30. 120 seconds later…thrashing PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16765 puppet 20 0 341m 76m 3828 S 94.0 0.1 67:14.92 ruby 17197 puppet 20 0 343m 75m 3828 S 93.7 0.1 62:50.01 ruby 17174 puppet 20 0 353m 78m 3996 S 92.7 0.1 70:07.44 ruby 16330 puppet 20 0 338m 74m 3828 S 90.8 0.1 66:08.81 ruby 17231 puppet 20 0 344m 75m 3820 S 89.8 0.1 70:00.47 ruby 17238 puppet 20 0 353m 76m 3996 S 89.8 0.1 69:11.94 ruby 17187 puppet 20 0 343m 76m 3820 S 88.2 0.1 70:48.66 ruby 17156 puppet 20 0 353m 75m 3984 S 87.8 0.1 64:44.62 ruby 17152 puppet 20 0 353m 75m 3984 S 86.3 0.1 64:44.62 ruby 17153 puppet 20 0 353m 75m 3984 S 85.3 0.1 64:44.62 ruby 17151 puppet 20 0 353m 75m 3984 S 82.9 0.1 64:44.62 ruby … more ruby processes
  • 31.
  • 32. what we really want A flat consistent runtime curve, this is important for any production service. Without predictability there is no reliability!
  • 33. consistency @ medium load PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby 17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby 17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby 16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby 17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby 17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby 17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby 17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby … system processes
  • 34. hurdle: runinterval near impossible to get a flat curve because of uneven and chaotic agent run distribution. runinterval is non-deterministic … even if you manage to sync up service times eventually it’s nebulous.
  • 35. the puppet agent daemon approach is not going to work.
  • 36. plan A: puppet via cron generate run time based some deterministic agent data point (IP, MAC address, hostname, etc.). IE, if you wanted a puppet run every 30 minutes, your crontab may look like: 08 * * * * puppet agent -t 38 * * * * puppet agent -t
  • 37. plan A yields Fewer and predictable spikes
  • 38. Improved. But does not scale because cronjobs help run times become deterministic but lack even distribution.
  • 39. eliminate all masters? masterless puppet kicking the can down the road, somewhere infrastructure still has to serve the files and catalog to agents. masterless puppet creates a whole host of other issues (file transfer channels, catalog compiler host).
  • 40. eliminate all masters? masterless puppet …and the same issues exists in albeit in different forms. shifts problems to “compile interval” and “manifest/module push interval”.
  • 41. plan Z: increase your runinterval Z, the zombie apocalypse plan (do not do this!). delaying failure till you are no longer responsible for it (hopefully).
  • 42. alternate setups SSL termination on load balancer – expensive - LB’s are difficult to deploy, cost more (you still need fail over otherwise it’s a SPoF!) caching – cache is meant to make things faster, not required to work. If cache is required to make services functional, solving the wrong problem.
  • 43. zen moment maybe the issue isn’t about timing the agent from the host. maybe the issue is that the agent doesn’t know when there’s enough capacity to reliably and predictably run puppet.
  • 44. enforcing states is delayed runinterval/cronjobs/masterless setups still render puppet as a suboptimal solution in a state sensitive environment (customer and financial data). the problem is not unique to puppet. salt, coreOS, et al. are susceptible.
  • 45. security trivia web service REST3DotOh just got compromised and allows a sensitive file managed by puppet to be manipulated. Q: how/when does puppet set the proper state?
  • 46. the how; sounds awesome A: every puppet runs ensures that a file is in its’ intended state and records the previous state if it was not.
  • 47. the when; sounds far from awesome A: whenever puppet is scheduled to run next. up to runinterval minutes from the compromise, masterless push, or cronjob execution.
  • 48. smaller intervals help but… all the strategies have one common issue: puppet masters do not scale with smaller intervals, exasperate spikes in the runtime curve.
  • 49. this needs to change
  • 50. pvc “pvc” – open source & lightweight process for a deterministic and evenly distributed puppet service curve… …and reactive state enforcement puppet runs.
  • 51. pvc a different approach that executes puppet runs based on available capacity and local state changes. pings from an agent to check if its’ time to run puppet. file monitoring to force puppet runs when important files change outside of puppet (think /etc/shadow, /etc/sudoers).
  • 52. pvc basic concepts: - Frequent pings to determine when to run puppet - Tied in to backend PPM health/capacity - Frequent fact collection without needing to run puppet - Sensitive files should be subject to monitoring - on change or updates outside of puppet, immediately run puppet! - efficiency an important factor.
  • 53. pvc advantages -> variable puppet agent run timing - allows the flat and predictable service curve (what we want). - more frequent puppet runs when capacity is available, less frequent puppet runs less capacity is available.
  • 54. pvc advantages -> improves security (kind of a big deal these days) - puppet runs when state changes rather than waiting to run. - efficient, uses inotify to monitor files. - if a file being monitored is changed, a puppet run is forced.
  • 55. pvc advantages - orchestration between foreman & puppet - controlled rollout of changes - upload facts between puppet runs into foreman
  • 56. pvc – backend 3 endpoints – all get the ?fqdn=<certname> parameter GET /host – should pvc run puppet or facter? POST /report – raw puppet run output, files monitored were changed POST /facts – facter output (puppet facts in JSON)
  • 57. pvc – /host > curl http://hi.com./host?fqdn=jj.e.com < PVC_RETURN=0 < PVC_RUN=1 < PVC_PUPPET_MASTER=puppet.vip.e.com < PVC_FACT_RUN=0 < PVC_CHECK_INTERVAL=60 < PVC_FILES_MONITORED="/etc/security/access.conf /etc/passwd"
  • 58. pvc – /facts allows collecting of facts outside of the normal puppet run, useful for monitoring. set PVC_FACT_RUN to report facts back to the pvc backend.
  • 59. pvc – git for auditing push actual changes between runs into git - branch per host, parentless branches & commits are cheap. - easy to audit fact changes (fact blacklist to prevent spam) and changes between puppet runs. - keeping puppet reports between runs is not helpful.
  • 60. pvc – incremental rollouts select candidate hosts based on your criteria and set an environment variable via the /host endpoint output: FACTER_UPDATE_FLAG=true in your manifest, check: if $::UPDATE_FLAG { … }
  • 61. example pvc.conf host_endpoint=http://jj.e.com./host report_endpoint=http://jj.e.com./report facts_endpoint=http://jj.e.com./facts info=1 warnings=1
  • 62. pvc – available on github $ git clone https://github.com/johnj/pvc make someone happy, achieve:
  • 63. wishlist stuff pvc should probably have: • authentication of some sort • a more general backend, currently tightly integrated into internal PPM infrastructure health • whatever other users wish it had
  • 64. misc. lessons learned your ENC has to be fast, or your puppetmasters fail without ever doing anything. upgrade ruby to 2.x for the performance improvements. serve static module files with a caching http server (nginx).

Editor's Notes

  1. Greg, dominic, ohad