The document discusses issues with the current state of infrastructure development and proposes solutions. It notes that developing directly in production environments leads to chaos and failures. It suggests using branches to represent environments, committing code to branches, and then deploying branches to test environments before production. It also acknowledges that infrastructure is inherently more complex than application development but argues that infrastructure development still needs to follow software engineering practices like unit and integration testing within a build pipeline. The overall message is that infrastructure development needs to modernize and apply practices from application development to become more reliable and efficient.
Breaking the Kubernetes Kill Chain: Host Path Mount
Test Driven Infrastructure Development
1. Test driven
Infrastructure
development
Tomas (t0m) Doran
<tomas.doran@timgroup.com>
@bobtfish
https://github.com/bobtfish
https://github.com/youdevise
Thursday, 14 March 13
2. ‘Real men’ develop in
production!
Thursday, 14 March 13
Repeat again and again. Development cycle SLOOOW.
3. ‘Real men’ develop in
production!
• Edit / Commit / Push
Thursday, 14 March 13
Repeat again and again. Development cycle SLOOOW.
4. ‘Real men’ develop in
production!
• Edit / Commit / Push
• Update puppetmaster
Thursday, 14 March 13
Repeat again and again. Development cycle SLOOOW.
5. ‘Real men’ develop in
production!
• Edit / Commit / Push
• Update puppetmaster
• puppet agent -t
Thursday, 14 March 13
Repeat again and again. Development cycle SLOOOW.
6. ‘Real men’ develop in
production!
• Edit / Commit / Push
• Update puppetmaster
• puppet agent -t
• Repeat
Thursday, 14 March 13
Repeat again and again. Development cycle SLOOOW.
7. This is insane!
Thursday, 14 March 13
CHOAS and FAIL result when you break each other. Or, MORE likely (this happens twice a
day!)
8. This is insane!
• Try it on an 8 person team.
Thursday, 14 March 13
CHOAS and FAIL result when you break each other. Or, MORE likely (this happens twice a
day!)
9. This is insane!
• Try it on an 8 person team.
Thursday, 14 March 13
CHOAS and FAIL result when you break each other. Or, MORE likely (this happens twice a
day!)
10. This is insane!
• Try it on an 8 person team.
• ‘LOL - I broke puppet’
Thursday, 14 March 13
CHOAS and FAIL result when you break each other. Or, MORE likely (this happens twice a
day!)
18. We can do better
Thursday, 14 March 13
This at least lets you develop things independently. Everyone can do dev in their own branch
and merge once they have something that doesn’t break _everything_. You can also rebase -i
(squash) all the ARGH PUPPET SYNTAX commits.
19. We can do better
• Branch == environment
Thursday, 14 March 13
This at least lets you develop things independently. Everyone can do dev in their own branch
and merge once they have something that doesn’t break _everything_. You can also rebase -i
(squash) all the ARGH PUPPET SYNTAX commits.
20. We can do better
• Branch == environment
• Branch / Commit / Push
Thursday, 14 March 13
This at least lets you develop things independently. Everyone can do dev in their own branch
and merge once they have something that doesn’t break _everything_. You can also rebase -i
(squash) all the ARGH PUPPET SYNTAX commits.
21. We can do better
• Branch == environment
• Branch / Commit / Push
• mco puppetupdate
Thursday, 14 March 13
This at least lets you develop things independently. Everyone can do dev in their own branch
and merge once they have something that doesn’t break _everything_. You can also rebase -i
(squash) all the ARGH PUPPET SYNTAX commits.
22. We can do better
• Branch == environment
• Branch / Commit / Push
• mco puppetupdate
• puppet agent -t
--environment xxx
Thursday, 14 March 13
This at least lets you develop things independently. Everyone can do dev in their own branch
and merge once they have something that doesn’t break _everything_. You can also rebase -i
(squash) all the ARGH PUPPET SYNTAX commits.
23. Sounds good?
•Then you’ll be wanting:
•https://github.com/
youdevise/puppetupdate
Thursday, 14 March 13
It’s a bit basic, but then I ripped it out of work internal code at 8am ;)
26. Refactoring
Thursday, 14 March 13
Sorry Chris, but when you say ‘refactoring’ - it’s not refactoring unless you have tests.
The problem is that you can’t always remember to run the right branch on all the right nodes.
Or rather, how do you even know what all the right nodes are? And if you’re hacking on
custom functions, or anything using exported resource - WOE
27. Refactoring
• We change things to be consistent across
codebase:
• Why did puppet just delete all the
firewall rules on the production
database?
Thursday, 14 March 13
Sorry Chris, but when you say ‘refactoring’ - it’s not refactoring unless you have tests.
The problem is that you can’t always remember to run the right branch on all the right nodes.
Or rather, how do you even know what all the right nodes are? And if you’re hacking on
custom functions, or anything using exported resource - WOE
28. Refactoring
• We change things to be consistent across
codebase:
• Why did puppet just delete all the
firewall rules on the production
database?
• We don’t refactor:
• Add bugs all the time due to
inconsistency
Thursday, 14 March 13
Sorry Chris, but when you say ‘refactoring’ - it’s not refactoring unless you have tests.
The problem is that you can’t always remember to run the right branch on all the right nodes.
Or rather, how do you even know what all the right nodes are? And if you’re hacking on
custom functions, or anything using exported resource - WOE
29. Unfortunate reality:
• Hard coded IPs in 10 places
Thursday, 14 March 13
So, despite our best efforts, our puppet code was SHIIIIT.
Exported resources IS NOT a good fit for non-trivial things (like generating load balancer
configs). Ergo lots of hard coded IPs in multiple places. Ergo puppet code per site.
30. Unfortunate reality:
• Hard coded IPs in 10 places
• role::oy_lb
Thursday, 14 March 13
So, despite our best efforts, our puppet code was SHIIIIT.
Exported resources IS NOT a good fit for non-trivial things (like generating load balancer
configs). Ergo lots of hard coded IPs in multiple places. Ergo puppet code per site.
31. Unfortunate reality:
• Hard coded IPs in 10 places
• role::oy_lb
• hiera data split by domain (colo)
Thursday, 14 March 13
So, despite our best efforts, our puppet code was SHIIIIT.
Exported resources IS NOT a good fit for non-trivial things (like generating load balancer
configs). Ergo lots of hard coded IPs in multiple places. Ergo puppet code per site.
32. Unfortunate reality:
• Hard coded IPs in 10 places
• role::oy_lb
• hiera data split by domain (colo)
• mco puppet
Thursday, 14 March 13
So, despite our best efforts, our puppet code was SHIIIIT.
Exported resources IS NOT a good fit for non-trivial things (like generating load balancer
configs). Ergo lots of hard coded IPs in multiple places. Ergo puppet code per site.
33. Unfortunate reality:
• Hard coded IPs in 10 places
• role::oy_lb
• hiera data split by domain (colo)
• mco puppet
• 4 weeks per app per environment
Thursday, 14 March 13
So, despite our best efforts, our puppet code was SHIIIIT.
Exported resources IS NOT a good fit for non-trivial things (like generating load balancer
configs). Ergo lots of hard coded IPs in multiple places. Ergo puppet code per site.
35. The state of the art
• It’s certainly in a state
Thursday, 14 March 13
Nobody does automatic runs
Puppet becomes an auditing tool (automatic noop runs + reports)
36. The state of the art
• It’s certainly in a state
• Automatic runs dangerous
Thursday, 14 March 13
Nobody does automatic runs
Puppet becomes an auditing tool (automatic noop runs + reports)
37. The state of the art
• It’s certainly in a state
• Automatic runs dangerous
• cron --noop runs
Thursday, 14 March 13
Nobody does automatic runs
Puppet becomes an auditing tool (automatic noop runs + reports)
38. The state of the art
• It’s certainly in a state
• Automatic runs dangerous
• cron --noop runs
• puppet becomes an auditing system
Thursday, 14 March 13
Nobody does automatic runs
Puppet becomes an auditing tool (automatic noop runs + reports)
39. The state of the art
• It’s certainly in a state
• Automatic runs dangerous
• cron --noop runs
• puppet becomes an auditing system
• This isn’t what I signed up for!
Thursday, 14 March 13
Nobody does automatic runs
Puppet becomes an auditing tool (automatic noop runs + reports)
41. Business says no!
• Launching new products has a long lead
time
• This is unhelpful if your company is trying
to branch out into new markets
Thursday, 14 March 13
42. Business says no!
• Launching new products has a long lead
time
• This is unhelpful if your company is trying
to branch out into new markets
• CI / stage environments unlike prod
• Issues when new functionality goes live
• Developers think you’re incompetent
Thursday, 14 March 13
43. What is wrong
with this picture?
Thursday, 14 March 13
You just don’t know the answer to any of these questions in any reliable way...
But, generally, the answers are NO, YES, NO, NO
44. What is wrong
with this picture?
• Did you run it everywhere?
Thursday, 14 March 13
You just don’t know the answer to any of these questions in any reliable way...
But, generally, the answers are NO, YES, NO, NO
45. What is wrong
with this picture?
• Did you run it everywhere?
• Does it affect anything you’re
not expecting?
Thursday, 14 March 13
You just don’t know the answer to any of these questions in any reliable way...
But, generally, the answers are NO, YES, NO, NO
46. What is wrong
with this picture?
• Did you run it everywhere?
• Does it affect anything you’re
not expecting?
• Can you rebuild cleanly?
Thursday, 14 March 13
You just don’t know the answer to any of these questions in any reliable way...
But, generally, the answers are NO, YES, NO, NO
47. What is wrong
with this picture?
• Did you run it everywhere?
• Does it affect anything you’re
not expecting?
• Can you rebuild cleanly?
• Does the code even make things
reflect current state?
Thursday, 14 March 13
You just don’t know the answer to any of these questions in any reliable way...
But, generally, the answers are NO, YES, NO, NO
49. ‘We use puppet’
• Means nothing
Thursday, 14 March 13
Hint - you don’t!
50. ‘We use puppet’
• Means nothing
• State of your system is
the sum of all changes
Thursday, 14 March 13
Hint - you don’t!
51. ‘We use puppet’
• Means nothing
• State of your system is
the sum of all changes
• How do you know your
code can rebuild things?
Thursday, 14 March 13
Hint - you don’t!
53. It’s all mierda
• Development communities are 10
years ahead
Thursday, 14 March 13
We need to grow up, and raise the level of the conversation..
54. It’s all mierda
• Development communities are 10
years ahead
• We don’t integration test
• (repeatably)
Thursday, 14 March 13
We need to grow up, and raise the level of the conversation..
55. It’s all mierda
• Development communities are 10
years ahead
• We don’t integration test
• (repeatably)
• We can’t build / rebuild
• (reliably)
Thursday, 14 March 13
We need to grow up, and raise the level of the conversation..
56. Infra is hard
Thursday, 14 March 13
Sure - it’s much much harder to get a standalone testable system in infra than it is in
development.
57. Infra is hard
• Infrastructure is inherently more complex
Thursday, 14 March 13
Sure - it’s much much harder to get a standalone testable system in infra than it is in
development.
58. Infra is hard
• Infrastructure is inherently more complex
• Less control
Thursday, 14 March 13
Sure - it’s much much harder to get a standalone testable system in infra than it is in
development.
59. Infra is hard
• Infrastructure is inherently more complex
• Less control
• More moving parts
Thursday, 14 March 13
Sure - it’s much much harder to get a standalone testable system in infra than it is in
development.
60. Infra is hard
• Infrastructure is inherently more complex
• Less control
• More moving parts
• ‘End to end’ testing
Thursday, 14 March 13
Sure - it’s much much harder to get a standalone testable system in infra than it is in
development.
61. Infra is hard
• Infrastructure is inherently more complex
• Less control
• More moving parts
• ‘End to end’ testing
• Persistent data
Thursday, 14 March 13
Sure - it’s much much harder to get a standalone testable system in infra than it is in
development.
62. No excuses:
Scientific method
Thursday, 14 March 13
I do not consider this an excuse to abandon sanity.
64. The solution?
• Re-provision everything in tests
• N.B. Not perfect (but better!)
Thursday, 14 March 13
65. The solution?
• Re-provision everything in tests
• N.B. Not perfect (but better!)
Thursday, 14 March 13
66. The solution?
• Re-provision everything in tests
• N.B. Not perfect (but better!)
• Proper software engineering
• Unit and integration tests
• Build pipeline + promotion
Thursday, 14 March 13
67. Openstack
• Our tests spinning up 12 machines => VMs
Thursday, 14 March 13
So, we should use openstack, right? As of December, when we looked - 2 networks max,
inflexible. lvs not possible.
68. Openstack
• Our tests spinning up 12 machines => VMs
• Openstack going to be awesome, right now:
Thursday, 14 March 13
So, we should use openstack, right? As of December, when we looked - 2 networks max,
inflexible. lvs not possible.
69. Openstack
• Our tests spinning up 12 machines => VMs
• Openstack going to be awesome, right now:
• Networking sucks
Thursday, 14 March 13
So, we should use openstack, right? As of December, when we looked - 2 networks max,
inflexible. lvs not possible.
70. Openstack
• Our tests spinning up 12 machines => VMs
• Openstack going to be awesome, right now:
• Networking sucks
• Load balancing is a shambles
Thursday, 14 March 13
So, we should use openstack, right? As of December, when we looked - 2 networks max,
inflexible. lvs not possible.
71. Openstack
• Our tests spinning up 12 machines => VMs
• Openstack going to be awesome, right now:
• Networking sucks
• Load balancing is a shambles
• lvs / vlans / metal / bonding - nope
Thursday, 14 March 13
So, we should use openstack, right? As of December, when we looked - 2 networks max,
inflexible. lvs not possible.
73. My desires:
• Reuse as much code as possible! (e.g. load
balancers)
Thursday, 14 March 13
74. My desires:
• Reuse as much code as possible! (e.g. load
balancers)
• No per colo/environment puppet code
Thursday, 14 March 13
75. My desires:
• Reuse as much code as possible! (e.g. load
balancers)
• No per colo/environment puppet code
• No IPs anywhere
Thursday, 14 March 13
76. My desires:
• Reuse as much code as possible! (e.g. load
balancers)
• No per colo/environment puppet code
• No IPs anywhere
• ‘DRY’
Thursday, 14 March 13
77. My desires:
• Reuse as much code as possible! (e.g. load
balancers)
• No per colo/environment puppet code
• No IPs anywhere
• ‘DRY’
• CI pipeline to promote to production
Thursday, 14 March 13
78. My desires:
• Reuse as much code as possible! (e.g. load
balancers)
• No per colo/environment puppet code
• No IPs anywhere
• ‘DRY’
• CI pipeline to promote to production
• 1 puppet run from provisioned to working
Thursday, 14 March 13
79. My desires:
• Reuse as much code as possible! (e.g. load
balancers)
• No per colo/environment puppet code
• No IPs anywhere
• ‘DRY’
• CI pipeline to promote to production
• 1 puppet run from provisioned to working
• Repeatable and testable!
Thursday, 14 March 13
86. Puppetroll
• Rolls out a consistent sha1 from the
puppetmaster to an entire environment
Thursday, 14 March 13
87. Puppetroll
• Rolls out a consistent sha1 from the
puppetmaster to an entire environment
• Fails if any puppet run fails
Thursday, 14 March 13
88. Puppetroll
• Rolls out a consistent sha1 from the
puppetmaster to an entire environment
• Fails if any puppet run fails
• https://github.com/youdevise/puppetroll
Thursday, 14 March 13
91. Provisioning tools
• debootstrap custom gold images
• mcollective ‘computenode’ agent for kvm
Thursday, 14 March 13
92. Provisioning tools
• debootstrap custom gold images
• mcollective ‘computenode’ agent for kvm
• ‘provision me a machine called X, on
networks Y and Z’
Thursday, 14 March 13
93. Provisioning tools
• debootstrap custom gold images
• mcollective ‘computenode’ agent for kvm
• ‘provision me a machine called X, on
networks Y and Z’
• Dynamic IP allocation (dnsmasq locally,
DDNS for real)
Thursday, 14 March 13
95. stacks
• Model driven deployment
Thursday, 14 March 13
96. stacks
• Model driven deployment
• DSL for describing groups of systems +
dependencies
Thursday, 14 March 13
97. stacks
• Model driven deployment
• DSL for describing groups of systems +
dependencies
• rake tasks to provision / test / clean up
stack + deps
Thursday, 14 March 13
98. stacks
• Model driven deployment
• DSL for describing groups of systems +
dependencies
• rake tasks to provision / test / clean up
stack + deps
• Can provision a full environment, run E2E
tests, tear it down - in CI.
Thursday, 14 March 13
103. How it works?
• DSL creates model of systems
Thursday, 14 March 13
104. How it works?
• DSL creates model of systems
• rake task ‘launch’:
Thursday, 14 March 13
105. How it works?
• DSL creates model of systems
• rake task ‘launch’:
• mco provisions boxes on compute nodes
Thursday, 14 March 13
106. How it works?
• DSL creates model of systems
• rake task ‘launch’:
• mco provisions boxes on compute nodes
• each box runs puppet --waitforcert
Thursday, 14 March 13
107. How it works?
• DSL creates model of systems
• rake task ‘launch’:
• mco provisions boxes on compute nodes
• each box runs puppet --waitforcert
• mco signs cert
Thursday, 14 March 13
108. How it works?
• DSL creates model of systems
• rake task ‘launch’:
• mco provisions boxes on compute nodes
• each box runs puppet --waitforcert
• mco signs cert
• puppet runs for each box
Thursday, 14 March 13
111. Puppetmaster
• Uses the same model
Thursday, 14 March 13
112. Puppetmaster
• Uses the same model
• Generates an ENC for each node
Thursday, 14 March 13
113. Puppetmaster
• Uses the same model
• Generates an ENC for each node
• Puppet code:
Thursday, 14 March 13
114. Puppetmaster
• Uses the same model
• Generates an ENC for each node
• Puppet code:
• Just installs things / starts services
Thursday, 14 March 13
115. Puppetmaster
• Uses the same model
• Generates an ENC for each node
• Puppet code:
• Just installs things / starts services
• I.E. what it’s good at!
Thursday, 14 March 13
117. Putting it together
Thursday, 14 March 13
So, what do we have? Well - everything I showed you already...
Building proxy server layer (by refactoring puppet code) right now. Databases to follow!
118. Putting it together
• Still ongoing - live production apps ETA two
weeks.
Thursday, 14 March 13
So, what do we have? Well - everything I showed you already...
Building proxy server layer (by refactoring puppet code) right now. Databases to follow!
119. Putting it together
• Still ongoing - live production apps ETA two
weeks.
• Still haven’t solved re-provisioning problem
for live environments!
Thursday, 14 March 13
So, what do we have? Well - everything I showed you already...
Building proxy server layer (by refactoring puppet code) right now. Databases to follow!
120. Putting it together
• Still ongoing - live production apps ETA two
weeks.
• Still haven’t solved re-provisioning problem
for live environments!
• Do have repeatable and testable / tested
infrastructure building in CI!
Thursday, 14 March 13
So, what do we have? Well - everything I showed you already...
Building proxy server layer (by refactoring puppet code) right now. Databases to follow!
122. Thursday, 14 March 13
The top table is our test overview - we have two types of tests, those which are for a specific
machine (i.e. a VM) and those which are for a virtual service (backed by multiple machines)
‘behaves like’ is an rspec thing we haven’t overridden.
For each machine, we test that it’s pingable, then run every nrpe (nagios) agent and check
124. In the (near) future?
• Live application stack in production
Thursday, 14 March 13
125. In the (near) future?
• Live application stack in production
• Automated ‘promotion’ of good changes to
production
Thursday, 14 March 13
126. In the (near) future?
• Live application stack in production
• Automated ‘promotion’ of good changes to
production
• Integrated environment support for dev
stacks on dev branches/environments
Thursday, 14 March 13
127. In the (near) future?
• Live application stack in production
• Automated ‘promotion’ of good changes to
production
• Integrated environment support for dev
stacks on dev branches/environments
• Open source all the things!
Thursday, 14 March 13
129. Thanks!
• puppet is an awesome tool.
• It doesn’t solve higher level system
modeling problems
• It shouldn’t try to!
Thursday, 14 March 13
130. Thanks!
• puppet is an awesome tool.
• It doesn’t solve higher level system
modeling problems
• It shouldn’t try to!
• sysadmins need to level up
• It’s not done till you can test it still works
Thursday, 14 March 13