2. #b20Con
THE PROJECT
• COTS product integration for DoD
• custom Python glue
• and Java, PHP, Perl
• Releases every 6 months or so
• Freeze 2-4 weeks in advance
• Deploy Friday evening to Sunday
afternoon
• Repair broken functionality Monday
and Tuesday (and on)
• Barely starting Agile
• Daily Stand-ups
• (really daily status calls)
• 2-week Sprints
• Good, pruned backlog
• No automated testing
• No unit tests
• No continuous integration
3. #b20Con
THE DELIVERY TEAM
• Development – Local
• 2 Developers
• 1 Business Analyst
• 1 Project Manager
• DISA PMO
• 1 Program Manager
• 1 Chief Engineer
• 1 Program Director
• 1 Systems Engineer
• Test and Integration – Remote
• 4-6 Testers
• 4-6 Integrators
• including security experts
• 1 Information Assurance
• Off-team
• Systems Administrators
• hardware and software
5. #b20Con
DEVOPS IS…
“How long would it take your organization to deploy a change that involves
just one single line of code?
Do you do this on a repeatable, reliable basis?”
- Mary and Tom Poppendieck
Implementing Lean Software Development:
From Concept to Cash
6. #b20Con
DEVOPS IS…
“The goal of DevOps is not just to increase the rate of change, but to
successfully deploy features into production without causing chaos and
disrupting other services, while quickly detecting and correcting incidents
when they occur.”
- Gene Kim
Top 11 Things You Need to Know About DevOps
7. #b20Con
CONTINUOUS DELIVERY
• Make releasing a business decision, not a technical decision
• High-confidence releases
• Small releases
• Fully tested
• No expectation of problems
• Hotfix releases
• Possible, no more than moderate risk and moderate coordination
9. #b20Con
THE APPROACH
• Started with things that were in
our control
• Dev and Test environments
• Development process
• Make changes behind the scenes
• Free/open source tools
• Easy to integrate into our CI system
• Small changes
• Disclose the changes when there
was a win
• Highlight ease of use
• Use as justification for higher
environments
10. #b20Con
THE JOURNEY
1. Continuous Integration
2. Functional Testing
3. Automated Deploys
4. Security Testing
5. Performance
6. Culture Clash
4½ Years
September 2009 – March 2014
11. #b20Con
1. CONTINUOUS INTEGRATION
• Trouble explaining “integration”
• between two or more developers
• not between systems
• Set up SecureCI one afternoon
• Explained the advantages later
• Wired to the ALM tool we had
• Jenkins (Hudson at the time)
• Nexus
• SonarQube (Sonar at the time)
• Automated builds
• Ant, Maven
• PMD, FindBugs, Checkstyle
• Cobertura
• Later added Python tools
12. #b20Con
2. FUNCTIONAL TESTING
• Functional testing was done manually
• from a script written in Microsoft Word
• We waited a year before staging a coup
• we didn’t want to encroach on their domain
• Demo of Selenium
• demonstrated record-and-playback through the Selenium IDE
• we recorded the first set of tests
• then turned it back over to the test team
Sound from soundbible.com, CC BY 3.0 US
13. #b20Con
2. FUNCTIONAL TESTING
• They argued later that automated testing was ineffective
• the automated script (singular) only worked one time, then needed to be re-
recorded when any changes got made to the app
Lesson Learned:
Automated testing isn’t just about replacing
manual tests with a test framework.
It requires a different way of thinking.
14. #b20Con
2. FUNCTIONAL TESTING
• We took it back
• Rewrote existing tests in Java
• Showed our business analyst how to clone-and-mutate the Java tests
• Started with JUnit, but went to TestNG
• better tagging and parameterization
• pre-test run initialization
15. #b20Con
2. FUNCTIONAL TESTING
• Development team had more confidence in releases
• Also began testing user roles
• Security testing = what can this type of user NOT do
Lesson Learned:
Should have focused on demonstrating that
there were fewer escaped defects.
It was hard to point to a clear benefit.
16. #b20Con
CONTINUOUS DELIVERY
• Project Manager came across the book in a
book store
• Everything made so much sense
• Logical extension of what we were trying to
do
• Addressed a lot of the issues we were running
into
• No money or time for an effort, so we
adopted it as our long-term goal
17. #b20Con
3. AUTOMATED DEPLOYS
• Started with automating a Drupal web server
install
• new system, not yet in production
• database server was easy, so we skipped it for now
• Then automated the manual COTS install
• Then started reverse engineering the broken
COTS installer
18. #b20Con
3. AUTOMATED DEPLOYS
• Down the road, realized we could automate everything
• Doesn’t just reduce risk, also speeds up the process
Lesson Learned:
Automate everything- even the easy stuff.
When it is easy to install, you’ll stumble
across more reasons to install it.
Go from Why? to Why not?
19. #b20Con
3. AUTOMATED DEPLOYS
• No Puppet Enterprise Server
• just manually ran puppet apply from
the command line
• every system (DB, Web server, SVN
server, ALM tool) used the same
puppet apply command
• Vagrant would have been helpful
for local deploys
• Just hadn’t heard of it
20. #b20Con
4. SECURITY TESTING
• Noticed extra processes running
• Dev system in cloud with default
password
• Tested Security Blanket
• just purchased by Raytheon
• couldn’t get it purchased
21. #b20Con
4. SECURITY TESTING
• Decided we needed at least some security in dev
• System hardening
• Web application scanning
• We knew it couldn’t replace the “official” testing
• plus, we didn’t want to encroach on their domain
22. #b20Con
4. SECURITY TESTING
• Knew we had some good base for security
• CI, static analysis, user role testing
• Wanted a security scanner
• at the time, none worked with client certificates out of the box
• Found w3af
• Python
• customizable
• client certificate support was there, but not exposed
• handed it over to the “security experts” on the integration team
24. #b20Con
4. SECURITY TESTING
• Never got past the login screen
• Never read the output or log
• So we took it back
• Eventually had problems getting customized w3af to work
properly
• Switched to OWASP ZAP, run manually
• Security team focused on STIG and SELinux
• that was their expertise anyway
25. #b20Con
4. SECURITY TESTING
• Lost a lot of faith in us
• Information Assurance isn’t the same as Security
Lesson Learned:
Protect every system, everywhere.
Many hacks are just there for the system,
not the data.
26. #b20Con
4+. SECURITY TESTING
• Over a few days, implemented
OpenSCAP in Jenkins for STIG
• immediately found issues
• started adding Puppet manifests for
remediation
• Started using Nikto2 for web
server scanning
• immediately found issues
• Started running weekly scans of
dev and test using OpenVAS
• no immediate issues, but started
seeing package security updates
before they became IAVMs
• Discovered SELinux was in
permissive mode
• had never been in enforcing
27. #b20Con
4+. SECURITY TESTING
• Easier audits
• Proactive security upgrades
• Much better relationship with the data center
Lesson Learned:
Benefits of security testing go
beyond increased security.
28. #b20Con
5. PERFORMANCE
• Applying STIG to database server
• seemed like it was getting slower
• Used JMeter to get baseline
• Took rough breakdown of most
common queries
• Repeated as a 15-minute test
• Monitored trend
• Added similar testing to functional
tests, another 15 mins
• Also, number of functional tests
was growing slowly
• Watched functional test elapsed
time as rough guide
29. #b20Con
5. PERFORMANCE
• Watching trends can be very worthwhile
• Some testing can be almost as valuable as full testing
Lesson Learned:
A baseline can be a great safety net.
30. #b20Con
6. CULTURE CLASH
• Continuous Delivery was being
openly discussed
• PMO had just started thinking of it as
a clear plan
• Kept asking when “continuous
delivery” would be delivered, and
how it would be packaged
• Test and Integration started
complaining
• 3½ of us were pushing the 12+ of
them too hard
• moving too fast
• not a risk or control complaint,
merely effort
• People on test and integration
team started leaving
• including “Burt”
31. #b20Con
6. CULTURE CLASH
• Benefits were growing clear
• Effort was minimal
• No active resistance
Lesson Learned:
Do not underestimate cultural inertia.
Some will not or cannot ever make the mental shift.
32. #b20Con
THE AFTERMATH
• Test and Integration decided not to
renew their contract
• all remaining personnel ended
project with a month
• Security issue found the following
week
• deployed 3 days later
• Went back to 2-week deploy
cycles, sometimes faster
• Left 3 people on development
team
• One went back to take over for the
test and integration team as hands-
on-keyboard
• BA left project and another came in
½ time for testing
• Dropped into maintenance mode
33. #b20Con
THE DELIVERY TEAM
• Development – Local
• 1 Developer
• 1 Release Manager
• ½ Tester
• DISA PMO
• 1 Program Manager
• 1 Chief Engineer
• 1 Program Director
• 1 Systems Engineer
• Test and Integration – Remote
• 1 Information Assurance
• Off-team
• Systems Administrators
• hardware and software
34. #b20Con
THE PROJECT
• Barely Agile
• Maintenance only
• Kanban-ish
• tracking work in progress
• Daily Stand-ups
• (really daily status calls)
• 2-week Sprints
• Releases prepared every 2 weeks
• Soft freeze Thursday for Friday
release
• Deploy Friday evening
• 100% working functionality Friday
evening
• Non-event
35. #b20Con
THE PROJECT
• Puppet took the configuration parameters
• from 200+ untracked values
• to ~30 Hiera-controlled values
• Biggest coordination issue: 72 hours for user messaging
• Biggest time consumer: 3-6 hours for VM clones
36. #b20Con
MY ADVICE
Lessons Learned:
DevOps and Continuous Delivery are not a goal.
Do not set out to do DevOps or CD.
Remove road blocks and bottlenecks.
Fix quality issues.
Be more responsive to change.
39. #b20Con
MISSED OPPORTUNITIES
• Automated deploys
• more valuable than just reducing risk
• Vagrant
• Some security scanning earlier
• do not just assume someone else is
doing it
• Some performance testing earlier
• some is a lot better than none
• maybe almost as good as a lot
• We relied on client-side certificates
for authentication
• EJBCA should have been set up
immediately
• Upgrades are a huge time sink
• components, libraries,
applications, system software
• add tools to track it as early as
possible
40. #b20Con
THE TOOL CHAIN
• Jenkins
• Puppet (no Puppet Enterprise)
• 2 puppet apply commands per server
• one --noop for system audit
• one for deploy
• Security
• OpenSCAP (every deploy, minutes)
• OpenVAS (every weekend, hours)
• included Nikto2
• used Kali Linux
• OWASP Dependency Check (on-demand,
many minutes)
• OWASP Zed Attack Proxy (on-demand, few
days)
• Full role-based Selenium test coverage
(every deploy, overnight)
• 10k+ Selenium tests via TestNG
41. #b20Con
THE TOOL CHAIN
• Testing
• TestNG for Java unit tests
• Nose for Python unit tests
• Mockito/Mockito for Python
• JMeter
• for some representative
performance tests
• Static Analysis - Java
• PMD
• FindBugs
• Checkstyle
• Cobertura
• SonarQube
• Static Analysis - Python
• Pylint
• coverage.py
-- Every developer, everywhere, at some point
In this case test and integration team.
Ran on servers that were configured differently, with different security restrictions because it was in a different data center.
High drama, high risk, lots of deliberation
We didn’t know it at the time, but in retrospect we had every anti-DevOps stereotype: painful releases, so do fewer, with more changes
Also, deliberate throwing over the wall between integrator testing/documenting the deploy and integrator doing the deploy. That is how the team was making sure the deploy notes were complete, if someone could install the software sight unseen with the documentation they had prepared. They saw that as part of separation of duties.
Everyone has their own definition, so we were thinking of it as
So when I say we got to a Continuous Delivery process, this was what we achieved
This is the approach that worked for us.
THIS IS NOT PRESCRIPTIVE! Your situation is different, so your approach may be different.
Also, while it is nothing we did, we had air cover from a really, really strong advocate and a directive to be an “exemplar” project, and show other DoD projects that they could be Agile and show how to do it. We never would have succeeded without a champion to buy us the time and flexibility to undertake the changes to the process we made.
In all cases, we started with the pieces that were within our control (dev/test) and showed the value there as justification to push out further to “higher” environments and outside our immediate team.
A lot more overlap than I’m going to describe, but it did happen roughly in these waves over 4 years.
When we started, we were not driving to a CD process. We didn’t even know what CD was.
Continuous integration to them was a two week test cycle that could be kicked off with roughly 6 weeks of lead time run once or twice a year. We explained that was neither continuous nor developer integration.
They didn’t see the value in CI, but didn’t see any harm since it was just an afternoon to get Jenkins set up anyway. We stuck with primarily open source software, because this wasn’t an explicitly funded effort. Plus it made it instant (no lead time for procurement).
This gave us a strong basis for CD later, although we didn’t know it at the time.
Lesson learned: CI is valuable, but outside the dev team it isn’t obvious. Also, the biggest advantage to open-source tools is often acquisition time, not acquisition cost.
After a few months of CI, functional testing became the biggest bottleneck and showed the least value.
Lesson learned: Automated testing isn’t just replacing manual with a test framework. There is a different way of thinking.
The test team was happy not to be burdened with testing.
Since it was COTS, focused on testing system interfaces, not application functionality
Positive and negative user role testing was a great idea in retrospect. Strong basis for security and highest risk point when adding new functionality.
Lesson learned: Should have focused on demonstrating that there were fewer escaped defects. It was hard to point to a clear win.
Project manager dropped to ½ time to free up budget, hired release engineer as Puppet “expert” even though he didn’t know Puppet at the time.
Chose Puppet almost at random. Chef or Ansible would have worked just as well. None of them is a wrong choice, and anything is better than nothing.
Lucky timing: Integration team was focusing or had just finished a 6-month, full team effort to update OS from 32-bit to 64-bit.
COTS install was a crap shoot, didn’t always work, took several days, never seemed to end up installed the same way. When we called the vendor, they confirmed their professional services had a 50% success rate. Their solution was just to try again. It almost always works on the second try.
Used Drupal success as a clear win for automating COTS install.
Integration team was happy not to be burdened with integration or documenting the install process.
COTS vendor eventually called us and asked for our Puppet code. We said no. Largely out of spite.
Lesson learned: automate everything, even the easy stuff. It isn’t just to reduce risk, but also to speed things up. If you can install it easily, you will stumble across reasons to install it more often. We went from Why? to Why not?
Couldn’t easily get funds for license, extra server, nor accreditation for Puppet Enterprise.
Also, Vagrant would have been useful here had we known about it. With so few developers, coordination was easy and we almost never had conflicts. But Vagrant would have been easier if even one more dev added, especially if not colocated.
Had a hacker get into a dev system with a default password. Just an email bot, not a directed attack.
But we realized how lax we were with security, even in the safety of a dev env.
Again, open source acquisition was faster. Raytheon had just purchased Security Blanket and no one knew how to sell it to us.
3 months of effort, finally got results
Announced on daily call
Our non-technical BA was able to see the issue right away.
Never got past the login screen
But didn’t start at the beginning, so they even missed a XSS bug on the home page
The security guys were happy not to be burdened with security testing. They were IA, really checklist guys anyway.
Security Technical Implementation Guide
Lesson learned: IA not the same as security. And protect every system everywhere, it doesn’t matter if it has “production” data, many hackers just want the system.
Pattern recognition began to set in.
In their defense, this was automated vs. manual checks, not really incompetence
Found SELinux when Puppet code ran and assumed SELinux was enforcing, COTS product could not work in enforcing
Lesson learned: The benefits aside from increased security are significant: easier audits, proactive security upgrades on our schedule, and in the long term a much better relationship with the data center Ops guys.
We were applying STIG to database settings and noticed the database got slower
We never got around to adding true L&P or stress, or even response times.
Lesson learned: Even without formal guidance, a baseline is a great safety net. A relatively short test to show the trend over time is very worthwhile.
Remember, we were doing the development, testing, writing functional tests, security scans, and automating the deploys
“Burt”- no last name because I never met him in 2-2½ years. Nor heard of him, before or after the status call when they announced he was leaving. He wasn’t on the call that day. Never spoke up on the daily status call. Never showed up to a release planning meeting that happened every 6 months, sometimes in their offices. No tasks assigned nor delivered. Never supported a deploy. They were sad to see him go.
No back filling positions, just let the test and integration team atrophy.
Lesson learned: No matter how many advantages and benefits you show, even if there is little effort to be expended, some people/teams/orgs will never make the mental shift. It wasn’t active resistance, they just couldn’t/didn’t make the mental shift.
These results have been reinforced by our experiences at other gov’t agencies and commercial clients
5 years in, 4 years since CI introduced. May have been a little more notice, but I don’t believe so.
Used to have 2 deploys a month split between two sets of properties, dwindled to no more than once a month due to “moving too fast”
Smaller local team, only 1 on remote
to ~30 Hiera-controlled through standardization and composition
Jeff Payne- “Don’t do Agile, you have to be Agile”
Not focusing on DevOps/CD as a goal will help you prioritize what needs to be improved and will show you benefits sooner
Incremental adoption
less culture shock
more visible and concrete benefits sooner
Monty Hall Let's Make a Deal (1963–1977)
said “Actually, I'm an overnight success, but it took twenty years.”http://www.brainyquote.com/quotes/quotes/m/montyhall192769.html#QTimGb64Le97MELL.99
5 envs: dev, test, uat, staging/RACE, production
OWASP Dep Check every new library and periodically
OWASP ZAP any change in security posture
Enough functional testing, regression testing, performance testing, and security testing to give us confidence