DevOpsDays Austin 2013
@ernestmueller| @bazaarvoice
2012: A Release Odyssey
Hi, I’m Ernest Mueller from Bazaarvoice here i...
The Monolith! Bazaarvoice Conversations, aka PRR, has 15,000 files and 4.9M lines of code, the oldest from Feb
2006, and t...
BV had gone agile and said “Let’s release more quickly too! All the cool kids are doing it! We’re doing two week
sprints, ...
Enter yours truly on January 30th. “You’re hired! We want biweekly releases in a month. With zero user facing
downtime. Fa...
Careful analysis of the situation was warranted. Luckily a SWAT team had been analyzing the problem already.
The two major...
Path One - Testing! We hired up QA automation people and set them to work. We set the expectation, backed
up strongly by t...
JUnit testing and CIT testing in TeamCity was ramped up.
A selenium-based “Testmaster” system was used to improve the leve...
Branching - changed over to a trunk/release branch model, splits off every 2 weeks, no commits to branch
without going thr...
We also had a team write a feature flagging system, like the cool kids use, so we could launch features dark and
then enabl...
We couldn’t fix a couple things in time.
Our Solr indexes are 20 GB and reindexing and distributing them, while doing a ze...
We got a delay of game due to our IPO and then were “no go” March 1. We were under a lot of management
pressure to ship, b...
First biweekly release - PRR 5.2 went out on March 6, 5 days late. 5 issues were reported by customers.
5.3 went out March...
It took a lot of collaboration and good old fashioned project management. Product, QA, DevOps, various
engineering teams, ...
And the release train kept spinning. We had one major disaster on May 17, when a major architectural change to
our product...
But we weren’t done there. We wanted to totally pwn the old way, and the next step was weekly releases. There
were still s...
This was a lot easier - the QA team worked in the background to get the test coverage numbers up and then we
said to the t...
As I mentioned, our build and deployment is already automated (somewhat sketchily) with TeamCity, puppet,
Rundeck, and noa...
After that we will get rid of the bash-spaghetti deployment system we have and making deploys faster and
better (current 3...
And eventually... Continuous deployment. The cloud kids get to start there, but it takes some heavy lifting to
get a large...
And that’s my story!
Hit me up at theagileadmin.com
And thanks to 2001: A Space Odyssey for all the screen caps I used as ...
Upcoming SlideShare
Loading in …5
×

2012 - A Release Odyssey

475 views

Published on

Lightning talk for DevOpsDays Austin 2013 on taking releases from a 10 week to 1 week cadence. Sorry about the format, had to go from Keynote to PDF and since it was a lightning talk all the actual content's in the notes.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
475
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2012 - A Release Odyssey

  1. 1. DevOpsDays Austin 2013 @ernestmueller| @bazaarvoice 2012: A Release Odyssey Hi, I’m Ernest Mueller from Bazaarvoice here in Austin. We’re the biggest SaaS company you’ve never heard of; our primary application is for the collection and display of user generated content – for example, ratings and reviews – and a lot of the biggest Internet retailers use our solution on their sites for that purpose. We pushed out more than 1bn reviews last Cyber Monday. I’m going to tell you how we went from releasing our code once every ten weeks to once a week in a pretty short time.
  2. 2. The Monolith! Bazaarvoice Conversations, aka PRR, has 15,000 files and 4.9M lines of code, the oldest from Feb 2006, and that’s not counting UI versions, customer config, or operations code repos (all of which get released along with it). Written by generations of coders, including outsourcer partners. It runs across 1200 hosts in 4 datacenters; Rackspace, and AWS East, West, and Ireland. So by any measure this was a large legacy system.
  3. 3. BV had gone agile and said “Let’s release more quickly too! All the cool kids are doing it! We’re doing two week sprints, so let’s release biweekly - go! They tried it two weeks after a big ten-week release, and PRR v5.1 launched on January 19th, 2012. Whoops, it’s not that easy - 44 client tickets logged, mass hysteria. “Let’s not do that again!”
  4. 4. Enter yours truly on January 30th. “You’re hired! We want biweekly releases in a month. With zero user facing downtime. Failure is not an option! Go!” It wasn’t just an irrational need for speed, the product organization wanted to get faster A/B testing, more piloting, etc. and the engineering team wanted the benefits of a more continuous flow as well.
  5. 5. Careful analysis of the situation was warranted. Luckily a SWAT team had been analyzing the problem already. The two major impediments, which are frequently encountered factors in legacy implementations: • Lack of automation in testing - testing was a huge burden and couldn’t be done sufficiently in the time allotted • Poor SCM code discipline - checkins continuing up to the release
  6. 6. Path One - Testing! We hired up QA automation people and set them to work. We set the expectation, backed up strongly by the product team, that the development teams had to stop and do three testing sprints. We have a standard four-environment setup - dev, QA, staging, production.
  7. 7. JUnit testing and CIT testing in TeamCity was ramped up. A selenium-based “Testmaster” system was used to improve the level of regression automation to safe levels. More importantly perhaps, a new discipline of not running all the tests all the time - feature/story in dev, regression in QA, smoke testing in staging and production
  8. 8. Branching - changed over to a trunk/release branch model, splits off every 2 weeks, no commits to branch without going through a code freeze break process. Process enforcement via wiki! Trunk goes to dev twice daily, branch goes to QA, when labeled “verified” it goes to staging and then to production.
  9. 9. We also had a team write a feature flagging system, like the cool kids use, so we could launch features dark and then enable them later. We made the rule that all new features must be launched dark.
  10. 10. We couldn’t fix a couple things in time. Our Solr indexes are 20 GB and reindexing and distributing them, while doing a zero downtime deployment and keeping replication lag down needed more engineering. And our build and deploy system was pretty bad. It’s buzzword compliant - svn, TeamCity, maven, yum, puppet, rundeck, noah, but it’s actually a bit of spaghetti mess in a big crufty bash framework; builds take more than an hour and deploys take 3+ hours.
  11. 11. We got a delay of game due to our IPO and then were “no go” March 1. We were under a lot of management pressure to ship, but tests weren’t passing and at the new go/no-go meeting the dev managers sucked it up and declared “no go.”
  12. 12. First biweekly release - PRR 5.2 went out on March 6, 5 days late. 5 issues were reported by customers. 5.3 went out March 22, 1 issue reported. 5.4 went out April 5, zero issues reported. I kept in depth release metrics - number of checkins, number of process faults, number of support tickets - and they showed consistent improvement.
  13. 13. It took a lot of collaboration and good old fashioned project management. Product, QA, DevOps, various engineering teams, Support, and other stakeholders had to all get on the same page. We didn’t really change tooling besides adding the feature flagging - still Confluence, JIRA, and all our other tools - just using them more effectively. http://www.flickr.com/photos/senorwences/2366892425/
  14. 14. And the release train kept spinning. We had one major disaster on May 17, when a major architectural change to our product feeds went out in a release and generated 28 client reported issues (from a nice rolling average of . 5). We enhanced our process to link each svn checkin to a ticket and put together a page requiring per-ticket signoff from the release and started tracking more quality metrics. This got us consistently smooth releases through the summer of 2012.
  15. 15. But we weren’t done there. We wanted to totally pwn the old way, and the next step was weekly releases. There were still some parts of the process that were manual and painful, and we were still having some “misses” causing production issues. “If it’s painful, do it more often” is a message that some folks still balk at when confronted with, but it is absolutely true.
  16. 16. This was a lot easier - the QA team worked in the background to get the test coverage numbers up and then we said to the teams, “We’re going weekly in two weeks... Same process otherwise.” Version 6.7 launched on September 27, a week after 6.6. Client reported issues stemming from a code release average around zero since that time. Solr index distribution was automated; they get regenerated before, shipped out to the data centers, brought up to date, and then swapped in during releases. Solr reindexing automation went live October 18, 2012. Then we trained the developers to take over the release process. We skipped some releases during Black Friday, but are shipping PRR 9.0 this week (in most of our absence!).
  17. 17. As I mentioned, our build and deployment is already automated (somewhat sketchily) with TeamCity, puppet, Rundeck, and noah. Our next step in killing off the old way is in progress by renovating our build system - moving to git with gerrit for code reviewing, and upgrading our TeamCity installation so it can be API controlled - and fixing the crappy CIT tests that have been languishing there. We have trouble currently with failing CIT because we don’t block people on it, because the failures are intermittent. We’ll get build and CIT running fast (current 1 hour build 40 minute CIT).
  18. 18. After that we will get rid of the bash-spaghetti deployment system we have and making deploys faster and better (current 3 hours). We’re removing the separate staging roll (staging = production because it’s client facing) and go to continuous deployment off trunk to our QA system. Some of this is technology-faster and some is process-faster - having to promote up four environments, when it takes 4 hours per, and when staging and production have to happen in maintenance windows, is slow.
  19. 19. And eventually... Continuous deployment. The cloud kids get to start there, but it takes some heavy lifting to get a large, established system there. But that’s the sequel, 2013: A Release Odyssey.
  20. 20. And that’s my story! Hit me up at theagileadmin.com And thanks to 2001: A Space Odyssey for all the screen caps I used as part of this presentation.

×