Your SlideShare is downloading. ×
AppJam 2012: Accelerating DevOps Culture at Edmunds.com
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

AppJam 2012: Accelerating DevOps Culture at Edmunds.com

747
views

Published on

John Martin, Senior Director, Production Engineering, Edmunds.com …

John Martin, Senior Director, Production Engineering, Edmunds.com
Collaboration between Dev and Ops is a critical part of how Edmunds.com manages its 30 critical web applications. Development requires hard data from Operations so they can fully understand how to improve application performance. In this session John will discuss the key drivers for DevOps at Edmunds.com, how Dev and Ops collaborate using AppDynamics to solve production incidents, and his top 5 tips for implementing a DevOps culture in your organization.


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
747
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Much like “The Cloud” when the word became buzz-worthy, DevOps shares similar misconceptions about its meaning, so give me a minute to share with you what DevOps means to us at Edmunds. When we use “DevOps” at Edmunds, what we’re really zeroing in on is the collaboration aspect of the aspect. As I hope you’ll see in a few minutes, the most challenging issues we faced ultimately were breakdowns in communication and collaboration. My personal feeling is that technology issues have actually been the easiest part of the puzzle to solve. It’s the soft-skill challenges that have been the most difficult to overcome. Damon Edwards of DTO nailed it by asserting that you can't change culture, but you can change behavior. And that has certainly rang true with me. Edmunds certainly has had its fair share of technology challenges, but ultimately it was cultural - re: "behavioral" - challenges that pushed us to seek a better way of working. I’d also like to make clear that while we’ve seen some radical changes as a result of embracing these ideas at Edmunds, there is still much for us to do. I’d consider myself a charlatan if I were stand up here and tell you that we’ve got it all figured out and there isn’t more to do. That’s simply untrue. But I am speaking truth when I tell you we’ve come a long way having embraced the principles of what DevOps has to offer.So with that… what got us thinking differently?
  • So we didn’t just wake up one day and say, “Let’s try DevOps.” In fact, when I look back, it’s clear that like many others we where already working in that direction when the term came about. There were a few key drivers that pushed the changing in our thinking:Infrastructure GrowthCommunication FailuresWe simply needed to build and deploy faster
  • When I first started at Edmunds in 2005, we had approximately 30 servers that made up both the website and internal services. In 2006, a major initiative to overhaul how we developed, tested, and served our website saw a large burst in infrastructure as the number of servers jumped to 200. We provided multiple pre-production environments to support several release tracks at one time. In hindsight, this created more limitations than the opportunities we had hoped it would provide. For starters, we hadn't yet fully automated how we built + configured hosts. Sure, we had Kickstart profiles, but what about once people started putting their hands to the keyboard in a shell. Configuration drift plagued us. Admins spent a ridiculous amount of time chasing differences in environment configurations.
  • On top of that, our applications were still deployed manually by following an Excel spreadsheet with a list of instructions we called "battle plans". Every month, a member of the Application Operations team - which I was a part of - would touch each server to deploy a new .EAR, check content out of source control, or manually update our Apache configurations. We'd manually swap hosts in and out of the load balancer along the way. Eight hours later, the production release could then be qualified.
  • The number of applications and servers just continued to grow. By the end of this year, we’ll be just shy of 2500 servers. By the end of 2013 it’s projected that number will nearly hit 3000.Even before the word “DevOps” hit the industry, we knew we had to automate our infrastructure management and code deployment. We did a “Phase 1” with BladeLogic and AnthillPro, but as I'll explain in a few minutes that actually didn't workout for us as well as we'd hoped.
  • Communication failures nearly tanked major initiatives.Back in December 2007, we launched a project called "Edmunds 2.0". This project had gone through a pretty lengthy development cycle and was meant to introduce a large number of new features for our business. Namely, it was our first implementation of a CMS. So there was a lot riding on this project. Our internal customers were anxious to have the ability to update content on our website much faster than our monthly release cycle was allowing them. But there was a snag, and a big one at that... Weeks before the project was set to go live, performance testing began to uncover a major issue. The CMS framework that had been chosen - a solution called Day Communique, or Day CQ - wasn't scaling to suit our expected traffic numbers. Up until this point, our validation had been entirely around functionality. At this late stage in the project's life cycle, we were just now beginning to have an understanding of Day's performance limitations and were in big trouble. So we did the only thing that made sense to us: we doubled our application infrastructure to support the app's limitations and invested in a hardware caching appliance to take the load of the poorly performing application. And that really wasn't what we as a technology team felt was a successful launch. We'd released on-time; we'd delivered for the business; but we suffered as a technology organization due to our failure to catch such a massive shortcoming in our core technology stack until it was too late. So the Technology team cracked open the hood and started taking a good look at the inner workings of how we operated. We needed to have a clear picture of what we got wrong in Edmunds 2.0 and how we were going to prevent it in the future.One of the first things that became apparent was a need for improving the relationship and processes between Operations and Development groups. I think we have all experienced the walls that often divide these two parts of any technology organization. And at the time, we were no different. Product was typically developed in a vacuum and then tossed over the wall for Operations staff to support. To be certain we were architecting infrastructure to suit the application as it was developed, a tighter relationship had to exist. I think it was Scott Nealy, co-founder of Sun Microsystems, who once said that without software the hardware that his company manufactured was useless; and without the hardware, the software would have no place to run. To us, this relationship had not been fully realized and was the largest contributor to the late discovery that our new product wasn't built for scalability.
  • In 2010 we embarked on our next big technology overhaul. The Edmunds Redesign project was a nearly 100% rewrite of our application. We proposed going from 1 ginormous, monolithic .EAR and breaking that into 20 or so separate applications. This would mean another explosion in the size of our infrastructure, but perhaps more challenging than previous bursts, there would be far more applications requiring different configuration and deployment requirements.Having learned heavily from our communication failures a few years back, we understood clearly an overhaul of this type would require tight collaboration between all of our technical teams.Armed with BladeLogic, Anthill Pro (and some additional homegrown tooling) we believed we were well armed to meet the challenges this redesign effort required. We were partially right.From April to November of 2010 members of both our Dev and Ops teams worked through implementing the pre-production and "beta" production environments for the new site. We really nailed the collaboration piece as integrated teams of SysAdmins and Developers worked through technical challenges building new services, tuning and tweaking at each step. But we had another major failure ahead of us we didn't quite see.Because our pre-production and "beta" environment had been constructed piece-by-piece, when it came time to build the official production runway it didn't work exactly as we expected it to.We hadn't tied BladeLogic to Anthill effectively, once again highlighting a communication disconnect.We'd also failed to integrate changes we'd made after the infrastructure was provisioned back into our BladeLogic processes.So our "beta" environment actually served production traffic for almost 90-days while we scrounged to build another functioning environment.It was in these three months that several of us sat down to really begin planning the technical and culture shifts to deal with these types of challenges for the long haul.With yet one more failure scenario to learn from, 2011 is when we started to get really serious about DevOps.
  • ▼ ❑ Technical Leads who spent too much time in war rooms. • ❑ The spark of DevOps happened at a technical lead level. Several of us who'd worked through these issues, spending too much time in war rooms, knew there was a better way. • ❑ We started by identifying the gaps between infrastructure management and application deployment. Remember Scott Nealy's assertion about his hardware and software dependencies? Well, we finally understood it better than ever.▼ ❑ How did they make it happen? • ❑ Two teams, Production Engineering & Automation Engineering set about to provide tools which bridged the divide. • ❑ (ProdEng = Ops) + (AutoEng = Dev) == How we really started gaining inroads. • ❑ Members of both these teams shed traditional views on what they were supposed to do and just did it. • ❑ The result were improved relationships, better tooling, and a clearer perspective on how future projects could work.
  • • ❑ Email, Jabber, Google Hangouts -- Nothing beats a whiteboard and good documentation!It’s important to point out that tools don’t make a DevOps environment, people do. But people rely on tools to be more efficient and effective, so let’s talk about our tooling.The usual suspects are of course at play: email, jabber for chat, and we’ve started leveraging Google Hangouts quite a bit as we spend more time working remotely.
  • We use Splunk heavily at Edmunds. It ingests logs from web servers, app servers, syslog, network devices, Chef, and jmeter. We’ve been using Splunk for about 4 years now and while there have been some challenges along the way to keep it scaling with our data, I still see it as the best log and data aggregator around.What you’re looking at here is a set of dashboard’s our QA teams have put together charting out performance results from jmeter. These charts get updated in real-time as performance tests are being executed. It’s pretty slick. If you’re interested in how this works, let me know and I’ll be more than happy to put you in touch with the folks who built it.I have to tell you… Splunk has always been one of my favorite tools in our arsenal. As I was talking with several of the folks at AppDynamics about the tool, they’ve taken some of my feedback on how the two products could be integrated. In less than a month I’ve seen demos of how that integration will work. Slick was my first response and hopefully there are no product folks in the room freaking out I’m talking to you about this, but trust me… it’s gonna blow your mind when you see it. It’s the integration of two of the best tools out there.
  • Instead of traditional SQL databases backing our website, we rely on Oracle Coherence to provide data to our web applications. Coherence is essentially an in-memory data store similar to Voldemort or memcache. As every web application interacts with this cluster, understanding how each node within the cluster is performing an absolute requirement.We’ve been on RTView since 2010 and it has served us very well through a number of incidents. Right now, we map exit points from the web apps in AppDynamics to our Coherence caches, but once that end point is reached the data stops. If we want deeper insight into what’s happening within the Coherence cluster, we then switch to RTView to dig deeper.That’s about to change though. We’re working with several of the folks at AppDynamics to bring what information we currently rely on RTView to provide into AppDynamics for a more holistic view of the interactions between the web apps and Coherence. Be on the look out for info from us on that in the near future.
  • • ❑ As part of that 2010 redesign, the breakup of our applications along the improvements we'd made to our provisioning and deployment processes resulted in an even larger burst in growth. Ops teams were drowning trying to keep up with understanding the rapid changes they were being asked to support. Devs didn't have enough information to properly diagnose failures due to a lack of visibility beyond just the reported hotspot.• ❑ YourKit, console, VisualVM… all were part of the arsenal, but require hundreds of connections and being connected when issues occur.• ❑ Let's be real… logs don't always have the information you'd like them to.
  • It’s no exaggeration to say that AppDynamics has saved us hundreds of hours resolving issues in pre-production. It was clear from the start that this product had real value in identifying issues in production, but the real benefits began presenting themselves when we started working it into our pre-release qualification processes.I already mentioned that we use jmeter and Splunk as part of our performance testing workflow. Well, during the qualification of a recent release, our jmeter tests told us one of our traffic web apps which is responsible for inventory display had response time increases of 111% over the last build. No joke – 111%.So jmeter tells us one thing… what would AppDynamics tell us?
  • So here’s our Inventory application, and as you can see it’s dependent on our Inventory Solr cluster for data. Those response times look lousy, don’t they? Right off the bat, we were scratching our head over this one. We’ve known for a while our Solr services needed some work, so it wouldn’t have surprised us if we were seeing slow response time there. But to see that 74% of the requests was stalling out in the web app threw us for a bit of a loop.Okay… so what does drilling down tell us?CLICK –Well, right off the bat we saw the three most expensive methods were around those Solr calls, but what stood out the most was item number two on that list: Filtering the response to our Solr query. The team responsible for this web app were certain they hadn’t changed anything in the filtering functions, so what else? Maybe the results being returned from Solr weren’t what was expected? How to confirm that?CLICK –It’s an HTTP request which AppDynamics is capturing perfectly. That’s it right there, in fact. Just looking at the URL isn’t helpful, but when we popped it into a browser and saw the results the pieces started coming together.The results in a browser showed us inventory for nearly half of the United States. Now most of the visitors to our site aren’t interested in traveling more than a couple hundred miles at most for a car, so the payload we were seeing here seemed a bit off. It’s buried there in the URL, but our culprit had been found.CLICK –A search radius of 8000 miles. Anyone here willing to go 8000 miles for a new car?As it turns out, ultimately the issue was an error in our test. The test scripts attempt to vary the radius of each search so the results returned vary as well. As it turns out, the scripts had recently been modified and the logic which varies the radius had a couple extra zeroes than it needed.Though this application would’ve have performed perfectly in production, I’ve got to tell you… When you’ve got tools that you trust telling you something is wrong, it’s important to take the time to understand what’s really going on. AppDynamics is one of those tools that we trust.
  • • ❑ Continuous DeliveryA quick word about our deployment model, because it’s important to talk about where this work is leading us to next. Like many companies, we’re moving towards a Continuous Delivery model. This has been challenging on many fronts, but we’ve made significant progress in 2012 so let me bring you up to speed with where we’re at.Currently, our CD model stops before hitting a production environment. We’re still working through the validation points of our pipeline, so until that’s solidified, we’ll continue to have gates which prevent full deployment to production. A couple of things I’d like to highlight here…We’ve built a couple of really great tools to help us out. One is called Kingpin – which you don’t see labeled here – that relies on our Chef services to know what hosts artifacts are meant to be deployed to. Kingpin is aware of all roles and environments at Edmunds, and relates those roles to build artifacts. Another tool – Ronin – is constantly sweeping our artifact repository for new builds and when it finds them, sends them off to Kingpin for deployment. Once ithas confirmed that Kingpin has completed artifact deployment, Ronin then initiates the necessary functional and performance tests for that artifact. After those tests are completed, and provided they’ve passed, Ronin pushes the artifact on to the next runway in the pipeline.The reason to highlight both Ronin and Kingpin are this: they are an excellent example of the communication between Dev and Ops in order to deliver a solid service for the business. Kingpin wouldn’t work with out Chef. Neither would Ronin. And we wouldn’t have structured out Chef services the way we had if there wasn’t a clear picture of how they’d be used by Kingpin and Ronin. The symbiotic nature of these services required 100% clarity in how they’d be used together. It required symmetry between Dev and Ops to implement properly.• ❑ Where AppDynamics fits inYou’re probably asking where AppDynamics fits into the pipeline. We’re currently leverage AD in our QA runway and all production runways. There are already discussions on more wide spread use in our DEV runway to catch things even earlier but that may be a while off. Right now, it fits nicely into the flow and as I mentioned before, is trapping things nicely in QA before hitting production.
  • • ❑ Developers building and supporting their applications from concept to productionHere’s our reality at Edmunds:Though Developers are doing more to involve themselves in production support, the truth is they just haven’t been exposed to it well enough to take it on themselves 100% of the time. I’m not knocking all Devs, but it has to be acknowledged that ours are simply not yet capable of taking an on-call rotation and understanding how to deal with failures in production the same way our Ops teams are. This will change.I believe very strongly in a build it/deploy it/support it model. The person best equipped to deal with something that’s broken is the person that built it. And let’s be real… nothing gets something fixed faster than making the person that builds it to deal with it when it’s broken.I see us moving towards a model where Devs build and deploy their app to production, support it in production for six months or however long it takes to get the kinks worked out, and then handing off a well running service to a Tier 1 support team. I intend on piloting that with a few projects over the next year so keep in touch and I’ll let you know how it goes. • ❑ Ops as platform providers and maintainers, rather than application first respondersGetting Ops to think differently is a continuing effort. We’ve hired really good talent in the last 12 months to help transform Ops from being “end of the line”.
  • • ❑ 1 Absence of trust• ❑ 2 Rigid view of roles• ❑ 3 Incomplete tooling• ❑ 4 Indecisive leadership
  • Transcript

    • 1. Accelerating DevOps Culture @ Edmunds.com John Martin (@tekbuddha)Sr. Director, Production Engineering
    • 2. Introduction• John Martin – Senior Director, Production Engineering – 10+ years supporting Java Architectures – Craves metrics, logs & whiteboards• My Crew – Paul MacDougall – Paddy Hannon – Ajit Zadgoankar• Edmunds.com – Founded in 1966 (print magazine) – First website in 1995 – 475 employees – 650,000 Daily Uniques
    • 3. Our Architecture
    • 4. A moment about DevOps ®
    • 5. The Drive to DevOps INFRASTRUCTURE GROWTH COMMUNICATION FAILURE GO FASTER, BE EFFICIENT
    • 6. Infrastructure Growth
    • 7. Manual Deployments
    • 8. Self-inflected pain through growth
    • 9. Communication Failure
    • 10. The Turning Point• Edmunds.com 2010 Redesign – Complete rewrite, more modular architecture• QA + BETA sites worked great! – Built brick-by-brick• Difficulties repeating build-out – BETA site became Production• 3 Months to iron out build-out processes• Lack of Collaboration between Dev & Ops
    • 11. Who made DevOps happen at Edmunds?• Who made it happen? – Tech Leads who spent too much time in war rooms!• How did they make it happen? – Reorganized specialized teams within Dev + Ops to work through the initial pains – Shed preconceptions on who should do what.
    • 12. Our DevOps Collaboration Tools• The Usual Suspects – Email – Jabber – Google Hangouts – Nothing beats a whiteboard! – Okay… beer > whiteboard.
    • 13. Our DevOps Collaboration Tools: Splunk
    • 14. Our DevOps Collaboration Tools: RTView
    • 15. What value does AppDynamics give DevOps?• 2005-2009 – 2 web apps – 50 hosts per runway• 2010 Platform Redesign – 30 web apps – 200 hosts per runway – Lots of touchpoints to keep an eye on• AD provides a way to speak the same language – #{data} > #{feelings}
    • 16. Pre-Release Qualifications• Hundreds of hours saved in pre-release tests• Early discovery of new hotspots• Most recent example: – Inventory response time increased 111% – 111%? That’s more than just a hotspot!
    • 17. Pre-Release Qualifications
    • 18. Business Impact & ROI of AppDynamicsUse Case Before After $ SavingsManaging Availability 99.91% 99.95% $167,475 revenue protectionAverage MTTR of Production Sev-1 5 man-days Reduced by 45% (cons) $307,521 productivityIncidentsAverage time to Identify 2.5 man-days Reduced by 35% (cons) $320,170 productivityPre-Production defects Total First year Savings $795,166 Expected Two Year ROI $1,217,334
    • 19. Being Agile in 2012
    • 20. DevOps at Edmunds.com Tomorrow• Devs building and supporting applications – From Concept to Production – Devs on-call• Ops as platform providers – API providers for all levels of infrastructure – Get out of application support business
    • 21. Reasons DevOps will Fail• Absence of trust• Rigid view of roles• Incomplete tooling• Indecisiveness in leadership
    • 22. Top 5 Reasons DevOps will Succeed• Be Honest• Communicate Early and Often• Educate• Criticize Constructively• Create Champions
    • 23. Questions ?