Building DevOps with Beer & Whiteboards

  • 1,636 views
Uploaded on

Velocity 2013 - How Edmunds learned from failure, begin opening communications between silos, and build a DevOps culture over beer and whiteboards. …

Velocity 2013 - How Edmunds learned from failure, begin opening communications between silos, and build a DevOps culture over beer and whiteboards.

(HINT: Download to see the presenter's notes for what may not make sense without a speaker!)

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,636
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • - The automative resource of the Internet - Originally in print, then Gopher in 1994, Web in 1996
  • - Our environment is highly distributed. When you visit Edmunds.com you’re interacting one or more of our 30 web apps spread out across a couple hundred hosts. - The website itself is built on Apache Tomcat, Solr, MongoDB, and Oracle Coherence. - Internally, you’ll also find ActiveMQ, Oracle, and some lingering WebLogic services we’ll soon be doing away with. - We rely heavily on a mix of different tools to build and support all this: chef, jenkins, CloudStack, AppDynamics, Splunk, to name a few. - But I’m getting ahead of myself because how we got to this architecture is part of the tale on how Edmunds came to embrace a DevOps mindset.
  • - So then where does our story start? - Let me be up front: WE STUMBLED. WE PERFECTED THE FACEPALM. - The specifics of our situations when the shit hit the fan may have felt unique, but they’re not. - We learned from our mistakes with the intent of getting better. - Let’s talk facepalms...
  • - This may be familiar... - In 2005, we had 30 servers. In 2006, we burst up to 300 and held steady for a few years with slow growth. - In 2009, we saw radical jump in server deployment - We grew in number of servers, but not in the number of admins - We had Kickstart, but that’s only good at bootstrap time - BladeLogic + AnthillPro seemed a good solution, but there were major issues - Growth is painful
  • - One very specific breakdown in our history that stands out to me. - 2007 - Edmunds 2.0: Introducing CMS for the business - All content was locked to a monthly release cycle - Six months of functional testing, without any performance validation. - Two months before launch, performance testing uncovered scalability issues. - Ops response: double application infrastructure and throw a hardware cache appliance. - Breakdown in relationships between Dev/Ops lead to major business costs. - Fast forward to 2009; remember that big jump in the number of servers we were deploying?
  • - 2010 Edmunds Redesign: Complete rewrite of all website code + modular breakout of applications. - Good collaboration between Dev/Ops to understand requirements on all sides. - But QA + BETA were build brick-by-brick, and not easily reproducible. - Armed with BladeLogic + AnthillPro, build/deploy was more automated but weren’t coupled together! - Production environment took 3 months to build while BETA served the new website. - We started to realize that the real challenge wasn’t technology but culture .
  • We wanted to stop working like this...
  • and start building like this.
  • We really wanted to get out of here.
  • - And go here - This is the Daily Pint! Let me buy you a beer! - This is where the wildest of ideas are born - Disagreements are worked through with positive jest and jeers - It is where we talked it over
  • - Then we’d take it here! - THE MOST UNDER RATED TOOL YOU ALREADY HAVE. - Floor-to-ceiling whiteboards where we worked out our ideas. - We talked gaps in handoffs, failure rates due to manual builds, linking tools in together - “self-service”, Automated testing, and much much more. - What happened those was no “ops”, no “dev”. We were technologists working to solve problems with no boundaries of roles in the way. - Our proposal: tear down silos. - We did just that!
  • - So who and how did this happen? - TechLeads who spent too much time in war rooms started chewing on the problem together. - Identified gaps in provisioning/config management and app deployment tools. - Scott McNealy was right about hardware/software dependencies. - Two teams, Production Engineering & Automation Engineering set about to provide tools which bridged the divide. - (ProdEng = Ops) + (AutoEng = Dev) == How we really started gaining inroads. (NOT IDEAL!) - Members of both these teams shed traditional views on what they were supposed to do and just did it. - The result were improved relationships, better tooling, and a clearer perspective on how future projects could work.
  • - So we started linking all our tools together! - “Your tools don’t make your culture, but they do have an impact on the people who do.”
  • - We now talk about data that our tools provide us - You can talk from your gut, but you better back it up with data - We pushed ownership and accountability by leveraging what we found with data . - The metrics were clearly pointing out our failures, allowing us to learn how to prevent them in the future.
  • - Armed with a tighter toolchain and a new way of working together, we were once again about to be put to the test. - Edmunds began investing resources into “the cloud”. - Heavily virtualized since 2010, but no clear “cloud” offerings - Two teams, one objective: make edmunds.com work on $x cloud platform - Why two? DIVERSIFY.
  • - This was our first shot at a “new” project armed with our new practices + tooling - They were uncharted waters, even though we’d been virtualized for a few years. “Cloud” is a different beast. - But with familiar tooling + improved communications, these teams produced success results that were easily measured. - Environment build time down to less than a week. - Done with 95% of the same toolset for both cloud platforms.
  • - We’ve all spent our careers as firefighters. - Street cred with co-workers, bosses, executives as cool headed during a mess - So what about when there are less - or different - kinds of fires? - By increasing accountable individuals, more “self-service”, less fires == increased capacity for business acumen. - This is the business value of what we call DevOps is leading is to.
  • - To go from this to this... - Invest in addressing systemic issues around communication + partnerships, we increase our capacity to take on other challenges - No big secret, it’s been talked about by Damon Edwards, John Willis - Covered beautifully in “The Phoenix Project” - Technologist in the age of the Internet are no longer back-office workers keeping the lights on - We help shape the direction of our companies; direct impact on revenue in ways our field sees change now yearly. - We needed to change the way we work together to free ourselves for “bigger things”. - An exciting time to be working in our field!
  • - Okay, back to our cloud initiatives... - With this additional capacity, here’s a few things we learned to give value to our company - Cloud isn’t free; server sprawl can be expensive and lack of education with “self-service” becomes a major issue. - How much does it cost to operate your environment? It’s tough to calculate! - Licensing by host or CPUs is costly at scale, so look for alternatives to those things you pay a premium for. - Managing operating costs starts with understanding where the money is going!
  • - A great growing experience the last few years @ Edmunds. - No rose-tinted glasses to suggest we’ve solved all our problems! BUT WE GOT SOME BIG ONES! - And today we work a helluva lot more like this! - So, let’s take on the challenge of showing some metrics of success by adopting a DevOps culture...
  • - Application Availability has increased. Not the holy metrics of “four 9’s”, but a bump all the same! - The number of high-severity INCs has dropped 50% year-over-year - The number of TKTs filed has dropped 50% year-over-year --- Self-service is slick! - The MTTR of pre-production issues has drastically reduced from 5 days to 2 days and even faster than that in most situations. - The time it takes us to build runways has gone down from 3 months to 1 week! - Deeper inspection of our costs-per-host, we’re expecting to begin shaving off overall operating costs drastically for next year’s budget. - Team morale? Well...
  • We got out of here.
  • And into here, so it’s pretty good.
  • - Always more to be done! You’re never “finished” growing. - Devs on-call! (You build it, you run it!) - Reducing infrastructure footprint == reducing operating costs - More RESTful applications - Other cloud offerings?

Transcript

  • 1. BUILDING DEVOPSWITHBEER ANDWHITEBOARDSJOHN MARTIN@tekBuddhaSTEVE BURTON@BurtonSays
  • 2. CALL OF DUTY:DEV OPS
  • 3. the challengeGAME SELECTDEVELOPERDEVELOPEROPERATIONSOPERATIONSDEVOPSDEVOPSNOOPSNOOPSAADEVOPSMISSION PARAMETERS:MISSION OBJECTIVESKILL YOUR COMPETITORS- DEVELOP, TEST, DEPLOY, OPERATE- AUTOMATION & BUSINESS AGILITYRECOMMENDED ESSENTIALSBEER, WHITEBOARDS, COMMUNICATION
  • 4. but what is success?
  • 5. “success is going from failure tofailure without losingenthusiasm”Winston Churchill
  • 6. failure
  • 7. mean time to innocence (MTTI)
  • 8. mean time to resolution (MTTR)Weeks, Days, Hours or Minutes?
  • 9. mean time between failure (MTBF)Weeks, Days, Hours or Minutes?
  • 10. availability?99.9%The most meaningless metric in IT today.
  • 11. business metrics> revenue> throughput> performance> productivity
  • 12. Edmunds.comEXPERT CAR ADVICEFOUNDED IN 1966550 EMPLOYEES650K DAILY UNIQUES
  • 13. whoamiSR DIRECTOR PRODUCTION ENGINEERINGA DECADE SUPPORTING JAVAARCHITECTURESFUELED BY METRICS, WHITEBOARDS,LOGS, AND BEER
  • 14. Our environment.
  • 15. Compelling EventsSource: http://is.gd/iJU4et
  • 16. Growing Pains
  • 17. Communication
  • 18. 2010 RedesignSource: http://is.gd/L77vl1COMPLETE REWRITE OF PLATFORMQA & BETA WORKED GREAT!BETA BECOMES PROD3 MONTHS IN A WAR ROOM
  • 19. NOT LIKE THISSource:http://is.gd/PFLRmW
  • 20. LIKE THISSource: http://is.gd/iJU4et
  • 21. OUT OF HERESource: http://is.gd/oFCXNH
  • 22. IN TO HERESource: http://is.gd/iJU4et
  • 23. ONE OF THEMOST UNDERRATED TOOLSYOU ALREADYHAVE.THE WHITEBOARD
  • 24. TEARING IT DOWNSource: http://is.gd/Vrnwu4
  • 25. The Toolshed
  • 26. Communicating with MetricsSource: http://is.gd/L77vl1DATA DRIVEN CULTURECHECK THE GUTDRIVE ACCOUNTABILITYLEARN FROM FAILURE
  • 27. CLOUDY SKIESSource: http://is.gd/arBZ4M
  • 28. Putting It All TogetherSource: http://is.gd/L77vl1UNCHARTED WATERSFAMILIAR TOOLINGIMPROVED COMMUNICATIONSMEASURABLE SUCCESS STORIES
  • 29. A Personal NoteSource: http://is.gd/L77vl1
  • 30. A Personal NoteSource: http://is.gd/L77vl1
  • 31. The Business PropositionSource: http://is.gd/L77vl1THE CLOUD ISN’T FREECOST PER HOST CAN GET SCARYLOOK FOR THE FREEBIES
  • 32. AWESOMENESSSource: http://is.gd/iJU4et
  • 33. Measuring SuccessSource: http://is.gd/L77vl1Before After Benefit $ SavingsApplication Availability % 99.91% 99.95% > 0.04% $167k revenue protection# of High Severity Incidents 21 10 < 50% $307k productivity# of Help desk Tickets 196 99 < 50%MTTR in Pre-Production 5 Days 2 Days < 45% $320k productivityTime To Build Runways 3 Months < 1 Week Seriously?!Operating Costs $$$$ TBDTeam Morale Bummered Beer
  • 34. OUT OF HERESource: http://is.gd/oFCXNH
  • 35. IN TO HERESource: http://is.gd/iJU4et
  • 36. Source: http://is.gd/xKdI6EWHERE NEXT?
  • 37. JOHN MARTINjmartin@edmunds.com@tekBuddhaSTEVE BURTONsburton@appdynamics.com@BurtonSaysQUESTIONS?We’re hiring!Stop by our booth!