• Like
B. Durrett The Challenges of Continuous Deployment Social Developer Summit
Upcoming SlideShare
Loading in...5
×

B. Durrett The Challenges of Continuous Deployment Social Developer Summit

  • 968 views
Uploaded on

Social Developer Summit

Social Developer Summit

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
968
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
7
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • IMVU is avatar-based chat with a hint of social networkingPeople decorate their avatar, room and home pages with UGCAll content is built by customers, for customers – over 4MM itemsWe have a robust economy, micro transactions, virtual currency, creators real businessWe offer Windows download for chat and web for social-network type activities
  • How many familiar with continuous deployment?How many practicing CD now?Frequency releases? 1 week, several times week, 1 day, several times a day?
  • Like continuous integrationWhere CI reduces pain of integrationCD reduces the pain of deployment
  • Early last year TF, engineer at IMVU blogged about thisResponses reflected a lot of skepticismAt this point millions of customers, $12MM run rateUptime > 99%
  • Third point is criticalTalk about other placeA good system builds confidenceFirst day engineers have code running in production (current record is 12:30 in afternoon)
  • Another place I worked each release represented 6 weeks workThousands of commitsRelease day was everybody trying to react to issues
  • Simple overview processTests not passing: intermittent or missedRollbacks: real issues, occasional spikesTotal time capturing metrics 2-3 minutes… lots of customers
  • Here is a more detailed eye-chartMostly here to show that it’s a little more complexMake slides available / provide details for interested
  • At IMVU, went from 3->1No tests, monitoring / alerting was early customer e-mailCI started off as local testsDeploy was rsync, rollback was SVN revert & rsync againMonitoring – predictive desired, can’t make it work
  • Need tests? Is it important that it keeps working, it needs test coverageIf not, why is it in the code base in the first placeIf you already have code not under test, old code less coverageKeep tests fast… too many functional testsDependency injection
  • When I talk to people trying this…Cultural:Battle of testing on BB – 1 person efficient, many not20% overheadMorale: builds over 20 minutes == unhappy
  • When describing this system frequently get asked about databasesOf course, NoSQLmay fix this 
  • Another common question
  • May seem obvious that fast tests are goodHow slow tests impact may not be obvious, poor tagging leads to test in build
  • Setup dependency – scorched earthExternal services
  • We solved with some custom workIsolate bad queriesQuery killerOutsourcing
  • We hope so!Since we first blogged and were questionedMore than doubled businessAlmost doubled staff
  • Square wave painCheck-in pile-up2009 company kept head count flatWorking effectively, 2x revenue, few build problemsThis year having to investing more, engineers to a build system are like cars to a freeway
  • This may seem obvious, these are criticalEasy to lose sight of how much productivity being lost
  • Tech Ops background, can’t believe I missed thisSame diligence you would have for your live siteMonitoring, trending, alerting
  • Here is a sample build and push times
  • More engineers = more pushes, more opportunity for redWith red, more commits get included in next push, more chance of failureMore tests = more chance of intermittentImpact of a red test on rest of organizationRequires process to keep working well
  • client build machine count: 24web build machine count: 72 totallinux web machines: 51windows web machines: 21Total test count: 1392 (including) selenium tests: 134 windows only tests: 183 linux only tests: 10 other / general tests: 1065
  • important because of test-tagging, usability.
  • We have some time, I am happy to take questions
  • Thank you for attending my presentation

Transcript

  • 1. Scaling with Continuous Deployment
    Social Developer Summit
    San Francisco, CA, June 29, 2010
    Brett G. Durrett (@bdurrett)
    Vice President Engineering & Operations, IMVU, Inc.
  • 2.
  • 3. Survey Says
    Continuous Deployment... who is with me?
  • 4. In a Nutshell
    What is Continuous Deployment?
    Engineer commits code
    20 minutes later it is live in production
    Repeat about 50 times per day
  • 5. Does This Really Work?
    “Maybe this is just viable for a single developer … your site will be down. A lot.”
    “It seems like the author either has no customers or very understanding customers”
    Responses to February 2009 posting by Timothy Fitz about Continuous Deployment at IMVU
    (at the time IMVU had a $12 million run rate)
  • 6. Benefits
    Regressions easy to find, correct
    Releases have zero overhead
    Rapid iteration using real customer metrics
  • 7. Finding and Fixing Problems
    Each release has few changes, 1-3 commits
    Production issues correlate with check-in timestamp
    No overhead to producing a new release to correct issue
    Identifying cause takes minutes
  • 8. CD at IMVU: Simple Overview
    Rollback
    (Blocks)
    Local tests pass, engineer commits code
    No
    Metrics good?
    Code deployed to all servers
    Lots and lots of tests run
    Yes
    All tests pass?
    Metrics still good?
    Code deployed to % of servers
    Yes
    No
    No
    Yes
    Revert commit
    (Blocks)
    Win!
  • 9. CD at IMVU: Detailed Overview
  • 10. Getting Started – Extreme Basics
    Continuous integration system
    Production monitoring and alerting
    System performance
    Business metrics
    Trending is nice too 
    Simple deploy / roll-back system
  • 11. Commit to Making Forward Progress
    Require coverage for all new code
    Add coverage for bugs / regressions
    Understand and fix root cause of failures
  • 12. Expect Some Hurdles
    Production outages
    New overhead
    Tests
    Build systems
    Production outages
    Frustration
    Production outages
    (but well worth it)
  • 13. Dealing with SQL
    Problems
    Difficult to roll-back schema
    Alter statements lock / impact customers
    Solutions
    New schema has formal review process
    No alter on large tables, create new table
    Copy on read
    Complete migration with background job
  • 14. Big Features
    Developed on trunk, not branch
    “hidden” from customers by A/B experiment
    100% control, add QA to experiment
    Deployed daily during development
    Slow roll-out by increasing experiment %
    Experiment closed = fully launched
  • 15. Test Speed
    Slow tests burden to scaling
    Can’t run all tests in sandbox
    Faster to debug on build cluster
    If possible…
    Keep tests fast
    Keep tests specific
  • 16. The cost of failing tests
    As the team grows…
    More likely to have test failures
    More people blocked as a result
    Intermittent failures very bad
    Eliminate the root cause
  • 17. Other Issues
    Won’t catch issues that fail slowly
    SELECT * FROM growing_table WHERE 1
    Some critical areas cause hard lock-ups
    MySQL
    Memcached
    Lack of test coverage of older code
    Not an issue if you start with test coverage
  • 18. Does Continuous Deployment Scale?
    Technical staff ~50 people
    10 million monthly unique visitors
    Peak ~115K concurrent IM client logins
    It’s a real business!
    $40 million run rate
    Profitable and doubled revenue in 2009
  • 19. Newer Scaling Challenges
    Biggest challenges come with growth of the engineering organization
  • 20. SLA for Build Systems
    Build systems are a critical service
  • 21. SLA for Build Systems
    Build systems are a critical service
    Run them that way
  • 22. Build and Push Times
  • 23. Overall Availability
  • 24. Build Throughput
    Initial implementation sequential builds
    Scaled okay to ~20 engineers
    Like trains running every 20 minutes
    One “red” blocks all following builds
    Solution: build isolation
    Enable testing single build without deploy
    “Red” build pulled, allow other builds to pass
  • 25. Current Systems
    > 15,000 tests
    72 web build servers
    51 Linux, 21 Windows
    > 6 hours of tests on average hardware
    Deploy to cluster of ~700 servers
  • 26. Web Build Software
    Custom test-file runner with JS GUI
    PHP SimpleTest
    Python's built-in unittest
    Selenium Core with in-house API wrapper
    YUITest for browser JS unit tests
    ErlangEunit
  • 27. Conclusion
    Continuous Deployment is good
    Try it – starting earlier is easier
    It’s a key part of a nutritious development process
  • 28. Questions?
  • 29. More on Continuous Deployment
    SD Times Leaders of Agile: Kent Beck's Principles of Agility: http://bit.ly/9wsAYv(this webinar tomorrow, June 30)
    Eric Ries (Startup Lessons Learned) on Continuous Deployment: http://bit.ly/5l6X1
    Timothy Fitz (IMVU) Doing the impossible 50 times a day: http://bit.ly/OxJv
  • 30. Thank You!
    Brett G. Durrett
    bdurrett@imvu.com
    Twitter: @bdurrett
    IMVU was recognized as one of
    the “Best Places to Work” (and we’re hiring)
    http://www.imvu.com/jobs/