Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Moving Fast At Scale


Published on

Faster is Better. High-performing organizations deploy both substantially faster and substantially more reliably, and thus are 2.5x more likely to achieve business goals. This keynote covers how to move fast at large scale:
* Organizing for Speed
* What to Build and What NOT to Build
* When to Build
* How to Build
* Delivering and Operating
Keynote at Reversim Summit 2017 in Tel Aviv, Israel.

Published in: Software

Moving Fast At Scale

  1. 1. Moving Fast at Scale Randy Shoup @randyshoup
  2. 2. Background • VP Engineering at Stitch Fix o Using technology and data science to revolutionize clothing retail • Consulting “CTO as a service” o Helping companies move fast at scale  • Director of Engineering for Google App Engine o World’s largest Platform-as-a-Service • Chief Engineer at eBay o Evolving multiple generations of eBay’s infrastructure
  3. 3. Stitch Fix @randyshoup
  4. 4. Stitch Fix @randyshoup
  5. 5. Stitch Fix @randyshoup
  6. 6. Stitch Fix @randyshoup
  7. 7. Personalized Recommendations Inventory Algorithmic recommendations Machine learning @randyshoup
  8. 8. Expert Human Curation Human curation Algorithmic recommendations @randyshoup
  9. 9. Humans + Data Science • 1:1 Ratio of Data Science to Engineering o More than 100 software engineers o ~80 data scientists and algorithm developers o Unique ratio in our industry • Apply intelligence to *every* part of the business o Buying o Inventory management o Logistics optimization o Styling recommendations o Demand prediction • Humans and machines augmenting each other @randyshoup
  10. 10. Faster is Better
  11. 11. Willingness to go fast Ability to go fast +
  12. 12. Lack of Fear Capability +
  13. 13. High-Performing Organizations • Multiple deploys per day vs. one per month • Commit to deploy in less than 1 hour vs. one week • Recover from failure in less than 1 hour vs. one day • Change failure rate of 0-15% vs. 31-45% @randyshoup
  14. 14. High-Performing Organizations 2.5x more likely to exceed business goals o Profitability o Market share o Productivity @randyshoup
  15. 15. Speed vs. Stability?
  16. 16. Faster is Better
  17. 17. Moving Fast at Scale •Organizing for Speed •What to Build / What NOT to Build •When to Build •How to Build •Delivering and Operating
  18. 18. Moving Fast at Scale •Organizing for Speed •What to Build / What NOT to Build •When to Build •How to Build •Delivering and Operating
  19. 19. Conway’s Law • Organization determines architecture o Design of a system will be a reflection of the communication paths within the organization • Modular system requires modular organization o Small, independent teams lead to more flexible, composable systems o Larger, interdependent teams lead to larger systems • We can engineer the system we want by engineering the organization @randyshoup
  20. 20. Small “Service” Teams • Full-Stack, “2 Pizza” Teams o No team should be larger than can be fed by 2 large pizzas o Typically 4-6 people o All disciplines required for the team to function • Aligned to Business Domains o Clear, well-defined area of responsibility o Single service or set of related services o Deep understanding of business problems • Growth through “cellular mitosis” @randyshoup
  21. 21. Ideally, 80% of project work should be within a team boundary.
  22. 22. eBay “Train Seats” • eBay’s development process (circa 2006) o Design and estimate project (“Train Seat” == 2 engineer-weeks) o Assign engineers from common pool to implement tasks o Designer does not implement; implementers do not design (or test, or operate) •  Dysfunctional engineering culture o (-) Engineers treated as interchangeable “cogs” o (-) No regard for skill, interest, experience o (-) No pride of ownership in task implementation o (-) No long-term ownership of codebase
  23. 23. Moving Fast at Scale •Organizing for Speed •What to Build / What NOT to Build •When to Build •How to Build •Delivering and Operating
  24. 24. “Building the wrong thing is the biggest waste in software development.” -- Mary and Tom Poppendieck, Lean Software Development
  25. 25. What problem are you trying to solve?
  26. 26. “A problem well-stated is a problem half-solved.” -- Charles Kettering, former head of research for General Motors
  27. 27. What Problem Are You Trying to Solve? • Focus on what is important for your business • Problem might be solved without any technology at all o Redefine the problem o Change the business process o Implement manually for a while before automating in an application @randyshoup
  28. 28. Buy, Not Build • Use Cloud Infrastructure o Faster, cheaper, better than we can do ourselves o Stitch Fix has no owned physical infrastructure anywhere in the world • Prefer Open Source o Kubernetes, Docker, Istio o MySQL, Postgres, Redis, Elastic Search o Machine learning models o Etc. o Usually better than the commercial alternatives (!) @randyshoup
  29. 29. Buy, Not Build • Third-Party Services o Stitch Fix uses >50 third party services o Logging, monitoring, alerting o Project management, bug tracking o Billing, fraud detection o Etc. • Focus on our core competency o Use services for everything else (!) @randyshoup
  30. 30. Soon it will be just as common to run your own data center as it is to run your own electrical power generation.
  31. 31. Experimental Discipline • State your hypothesis o What metrics do you expect to move and why o Understand your baseline • Run a real A | B test o Sample size o Isolated treatment and control groups o No peeking or quitting early! • Obsessively log and measure o Understand customer and system behavior o Understand why this experiment worked or did not
  32. 32. Experimental Discipline • Listen to the data o Data trumps hope and intuition o Develop insights for next experiment • Thinking of the experiment is art; evaluating it is science • Rinse and Repeat o This is a journey, not a single step
  33. 33. eBay Machine-Learned Ranking • Ranking function for search results o Which item should appear 1st, 10th, 100th, 1000th o Before: Small number of hand-tuned factors o Goal: Thousands of factors • Incremental Experimentation o Predictive models: query->view, view->purchase, etc. o Hundreds of parallel A | B tests o Full year of steady, incremental improvements  2% increase in eBay revenue (~$120M / year)
  34. 34. eBay Site Speed • Reduce user-experienced latency for search results • Iterative Process o Implement a potential improvement o Release to the site in an A | B test o Monitor metrics –time to first byte, time to click, click rate, purchase rate  2% increase in eBay revenue (~$120M / year)
  35. 35. Moving Fast at Scale •Organizing for Speed •What to Build / What NOT to Build •When to Build •How to Build •Delivering and Operating
  36. 36. Prioritization • Scarce resources require prioritization o We always have more to do than resources to do it o Opportunity cost -- deciding to do X means deciding not to do Y o Every decision is a tradeoff • Priority ← Return on Investment o Impact / Effort • Prioritization is a business decision, not a technical decision @randyshoup
  37. 37. Fewer Things, More Done
  38. 38. Fewer Things, More Done • Maximize resources applied to o Priority 1, then o Priority 2 o etc. • Incremental Delivery o Deliver increments along the way instead of everything at the end • Deliver Value Faster o Time Value of Money o Benefit now is worth more than benefit in the future @randyshoup
  39. 39. “When you solve problem one, problem two gets a promotion.”
  40. 40. Moving Fast at Scale •Organizing for Speed •What to Build / What NOT to Build •When to Build •How to Build •Delivering and Operating
  41. 41. Microservices • Single-purpose • Simple, well-defined interface • Modular and composable • Independently deployable A C D E B
  42. 42. Evolution to Microservices • eBay • 5th generation today • Monolithic Perl  Monolithic C++  Java  microservices • Twitter • 3rd generation today • Monolithic Rails  JS / Rails / Scala  microservices • Amazon • Nth generation today • Monolithic Perl / C++  Java / Scala  microservices
  43. 43. No one starts with microservices … Past a certain scale, everyone ends up with microservices
  44. 44. If you don’t end up regretting your early technology decisions, you probably over- engineered.
  45. 45. Quality Discipline • Quality and Reliability are “Priority-0 features” o Equally important to users as product features and engaging user experience • Developers responsible for o Features o Quality o Performance o Reliability o Manageability
  46. 46. Test-Driven Development • Tests help you go faster o Tests “have your back” o Development velocity • Tests make better code o Confidence to break things o Courage to refactor mercilessly • Tests make better systems o Catch bugs earlier, fail faster @randyshoup
  47. 47. Optimizing Developer Effort @randyshoup • 75% reading existing code • 20% modifying existing code • 5% writing new code
  48. 48. Optimizing Developer Effort @randyshoup • 75% reading existing code • 20% modifying existing code • 5% writing new code
  49. 49. “Do you have time to do it twice?” “We don’t have time to do it right!”
  50. 50. The more constrained you are on time or resources, the more important it is to build it right the first time.
  51. 51. Build It Right (Enough) The First Time • Build one great thing instead of two half-finished things • Right ≠ Perfect (80 / 20 Rule) •  Basically no bug tracking system (!) o Bugs are fixed as they come up o Backlog contains features we want to build o Backlog contains technical debt we want to repay @randyshoup
  52. 52. Continuous Integration • Small incremental changes o Faster feedback loop o Easy to understand, code-review, test, roll back • Large changes are risky o Risk of code change is nonlinear in the size of the change o Ex. Initial memcache service submission • Main code branch always shippable o Increased rate of change while reducing risk of change
  53. 53. Vicious Cycle of Technical Debt Technical Debt “No time to do it right” Quick- and-dirty
  54. 54. Virtuous Cycle of Investment Solid Foundation Confidence Faster and Better Testing
  55. 55. Moving Fast at Scale •Organizing for Speed •What to Build / What NOT to Build •When to Build •How to Build •Delivering and Operating
  56. 56. You Build It, You Run It. -- Werner Vogels
  57. 57. Continuous Delivery • Repeatable Deployment Pipeline o Low-risk, push-button deployment o Rapid release cadence o Rapid rollback and recovery • Most applications deployed multiple times per day • More solid systems o Release smaller units of work o Smaller changes to roll back or roll forward o Faster to repair, easier to understand, simpler to diagnose @randyshoup
  58. 58. Observability • Strong practice of detailed, end-to-end monitoring of production systems • Ability to detect and alert on issues anywhere in the system • Sufficient monitoring to be able to do remote runtime diagnosis
  59. 59. If you ever have to ssh into a machine, your monitoring has failed.
  60. 60. Feature Flags • Configuration “flag” to enable / disable a feature for a particular set of users o Independently discovered at eBay, Yahoo, Facebook, Google, etc. • More solid systems o Decouple feature delivery from code delivery o Rapid on and off o Develop / test / verify in production o Dark launches • Enables experimentation
  61. 61. Blameless Post-Mortems • Post-mortem After Every Incident o Document exactly what happened o What went right o What went wrong • Open and Honest Discussion o What contributed to the incident? o What could we have done better? Engineers compete to take personal responsibility (!) @randyshoup
  62. 62. “Finally we can prioritize fixing that broken system!”
  63. 63. Blameless Post-Mortems • Action Items o How will we change process, technology, documentation, etc. o How could we have automated the problems away? o How could we have diagnosed more quickly? o How could we have restored service more quickly? • Follow up (!) @randyshoup
  64. 64. Failure is not falling down, but refusing to get back up. -- Theodore Roosevelt
  65. 65. Moving Fast at Scale •Organizing for Speed •What to Build / What NOT to Build •When to Build •How to Build •Delivering and Operating
  66. 66. High-Performing Organizations 2.5x more likely to exceed business goals o Profitability o Market share o Productivity @randyshoup
  67. 67. Faster is Better
  68. 68. Thank You! • @randyshoup •