Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Service Architectures at Scale: Lessons from Google and eBay


Published on

Video and slides synchronized, mp3 and slide download available at URL

Randy Shoup discusses modern service architectures at scale, using specific examples from both Google and eBay. He covers some interesting lessons learned in building and operating these sites. He concludes with a number of experience-based recommendations for other smaller organizations evolving to -- and sustaining -- an effective service ecosystem. Filmed at

Randy Shoup is Consulting CTO. Randy has worked as a senior technology leader and executive in Silicon Valley at companies ranging from small startups, to mid-sized places, to eBay and Google.

Published in: Technology

Service Architectures at Scale: Lessons from Google and eBay

  1. 1. Service  Architectures  at  Scale       Lessons  from  Google  and  eBay Randy Shoup @randyshoup
  2. 2. News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on! /service-arch-scale-google-ebay
  3. 3. Presented at QCon London Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Architecture   Evolution • eBay •  5th generation today •  Monolithic Perl à Monolithic C++ à Java à microservices • Twitter •  3rd generation today •  Monolithic Rails à JS / Rails / Scala à microservices • Amazon •  Nth generation today •  Monolithic C++ à Java / Scala à microservices
  5. 5. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  6. 6. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  7. 7. Ecosystem     of  Services •  Hundreds to thousands of independent services •  Many layers of dependencies, no strict tiers •  Graph of relationships, not a hierarchy C B A E F G D
  8. 8. Evolution,     not  Intelligent  Design •  No centralized, top-down design of the system •  Variation and Natural selection o  Create / extract new services when needed to solve a problem o  Deprecate services when no longer used o  Services justify their existence through usage •  Appearance of clean layering is an emergent property
  9. 9. Google     Service  Layering •  Cloud Datastore: NoSQL service o  Highly scalable and resilient o  Strong transactional consistency o  SQL-like rich query capabilities •  Megastore: geo-scale structured database o  Multi-row transactions o  Synchronous cross-datacenter replication •  Bigtable: cluster-level structured storage o  (row, column, timestamp) -> cell contents •  Colossus: next-generation clustered file system o  Block distribution and replication •  Borg: cluster management infrastructure o  Task scheduling, machine assignment Cloud   Datastore Megastore Bigtable Colossus Borg
  10. 10. Architecture  without  an   Architect? •  No “Architect” title / role •  (+) No central approval for technology decisions o  Most technology decisions made locally instead of globally o  Better decisions in the field •  (-) eBay Architecture Review Board o  Central approval body for large-scale projects o  Usually far too late in the process to be valuable o  Experienced engineers saying “no” after the fact vs. encoding knowledge in a reusable library, tool, or service
  11. 11. Standardization •  Standardized communication o  Network protocols o  Data formats o  Interface schema / specification •  Standardized infrastructure o  Source control o  Configuration management o  Cluster management o  Monitoring, alerting, diagnosing, etc.
  12. 12. Standards become standards by being better than the alternatives!
  13. 13. “Enforcing”   Standardization •  Encouraged via o  Libraries o  Support in underlying services o  Code reviews o  Searchable code
  14. 14. The easiest way to encourage best practices is with *code*!
  15. 15. Make it really easy to do the right thing, and harder to do the wrong thing!
  16. 16. Service   Independence •  No standardization of service internals o  Programming languages o  Frameworks o  Persistence mechanisms
  17. 17. In a mature ecosystem of services, we standardize the arcs of the graph, not the nodes!
  18. 18. Creating     New  Services •  Spinning out a new service o  Almost always built for particular use-case first o  If successful and appropriate, form a team and generalize for multiple use-cases •  Pragmatism wins •  Examples o  Google File System o  Bigtable o  Megastore o  Google App Engine o  Gmail
  19. 19. Deprecating     Old  Services •  What if a service is a failure? o  Repurpose technologies for other uses o  Redeploy people to other teams •  Examples o  Google Wave -> Google Apps o  Multiple generations of core services
  20. 20. “Every service at Google is either deprecated or not ready yet.” -- Google engineering proverb
  21. 21. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  22. 22. Characteristics  of  an   Effective  Service •  Single-purpose •  Simple, well-defined interface •  Modular and independent •  Isolated persistence (!) A C D E B
  23. 23. Goals  of  a     Service  Owner •  Meet the needs of my clients … •  Functionality •  Quality •  Performance •  Stability and reliability •  Constant improvement over time •  … at minimum cost and effort •  Leverage common tools and infrastructure •  Leverage other services •  Automate building, deploying, and operating my service •  Optimize for efficient use of resources
  24. 24. Responsibilities  of  a   Service  Owner •  End-to-end Ownership o  Team owns service from design to deployment to retirement o  No separate maintenance or sustaining engineering team o  DevOps philosophy of “You build it, you run it” •  Autonomy and Accountability o  Freedom to choose technology, methodology, working environment o  Responsibility for the results of those choices
  25. 25. Service  as     Bounded  Context •  Primary focus on my service o  Clients which depend on my service o  Services which my service depends on o  Cognitive load is very bounded •  Very little worry about o  The complete ecosystem o  The underlying infrastructure •  è Small, nimble service teams Service Client   A Client   B Client   C
  26. 26. Service-­‐‑Service     Relationships •  Vendor – Customer Relationship o  Friendly and cooperative, but structured o  Clear ownership and division of responsibility o  Customer can choose to use service or not (!) •  Service-Level Agreement (SLA) o  Promise of service levels by the provider o  Customer needs to be able to rely on the service, like a utility
  27. 27. Service-­‐‑Service     Relationships •  Charging and Cost Allocation o  Charge customers for *usage* of the service o  Aligns economic incentives of customer and provider o  Motivates both sides to optimize for efficiency o  (+) Pre- / post-allocation at Google
  28. 28. Maintaining   Service  Quality •  Small incremental changes o  Easy to reason about and understand o  Risk of code change is nonlinear in the size of the change o  (-) Initial memcache service submission •  Solid Development Practices o  Code reviews before submission o  Automated tests for everything •  Google build and test system o  Uses production cluster manager o  Runs millions of tests per day in parallel o  All acceptance tests run before code is accepted into source control
  29. 29. Maintaining     Interface  Stability •  Backward / forward compatibility of interfaces o  Can *never* break your clients’ code o  Often multiple interface versions o  Sometimes multiple deployments o  Majority of changes don’t impact the interface in any way •  Explicit deprecation policy o  Strong incentive to wean customers off old versions (!)
  30. 30. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  31. 31. Predictable   Performance •  Services at scale highly exposed to performance variability •  Imagine an operation … o  1ms median latency, but 1 second latency at 99.99%ile (1 in 10,000) o  Service using one machine à 0.01% slow o  Service using 5,000 machines à 50% slow •  Predictability trumps average performance o  Low latency + inconsistent performance != low latency o  Far easier to program to consistent performance o  Tail latencies are *much* more important than average latencies
  32. 32. Google  App  Engine     Memcache  Service •  Periodic “hiccups” in latency at 99.99%ile and beyond •  Very difficult to detect and diagnose •  è Slab memory allocation
  33. 33. Service   Reliability •  Systems at scale highly exposed to failure o  Software, hardware, service failures o  Sharks and backhoes o  Operator “oops” •  Resilience in depth o  Redundancy for machine / cluster / data center failures o  Load-balancing and flow control for service invocations o  Rapid rollback for “oops”
  34. 34. Service  Reliability:   Deployment •  Incremental Deployment o  Canary systems o  Staged rollouts o  Rapid rollback •  eBay “Feature Flags” o  Decouple code deployment from feature deployment o  Rapidly turn on / off features without redeploying code o  Typically deploy with feature turned off, then turn on as a separate step
  35. 35. Service  Reliability:   Monitoring •  Instrumentation o  Common monitoring service o  Machine / instance statistics: CPU, memory, I/O o  Request statistics: request rate, error rate, latency distribution o  Application / service statistics o  Downstream service invocations •  Diagnosability o  In-process web server with current statistics o  Distributed tracing of requests through multiple service invocations
  36. 36. You can have too much alerting, but you can never have too much monitoring!
  37. 37. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  38. 38. Service   Anti-­‐‑PaQerns •  The “Mega-Service” o  Overbroad area of responsibility is difficult to reason about, change o  Leads to more upstream / downstream dependencies •  Shared persistence o  Breaks encapsulation, encourages “backdoor” interface violations o  Unhealthy and near-invisible coupling of services o  (-) Initial eBay SOA efforts
  39. 39. Thank  You! •  @randyshoup • •  Slides will be at
  40. 40. Watch the video with slide synchronization on! arch-scale-google-ebay