Best Practices for Large-Scale Web Sites

  • 6,950 views
Uploaded on

This is a lightning presentation given by Brian Ko summarizing a session he attended at JavaOne 2009 on how to build very large scale websites. …

This is a lightning presentation given by Brian Ko summarizing a session he attended at JavaOne 2009 on how to build very large scale websites.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,950
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
193
Comments
0
Likes
14

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Best Practices for Large-Scale Web Sites Lessons from Ebay Brian Ko
  • 2. Ebay
    • 276,000,000 registered users
    • stores over 2 Petabytes of data
    • over 1 billion page views per day
    • 113 million items for sale in over 50,000 categories
    • 2 billion Photos
    1
  • 3. Ebay
    • 300+ features per quarter
    • Rolls 100,000+ lines of code every two weeks
    • In 39 countries, in 7 languages, 24x7x365
    • 48 Billion SQL executions/day!
    • In Year 2008
    2
  • 4. Design goal
    • Scalability
      • – Resource usage should increase linearly (or better!) with load
      • – Design for 10x growth in data, traffic, users, etc.
    • Availability
      • – Resilience to failure
      • – Graceful degradation
      • – Recoverability from failure
    3
  • 5. Design Goal
    • Latency
      • – User experience, data latency
    • Manageability
      • – Simplicity, Maintainability
      • – Provide diagnostics
    • Cost
      • – Development effort and complexity
      • – Operational cost (TCO)
    5
  • 6. Architecture consideration
    • Partition everything
      • – “ you eat an elephant only one bite at a time”
    • Asynchrony for everywhere
      • – “ Good things come to those who wait”
    • Automate everything
      • – “ Automation will save time and eliminate human errors…”
    • Assume everything fails
      • – “ Be Prepared”
    4
  • 7. Partition Everything Split
    • Split every problem into manageable chunks
      • – “ If you can’t split it, you can’t scale it”
      • – By data, load, and/or usage pattern
      • – For example, there are 1000’s of databases
    6
  • 8. Partition Everything Motivation
    • Scalability: can scale horizontally and independently
    • Availability: can isolate failures
    • Manageability: can decouple different segments and functional areas
    • Cost: can use less expensive hardware
    7
  • 9. Partition Everything Databases
    • Functional Segmentation
      • – Segment databases into functional areas – user, item, transaction, product, account, feedback
      • – Over 1000 logical databases on over 400 physical hosts
    • Horizontal Split
      • – Split (or “shard” ) databases horizontally along primary access path.
    8
  • 10. Partition Everything Databases
    • No Database Transactions
    • eBay’s transaction policy
      • – Absolutely no client side transactions, two-phase commit, etc.
      • – Auto-commit for vast majority of DB writes
    • Consistency is not always required or possible
      • – To guarantee availability and partition-tolerance, we are forced to trade off consistency (Brewer’s CAP Theorem)
    9
  • 11. Partition Everything Databases
    • Consistency without transactions
      • – Careful ordering of DB operations
      • – Eventual consistency through asynchronous event or reconciliation batch
    10
  • 12. Partition Everything Application Tier
    • Over 17,000 application servers in 220 pools
    • Functional Segmentation
      • – Segment functions into separate application pools
      • – Allows for parallel development, deployment, and monitoring
      • – Minimizes DB / resource dependencies
    • Horizontal Split
      • – Within pool, all application servers are created equal
    11
  • 13. Partition Everything Application Tier
    • User session flow moves through multiple application pools
    • Absolutely no session state
    • Transient state maintained by
      • URL, Cookie, Scratch database
    12
  • 14. Async Everywhere
    • Prefer Asynchronous Processing
      • – Where possible, integrate disparate components asynchronously
    • Motivations
      • – Scalability: can scale components independently
      • – Availability
        • • Can decouple availability state
        • • Can retry operations
      • – Latency
        • • Can significantly improve user experience latency at cost of data/execution latency
        • • Can allocate more time to processing than user would tolerate
      • – Cost: can spread peak load over time
    13
  • 15. Async Everywhere Batch
    • Scheduled offline batch process appropriate for
      • Infrequent, periodic, or scheduled processing
      • Non-incremental computation (a.k.a. “Full Table Scan”)
    • Examples
      • Import data (catalogs, currency, etc.)
      • Generate recommendations (items, products, searches, etc.)
      • Process items at end of auction
    14
  • 16. Automate Everything Motivation
    • Scalability
      • Can scale with machines, not humans
    • Availability / Latency
      • Can adapt to changing environment more rapidly
    • Cost
      • Machines are far less expensive than humans
      • Can learn / improve / adjust over time without manual effort
    15
  • 17. Automate Everything Deployment
    • Challenge
      • Need to deploy the application to over 17,000 application servers at the same time
    • Solution
      • Deploy Application in advance with the new feature switch turned off
      • Turn on the switch through automatic process on target date.
      • Make the roll back easier.
    16
  • 18. Assume Everything Fails
    • Build all systems to be tolerant of failure
      • Assume every operation will fail and every resource will be unavailable
      • Rapid failure detection and recovery
      • Do as much as possible during failure
    • Motivation
      • Availability
    17
  • 19. Assume Everything Fails
    • Rollback
      • Absolutely no changes to the site which cannot be undone (!)
    • Failure Detection
      • Real-time application state monitoring: exceptions and operational alerts
      • “Resource slow” is often far more challenging than “resource down”
    18
  • 20. Assume Everything Fails Graceful Degradation
    • Application “marks down” the resource
      • Stops making calls to it and sends alert
    • Non-critical functionality is removed or ignored
    • Critical functionality is retried or deferred
      • Failover to alternate resource
      • Defer processing to async event
    • Explicit “markup”
      • Allows resource to be restored and brought online in a controlled way
    19
  • 21. Summary
    • Partition everything
    • Asynchrony for everywhere
    • Automate everything
    • Assume everything fails
    20
  • 22. The End 5 minutes of question time starts now!
  • 23. Questions 4 minutes left!
  • 24. Questions 3 minutes left!
  • 25. Questions 2 minutes left!
  • 26. Questions 1 minute left!
  • 27. Questions 30 seconds left!
  • 28. Questions TIME IS UP!