Best Practices for Large-Scale Web Sites


Published on

This is a lightning presentation given by Brian Ko summarizing a session he attended at JavaOne 2009 on how to build very large scale websites.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Best Practices for Large-Scale Web Sites

  1. 1. Best Practices for Large-Scale Web Sites Lessons from Ebay Brian Ko
  2. 2. Ebay <ul><li>276,000,000 registered users </li></ul><ul><li>stores over 2 Petabytes of data </li></ul><ul><li>over 1 billion page views per day </li></ul><ul><li>113 million items for sale in over 50,000 categories </li></ul><ul><li>2 billion Photos </li></ul>1
  3. 3. Ebay <ul><li>300+ features per quarter </li></ul><ul><li>Rolls 100,000+ lines of code every two weeks </li></ul><ul><li>In 39 countries, in 7 languages, 24x7x365 </li></ul><ul><li>48 Billion SQL executions/day! </li></ul><ul><li>In Year 2008 </li></ul>2
  4. 4. Design goal <ul><li>Scalability </li></ul><ul><ul><li>– Resource usage should increase linearly (or better!) with load </li></ul></ul><ul><ul><li>– Design for 10x growth in data, traffic, users, etc. </li></ul></ul><ul><li>Availability </li></ul><ul><ul><li>– Resilience to failure </li></ul></ul><ul><ul><li>– Graceful degradation </li></ul></ul><ul><ul><li>– Recoverability from failure </li></ul></ul>3
  5. 5. Design Goal <ul><li>Latency </li></ul><ul><ul><li>– User experience, data latency </li></ul></ul><ul><li>Manageability </li></ul><ul><ul><li>– Simplicity, Maintainability </li></ul></ul><ul><ul><li>– Provide diagnostics </li></ul></ul><ul><li>Cost </li></ul><ul><ul><li>– Development effort and complexity </li></ul></ul><ul><ul><li>– Operational cost (TCO) </li></ul></ul>5
  6. 6. Architecture consideration <ul><li>Partition everything </li></ul><ul><ul><li>– “ you eat an elephant only one bite at a time” </li></ul></ul><ul><li>Asynchrony for everywhere </li></ul><ul><ul><li>– “ Good things come to those who wait” </li></ul></ul><ul><li>Automate everything </li></ul><ul><ul><li>– “ Automation will save time and eliminate human errors…” </li></ul></ul><ul><li>Assume everything fails </li></ul><ul><ul><li>– “ Be Prepared” </li></ul></ul>4
  7. 7. Partition Everything Split <ul><li>Split every problem into manageable chunks </li></ul><ul><ul><li>– “ If you can’t split it, you can’t scale it” </li></ul></ul><ul><ul><li>– By data, load, and/or usage pattern </li></ul></ul><ul><ul><li>– For example, there are 1000’s of databases </li></ul></ul>6
  8. 8. Partition Everything Motivation <ul><li>Scalability: can scale horizontally and independently </li></ul><ul><li>Availability: can isolate failures </li></ul><ul><li>Manageability: can decouple different segments and functional areas </li></ul><ul><li>Cost: can use less expensive hardware </li></ul>7
  9. 9. Partition Everything Databases <ul><li>Functional Segmentation </li></ul><ul><ul><li>– Segment databases into functional areas – user, item, transaction, product, account, feedback </li></ul></ul><ul><ul><li>– Over 1000 logical databases on over 400 physical hosts </li></ul></ul><ul><li>Horizontal Split </li></ul><ul><ul><li>– Split (or “shard” ) databases horizontally along primary access path. </li></ul></ul>8
  10. 10. Partition Everything Databases <ul><li>No Database Transactions </li></ul><ul><li>eBay’s transaction policy </li></ul><ul><ul><li>– Absolutely no client side transactions, two-phase commit, etc. </li></ul></ul><ul><ul><li>– Auto-commit for vast majority of DB writes </li></ul></ul><ul><li>Consistency is not always required or possible </li></ul><ul><ul><li>– To guarantee availability and partition-tolerance, we are forced to trade off consistency (Brewer’s CAP Theorem) </li></ul></ul>9
  11. 11. Partition Everything Databases <ul><li>Consistency without transactions </li></ul><ul><ul><li>– Careful ordering of DB operations </li></ul></ul><ul><ul><li>– Eventual consistency through asynchronous event or reconciliation batch </li></ul></ul>10
  12. 12. Partition Everything Application Tier <ul><li>Over 17,000 application servers in 220 pools </li></ul><ul><li>Functional Segmentation </li></ul><ul><ul><li>– Segment functions into separate application pools </li></ul></ul><ul><ul><li>– Allows for parallel development, deployment, and monitoring </li></ul></ul><ul><ul><li>– Minimizes DB / resource dependencies </li></ul></ul><ul><li>Horizontal Split </li></ul><ul><ul><li>– Within pool, all application servers are created equal </li></ul></ul>11
  13. 13. Partition Everything Application Tier <ul><li>User session flow moves through multiple application pools </li></ul><ul><li>Absolutely no session state </li></ul><ul><li>Transient state maintained by </li></ul><ul><ul><li>URL, Cookie, Scratch database </li></ul></ul>12
  14. 14. Async Everywhere <ul><li>Prefer Asynchronous Processing </li></ul><ul><ul><li>– Where possible, integrate disparate components asynchronously </li></ul></ul><ul><li>Motivations </li></ul><ul><ul><li>– Scalability: can scale components independently </li></ul></ul><ul><ul><li>– Availability </li></ul></ul><ul><ul><ul><li>• Can decouple availability state </li></ul></ul></ul><ul><ul><ul><li>• Can retry operations </li></ul></ul></ul><ul><ul><li>– Latency </li></ul></ul><ul><ul><ul><li>• Can significantly improve user experience latency at cost of data/execution latency </li></ul></ul></ul><ul><ul><ul><li>• Can allocate more time to processing than user would tolerate </li></ul></ul></ul><ul><ul><li>– Cost: can spread peak load over time </li></ul></ul>13
  15. 15. Async Everywhere Batch <ul><li>Scheduled offline batch process appropriate for </li></ul><ul><ul><li>Infrequent, periodic, or scheduled processing </li></ul></ul><ul><ul><li>Non-incremental computation (a.k.a. “Full Table Scan”) </li></ul></ul><ul><li>Examples </li></ul><ul><ul><li>Import data (catalogs, currency, etc.) </li></ul></ul><ul><ul><li>Generate recommendations (items, products, searches, etc.) </li></ul></ul><ul><ul><li>Process items at end of auction </li></ul></ul>14
  16. 16. Automate Everything Motivation <ul><li>Scalability </li></ul><ul><ul><li>Can scale with machines, not humans </li></ul></ul><ul><li>Availability / Latency </li></ul><ul><ul><li>Can adapt to changing environment more rapidly </li></ul></ul><ul><li>Cost </li></ul><ul><ul><li>Machines are far less expensive than humans </li></ul></ul><ul><ul><li>Can learn / improve / adjust over time without manual effort </li></ul></ul>15
  17. 17. Automate Everything Deployment <ul><li>Challenge </li></ul><ul><ul><li>Need to deploy the application to over 17,000 application servers at the same time </li></ul></ul><ul><li>Solution </li></ul><ul><ul><li>Deploy Application in advance with the new feature switch turned off </li></ul></ul><ul><ul><li>Turn on the switch through automatic process on target date. </li></ul></ul><ul><ul><li>Make the roll back easier. </li></ul></ul>16
  18. 18. Assume Everything Fails <ul><li>Build all systems to be tolerant of failure </li></ul><ul><ul><li>Assume every operation will fail and every resource will be unavailable </li></ul></ul><ul><ul><li>Rapid failure detection and recovery </li></ul></ul><ul><ul><li>Do as much as possible during failure </li></ul></ul><ul><li>Motivation </li></ul><ul><ul><li>Availability </li></ul></ul>17
  19. 19. Assume Everything Fails <ul><li>Rollback </li></ul><ul><ul><li>Absolutely no changes to the site which cannot be undone (!) </li></ul></ul><ul><li>Failure Detection </li></ul><ul><ul><li>Real-time application state monitoring: exceptions and operational alerts </li></ul></ul><ul><ul><li>“Resource slow” is often far more challenging than “resource down” </li></ul></ul>18
  20. 20. Assume Everything Fails Graceful Degradation <ul><li>Application “marks down” the resource </li></ul><ul><ul><li>Stops making calls to it and sends alert </li></ul></ul><ul><li>Non-critical functionality is removed or ignored </li></ul><ul><li>Critical functionality is retried or deferred </li></ul><ul><ul><li>Failover to alternate resource </li></ul></ul><ul><ul><li>Defer processing to async event </li></ul></ul><ul><li>Explicit “markup” </li></ul><ul><ul><li>Allows resource to be restored and brought online in a controlled way </li></ul></ul>19
  21. 21. Summary <ul><li>Partition everything </li></ul><ul><li>Asynchrony for everywhere </li></ul><ul><li>Automate everything </li></ul><ul><li>Assume everything fails </li></ul>20
  22. 22. The End 5 minutes of question time starts now!
  23. 23. Questions 4 minutes left!
  24. 24. Questions 3 minutes left!
  25. 25. Questions 2 minutes left!
  26. 26. Questions 1 minute left!
  27. 27. Questions 30 seconds left!
  28. 28. Questions TIME IS UP!