Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC


Published on

Published in: Technology, Sports
  • Gov't Seized Cars-95% Off US Largest & Most Trusted Source to Gov't Seized Car Auctions, See Why! ✤✤✤
    Are you sure you want to  Yes  No
    Your message goes here
  • How long does it take for VigRX Plus to start working? ♣♣♣
    Are you sure you want to  Yes  No
    Your message goes here
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @
    Are you sure you want to  Yes  No
    Your message goes here
  • Njce! Thanks for sharing.
    Are you sure you want to  Yes  No
    Your message goes here
  • My dear, How are you today? i will like to be your friend My name is Sheikha Ghunaim , am a 43 years old divorcee. Please write to me in my email ( ). im honest and open mind single woman. im going to tell more when i see your response. Regards Sheikha.
    Are you sure you want to  Yes  No
    Your message goes here

Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC

  1. 1. Scalable Web Architectures Common Patterns & Approaches Cal Henderson
  2. 2. <ul><li>Hello </li></ul>
  3. 3. Flickr <ul><li>Large scale kitten-sharing </li></ul><ul><li>Almost 5 years old </li></ul><ul><ul><li>4.5 since public launch </li></ul></ul><ul><li>Open source stack </li></ul><ul><ul><li>An open attitude towards architecture </li></ul></ul>
  4. 4. Flickr <ul><li>Large scale </li></ul><ul><li>In a single second: </li></ul><ul><ul><li>Serve 40,000 photos </li></ul></ul><ul><ul><li>Handle 100,000 cache operations </li></ul></ul><ul><ul><li>Process 130,000 database queries </li></ul></ul>
  5. 5. Scalable Web Architectures? <ul><li>What does scalable mean? </li></ul><ul><li>What’s an architecture? </li></ul><ul><li>An answer in 12 parts </li></ul>
  6. 6. <ul><li>1. </li></ul><ul><li>Scaling </li></ul>
  7. 7. Scalability – myths and lies <ul><li>What is scalability? </li></ul>
  8. 8. Scalability – myths and lies <ul><li>What is scalability not ? </li></ul>
  9. 9. Scalability – myths and lies <ul><li>What is scalability not ? </li></ul><ul><ul><li>Raw Speed / Performance </li></ul></ul><ul><ul><li>HA / BCP </li></ul></ul><ul><ul><li>Technology X </li></ul></ul><ul><ul><li>Protocol Y </li></ul></ul>
  10. 10. Scalability – myths and lies <ul><li>So what is scalability? </li></ul>
  11. 11. Scalability – myths and lies <ul><li>So what is scalability? </li></ul><ul><ul><li>Traffic growth </li></ul></ul><ul><ul><li>Dataset growth </li></ul></ul><ul><ul><li>Maintainability </li></ul></ul>
  12. 12. Today <ul><li>Two goals of application architecture: </li></ul><ul><li>Scale </li></ul><ul><li>HA </li></ul>
  13. 13. Today <ul><li>Three goals of application architecture: </li></ul><ul><li>Scale </li></ul><ul><li>HA </li></ul><ul><li>Performance </li></ul>
  14. 14. Scalability <ul><li>Two kinds: </li></ul><ul><ul><li>Vertical (get bigger) </li></ul></ul><ul><ul><li>Horizontal (get more) </li></ul></ul>
  15. 15. Big Irons Sunfire E20k $450,000 - $2,500,000 36x 1.8GHz processors PowerEdge SC1435 Dualcore 1.8 GHz processor Around $1,500
  16. 16. Cost vs Cost
  17. 17. That’s OK <ul><li>Sometimes vertical scaling is right </li></ul><ul><li>Buying a bigger box is quick (ish) </li></ul><ul><li>Redesigning software is not </li></ul><ul><li>Running out of MySQL performance? </li></ul><ul><ul><li>Spend months on data federation </li></ul></ul><ul><ul><li>Or, Just buy a ton more RAM </li></ul></ul>
  18. 18. The H & the V <ul><li>But we’ll mostly talk horizontal </li></ul><ul><ul><li>Else this is going to be boring </li></ul></ul>
  19. 19. <ul><li>2. </li></ul><ul><li>Architecture </li></ul>
  20. 20. Architectures then? <ul><li>The way the bits fit together </li></ul><ul><li>What grows where </li></ul><ul><li>The trade-offs between good/fast/cheap </li></ul>
  21. 21. LAMP <ul><li>We’re mostly talking about LAMP </li></ul><ul><ul><li>Linux </li></ul></ul><ul><ul><li>Apache (or LightHTTPd) </li></ul></ul><ul><ul><li>MySQL (or Postgres) </li></ul></ul><ul><ul><li>PHP (or Perl, Python, Ruby) </li></ul></ul><ul><li>All open source </li></ul><ul><li>All well supported </li></ul><ul><li>All used in large operations </li></ul><ul><li>(Same rules apply elsewhere) </li></ul>
  22. 22. Simple web apps <ul><li>A Web Application </li></ul><ul><ul><ul><ul><li>Or “Web Site” in Web 1.0 terminology </li></ul></ul></ul></ul>Interwebnet App server Database
  23. 23. Simple web apps <ul><li>A Web Application </li></ul><ul><ul><ul><ul><li>Or “Web Site” in Web 1.0 terminology </li></ul></ul></ul></ul>Interwobnet App server Database Cache Storage array AJAX!!!1
  24. 24. App servers <ul><li>App servers scale in two ways: </li></ul>
  25. 25. App servers <ul><li>App servers scale in two ways: </li></ul><ul><ul><li>Really well </li></ul></ul>
  26. 26. App servers <ul><li>App servers scale in two ways: </li></ul><ul><ul><li>Really well </li></ul></ul><ul><ul><li>Quite badly </li></ul></ul>
  27. 27. App servers <ul><li>Sessions! </li></ul><ul><ul><li>(State) </li></ul></ul><ul><ul><li>Local sessions == bad </li></ul></ul><ul><ul><ul><li>When they move == quite bad </li></ul></ul></ul><ul><ul><li>Centralized sessions == good </li></ul></ul><ul><ul><li>No sessions at all == awesome! </li></ul></ul>
  28. 28. Local sessions <ul><li>Stored on disk </li></ul><ul><ul><li>PHP sessions </li></ul></ul><ul><li>Stored in memory </li></ul><ul><ul><li>Shared memory block (APC) </li></ul></ul><ul><li>Bad! </li></ul><ul><ul><li>Can’t move users </li></ul></ul><ul><ul><li>Can’t avoid hotspots </li></ul></ul><ul><ul><li>Not fault tolerant </li></ul></ul>
  29. 29. Mobile local sessions <ul><li>Custom built </li></ul><ul><ul><li>Store last session location in cookie </li></ul></ul><ul><ul><li>If we hit a different server, pull our session information across </li></ul></ul><ul><li>If your load balancer has sticky sessions, you can still get hotspots </li></ul><ul><ul><li>Depends on volume – fewer heavier users hurt more </li></ul></ul>
  30. 30. Remote centralized sessions <ul><li>Store in a central database </li></ul><ul><ul><li>Or an in-memory cache </li></ul></ul><ul><li>No porting around of session data </li></ul><ul><li>No need for sticky sessions </li></ul><ul><li>No hot spots </li></ul><ul><li>Need to be able to scale the data store </li></ul><ul><ul><li>But we’ve pushed the issue down the stack </li></ul></ul>
  31. 31. No sessions <ul><li>Stash it all in a cookie! </li></ul><ul><li>Sign it for safety </li></ul><ul><ul><li>$data = $user_id . ‘-’ . $user_name; </li></ul></ul><ul><ul><li>$time = time(); </li></ul></ul><ul><ul><li>$sig = sha1($secret . $time . $data); </li></ul></ul><ul><ul><li>$cookie = base64(“$sig-$time-$data”); </li></ul></ul><ul><ul><li>Timestamp means it’s simple to expire it </li></ul></ul>
  32. 32. Super slim sessions <ul><li>If you need more than the cookie (login status, user id, username), then pull their account row from the DB </li></ul><ul><ul><li>Or from the account cache </li></ul></ul><ul><li>None of the drawbacks of sessions </li></ul><ul><li>Avoids the overhead of a query per page </li></ul><ul><ul><li>Great for high-volume pages which need little personalization </li></ul></ul><ul><ul><li>Turns out you can stick quite a lot in a cookie too </li></ul></ul><ul><ul><li>Pack with base64 and it’s easy to delimit fields </li></ul></ul>
  33. 33. App servers <ul><li>The Rasmus way </li></ul><ul><ul><li>App server has ‘shared nothing’ </li></ul></ul><ul><ul><li>Responsibility pushed down the stack </li></ul></ul><ul><ul><ul><li>Ooh, the stack </li></ul></ul></ul>
  34. 34. Trifle
  35. 35. Trifle Sponge / Database Jelly / Business Logic Custard / Page Logic Cream / Markup Fruit / Presentation
  36. 36. Trifle Sponge / Database Jelly / Business Logic Custard / Page Logic Cream / Markup Fruit / Presentation
  37. 37. App servers
  38. 38. App servers
  39. 39. App servers
  40. 40. Well, that was easy <ul><li>Scaling the web app server part is easy </li></ul><ul><li>The rest is the trickier part </li></ul><ul><ul><li>Database </li></ul></ul><ul><ul><li>Serving static content </li></ul></ul><ul><ul><li>Storing static content </li></ul></ul>
  41. 41. The others <ul><li>Other services scale similarly to web apps </li></ul><ul><ul><li>That is, horizontally </li></ul></ul><ul><li>The canonical examples: </li></ul><ul><ul><li>Image conversion </li></ul></ul><ul><ul><li>Audio transcoding </li></ul></ul><ul><ul><li>Video transcoding </li></ul></ul><ul><ul><li>Web crawling </li></ul></ul><ul><ul><li>Compute! </li></ul></ul>
  42. 42. Amazon <ul><li>Let’s talk about Amazon </li></ul><ul><ul><li>S3 - Storage </li></ul></ul><ul><ul><li>EC2 – Compute! (XEN based) </li></ul></ul><ul><ul><li>SQS – Queueing </li></ul></ul><ul><li>All horizontal </li></ul><ul><li>Cheap when small </li></ul><ul><ul><li>Not cheap at scale </li></ul></ul>
  43. 43. <ul><li>3. </li></ul><ul><li>Load Balancing </li></ul>
  44. 44. Load balancing <ul><li>If we have multiple nodes in a class, we need to balance between them </li></ul><ul><li>Hardware or software </li></ul><ul><li>Layer 4 or 7 </li></ul>
  45. 45. Hardware LB <ul><li>A hardware appliance </li></ul><ul><ul><li>Often a pair with heartbeats for HA </li></ul></ul><ul><li>Expensive! </li></ul><ul><ul><li>But offers high performance </li></ul></ul><ul><ul><li>Easy to do > 1Gbps </li></ul></ul><ul><li>Many brands </li></ul><ul><ul><li>Alteon, Cisco, Netscalar, Foundry, etc </li></ul></ul><ul><ul><li>L7 - web switches, content switches, etc </li></ul></ul>
  46. 46. Software LB <ul><li>Just some software </li></ul><ul><ul><li>Still needs hardware to run on </li></ul></ul><ul><ul><li>But can run on existing servers </li></ul></ul><ul><li>Harder to have HA </li></ul><ul><ul><li>Often people stick hardware LB’s in front </li></ul></ul><ul><ul><li>But Wackamole helps here </li></ul></ul>
  47. 47. Software LB <ul><li>Lots of options </li></ul><ul><ul><li>Pound </li></ul></ul><ul><ul><li>Perlbal </li></ul></ul><ul><ul><li>Apache with mod_proxy </li></ul></ul><ul><li>Wackamole with mod_backhand </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>http :// </li></ul></ul>
  48. 48. Wackamole
  49. 49. Wackamole
  50. 50. The layers <ul><li>Layer 4 </li></ul><ul><ul><li>A ‘dumb’ balance </li></ul></ul><ul><li>Layer 7 </li></ul><ul><ul><li>A ‘smart’ balance </li></ul></ul><ul><li>OSI stack, routers, etc </li></ul>
  51. 51. <ul><li>4. </li></ul><ul><li>Queuing </li></ul>
  52. 52. Parallelizable == easy! <ul><li>If we can transcode/crawl in parallel, it’s easy </li></ul><ul><ul><li>But think about queuing </li></ul></ul><ul><ul><li>And asynchronous systems </li></ul></ul><ul><ul><li>The web ain’t built for slow things </li></ul></ul><ul><ul><li>But still, a simple problem </li></ul></ul>
  53. 53. Synchronous systems
  54. 54. Asynchronous systems
  55. 55. Helps with peak periods
  56. 56. Synchronous systems
  57. 57. Asynchronous systems
  58. 58. Asynchronous systems
  59. 59. <ul><li>5. </li></ul><ul><li>Relational Data </li></ul>
  60. 60. Databases <ul><li>Unless we’re doing a lot of file serving, the database is the toughest part to scale </li></ul><ul><li>If we can, best to avoid the issue altogether and just buy bigger hardware </li></ul><ul><li>Dual-Quad Opteron/Intel64 systems with 16+GB of RAM can get you a long way </li></ul>
  61. 61. More read power <ul><li>Web apps typically have a read/write ratio of somewhere between 80/20 and 90/10 </li></ul><ul><li>If we can scale read capacity, we can solve a lot of situations </li></ul><ul><li>MySQL replication! </li></ul>
  62. 62. Master-Slave Replication
  63. 63. Master-Slave Replication Reads and Writes Reads
  64. 64. Master-Slave Replication
  65. 65. Master-Slave Replication
  66. 66. Master-Slave Replication
  67. 67. Master-Slave Replication
  68. 68. Master-Slave Replication
  69. 69. Master-Slave Replication
  70. 70. Master-Slave Replication
  71. 71. Master-Slave Replication
  72. 72. <ul><li>6. </li></ul><ul><li>Caching </li></ul>
  73. 73. Caching <ul><li>Caching avoids needing to scale! </li></ul><ul><ul><li>Or makes it cheaper </li></ul></ul><ul><li>Simple stuff </li></ul><ul><ul><li>mod_perl / shared memory </li></ul></ul><ul><ul><ul><li>Invalidation is hard </li></ul></ul></ul><ul><ul><li>MySQL query cache </li></ul></ul><ul><ul><ul><li>Bad performance (in most cases) </li></ul></ul></ul>
  74. 74. Caching <ul><li>Getting more complicated… </li></ul><ul><ul><li>Write-through cache </li></ul></ul><ul><ul><li>Write-back cache </li></ul></ul><ul><ul><li>Sideline cache </li></ul></ul>
  75. 75. Write-through cache
  76. 76. Write-back cache
  77. 77. Sideline cache
  78. 78. Sideline cache <ul><li>Easy to implement </li></ul><ul><ul><li>Just add app logic </li></ul></ul><ul><li>Need to manually invalidate cache </li></ul><ul><ul><li>Well designed code makes it easy </li></ul></ul><ul><li>Memcached </li></ul><ul><ul><li>From Danga (LiveJournal) </li></ul></ul><ul><ul><li> </li></ul></ul>
  79. 79. Memcache schemes <ul><li>Layer 4 </li></ul><ul><ul><li>Good: Cache can be local on a machine </li></ul></ul><ul><ul><li>Bad: Invalidation gets more expensive with node count </li></ul></ul><ul><ul><li>Bad: Cache space wasted by duplicate objects </li></ul></ul>
  80. 80. Memcache schemes <ul><li>Layer 7 </li></ul><ul><ul><li>Good: No wasted space </li></ul></ul><ul><ul><li>Good: linearly scaling invalidation </li></ul></ul><ul><ul><li>Bad: Multiple, remote connections </li></ul></ul><ul><ul><ul><li>Can be avoided with a proxy layer </li></ul></ul></ul><ul><ul><ul><ul><li>Gets more complicated </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Last indentation level! </li></ul></ul></ul></ul></ul>
  81. 81. <ul><li>7. </li></ul><ul><li>HA Data </li></ul>
  82. 82. But what about HA?
  83. 83. But what about HA?
  84. 84. SPOF! <ul><li>The key to HA is avoiding SPOFs </li></ul><ul><ul><li>Identify </li></ul></ul><ul><ul><li>Eliminate </li></ul></ul><ul><li>Some stuff is hard to solve </li></ul><ul><ul><li>Fix it further up the tree </li></ul></ul><ul><ul><ul><li>Dual DCs solves Router/Switch SPOF </li></ul></ul></ul>
  85. 85. Master-Master
  86. 86. Master-Master <ul><li>Either hot/warm or hot/hot </li></ul><ul><li>Writes can go to either </li></ul><ul><ul><li>But avoid collisions </li></ul></ul><ul><ul><li>No auto-inc columns for hot/hot </li></ul></ul><ul><ul><ul><li>Bad for hot/warm too </li></ul></ul></ul><ul><ul><ul><li>Unless you have MySQL 5 </li></ul></ul></ul><ul><ul><ul><ul><li>But you can’t rely on the ordering! </li></ul></ul></ul></ul><ul><ul><li>Design schema/access to avoid collisions </li></ul></ul><ul><ul><ul><li>Hashing users to servers </li></ul></ul></ul>
  87. 87. Rings <ul><li>Master-master is just a small ring </li></ul><ul><ul><li>With 2 nodes </li></ul></ul><ul><li>Bigger rings are possible </li></ul><ul><ul><li>But not a mesh! </li></ul></ul><ul><ul><li>Each slave may only have a single master </li></ul></ul><ul><ul><li>Unless you build some kind of manual replication </li></ul></ul>
  88. 88. Rings
  89. 89. Rings
  90. 90. Dual trees <ul><li>Master-master is good for HA </li></ul><ul><ul><li>But we can’t scale out the reads (or writes!) </li></ul></ul><ul><li>We often need to combine the read scaling with HA </li></ul><ul><li>We can simply combine the two models </li></ul>
  91. 91. Dual trees
  92. 92. Cost models <ul><li>There’s a problem here </li></ul><ul><ul><li>We need to always have 200% capacity to avoid a SPOF </li></ul></ul><ul><ul><ul><li>400% for dual sites! </li></ul></ul></ul><ul><ul><li>This costs too much </li></ul></ul><ul><li>Solution is straight forward </li></ul><ul><ul><li>Make sure clusters are bigger than 2 </li></ul></ul>
  93. 93. N+M <ul><li>N+M </li></ul><ul><ul><li>N = nodes needed to run the system </li></ul></ul><ul><ul><li>M = nodes we can afford to lose </li></ul></ul><ul><li>Having M as big as N starts to suck </li></ul><ul><ul><li>If we could make each node smaller, we can increase N while M stays constant </li></ul></ul><ul><ul><li>(We assume smaller nodes are cheaper) </li></ul></ul>
  94. 94. 1+1 = 200% hardware
  95. 95. 3+1 = 133% hardware
  96. 96. Meshed masters <ul><li>Not possible with regular MySQL out-of-the-box today </li></ul><ul><li>But there is hope! </li></ul><ul><ul><li>NBD (MySQL Cluster) allows a mesh </li></ul></ul><ul><ul><li>Support for replication out to slaves in a coming version </li></ul></ul><ul><ul><ul><li>RSN! </li></ul></ul></ul>
  97. 97. <ul><li>8. </li></ul><ul><li>Federation </li></ul>
  98. 98. Data federation <ul><li>At some point, you need more writes </li></ul><ul><ul><li>This is tough </li></ul></ul><ul><ul><li>Each cluster of servers has limited write capacity </li></ul></ul><ul><li>Just add more clusters! </li></ul>
  99. 99. Simple things first <ul><li>Vertical partitioning </li></ul><ul><ul><li>Divide tables into sets that never get joined </li></ul></ul><ul><ul><li>Split these sets onto different server clusters </li></ul></ul><ul><ul><li>Voila! </li></ul></ul><ul><li>Logical limits </li></ul><ul><ul><li>When you run out of non-joining groups </li></ul></ul><ul><ul><li>When a single table grows too large </li></ul></ul>
  100. 100. Data federation <ul><li>Split up large tables, organized by some primary object </li></ul><ul><ul><li>Usually users </li></ul></ul><ul><li>Put all of a user’s data on one ‘cluster’ </li></ul><ul><ul><li>Or shard, or cell </li></ul></ul><ul><li>Have one central cluster for lookups </li></ul>
  101. 101. Data federation
  102. 102. Data federation <ul><li>Need more capacity? </li></ul><ul><ul><li>Just add shards! </li></ul></ul><ul><ul><li>Don’t assign to shards based on user_id! </li></ul></ul><ul><li>For resource leveling as time goes on, we want to be able to move objects between shards </li></ul><ul><ul><li>Maybe – not everyone does this </li></ul></ul><ul><ul><li>‘ Lockable’ objects </li></ul></ul>
  103. 103. The approach <ul><li>Hash users into one of n buckets </li></ul><ul><ul><li>Where n is a power of 2 </li></ul></ul><ul><li>Put all the buckets on one server </li></ul><ul><li>When you run out of capacity, split the buckets across two servers </li></ul><ul><li>Then you run out of capacity, split the buckets across four servers </li></ul><ul><li>Etc </li></ul>
  104. 104. Data federation <ul><li>Heterogeneous hardware is fine </li></ul><ul><ul><li>Just give a larger/smaller proportion of objects depending on hardware </li></ul></ul><ul><li>Bigger/faster hardware for paying users </li></ul><ul><ul><li>A common approach </li></ul></ul><ul><ul><li>Can also allocate faster app servers via magic cookies at the LB </li></ul></ul>
  105. 105. Downsides <ul><li>Need to keep stuff in the right place </li></ul><ul><li>App logic gets more complicated </li></ul><ul><li>More clusters to manage </li></ul><ul><ul><li>Backups, etc </li></ul></ul><ul><li>More database connections needed per page </li></ul><ul><ul><li>Proxy can solve this, but complicated </li></ul></ul><ul><li>The dual table issue </li></ul><ul><ul><li>Avoid walking the shards! </li></ul></ul>
  106. 106. Bottom line <ul><li>Data federation is how large applications are scaled </li></ul>
  107. 107. Bottom line <ul><li>It’s hard, but not impossible </li></ul><ul><li>Good software design makes it easier </li></ul><ul><ul><li>Abstraction! </li></ul></ul><ul><li>Master-master pairs for shards give us HA </li></ul><ul><li>Master-master trees work for central cluster (many reads, few writes) </li></ul>
  108. 108. <ul><li>9. </li></ul><ul><li>Multi-site HA </li></ul>
  109. 109. Multiple Datacenters <ul><li>Having multiple datacenters is hard </li></ul><ul><ul><li>Not just with MySQL </li></ul></ul><ul><li>Hot/warm with MySQL slaved setup </li></ul><ul><ul><li>But manual (reconfig on failure) </li></ul></ul><ul><li>Hot/hot with master-master </li></ul><ul><ul><li>But dangerous (each site has a SPOF) </li></ul></ul><ul><li>Hot/hot with sync/async manual replication </li></ul><ul><ul><li>But tough (big engineering task) </li></ul></ul>
  110. 110. Multiple Datacenters
  111. 111. GSLB <ul><li>Multiple sites need to be balanced </li></ul><ul><ul><li>Global Server Load Balancing </li></ul></ul><ul><li>Easiest are AkaDNS-like services </li></ul><ul><ul><li>Performance rotations </li></ul></ul><ul><ul><li>Balance rotations </li></ul></ul>
  112. 112. <ul><li>10. </li></ul><ul><li>Serving Files </li></ul>
  113. 113. Serving lots of files <ul><li>Serving lots of files is not too tough </li></ul><ul><ul><li>Just buy lots of machines and load balance! </li></ul></ul><ul><li>We’re IO bound – need more spindles! </li></ul><ul><ul><li>But keeping many copies of data in sync is hard </li></ul></ul><ul><ul><li>And sometimes we have other per-request overhead (like auth) </li></ul></ul>
  114. 114. Reverse proxy
  115. 115. Reverse proxy <ul><li>Serving out of memory is fast! </li></ul><ul><ul><li>And our caching proxies can have disks too </li></ul></ul><ul><ul><li>Fast or otherwise </li></ul></ul><ul><ul><ul><li>More spindles is better </li></ul></ul></ul><ul><li>We can parallelize it! </li></ul><ul><ul><li>50 cache servers gives us 50 times the serving rate of the origin server </li></ul></ul><ul><ul><li>Assuming the working set is small enough to fit in memory in the cache cluster </li></ul></ul>
  116. 116. Invalidation <ul><li>Dealing with invalidation is tricky </li></ul><ul><li>We can prod the cache servers directly to clear stuff out </li></ul><ul><ul><li>Scales badly – need to clear asset from every server – doesn’t work well for 100 caches </li></ul></ul>
  117. 117. Invalidation <ul><li>We can change the URLs of modified resources </li></ul><ul><ul><li>And let the old ones drop out cache naturally </li></ul></ul><ul><ul><li>Or poke them out, for sensitive data </li></ul></ul><ul><li>Good approach! </li></ul><ul><ul><li>Avoids browser cache staleness </li></ul></ul><ul><ul><li>Hello Akamai (and other CDNs) </li></ul></ul><ul><ul><li>Read more: </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul>
  118. 118. CDN – Content Delivery Network <ul><li>Akamai, Savvis, Mirror Image Internet, etc </li></ul><ul><li>Caches operated by other people </li></ul><ul><ul><li>Already in-place </li></ul></ul><ul><ul><li>In lots of places </li></ul></ul><ul><li>GSLB/DNS balancing </li></ul>
  119. 119. Edge networks Origin
  120. 120. Edge networks Origin Cache Cache Cache Cache Cache Cache Cache Cache
  121. 121. CDN Models <ul><li>Simple model </li></ul><ul><ul><li>You push content to them, they serve it </li></ul></ul><ul><li>Reverse proxy model </li></ul><ul><ul><li>You publish content on an origin, they proxy and cache it </li></ul></ul>
  122. 122. CDN Invalidation <ul><li>You don’t control the caches </li></ul><ul><ul><li>Just like those awful ISP ones </li></ul></ul><ul><li>Once something is cached by a CDN, assume it can never change </li></ul><ul><ul><li>Nothing can be deleted </li></ul></ul><ul><ul><li>Nothing can be modified </li></ul></ul>
  123. 123. Versioning <ul><li>When you start to cache things, you need to care about versioning </li></ul><ul><ul><li>Invalidation & Expiry </li></ul></ul><ul><ul><li>Naming & Sync </li></ul></ul>
  124. 124. Cache Invalidation <ul><li>If you control the caches, invalidation is possible </li></ul><ul><li>But remember ISP and client caches </li></ul><ul><li>Remove deleted content explicitly </li></ul><ul><ul><li>Avoid users finding old content </li></ul></ul><ul><ul><li>Save cache space </li></ul></ul>
  125. 125. Cache versioning <ul><li>Simple rule of thumb: </li></ul><ul><ul><li>If an item is modified, change its name (URL) </li></ul></ul><ul><li>This can be independent of the file system! </li></ul>
  126. 126. Virtual versioning <ul><li>Database indicates version 3 of file </li></ul><ul><li>Web app writes version number into URL </li></ul><ul><li>Request comes through cache and is cached with the versioned URL </li></ul><ul><li>mod_rewrite converts versioned URL to path </li></ul>Version 3 Cached: foo_3.jpg foo_3.jpg -> foo.jpg
  127. 127. Reverse proxy <ul><li>Choices </li></ul><ul><ul><li>L7 load balancer & Squid </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>mod_proxy & mod_cache </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>Perlbal and Memcache? </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul>
  128. 128. More reverse proxy <ul><li>HA proxy </li></ul><ul><li>Squid & CARP </li></ul><ul><li>Varnish </li></ul>
  129. 129. High overhead serving <ul><li>What if you need to authenticate your asset serving? </li></ul><ul><ul><li>Private photos </li></ul></ul><ul><ul><li>Private data </li></ul></ul><ul><ul><li>Subscriber-only files </li></ul></ul><ul><li>Two main approaches </li></ul><ul><ul><li>Proxies w/ tokens </li></ul></ul><ul><ul><li>Path translation </li></ul></ul>
  130. 130. Perlbal Re-proxying <ul><li>Perlbal can do redirection magic </li></ul><ul><ul><li>Client sends request to Perlbal </li></ul></ul><ul><ul><li>Perlbal plugin verifies user credentials </li></ul></ul><ul><ul><ul><li>token, cookies, whatever </li></ul></ul></ul><ul><ul><ul><li>tokens avoid data-store access </li></ul></ul></ul><ul><ul><li>Perlbal goes to pick up the file from elsewhere </li></ul></ul><ul><ul><li>Transparent to user </li></ul></ul>
  131. 131. Perlbal Re-proxying
  132. 132. Perlbal Re-proxying <ul><li>Doesn’t keep database around while serving </li></ul><ul><li>Doesn’t keep app server around while serving </li></ul><ul><li>User doesn’t find out how to access asset directly </li></ul>
  133. 133. Permission URLs <ul><li>But why bother!? </li></ul><ul><li>If we bake the auth into the URL then it saves the auth step </li></ul><ul><li>We can do the auth on the web app servers when creating HTML </li></ul><ul><li>Just need some magic to translate to paths </li></ul><ul><li>We don’t want paths to be guessable </li></ul>
  134. 134. Permission URLs
  135. 135. Permission URLs (or mod_perl)
  136. 136. Permission URLs <ul><li>Downsides </li></ul><ul><ul><li>URL gives permission for life </li></ul></ul><ul><ul><li>Unless you bake in tokens </li></ul></ul><ul><ul><ul><li>Tokens tend to be non-expirable </li></ul></ul></ul><ul><ul><ul><ul><li>We don’t want to track every token </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Too much overhead </li></ul></ul></ul></ul></ul><ul><ul><ul><li>But can still expire </li></ul></ul></ul><ul><ul><ul><ul><li>But you choose at time of issue, not later </li></ul></ul></ul></ul><ul><li>Upsides </li></ul><ul><ul><li>It works </li></ul></ul><ul><ul><li>Scales very nicely </li></ul></ul>
  137. 137. <ul><li>11. </li></ul><ul><li>Storing Files </li></ul>
  138. 138. Storing lots of files <ul><li>Storing files is easy! </li></ul><ul><ul><li>Get a big disk </li></ul></ul><ul><ul><li>Get a bigger disk </li></ul></ul><ul><ul><li>Uh oh! </li></ul></ul><ul><li>Horizontal scaling is the key </li></ul><ul><ul><li>Again </li></ul></ul>
  139. 139. Connecting to storage <ul><li>NFS </li></ul><ul><ul><li>Stateful == Sucks </li></ul></ul><ul><ul><li>Hard mounts vs Soft mounts, INTR </li></ul></ul><ul><li>SMB / CIFS / Samba </li></ul><ul><ul><li>Turn off MSRPC & WINS (NetBOIS NS) </li></ul></ul><ul><ul><li>Stateful but degrades gracefully </li></ul></ul><ul><li>HTTP </li></ul><ul><ul><li>Stateless == Yay! </li></ul></ul><ul><ul><li>Just use Apache </li></ul></ul>
  140. 140. Multiple volumes <ul><li>Volumes are limited in total size </li></ul><ul><ul><li>Except (in theory) under ZFS & others </li></ul></ul><ul><li>Sometimes we need multiple volumes for performance reasons </li></ul><ul><ul><li>When using RAID with single/dual parity </li></ul></ul><ul><li>At some point, we need multiple volumes </li></ul>
  141. 141. Multiple volumes
  142. 142. Multiple hosts <ul><li>Further down the road, a single host will be too small </li></ul><ul><li>Total throughput of machine becomes an issue </li></ul><ul><li>Even physical space can start to matter </li></ul><ul><li>So we need to be able to use multiple hosts </li></ul>
  143. 143. Multiple hosts
  144. 144. HA Storage <ul><li>HA is important for file assets too </li></ul><ul><ul><li>We can back stuff up </li></ul></ul><ul><ul><li>But we tend to want hot redundancy </li></ul></ul><ul><li>RAID is good </li></ul><ul><ul><li>RAID 5 is cheap, RAID 10 is fast </li></ul></ul>
  145. 145. HA Storage <ul><li>But whole machines can fail </li></ul><ul><li>So we stick assets on multiple machines </li></ul><ul><li>In this case, we can ignore RAID </li></ul><ul><ul><li>In failure case, we serve from alternative source </li></ul></ul><ul><ul><li>But need to weigh up the rebuild time and effort against the risk </li></ul></ul><ul><ul><li>Store more than 2 copies? </li></ul></ul>
  146. 146. HA Storage
  147. 147. HA Storage <ul><li>But whole colos can fail </li></ul><ul><li>So we stick assets in multiple colos </li></ul><ul><li>Again, we can ignore RAID </li></ul><ul><ul><li>Or even multi-machine per colo </li></ul></ul><ul><ul><li>But do we want to lose a whole colo because of single disk failure? </li></ul></ul><ul><ul><li>Redundancy is always a balance </li></ul></ul>
  148. 148. Self repairing systems <ul><li>When something fails, repairing can be a pain </li></ul><ul><ul><li>RAID rebuilds by itself, but machine replication doesn’t </li></ul></ul><ul><li>The big appliances self heal </li></ul><ul><ul><li>NetApp, StorEdge, etc </li></ul></ul><ul><li>So does MogileFS (reaper) </li></ul>
  149. 149. Real Life Case Studies
  150. 150. GFS – Google File System <ul><li>Developed by … Google </li></ul><ul><li>Proprietary </li></ul><ul><li>Everything we know about it is based on talks they’ve given </li></ul><ul><li>Designed to store huge files for fast access </li></ul>
  151. 151. GFS – Google File System <ul><li>Single ‘Master’ node holds metadata </li></ul><ul><ul><li>SPF – Shadow master allows warm swap </li></ul></ul><ul><li>Grid of ‘chunkservers’ </li></ul><ul><ul><li>64bit filenames </li></ul></ul><ul><ul><li>64 MB file chunks </li></ul></ul>
  152. 152. GFS – Google File System 1(a) 2(a) 1(b) Master
  153. 153. GFS – Google File System <ul><li>Client reads metadata from master then file parts from multiple chunkservers </li></ul><ul><li>Designed for big files (>100MB) </li></ul><ul><li>Master server allocates access leases </li></ul><ul><li>Replication is automatic and self repairing </li></ul><ul><ul><li>Synchronously for atomicity </li></ul></ul>
  154. 154. GFS – Google File System <ul><li>Reading is fast (parallelizable) </li></ul><ul><ul><li>But requires a lease </li></ul></ul><ul><li>Master server is required for all reads and writes </li></ul>
  155. 155. MogileFS – OMG Files <ul><li>Developed by Danga / SixApart </li></ul><ul><li>Open source </li></ul><ul><li>Designed for scalable web app storage </li></ul>
  156. 156. MogileFS – OMG Files <ul><li>Single metadata store (MySQL) </li></ul><ul><ul><li>MySQL Cluster avoids SPF </li></ul></ul><ul><li>Multiple ‘tracker’ nodes locate files </li></ul><ul><li>Multiple ‘storage’ nodes store files </li></ul>
  157. 157. MogileFS – OMG Files Tracker Tracker MySQL
  158. 158. MogileFS – OMG Files <ul><li>Replication of file ‘classes’ happens transparently </li></ul><ul><li>Storage nodes are not mirrored – replication is piecemeal </li></ul><ul><li>Reading and writing go through trackers, but are performed directly upon storage nodes </li></ul>
  159. 159. Flickr File System <ul><li>Developed by Flickr </li></ul><ul><li>Proprietary </li></ul><ul><li>Designed for very large scalable web app storage </li></ul>
  160. 160. Flickr File System <ul><li>No metadata store </li></ul><ul><ul><li>Deal with it yourself </li></ul></ul><ul><li>Multiple ‘StorageMaster’ nodes </li></ul><ul><li>Multiple storage nodes with virtual volumes </li></ul>
  161. 161. Flickr File System SM SM SM
  162. 162. Flickr File System <ul><li>Metadata stored by app </li></ul><ul><ul><li>Just a virtual volume number </li></ul></ul><ul><ul><li>App chooses a path </li></ul></ul><ul><li>Virtual nodes are mirrored </li></ul><ul><ul><li>Locally and remotely </li></ul></ul><ul><li>Reading is done directly from nodes </li></ul>
  163. 163. Flickr File System <ul><li>StorageMaster nodes only used for write operations </li></ul><ul><li>Reading and writing can scale separately </li></ul>
  164. 164. Amazon S3 <ul><li>A big disk in the sky </li></ul><ul><li>Multiple ‘buckets’ </li></ul><ul><li>Files have user-defined keys </li></ul><ul><li>Data + metadata </li></ul>
  165. 165. Amazon S3 Servers Amazon
  166. 166. Amazon S3 Servers Amazon Users
  167. 167. The cost <ul><li>Fixed price, by the GB </li></ul><ul><li>Store: $0.15 per GB per month </li></ul><ul><li>Serve: $0.20 per GB </li></ul>
  168. 168. The cost S3
  169. 169. The cost S3 Regular Bandwidth
  170. 170. End costs <ul><li>~$2k to store 1TB for a year </li></ul><ul><li>~$63 a month for 1Mb </li></ul><ul><li>~$65k a month for 1Gb </li></ul>
  171. 171. <ul><li>12. </li></ul><ul><li>Field Work </li></ul>
  172. 172. Real world examples <ul><li>Flickr </li></ul><ul><ul><li>Because I know it </li></ul></ul><ul><li>LiveJournal </li></ul><ul><ul><li>Because everyone copies it </li></ul></ul>
  173. 173. Flickr Architecture
  174. 174. Flickr Architecture
  175. 175. LiveJournal Architecture
  176. 176. LiveJournal Architecture
  177. 177. Buy my book!
  178. 178. Or buy Theo’s
  179. 179. The end!
  180. 180. Awesome! <ul><li>These slides are available online: </li></ul><ul><li> </li></ul>