0
Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson
Hello!
Big file systems? <ul><li>Too vague! </li></ul><ul><li>What is a file system? </li></ul><ul><li>What constitutes big? </li...
<ul><li>Scalable </li></ul><ul><li>Looking at storage and serving infrastructures </li></ul>1
<ul><li>Reliable </li></ul><ul><li>Looking at redundancy, failure rates, on the fly changes </li></ul>2
<ul><li>Cheap </li></ul><ul><li>Looking at upfront costs, TCO and lifetimes </li></ul>3
Four buckets Storage Serving BCP Cost
Storage
The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and s...
Hardware overview <ul><li>The storage scale </li></ul>NAS SAN DAS Internal Higher Lower
Internal storage <ul><li>A disk in a computer </li></ul><ul><ul><li>SCSI, IDE, SATA </li></ul></ul><ul><li>4 disks in 1U i...
DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U
SAN <ul><li>Storage Area Network </li></ul><ul><li>Dumb disk shelves </li></ul><ul><li>Clients connect via a ‘fabric’ </li...
NAS <ul><li>Network Attached Storage </li></ul><ul><li>Intelligent disk shelf </li></ul><ul><li>Clients connect via a netw...
<ul><li>Of course, it’s more confusing than that </li></ul>
Meet the LUN <ul><li>Logical Unit Number </li></ul><ul><li>A slice of storage space </li></ul><ul><li>Originally for addre...
NAS vs SAN <ul><li>With a SAN, a single host (initiator) owns a single LUN/volume </li></ul><ul><li>With NAS, multiple hos...
SAN Advantages <ul><li>Virtualization within a SAN offers some nice features: </li></ul><ul><li>Real-time LUN replication ...
Some Practical Examples <ul><li>There are a lot of vendors </li></ul><ul><li>Configurations vary  </li></ul><ul><li>Prices...
NetApp Filers Heads and shelves, up to 500TB in 6 Cabs FC SAN with 1 or 2 NAS heads
Isilon IQ <ul><li>2U Nodes, 3-96 nodes/cluster, 6-600 TB </li></ul><ul><li>FC/InfiniBand SAN with NAS head on each node </...
Scaling <ul><li>Vertical vs Horizontal </li></ul>
Vertical scaling <ul><li>Get a bigger box </li></ul><ul><li>Bigger disk(s)  </li></ul><ul><li>More disks </li></ul><ul><li...
Horizontal scaling <ul><li>Buy more boxes </li></ul><ul><li>Add more servers/appliances </li></ul><ul><li>Scales forever* ...
Storage scaling approaches <ul><li>Four common models: </li></ul><ul><li>Huge FS </li></ul><ul><li>Physical nodes </li></u...
Huge FS <ul><li>Create one giant volume with growing space </li></ul><ul><ul><li>Sun’s ZFS </li></ul></ul><ul><ul><li>Isil...
Huge FS <ul><li>Pluses </li></ul><ul><ul><li>Simple from the application side </li></ul></ul><ul><ul><li>Logically simple ...
Physical nodes <ul><li>Application handles distribution to multiple physical nodes </li></ul><ul><ul><li>Disks, Boxes, App...
Physical Nodes <ul><li>Pluses </li></ul><ul><ul><li>Limitless expansion </li></ul></ul><ul><ul><li>Easy to expand </li></u...
Virtual nodes <ul><li>Application handles distribution to multiple virtual volumes, contained on multiple physical nodes <...
Virtual Nodes <ul><li>Pluses </li></ul><ul><ul><li>Limitless expansion </li></ul></ul><ul><ul><li>Easy to expand </li></ul...
Chunked space <ul><li>Storage layer writes parts of files to different physical nodes </li></ul><ul><li>A higher-level RAI...
Chunked space <ul><li>Pluses </li></ul><ul><ul><li>High performance </li></ul></ul><ul><ul><li>Limitless size </li></ul></...
Real Life Case Studies
GFS – Google File System <ul><li>Developed by … Google </li></ul><ul><li>Proprietary </li></ul><ul><li>Everything we know ...
GFS – Google File System <ul><li>Single ‘Master’ node holds metadata </li></ul><ul><ul><li>SPF – Shadow master allows warm...
GFS – Google File System 1(a) 2(a) 1(b) Master
GFS – Google File System <ul><li>Client reads metadata from master then file parts from multiple chunkservers </li></ul><u...
GFS – Google File System <ul><li>Reading is fast (parallelizable) </li></ul><ul><ul><li>But requires a lease </li></ul></u...
MogileFS – OMG Files <ul><li>Developed by Danga / SixApart </li></ul><ul><li>Open source </li></ul><ul><li>Designed for sc...
MogileFS – OMG Files <ul><li>Single metadata store (MySQL) </li></ul><ul><ul><li>MySQL Cluster avoids SPF </li></ul></ul><...
MogileFS – OMG Files Tracker Tracker MySQL
MogileFS – OMG Files <ul><li>Replication of file ‘classes’ happens transparently </li></ul><ul><li>Storage nodes are not m...
Flickr File System <ul><li>Developed by Flickr </li></ul><ul><li>Proprietary </li></ul><ul><li>Designed for very large sca...
Flickr File System <ul><li>No metadata store </li></ul><ul><ul><li>Deal with it yourself </li></ul></ul><ul><li>Multiple ‘...
Flickr File System SM SM SM
Flickr File System <ul><li>Metadata stored by app </li></ul><ul><ul><li>Just a virtual volume number </li></ul></ul><ul><u...
Flickr File System <ul><li>StorageMaster nodes only used for write operations </li></ul><ul><li>Reading and writing can sc...
Amazon S3 <ul><li>A big disk in the sky </li></ul><ul><li>Multiple ‘buckets’ </li></ul><ul><li>Files have user-defined key...
Amazon S3 Servers Amazon
Amazon S3 Servers Amazon Users
The cost <ul><li>Fixed price, by the GB </li></ul><ul><li>Store: $0.15 per GB per month </li></ul><ul><li>Serve: $0.20 per...
The cost S3
The cost S3 Regular Bandwidth
End costs <ul><li>~$2k to store 1TB for a year </li></ul><ul><li>~$63 a month for 1Mb </li></ul><ul><li>~$65k a month for ...
Serving
Serving files <ul><li>Serving files is easy! </li></ul>Apache Disk
Serving files <ul><li>Scaling is harder </li></ul>Apache Disk Apache Disk Apache Disk
Serving files <ul><li>This doesn’t scale well </li></ul><ul><li>Primary storage is expensive </li></ul><ul><ul><li>And tak...
Caching <ul><li>Insert caches between the storage and serving nodes </li></ul><ul><li>Cache frequently accessed content to...
Why it works <ul><li>Keep a smaller working set </li></ul><ul><li>Use faster hardware </li></ul><ul><ul><li>Lots of RAM </...
Two models <ul><li>Layer 4 </li></ul><ul><ul><li>‘Simple’ balanced cache </li></ul></ul><ul><ul><li>Objects in multiple ca...
Replacement policies <ul><li>LRU – Least recently used </li></ul><ul><li>GDSF – Greedy dual size frequency </li></ul><ul><...
Cache Churn <ul><li>How long do objects typically stay in cache? </li></ul><ul><li>If it gets too short, we’re doing badly...
Problems <ul><li>Caching has some problems: </li></ul><ul><ul><li>Invalidation is hard </li></ul></ul><ul><ul><li>Replacem...
CDN – Content Delivery Network <ul><li>Akamai, Savvis, Mirror Image Internet, etc </li></ul><ul><li>Caches operated by oth...
Edge networks Origin
Edge networks Origin Cache Cache Cache Cache Cache Cache Cache Cache
CDN Models <ul><li>Simple model </li></ul><ul><ul><li>You push content to them, they serve it </li></ul></ul><ul><li>Rever...
CDN Invalidation <ul><li>You don’t control the caches </li></ul><ul><ul><li>Just like those awful ISP ones </li></ul></ul>...
Versioning <ul><li>When you start to cache things, you need to care about versioning </li></ul><ul><ul><li>Invalidation & ...
Cache Invalidation <ul><li>If you control the caches, invalidation is possible </li></ul><ul><li>But remember ISP and clie...
Cache versioning <ul><li>Simple rule of thumb: </li></ul><ul><ul><li>If an item is modified, change its name (URL) </li></...
Virtual versioning <ul><li>Database indicates version 3 of file </li></ul><ul><li>Web app writes version number into URL <...
Authentication <ul><li>Authentication inline layer </li></ul><ul><ul><li>Apache / perlbal </li></ul></ul><ul><li>Authentic...
Auth layer <ul><li>Authenticator sits between client and storage </li></ul><ul><li>Typically built into the cache software...
Auth sideline <ul><li>Authenticator sits beside the cache </li></ul><ul><li>Lightweight protocol used for authenticator </...
Auth by URL <ul><li>Someone else performs authentication and gives URLs to client (typically the web app) </li></ul><ul><l...
BCP
Business Continuity Planning <ul><li>How can I deal with the unexpected? </li></ul><ul><ul><li>The core of BCP </li></ul><...
Reality <ul><li>On a long enough timescale, anything that  can  fail,  will  fail </li></ul><ul><li>Of course, everything ...
Reality <ul><li>Define your own SLAs </li></ul><ul><li>How long can you afford to be down? </li></ul><ul><li>How manual is...
Failure scenarios <ul><li>Disk failure </li></ul><ul><li>Storage array failure </li></ul><ul><li>Storage head failure </li...
Reliable by design <ul><li>RAID avoids disk failures, but not head or fabric failures </li></ul><ul><li>Duplicated nodes a...
Tend to all points in the stack <ul><li>Going dual-colo: great </li></ul><ul><li>Taking a whole colo offline because of a ...
Recovery times <ul><li>BCP is not just about continuing when things fail </li></ul><ul><li>How can we restore after they c...
Reliable Reads & Writes <ul><li>Reliable reads are easy </li></ul><ul><ul><li>2 or more copies of files </li></ul></ul><ul...
Dual writes <ul><li>Queue up data to be written </li></ul><ul><ul><li>Where? </li></ul></ul><ul><ul><li>Needs itself to be...
Cost
Judging cost <ul><li>Per GB? </li></ul><ul><li>Per GB upfront and per year </li></ul><ul><li>Not as simple as you’d hope <...
Hardware costs Cost of hardware Usable GB Single Cost
Power costs Cost of power per year Usable GB Recurring Cost
Power costs Power installation cost Usable GB Single Cost
Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost
Network costs Cost of network gear Usable GB Single Cost
Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs
Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]
TCO <ul><li>Total cost of ownership in two parts </li></ul><ul><ul><li>Upfront </li></ul></ul><ul><ul><li>Ongoing </li></u...
(fin)
Photo credits <ul><li>flickr.com/photos/ebright/260823954/ </li></ul><ul><li>flickr.com/photos/thomashawk/243477905/ </li>...
<ul><li>You can find these slides online: </li></ul><ul><li>iamcal.com/talks/ </li></ul>
Upcoming SlideShare
Loading in...5
×

Beyond the File System: Designing Large-Scale File Storage and Serving

23,063

Published on

Cal Henderson at Web 2.0 expo

Published in: Technology
2 Comments
82 Likes
Statistics
Notes
No Downloads
Views
Total Views
23,063
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
1,546
Comments
2
Likes
82
Embeds 0
No embeds

No notes for slide

Transcript of "Beyond the File System: Designing Large-Scale File Storage and Serving"

  1. 1. Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson
  2. 2. Hello!
  3. 3. Big file systems? <ul><li>Too vague! </li></ul><ul><li>What is a file system? </li></ul><ul><li>What constitutes big? </li></ul><ul><li>Some requirements would be nice </li></ul>
  4. 4. <ul><li>Scalable </li></ul><ul><li>Looking at storage and serving infrastructures </li></ul>1
  5. 5. <ul><li>Reliable </li></ul><ul><li>Looking at redundancy, failure rates, on the fly changes </li></ul>2
  6. 6. <ul><li>Cheap </li></ul><ul><li>Looking at upfront costs, TCO and lifetimes </li></ul>3
  7. 7. Four buckets Storage Serving BCP Cost
  8. 8. Storage
  9. 9. The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB
  10. 10. Hardware overview <ul><li>The storage scale </li></ul>NAS SAN DAS Internal Higher Lower
  11. 11. Internal storage <ul><li>A disk in a computer </li></ul><ul><ul><li>SCSI, IDE, SATA </li></ul></ul><ul><li>4 disks in 1U is common </li></ul><ul><li>8 for half depth boxes </li></ul>
  12. 12. DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U
  13. 13. SAN <ul><li>Storage Area Network </li></ul><ul><li>Dumb disk shelves </li></ul><ul><li>Clients connect via a ‘fabric’ </li></ul><ul><li>Fibre Channel, iSCSI, Infiniband </li></ul><ul><ul><li>Low level protocols </li></ul></ul>
  14. 14. NAS <ul><li>Network Attached Storage </li></ul><ul><li>Intelligent disk shelf </li></ul><ul><li>Clients connect via a network </li></ul><ul><li>NFS, SMB, CIFS </li></ul><ul><ul><li>High level protocols </li></ul></ul>
  15. 15. <ul><li>Of course, it’s more confusing than that </li></ul>
  16. 16. Meet the LUN <ul><li>Logical Unit Number </li></ul><ul><li>A slice of storage space </li></ul><ul><li>Originally for addressing a single drive: </li></ul><ul><ul><li>c1t2d3 </li></ul></ul><ul><ul><li>Controller, Target, Disk (Slice) </li></ul></ul><ul><li>Now means a virtual partition/volume </li></ul><ul><ul><li>LVM, Logical Volume Management </li></ul></ul>
  17. 17. NAS vs SAN <ul><li>With a SAN, a single host (initiator) owns a single LUN/volume </li></ul><ul><li>With NAS, multiple hosts own a single LUN/volume </li></ul><ul><li>NAS head – NAS access to a SAN </li></ul>
  18. 18. SAN Advantages <ul><li>Virtualization within a SAN offers some nice features: </li></ul><ul><li>Real-time LUN replication </li></ul><ul><li>Transparent backup </li></ul><ul><li>SAN booting for host replacement </li></ul>
  19. 19. Some Practical Examples <ul><li>There are a lot of vendors </li></ul><ul><li>Configurations vary </li></ul><ul><li>Prices vary wildly </li></ul><ul><li>Let’s look at a couple </li></ul><ul><ul><li>Ones I happen to have experience with </li></ul></ul><ul><ul><li>Not an endorsement ;) </li></ul></ul>
  20. 20. NetApp Filers Heads and shelves, up to 500TB in 6 Cabs FC SAN with 1 or 2 NAS heads
  21. 21. Isilon IQ <ul><li>2U Nodes, 3-96 nodes/cluster, 6-600 TB </li></ul><ul><li>FC/InfiniBand SAN with NAS head on each node </li></ul>
  22. 22. Scaling <ul><li>Vertical vs Horizontal </li></ul>
  23. 23. Vertical scaling <ul><li>Get a bigger box </li></ul><ul><li>Bigger disk(s) </li></ul><ul><li>More disks </li></ul><ul><li>Limited by current tech – size of each disk and total number in appliance </li></ul>
  24. 24. Horizontal scaling <ul><li>Buy more boxes </li></ul><ul><li>Add more servers/appliances </li></ul><ul><li>Scales forever* </li></ul><ul><li>*sort of </li></ul>
  25. 25. Storage scaling approaches <ul><li>Four common models: </li></ul><ul><li>Huge FS </li></ul><ul><li>Physical nodes </li></ul><ul><li>Virtual nodes </li></ul><ul><li>Chunked space </li></ul>
  26. 26. Huge FS <ul><li>Create one giant volume with growing space </li></ul><ul><ul><li>Sun’s ZFS </li></ul></ul><ul><ul><li>Isilon IQ </li></ul></ul><ul><li>Expandable on-the-fly? </li></ul><ul><li>Upper limits </li></ul><ul><ul><li>Always limited somewhere </li></ul></ul>
  27. 27. Huge FS <ul><li>Pluses </li></ul><ul><ul><li>Simple from the application side </li></ul></ul><ul><ul><li>Logically simple </li></ul></ul><ul><ul><li>Low administrative overhead </li></ul></ul><ul><li>Minuses </li></ul><ul><ul><li>All your eggs in one basket </li></ul></ul><ul><ul><li>Hard to expand </li></ul></ul><ul><ul><li>Has an upper limit </li></ul></ul>
  28. 28. Physical nodes <ul><li>Application handles distribution to multiple physical nodes </li></ul><ul><ul><li>Disks, Boxes, Appliances, whatever </li></ul></ul><ul><li>One ‘volume’ per node </li></ul><ul><li>Each node acts by itself </li></ul><ul><li>Expandable on-the-fly – add more nodes </li></ul><ul><li>Scales forever </li></ul>
  29. 29. Physical Nodes <ul><li>Pluses </li></ul><ul><ul><li>Limitless expansion </li></ul></ul><ul><ul><li>Easy to expand </li></ul></ul><ul><ul><li>Unlikely to all fail at once </li></ul></ul><ul><li>Minuses </li></ul><ul><ul><li>Many ‘mounts’ to manage </li></ul></ul><ul><ul><li>More administration </li></ul></ul>
  30. 30. Virtual nodes <ul><li>Application handles distribution to multiple virtual volumes, contained on multiple physical nodes </li></ul><ul><li>Multiple volumes per node </li></ul><ul><li>Flexible </li></ul><ul><li>Expandable on-the-fly – add more nodes </li></ul><ul><li>Scales forever </li></ul>
  31. 31. Virtual Nodes <ul><li>Pluses </li></ul><ul><ul><li>Limitless expansion </li></ul></ul><ul><ul><li>Easy to expand </li></ul></ul><ul><ul><li>Unlikely to all fail at once </li></ul></ul><ul><ul><li>Addressing is logical, not physical </li></ul></ul><ul><ul><li>Flexible volume sizing, consolidation </li></ul></ul><ul><li>Minuses </li></ul><ul><ul><li>Many ‘mounts’ to manage </li></ul></ul><ul><ul><li>More administration </li></ul></ul>
  32. 32. Chunked space <ul><li>Storage layer writes parts of files to different physical nodes </li></ul><ul><li>A higher-level RAID striping </li></ul><ul><li>High performance for large files </li></ul><ul><ul><li>read multiple parts simultaneously </li></ul></ul>
  33. 33. Chunked space <ul><li>Pluses </li></ul><ul><ul><li>High performance </li></ul></ul><ul><ul><li>Limitless size </li></ul></ul><ul><li>Minuses </li></ul><ul><ul><li>Conceptually complex </li></ul></ul><ul><ul><li>Can be hard to expand on the fly </li></ul></ul><ul><ul><li>Can’t manually poke it </li></ul></ul>
  34. 34. Real Life Case Studies
  35. 35. GFS – Google File System <ul><li>Developed by … Google </li></ul><ul><li>Proprietary </li></ul><ul><li>Everything we know about it is based on talks they’ve given </li></ul><ul><li>Designed to store huge files for fast access </li></ul>
  36. 36. GFS – Google File System <ul><li>Single ‘Master’ node holds metadata </li></ul><ul><ul><li>SPF – Shadow master allows warm swap </li></ul></ul><ul><li>Grid of ‘chunkservers’ </li></ul><ul><ul><li>64bit filenames </li></ul></ul><ul><ul><li>64 MB file chunks </li></ul></ul>
  37. 37. GFS – Google File System 1(a) 2(a) 1(b) Master
  38. 38. GFS – Google File System <ul><li>Client reads metadata from master then file parts from multiple chunkservers </li></ul><ul><li>Designed for big files (>100MB) </li></ul><ul><li>Master server allocates access leases </li></ul><ul><li>Replication is automatic and self repairing </li></ul><ul><ul><li>Synchronously for atomicity </li></ul></ul>
  39. 39. GFS – Google File System <ul><li>Reading is fast (parallelizable) </li></ul><ul><ul><li>But requires a lease </li></ul></ul><ul><li>Master server is required for all reads and writes </li></ul>
  40. 40. MogileFS – OMG Files <ul><li>Developed by Danga / SixApart </li></ul><ul><li>Open source </li></ul><ul><li>Designed for scalable web app storage </li></ul>
  41. 41. MogileFS – OMG Files <ul><li>Single metadata store (MySQL) </li></ul><ul><ul><li>MySQL Cluster avoids SPF </li></ul></ul><ul><li>Multiple ‘tracker’ nodes locate files </li></ul><ul><li>Multiple ‘storage’ nodes store files </li></ul>
  42. 42. MogileFS – OMG Files Tracker Tracker MySQL
  43. 43. MogileFS – OMG Files <ul><li>Replication of file ‘classes’ happens transparently </li></ul><ul><li>Storage nodes are not mirrored – replication is piecemeal </li></ul><ul><li>Reading and writing go through trackers, but are performed directly upon storage nodes </li></ul>
  44. 44. Flickr File System <ul><li>Developed by Flickr </li></ul><ul><li>Proprietary </li></ul><ul><li>Designed for very large scalable web app storage </li></ul>
  45. 45. Flickr File System <ul><li>No metadata store </li></ul><ul><ul><li>Deal with it yourself </li></ul></ul><ul><li>Multiple ‘StorageMaster’ nodes </li></ul><ul><li>Multiple storage nodes with virtual volumes </li></ul>
  46. 46. Flickr File System SM SM SM
  47. 47. Flickr File System <ul><li>Metadata stored by app </li></ul><ul><ul><li>Just a virtual volume number </li></ul></ul><ul><ul><li>App chooses a path </li></ul></ul><ul><li>Virtual nodes are mirrored </li></ul><ul><ul><li>Locally and remotely </li></ul></ul><ul><li>Reading is done directly from nodes </li></ul>
  48. 48. Flickr File System <ul><li>StorageMaster nodes only used for write operations </li></ul><ul><li>Reading and writing can scale separately </li></ul>
  49. 49. Amazon S3 <ul><li>A big disk in the sky </li></ul><ul><li>Multiple ‘buckets’ </li></ul><ul><li>Files have user-defined keys </li></ul><ul><li>Data + metadata </li></ul>
  50. 50. Amazon S3 Servers Amazon
  51. 51. Amazon S3 Servers Amazon Users
  52. 52. The cost <ul><li>Fixed price, by the GB </li></ul><ul><li>Store: $0.15 per GB per month </li></ul><ul><li>Serve: $0.20 per GB </li></ul>
  53. 53. The cost S3
  54. 54. The cost S3 Regular Bandwidth
  55. 55. End costs <ul><li>~$2k to store 1TB for a year </li></ul><ul><li>~$63 a month for 1Mb </li></ul><ul><li>~$65k a month for 1Gb </li></ul>
  56. 56. Serving
  57. 57. Serving files <ul><li>Serving files is easy! </li></ul>Apache Disk
  58. 58. Serving files <ul><li>Scaling is harder </li></ul>Apache Disk Apache Disk Apache Disk
  59. 59. Serving files <ul><li>This doesn’t scale well </li></ul><ul><li>Primary storage is expensive </li></ul><ul><ul><li>And takes a lot of space </li></ul></ul><ul><li>In many systems, we only access a small number of files most of the time </li></ul>
  60. 60. Caching <ul><li>Insert caches between the storage and serving nodes </li></ul><ul><li>Cache frequently accessed content to reduce reads on the storage nodes </li></ul><ul><li>Software (Squid, mod_cache) </li></ul><ul><li>Hardware (Netcache, Cacheflow) </li></ul>
  61. 61. Why it works <ul><li>Keep a smaller working set </li></ul><ul><li>Use faster hardware </li></ul><ul><ul><li>Lots of RAM </li></ul></ul><ul><ul><li>SCSI </li></ul></ul><ul><ul><li>Outer edge of disks (ZCAV) </li></ul></ul><ul><li>Use more duplicates </li></ul><ul><ul><li>Cheaper, since they’re smaller </li></ul></ul>
  62. 62. Two models <ul><li>Layer 4 </li></ul><ul><ul><li>‘Simple’ balanced cache </li></ul></ul><ul><ul><li>Objects in multiple caches </li></ul></ul><ul><ul><li>Good for few objects requested many times </li></ul></ul><ul><li>Layer 7 </li></ul><ul><ul><li>URL balances cache </li></ul></ul><ul><ul><li>Objects in a single cache </li></ul></ul><ul><ul><li>Good for many objects requested a few times </li></ul></ul>
  63. 63. Replacement policies <ul><li>LRU – Least recently used </li></ul><ul><li>GDSF – Greedy dual size frequency </li></ul><ul><li>LFUDA – Least frequently used with dynamic aging </li></ul><ul><li>All have advantages and disadvantages </li></ul><ul><li>Performance varies greatly with each </li></ul>
  64. 64. Cache Churn <ul><li>How long do objects typically stay in cache? </li></ul><ul><li>If it gets too short, we’re doing badly </li></ul><ul><ul><li>But it depends on your traffic profile </li></ul></ul><ul><li>Make the cached object store larger </li></ul>
  65. 65. Problems <ul><li>Caching has some problems: </li></ul><ul><ul><li>Invalidation is hard </li></ul></ul><ul><ul><li>Replacement is dumb (even LFUDA) </li></ul></ul><ul><li>Avoiding caching makes your life (somewhat) easier </li></ul>
  66. 66. CDN – Content Delivery Network <ul><li>Akamai, Savvis, Mirror Image Internet, etc </li></ul><ul><li>Caches operated by other people </li></ul><ul><ul><li>Already in-place </li></ul></ul><ul><ul><li>In lots of places </li></ul></ul><ul><li>GSLB/DNS balancing </li></ul>
  67. 67. Edge networks Origin
  68. 68. Edge networks Origin Cache Cache Cache Cache Cache Cache Cache Cache
  69. 69. CDN Models <ul><li>Simple model </li></ul><ul><ul><li>You push content to them, they serve it </li></ul></ul><ul><li>Reverse proxy model </li></ul><ul><ul><li>You publish content on an origin, they proxy and cache it </li></ul></ul>
  70. 70. CDN Invalidation <ul><li>You don’t control the caches </li></ul><ul><ul><li>Just like those awful ISP ones </li></ul></ul><ul><li>Once something is cached by a CDN, assume it can never change </li></ul><ul><ul><li>Nothing can be deleted </li></ul></ul><ul><ul><li>Nothing can be modified </li></ul></ul>
  71. 71. Versioning <ul><li>When you start to cache things, you need to care about versioning </li></ul><ul><ul><li>Invalidation & Expiry </li></ul></ul><ul><ul><li>Naming & Sync </li></ul></ul>
  72. 72. Cache Invalidation <ul><li>If you control the caches, invalidation is possible </li></ul><ul><li>But remember ISP and client caches </li></ul><ul><li>Remove deleted content explicitly </li></ul><ul><ul><li>Avoid users finding old content </li></ul></ul><ul><ul><li>Save cache space </li></ul></ul>
  73. 73. Cache versioning <ul><li>Simple rule of thumb: </li></ul><ul><ul><li>If an item is modified, change its name (URL) </li></ul></ul><ul><li>This can be independent of the file system! </li></ul>
  74. 74. Virtual versioning <ul><li>Database indicates version 3 of file </li></ul><ul><li>Web app writes version number into URL </li></ul><ul><li>Request comes through cache and is cached with the versioned URL </li></ul><ul><li>mod_rewrite converts versioned URL to path </li></ul>Version 3 example.com/foo_3.jpg Cached: foo_3.jpg foo_3.jpg -> foo.jpg
  75. 75. Authentication <ul><li>Authentication inline layer </li></ul><ul><ul><li>Apache / perlbal </li></ul></ul><ul><li>Authentication sideline </li></ul><ul><ul><li>ICP (CARP/HTCP) </li></ul></ul><ul><li>Authentication by URL </li></ul><ul><ul><li>FlickrFS </li></ul></ul>
  76. 76. Auth layer <ul><li>Authenticator sits between client and storage </li></ul><ul><li>Typically built into the cache software </li></ul>Cache Authenticator Origin
  77. 77. Auth sideline <ul><li>Authenticator sits beside the cache </li></ul><ul><li>Lightweight protocol used for authenticator </li></ul>Cache Authenticator Origin
  78. 78. Auth by URL <ul><li>Someone else performs authentication and gives URLs to client (typically the web app) </li></ul><ul><li>URLs hold the ‘keys’ for accessing files </li></ul>Cache Origin Web Server
  79. 79. BCP
  80. 80. Business Continuity Planning <ul><li>How can I deal with the unexpected? </li></ul><ul><ul><li>The core of BCP </li></ul></ul><ul><li>Redundancy </li></ul><ul><li>Replication </li></ul>
  81. 81. Reality <ul><li>On a long enough timescale, anything that can fail, will fail </li></ul><ul><li>Of course, everything can fail </li></ul><ul><li>True reliability comes only through redundancy </li></ul>
  82. 82. Reality <ul><li>Define your own SLAs </li></ul><ul><li>How long can you afford to be down? </li></ul><ul><li>How manual is the recovery process? </li></ul><ul><li>How far can you roll back? </li></ul><ul><li>How many $ node boxes can fail at once? </li></ul>
  83. 83. Failure scenarios <ul><li>Disk failure </li></ul><ul><li>Storage array failure </li></ul><ul><li>Storage head failure </li></ul><ul><li>Fabric failure </li></ul><ul><li>Metadata node failure </li></ul><ul><li>Power outage </li></ul><ul><li>Routing outage </li></ul>
  84. 84. Reliable by design <ul><li>RAID avoids disk failures, but not head or fabric failures </li></ul><ul><li>Duplicated nodes avoid host and fabric failures, but not routing or power failures </li></ul><ul><li>Dual-colo avoids routing and power failures, but may need duplication too </li></ul>
  85. 85. Tend to all points in the stack <ul><li>Going dual-colo: great </li></ul><ul><li>Taking a whole colo offline because of a single failed disk: bad </li></ul><ul><li>We need a combination of these </li></ul>
  86. 86. Recovery times <ul><li>BCP is not just about continuing when things fail </li></ul><ul><li>How can we restore after they come back? </li></ul><ul><li>Host and colo level syncing </li></ul><ul><ul><li>replication queuing </li></ul></ul><ul><li>Host and colo level rebuilding </li></ul>
  87. 87. Reliable Reads & Writes <ul><li>Reliable reads are easy </li></ul><ul><ul><li>2 or more copies of files </li></ul></ul><ul><li>Reliable writes are harder </li></ul><ul><ul><li>Write 2 copies at once </li></ul></ul><ul><ul><li>But what do we do when we can’t write to one? </li></ul></ul>
  88. 88. Dual writes <ul><li>Queue up data to be written </li></ul><ul><ul><li>Where? </li></ul></ul><ul><ul><li>Needs itself to be reliable </li></ul></ul><ul><li>Queue up journal of changes </li></ul><ul><ul><li>And then read data from the disk whose write succeeded </li></ul></ul><ul><li>Duplicate whole volume after failure </li></ul><ul><ul><li>Slow! </li></ul></ul>
  89. 89. Cost
  90. 90. Judging cost <ul><li>Per GB? </li></ul><ul><li>Per GB upfront and per year </li></ul><ul><li>Not as simple as you’d hope </li></ul><ul><ul><li>How about an example </li></ul></ul>
  91. 91. Hardware costs Cost of hardware Usable GB Single Cost
  92. 92. Power costs Cost of power per year Usable GB Recurring Cost
  93. 93. Power costs Power installation cost Usable GB Single Cost
  94. 94. Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost
  95. 95. Network costs Cost of network gear Usable GB Single Cost
  96. 96. Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs
  97. 97. Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]
  98. 98. TCO <ul><li>Total cost of ownership in two parts </li></ul><ul><ul><li>Upfront </li></ul></ul><ul><ul><li>Ongoing </li></ul></ul><ul><li>Architecture plays a huge part in costing </li></ul><ul><ul><li>Don’t get tied to hardware </li></ul></ul><ul><ul><li>Allow heterogeneity </li></ul></ul><ul><ul><li>Move with the market </li></ul></ul>
  99. 99. (fin)
  100. 100. Photo credits <ul><li>flickr.com/photos/ebright/260823954/ </li></ul><ul><li>flickr.com/photos/thomashawk/243477905/ </li></ul><ul><li>flickr.com/photos/tom-carden/116315962/ </li></ul><ul><li>flickr.com/photos/sillydog/287354869/ </li></ul><ul><li>flickr.com/photos/foreversouls/131972916/ </li></ul><ul><li>flickr.com/photos/julianb/324897/ </li></ul><ul><li>flickr.com/photos/primejunta/140957047/ </li></ul><ul><li>flickr.com/photos/whatknot/28973703/ </li></ul><ul><li>flickr.com/photos/dcjohn/85504455/ </li></ul>
  101. 101. <ul><li>You can find these slides online: </li></ul><ul><li>iamcal.com/talks/ </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×