Beyond the File System: Designing Large-Scale File Storage and Serving

  • 22,336 views
Uploaded on

Cal Henderson at Web 2.0 expo

Cal Henderson at Web 2.0 expo

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • good slide

    --
    http://www.financialcrisis2009.org
    Are you sure you want to
    Your message goes here
  • fs
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
22,336
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
1,531
Comments
2
Likes
81

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson
  • 2. Hello!
  • 3. Big file systems?
    • Too vague!
    • What is a file system?
    • What constitutes big?
    • Some requirements would be nice
  • 4.
    • Scalable
    • Looking at storage and serving infrastructures
    1
  • 5.
    • Reliable
    • Looking at redundancy, failure rates, on the fly changes
    2
  • 6.
    • Cheap
    • Looking at upfront costs, TCO and lifetimes
    3
  • 7. Four buckets Storage Serving BCP Cost
  • 8. Storage
  • 9. The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB
  • 10. Hardware overview
    • The storage scale
    NAS SAN DAS Internal Higher Lower
  • 11. Internal storage
    • A disk in a computer
      • SCSI, IDE, SATA
    • 4 disks in 1U is common
    • 8 for half depth boxes
  • 12. DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U
  • 13. SAN
    • Storage Area Network
    • Dumb disk shelves
    • Clients connect via a ‘fabric’
    • Fibre Channel, iSCSI, Infiniband
      • Low level protocols
  • 14. NAS
    • Network Attached Storage
    • Intelligent disk shelf
    • Clients connect via a network
    • NFS, SMB, CIFS
      • High level protocols
  • 15.
    • Of course, it’s more confusing than that
  • 16. Meet the LUN
    • Logical Unit Number
    • A slice of storage space
    • Originally for addressing a single drive:
      • c1t2d3
      • Controller, Target, Disk (Slice)
    • Now means a virtual partition/volume
      • LVM, Logical Volume Management
  • 17. NAS vs SAN
    • With a SAN, a single host (initiator) owns a single LUN/volume
    • With NAS, multiple hosts own a single LUN/volume
    • NAS head – NAS access to a SAN
  • 18. SAN Advantages
    • Virtualization within a SAN offers some nice features:
    • Real-time LUN replication
    • Transparent backup
    • SAN booting for host replacement
  • 19. Some Practical Examples
    • There are a lot of vendors
    • Configurations vary
    • Prices vary wildly
    • Let’s look at a couple
      • Ones I happen to have experience with
      • Not an endorsement ;)
  • 20. NetApp Filers Heads and shelves, up to 500TB in 6 Cabs FC SAN with 1 or 2 NAS heads
  • 21. Isilon IQ
    • 2U Nodes, 3-96 nodes/cluster, 6-600 TB
    • FC/InfiniBand SAN with NAS head on each node
  • 22. Scaling
    • Vertical vs Horizontal
  • 23. Vertical scaling
    • Get a bigger box
    • Bigger disk(s)
    • More disks
    • Limited by current tech – size of each disk and total number in appliance
  • 24. Horizontal scaling
    • Buy more boxes
    • Add more servers/appliances
    • Scales forever*
    • *sort of
  • 25. Storage scaling approaches
    • Four common models:
    • Huge FS
    • Physical nodes
    • Virtual nodes
    • Chunked space
  • 26. Huge FS
    • Create one giant volume with growing space
      • Sun’s ZFS
      • Isilon IQ
    • Expandable on-the-fly?
    • Upper limits
      • Always limited somewhere
  • 27. Huge FS
    • Pluses
      • Simple from the application side
      • Logically simple
      • Low administrative overhead
    • Minuses
      • All your eggs in one basket
      • Hard to expand
      • Has an upper limit
  • 28. Physical nodes
    • Application handles distribution to multiple physical nodes
      • Disks, Boxes, Appliances, whatever
    • One ‘volume’ per node
    • Each node acts by itself
    • Expandable on-the-fly – add more nodes
    • Scales forever
  • 29. Physical Nodes
    • Pluses
      • Limitless expansion
      • Easy to expand
      • Unlikely to all fail at once
    • Minuses
      • Many ‘mounts’ to manage
      • More administration
  • 30. Virtual nodes
    • Application handles distribution to multiple virtual volumes, contained on multiple physical nodes
    • Multiple volumes per node
    • Flexible
    • Expandable on-the-fly – add more nodes
    • Scales forever
  • 31. Virtual Nodes
    • Pluses
      • Limitless expansion
      • Easy to expand
      • Unlikely to all fail at once
      • Addressing is logical, not physical
      • Flexible volume sizing, consolidation
    • Minuses
      • Many ‘mounts’ to manage
      • More administration
  • 32. Chunked space
    • Storage layer writes parts of files to different physical nodes
    • A higher-level RAID striping
    • High performance for large files
      • read multiple parts simultaneously
  • 33. Chunked space
    • Pluses
      • High performance
      • Limitless size
    • Minuses
      • Conceptually complex
      • Can be hard to expand on the fly
      • Can’t manually poke it
  • 34. Real Life Case Studies
  • 35. GFS – Google File System
    • Developed by … Google
    • Proprietary
    • Everything we know about it is based on talks they’ve given
    • Designed to store huge files for fast access
  • 36. GFS – Google File System
    • Single ‘Master’ node holds metadata
      • SPF – Shadow master allows warm swap
    • Grid of ‘chunkservers’
      • 64bit filenames
      • 64 MB file chunks
  • 37. GFS – Google File System 1(a) 2(a) 1(b) Master
  • 38. GFS – Google File System
    • Client reads metadata from master then file parts from multiple chunkservers
    • Designed for big files (>100MB)
    • Master server allocates access leases
    • Replication is automatic and self repairing
      • Synchronously for atomicity
  • 39. GFS – Google File System
    • Reading is fast (parallelizable)
      • But requires a lease
    • Master server is required for all reads and writes
  • 40. MogileFS – OMG Files
    • Developed by Danga / SixApart
    • Open source
    • Designed for scalable web app storage
  • 41. MogileFS – OMG Files
    • Single metadata store (MySQL)
      • MySQL Cluster avoids SPF
    • Multiple ‘tracker’ nodes locate files
    • Multiple ‘storage’ nodes store files
  • 42. MogileFS – OMG Files Tracker Tracker MySQL
  • 43. MogileFS – OMG Files
    • Replication of file ‘classes’ happens transparently
    • Storage nodes are not mirrored – replication is piecemeal
    • Reading and writing go through trackers, but are performed directly upon storage nodes
  • 44. Flickr File System
    • Developed by Flickr
    • Proprietary
    • Designed for very large scalable web app storage
  • 45. Flickr File System
    • No metadata store
      • Deal with it yourself
    • Multiple ‘StorageMaster’ nodes
    • Multiple storage nodes with virtual volumes
  • 46. Flickr File System SM SM SM
  • 47. Flickr File System
    • Metadata stored by app
      • Just a virtual volume number
      • App chooses a path
    • Virtual nodes are mirrored
      • Locally and remotely
    • Reading is done directly from nodes
  • 48. Flickr File System
    • StorageMaster nodes only used for write operations
    • Reading and writing can scale separately
  • 49. Amazon S3
    • A big disk in the sky
    • Multiple ‘buckets’
    • Files have user-defined keys
    • Data + metadata
  • 50. Amazon S3 Servers Amazon
  • 51. Amazon S3 Servers Amazon Users
  • 52. The cost
    • Fixed price, by the GB
    • Store: $0.15 per GB per month
    • Serve: $0.20 per GB
  • 53. The cost S3
  • 54. The cost S3 Regular Bandwidth
  • 55. End costs
    • ~$2k to store 1TB for a year
    • ~$63 a month for 1Mb
    • ~$65k a month for 1Gb
  • 56. Serving
  • 57. Serving files
    • Serving files is easy!
    Apache Disk
  • 58. Serving files
    • Scaling is harder
    Apache Disk Apache Disk Apache Disk
  • 59. Serving files
    • This doesn’t scale well
    • Primary storage is expensive
      • And takes a lot of space
    • In many systems, we only access a small number of files most of the time
  • 60. Caching
    • Insert caches between the storage and serving nodes
    • Cache frequently accessed content to reduce reads on the storage nodes
    • Software (Squid, mod_cache)
    • Hardware (Netcache, Cacheflow)
  • 61. Why it works
    • Keep a smaller working set
    • Use faster hardware
      • Lots of RAM
      • SCSI
      • Outer edge of disks (ZCAV)
    • Use more duplicates
      • Cheaper, since they’re smaller
  • 62. Two models
    • Layer 4
      • ‘Simple’ balanced cache
      • Objects in multiple caches
      • Good for few objects requested many times
    • Layer 7
      • URL balances cache
      • Objects in a single cache
      • Good for many objects requested a few times
  • 63. Replacement policies
    • LRU – Least recently used
    • GDSF – Greedy dual size frequency
    • LFUDA – Least frequently used with dynamic aging
    • All have advantages and disadvantages
    • Performance varies greatly with each
  • 64. Cache Churn
    • How long do objects typically stay in cache?
    • If it gets too short, we’re doing badly
      • But it depends on your traffic profile
    • Make the cached object store larger
  • 65. Problems
    • Caching has some problems:
      • Invalidation is hard
      • Replacement is dumb (even LFUDA)
    • Avoiding caching makes your life (somewhat) easier
  • 66. CDN – Content Delivery Network
    • Akamai, Savvis, Mirror Image Internet, etc
    • Caches operated by other people
      • Already in-place
      • In lots of places
    • GSLB/DNS balancing
  • 67. Edge networks Origin
  • 68. Edge networks Origin Cache Cache Cache Cache Cache Cache Cache Cache
  • 69. CDN Models
    • Simple model
      • You push content to them, they serve it
    • Reverse proxy model
      • You publish content on an origin, they proxy and cache it
  • 70. CDN Invalidation
    • You don’t control the caches
      • Just like those awful ISP ones
    • Once something is cached by a CDN, assume it can never change
      • Nothing can be deleted
      • Nothing can be modified
  • 71. Versioning
    • When you start to cache things, you need to care about versioning
      • Invalidation & Expiry
      • Naming & Sync
  • 72. Cache Invalidation
    • If you control the caches, invalidation is possible
    • But remember ISP and client caches
    • Remove deleted content explicitly
      • Avoid users finding old content
      • Save cache space
  • 73. Cache versioning
    • Simple rule of thumb:
      • If an item is modified, change its name (URL)
    • This can be independent of the file system!
  • 74. Virtual versioning
    • Database indicates version 3 of file
    • Web app writes version number into URL
    • Request comes through cache and is cached with the versioned URL
    • mod_rewrite converts versioned URL to path
    Version 3 example.com/foo_3.jpg Cached: foo_3.jpg foo_3.jpg -> foo.jpg
  • 75. Authentication
    • Authentication inline layer
      • Apache / perlbal
    • Authentication sideline
      • ICP (CARP/HTCP)
    • Authentication by URL
      • FlickrFS
  • 76. Auth layer
    • Authenticator sits between client and storage
    • Typically built into the cache software
    Cache Authenticator Origin
  • 77. Auth sideline
    • Authenticator sits beside the cache
    • Lightweight protocol used for authenticator
    Cache Authenticator Origin
  • 78. Auth by URL
    • Someone else performs authentication and gives URLs to client (typically the web app)
    • URLs hold the ‘keys’ for accessing files
    Cache Origin Web Server
  • 79. BCP
  • 80. Business Continuity Planning
    • How can I deal with the unexpected?
      • The core of BCP
    • Redundancy
    • Replication
  • 81. Reality
    • On a long enough timescale, anything that can fail, will fail
    • Of course, everything can fail
    • True reliability comes only through redundancy
  • 82. Reality
    • Define your own SLAs
    • How long can you afford to be down?
    • How manual is the recovery process?
    • How far can you roll back?
    • How many $ node boxes can fail at once?
  • 83. Failure scenarios
    • Disk failure
    • Storage array failure
    • Storage head failure
    • Fabric failure
    • Metadata node failure
    • Power outage
    • Routing outage
  • 84. Reliable by design
    • RAID avoids disk failures, but not head or fabric failures
    • Duplicated nodes avoid host and fabric failures, but not routing or power failures
    • Dual-colo avoids routing and power failures, but may need duplication too
  • 85. Tend to all points in the stack
    • Going dual-colo: great
    • Taking a whole colo offline because of a single failed disk: bad
    • We need a combination of these
  • 86. Recovery times
    • BCP is not just about continuing when things fail
    • How can we restore after they come back?
    • Host and colo level syncing
      • replication queuing
    • Host and colo level rebuilding
  • 87. Reliable Reads & Writes
    • Reliable reads are easy
      • 2 or more copies of files
    • Reliable writes are harder
      • Write 2 copies at once
      • But what do we do when we can’t write to one?
  • 88. Dual writes
    • Queue up data to be written
      • Where?
      • Needs itself to be reliable
    • Queue up journal of changes
      • And then read data from the disk whose write succeeded
    • Duplicate whole volume after failure
      • Slow!
  • 89. Cost
  • 90. Judging cost
    • Per GB?
    • Per GB upfront and per year
    • Not as simple as you’d hope
      • How about an example
  • 91. Hardware costs Cost of hardware Usable GB Single Cost
  • 92. Power costs Cost of power per year Usable GB Recurring Cost
  • 93. Power costs Power installation cost Usable GB Single Cost
  • 94. Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost
  • 95. Network costs Cost of network gear Usable GB Single Cost
  • 96. Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs
  • 97. Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]
  • 98. TCO
    • Total cost of ownership in two parts
      • Upfront
      • Ongoing
    • Architecture plays a huge part in costing
      • Don’t get tied to hardware
      • Allow heterogeneity
      • Move with the market
  • 99. (fin)
  • 100. Photo credits
    • flickr.com/photos/ebright/260823954/
    • flickr.com/photos/thomashawk/243477905/
    • flickr.com/photos/tom-carden/116315962/
    • flickr.com/photos/sillydog/287354869/
    • flickr.com/photos/foreversouls/131972916/
    • flickr.com/photos/julianb/324897/
    • flickr.com/photos/primejunta/140957047/
    • flickr.com/photos/whatknot/28973703/
    • flickr.com/photos/dcjohn/85504455/
  • 101.
    • You can find these slides online:
    • iamcal.com/talks/