Beyond the File System: Designing Large-Scale File Storage and Serving
Upcoming SlideShare
Loading in...5
×
 

Beyond the File System: Designing Large-Scale File Storage and Serving

on

  • 32,800 views

Cal Henderson at Web 2.0 expo

Cal Henderson at Web 2.0 expo

Statistics

Views

Total Views
32,800
Slideshare-icon Views on SlideShare
32,192
Embed Views
608

Actions

Likes
80
Downloads
1,523
Comments
2

18 Embeds 608

http://poorbuthappy.com 433
http://22by7tech.blogspot.com 66
http://www.slideshare.net 55
http://www.poorbuthappy.com 30
http://webdevcampsp.ning.com 6
http://209.85.173.104 3
http://22by7tech.blogspot.in 2
http://wildfire.gigya.com 2
http://c17-cne-admin-bld1.cnet.com:8081 2
http://fbfanpage4biz.com 1
http://translate.googleusercontent.com 1
http://localhost 1
https://s3.amazonaws.com 1
http://blog.techno.taladnam.com 1
http://static.slideshare.net 1
http://wiki.iwin.com 1
http://www.thoughtbag.com 1
http://nassystem.jusst.us 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • good slide

    --
    http://www.financialcrisis2009.org
    Are you sure you want to
    Your message goes here
    Processing…
  • fs
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Beyond the File System: Designing Large-Scale File Storage and Serving Beyond the File System: Designing Large-Scale File Storage and Serving Presentation Transcript

    • Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson
    • Hello!
    • Big file systems?
      • Too vague!
      • What is a file system?
      • What constitutes big?
      • Some requirements would be nice
      • Scalable
      • Looking at storage and serving infrastructures
      1
      • Reliable
      • Looking at redundancy, failure rates, on the fly changes
      2
      • Cheap
      • Looking at upfront costs, TCO and lifetimes
      3
    • Four buckets Storage Serving BCP Cost
    • Storage
    • The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB
    • Hardware overview
      • The storage scale
      NAS SAN DAS Internal Higher Lower
    • Internal storage
      • A disk in a computer
        • SCSI, IDE, SATA
      • 4 disks in 1U is common
      • 8 for half depth boxes
    • DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U
    • SAN
      • Storage Area Network
      • Dumb disk shelves
      • Clients connect via a ‘fabric’
      • Fibre Channel, iSCSI, Infiniband
        • Low level protocols
    • NAS
      • Network Attached Storage
      • Intelligent disk shelf
      • Clients connect via a network
      • NFS, SMB, CIFS
        • High level protocols
      • Of course, it’s more confusing than that
    • Meet the LUN
      • Logical Unit Number
      • A slice of storage space
      • Originally for addressing a single drive:
        • c1t2d3
        • Controller, Target, Disk (Slice)
      • Now means a virtual partition/volume
        • LVM, Logical Volume Management
    • NAS vs SAN
      • With a SAN, a single host (initiator) owns a single LUN/volume
      • With NAS, multiple hosts own a single LUN/volume
      • NAS head – NAS access to a SAN
    • SAN Advantages
      • Virtualization within a SAN offers some nice features:
      • Real-time LUN replication
      • Transparent backup
      • SAN booting for host replacement
    • Some Practical Examples
      • There are a lot of vendors
      • Configurations vary
      • Prices vary wildly
      • Let’s look at a couple
        • Ones I happen to have experience with
        • Not an endorsement ;)
    • NetApp Filers Heads and shelves, up to 500TB in 6 Cabs FC SAN with 1 or 2 NAS heads
    • Isilon IQ
      • 2U Nodes, 3-96 nodes/cluster, 6-600 TB
      • FC/InfiniBand SAN with NAS head on each node
    • Scaling
      • Vertical vs Horizontal
    • Vertical scaling
      • Get a bigger box
      • Bigger disk(s)
      • More disks
      • Limited by current tech – size of each disk and total number in appliance
    • Horizontal scaling
      • Buy more boxes
      • Add more servers/appliances
      • Scales forever*
      • *sort of
    • Storage scaling approaches
      • Four common models:
      • Huge FS
      • Physical nodes
      • Virtual nodes
      • Chunked space
    • Huge FS
      • Create one giant volume with growing space
        • Sun’s ZFS
        • Isilon IQ
      • Expandable on-the-fly?
      • Upper limits
        • Always limited somewhere
    • Huge FS
      • Pluses
        • Simple from the application side
        • Logically simple
        • Low administrative overhead
      • Minuses
        • All your eggs in one basket
        • Hard to expand
        • Has an upper limit
    • Physical nodes
      • Application handles distribution to multiple physical nodes
        • Disks, Boxes, Appliances, whatever
      • One ‘volume’ per node
      • Each node acts by itself
      • Expandable on-the-fly – add more nodes
      • Scales forever
    • Physical Nodes
      • Pluses
        • Limitless expansion
        • Easy to expand
        • Unlikely to all fail at once
      • Minuses
        • Many ‘mounts’ to manage
        • More administration
    • Virtual nodes
      • Application handles distribution to multiple virtual volumes, contained on multiple physical nodes
      • Multiple volumes per node
      • Flexible
      • Expandable on-the-fly – add more nodes
      • Scales forever
    • Virtual Nodes
      • Pluses
        • Limitless expansion
        • Easy to expand
        • Unlikely to all fail at once
        • Addressing is logical, not physical
        • Flexible volume sizing, consolidation
      • Minuses
        • Many ‘mounts’ to manage
        • More administration
    • Chunked space
      • Storage layer writes parts of files to different physical nodes
      • A higher-level RAID striping
      • High performance for large files
        • read multiple parts simultaneously
    • Chunked space
      • Pluses
        • High performance
        • Limitless size
      • Minuses
        • Conceptually complex
        • Can be hard to expand on the fly
        • Can’t manually poke it
    • Real Life Case Studies
    • GFS – Google File System
      • Developed by … Google
      • Proprietary
      • Everything we know about it is based on talks they’ve given
      • Designed to store huge files for fast access
    • GFS – Google File System
      • Single ‘Master’ node holds metadata
        • SPF – Shadow master allows warm swap
      • Grid of ‘chunkservers’
        • 64bit filenames
        • 64 MB file chunks
    • GFS – Google File System 1(a) 2(a) 1(b) Master
    • GFS – Google File System
      • Client reads metadata from master then file parts from multiple chunkservers
      • Designed for big files (>100MB)
      • Master server allocates access leases
      • Replication is automatic and self repairing
        • Synchronously for atomicity
    • GFS – Google File System
      • Reading is fast (parallelizable)
        • But requires a lease
      • Master server is required for all reads and writes
    • MogileFS – OMG Files
      • Developed by Danga / SixApart
      • Open source
      • Designed for scalable web app storage
    • MogileFS – OMG Files
      • Single metadata store (MySQL)
        • MySQL Cluster avoids SPF
      • Multiple ‘tracker’ nodes locate files
      • Multiple ‘storage’ nodes store files
    • MogileFS – OMG Files Tracker Tracker MySQL
    • MogileFS – OMG Files
      • Replication of file ‘classes’ happens transparently
      • Storage nodes are not mirrored – replication is piecemeal
      • Reading and writing go through trackers, but are performed directly upon storage nodes
    • Flickr File System
      • Developed by Flickr
      • Proprietary
      • Designed for very large scalable web app storage
    • Flickr File System
      • No metadata store
        • Deal with it yourself
      • Multiple ‘StorageMaster’ nodes
      • Multiple storage nodes with virtual volumes
    • Flickr File System SM SM SM
    • Flickr File System
      • Metadata stored by app
        • Just a virtual volume number
        • App chooses a path
      • Virtual nodes are mirrored
        • Locally and remotely
      • Reading is done directly from nodes
    • Flickr File System
      • StorageMaster nodes only used for write operations
      • Reading and writing can scale separately
    • Amazon S3
      • A big disk in the sky
      • Multiple ‘buckets’
      • Files have user-defined keys
      • Data + metadata
    • Amazon S3 Servers Amazon
    • Amazon S3 Servers Amazon Users
    • The cost
      • Fixed price, by the GB
      • Store: $0.15 per GB per month
      • Serve: $0.20 per GB
    • The cost S3
    • The cost S3 Regular Bandwidth
    • End costs
      • ~$2k to store 1TB for a year
      • ~$63 a month for 1Mb
      • ~$65k a month for 1Gb
    • Serving
    • Serving files
      • Serving files is easy!
      Apache Disk
    • Serving files
      • Scaling is harder
      Apache Disk Apache Disk Apache Disk
    • Serving files
      • This doesn’t scale well
      • Primary storage is expensive
        • And takes a lot of space
      • In many systems, we only access a small number of files most of the time
    • Caching
      • Insert caches between the storage and serving nodes
      • Cache frequently accessed content to reduce reads on the storage nodes
      • Software (Squid, mod_cache)
      • Hardware (Netcache, Cacheflow)
    • Why it works
      • Keep a smaller working set
      • Use faster hardware
        • Lots of RAM
        • SCSI
        • Outer edge of disks (ZCAV)
      • Use more duplicates
        • Cheaper, since they’re smaller
    • Two models
      • Layer 4
        • ‘Simple’ balanced cache
        • Objects in multiple caches
        • Good for few objects requested many times
      • Layer 7
        • URL balances cache
        • Objects in a single cache
        • Good for many objects requested a few times
    • Replacement policies
      • LRU – Least recently used
      • GDSF – Greedy dual size frequency
      • LFUDA – Least frequently used with dynamic aging
      • All have advantages and disadvantages
      • Performance varies greatly with each
    • Cache Churn
      • How long do objects typically stay in cache?
      • If it gets too short, we’re doing badly
        • But it depends on your traffic profile
      • Make the cached object store larger
    • Problems
      • Caching has some problems:
        • Invalidation is hard
        • Replacement is dumb (even LFUDA)
      • Avoiding caching makes your life (somewhat) easier
    • CDN – Content Delivery Network
      • Akamai, Savvis, Mirror Image Internet, etc
      • Caches operated by other people
        • Already in-place
        • In lots of places
      • GSLB/DNS balancing
    • Edge networks Origin
    • Edge networks Origin Cache Cache Cache Cache Cache Cache Cache Cache
    • CDN Models
      • Simple model
        • You push content to them, they serve it
      • Reverse proxy model
        • You publish content on an origin, they proxy and cache it
    • CDN Invalidation
      • You don’t control the caches
        • Just like those awful ISP ones
      • Once something is cached by a CDN, assume it can never change
        • Nothing can be deleted
        • Nothing can be modified
    • Versioning
      • When you start to cache things, you need to care about versioning
        • Invalidation & Expiry
        • Naming & Sync
    • Cache Invalidation
      • If you control the caches, invalidation is possible
      • But remember ISP and client caches
      • Remove deleted content explicitly
        • Avoid users finding old content
        • Save cache space
    • Cache versioning
      • Simple rule of thumb:
        • If an item is modified, change its name (URL)
      • This can be independent of the file system!
    • Virtual versioning
      • Database indicates version 3 of file
      • Web app writes version number into URL
      • Request comes through cache and is cached with the versioned URL
      • mod_rewrite converts versioned URL to path
      Version 3 example.com/foo_3.jpg Cached: foo_3.jpg foo_3.jpg -> foo.jpg
    • Authentication
      • Authentication inline layer
        • Apache / perlbal
      • Authentication sideline
        • ICP (CARP/HTCP)
      • Authentication by URL
        • FlickrFS
    • Auth layer
      • Authenticator sits between client and storage
      • Typically built into the cache software
      Cache Authenticator Origin
    • Auth sideline
      • Authenticator sits beside the cache
      • Lightweight protocol used for authenticator
      Cache Authenticator Origin
    • Auth by URL
      • Someone else performs authentication and gives URLs to client (typically the web app)
      • URLs hold the ‘keys’ for accessing files
      Cache Origin Web Server
    • BCP
    • Business Continuity Planning
      • How can I deal with the unexpected?
        • The core of BCP
      • Redundancy
      • Replication
    • Reality
      • On a long enough timescale, anything that can fail, will fail
      • Of course, everything can fail
      • True reliability comes only through redundancy
    • Reality
      • Define your own SLAs
      • How long can you afford to be down?
      • How manual is the recovery process?
      • How far can you roll back?
      • How many $ node boxes can fail at once?
    • Failure scenarios
      • Disk failure
      • Storage array failure
      • Storage head failure
      • Fabric failure
      • Metadata node failure
      • Power outage
      • Routing outage
    • Reliable by design
      • RAID avoids disk failures, but not head or fabric failures
      • Duplicated nodes avoid host and fabric failures, but not routing or power failures
      • Dual-colo avoids routing and power failures, but may need duplication too
    • Tend to all points in the stack
      • Going dual-colo: great
      • Taking a whole colo offline because of a single failed disk: bad
      • We need a combination of these
    • Recovery times
      • BCP is not just about continuing when things fail
      • How can we restore after they come back?
      • Host and colo level syncing
        • replication queuing
      • Host and colo level rebuilding
    • Reliable Reads & Writes
      • Reliable reads are easy
        • 2 or more copies of files
      • Reliable writes are harder
        • Write 2 copies at once
        • But what do we do when we can’t write to one?
    • Dual writes
      • Queue up data to be written
        • Where?
        • Needs itself to be reliable
      • Queue up journal of changes
        • And then read data from the disk whose write succeeded
      • Duplicate whole volume after failure
        • Slow!
    • Cost
    • Judging cost
      • Per GB?
      • Per GB upfront and per year
      • Not as simple as you’d hope
        • How about an example
    • Hardware costs Cost of hardware Usable GB Single Cost
    • Power costs Cost of power per year Usable GB Recurring Cost
    • Power costs Power installation cost Usable GB Single Cost
    • Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost
    • Network costs Cost of network gear Usable GB Single Cost
    • Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs
    • Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]
    • TCO
      • Total cost of ownership in two parts
        • Upfront
        • Ongoing
      • Architecture plays a huge part in costing
        • Don’t get tied to hardware
        • Allow heterogeneity
        • Move with the market
    • (fin)
    • Photo credits
      • flickr.com/photos/ebright/260823954/
      • flickr.com/photos/thomashawk/243477905/
      • flickr.com/photos/tom-carden/116315962/
      • flickr.com/photos/sillydog/287354869/
      • flickr.com/photos/foreversouls/131972916/
      • flickr.com/photos/julianb/324897/
      • flickr.com/photos/primejunta/140957047/
      • flickr.com/photos/whatknot/28973703/
      • flickr.com/photos/dcjohn/85504455/
      • You can find these slides online:
      • iamcal.com/talks/