Web20expo Filesystems
Upcoming SlideShare
Loading in...5
×
 

Web20expo Filesystems

on

  • 1,581 views

Cal handerson's talk. Awesome

Cal handerson's talk. Awesome
http://www.iamcal.com/talks/

Statistics

Views

Total Views
1,581
Views on SlideShare
1,580
Embed Views
1

Actions

Likes
1
Downloads
7
Comments
0

1 Embed 1

http://www.techgig.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Web20expo Filesystems Web20expo Filesystems Presentation Transcript

  • Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson
  • Hello!
  • Big file systems?
    • Too vague!
    • What is a file system?
    • What constitutes big?
    • Some requirements would be nice
    • Scalable
    • Looking at storage and serving infrastructures
    1
    • Reliable
    • Looking at redundancy, failure rates, on the fly changes
    2
    • Cheap
    • Looking at upfront costs, TCO and lifetimes
    3
  • Four buckets Storage Serving BCP Cost
  • Storage
  • The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB
  • Hardware overview
    • The storage scale
    NAS SAN DAS Internal Higher Lower
  • Internal storage
    • A disk in a computer
      • SCSI, IDE, SATA
    • 4 disks in 1U is common
    • 8 for half depth boxes
  • DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U
  • SAN
    • Storage Area Network
    • Dumb disk shelves
    • Clients connect via a ‘fabric’
    • Fibre Channel, iSCSI, Infiniband
      • Low level protocols
  • NAS
    • Network Attached Storage
    • Intelligent disk shelf
    • Clients connect via a network
    • NFS, SMB, CIFS
      • High level protocols
    • Of course, it’s more confusing than that
  • Meet the LUN
    • Logical Unit Number
    • A slice of storage space
    • Originally for addressing a single drive:
      • c1t2d3
      • Controller, Target, Disk (Slice)
    • Now means a virtual partition/volume
      • LVM, Logical Volume Management
  • NAS vs SAN
    • With a SAN, a single host (initiator) owns a single LUN/volume
    • With NAS, multiple hosts own a single LUN/volume
    • NAS head – NAS access to a SAN
  • SAN Advantages
    • Virtualization within a SAN offers some nice features:
    • Real-time LUN replication
    • Transparent backup
    • SAN booting for host replacement
  • Some Practical Examples
    • There are a lot of vendors
    • Configurations vary
    • Prices vary wildly
    • Let’s look at a couple
      • Ones I happen to have experience with
      • Not an endorsement ;)
  • NetApp Filers Heads and shelves, up to 500TB in 6 Cabs FC SAN with 1 or 2 NAS heads
  • Isilon IQ
    • 2U Nodes, 3-96 nodes/cluster, 6-600 TB
    • FC/InfiniBand SAN with NAS head on each node
  • Scaling
    • Vertical vs Horizontal
  • Vertical scaling
    • Get a bigger box
    • Bigger disk(s)
    • More disks
    • Limited by current tech – size of each disk and total number in appliance
  • Horizontal scaling
    • Buy more boxes
    • Add more servers/appliances
    • Scales forever*
    • *sort of
  • Storage scaling approaches
    • Four common models:
    • Huge FS
    • Physical nodes
    • Virtual nodes
    • Chunked space
  • Huge FS
    • Create one giant volume with growing space
      • Sun’s ZFS
      • Isilon IQ
    • Expandable on-the-fly?
    • Upper limits
      • Always limited somewhere
  • Huge FS
    • Pluses
      • Simple from the application side
      • Logically simple
      • Low administrative overhead
    • Minuses
      • All your eggs in one basket
      • Hard to expand
      • Has an upper limit
  • Physical nodes
    • Application handles distribution to multiple physical nodes
      • Disks, Boxes, Appliances, whatever
    • One ‘volume’ per node
    • Each node acts by itself
    • Expandable on-the-fly – add more nodes
    • Scales forever
  • Physical Nodes
    • Pluses
      • Limitless expansion
      • Easy to expand
      • Unlikely to all fail at once
    • Minuses
      • Many ‘mounts’ to manage
      • More administration
  • Virtual nodes
    • Application handles distribution to multiple virtual volumes, contained on multiple physical nodes
    • Multiple volumes per node
    • Flexible
    • Expandable on-the-fly – add more nodes
    • Scales forever
  • Virtual Nodes
    • Pluses
      • Limitless expansion
      • Easy to expand
      • Unlikely to all fail at once
      • Addressing is logical, not physical
      • Flexible volume sizing, consolidation
    • Minuses
      • Many ‘mounts’ to manage
      • More administration
  • Chunked space
    • Storage layer writes parts of files to different physical nodes
    • A higher-level RAID striping
    • High performance for large files
      • read multiple parts simultaneously
  • Chunked space
    • Pluses
      • High performance
      • Limitless size
    • Minuses
      • Conceptually complex
      • Can be hard to expand on the fly
      • Can’t manually poke it
  • Real Life Case Studies
  • GFS – Google File System
    • Developed by … Google
    • Proprietary
    • Everything we know about it is based on talks they’ve given
    • Designed to store huge files for fast access
  • GFS – Google File System
    • Single ‘Master’ node holds metadata
      • SPF – Shadow master allows warm swap
    • Grid of ‘chunkservers’
      • 64bit filenames
      • 64 MB file chunks
  • GFS – Google File System 1(a) 2(a) 1(b) Master
  • GFS – Google File System
    • Client reads metadata from master then file parts from multiple chunkservers
    • Designed for big files (>100MB)
    • Master server allocates access leases
    • Replication is automatic and self repairing
      • Synchronously for atomicity
  • GFS – Google File System
    • Reading is fast (parallelizable)
      • But requires a lease
    • Master server is required for all reads and writes
  • MogileFS – OMG Files
    • Developed by Danga / SixApart
    • Open source
    • Designed for scalable web app storage
  • MogileFS – OMG Files
    • Single metadata store (MySQL)
      • MySQL Cluster avoids SPF
    • Multiple ‘tracker’ nodes locate files
    • Multiple ‘storage’ nodes store files
  • MogileFS – OMG Files Tracker Tracker MySQL
  • MogileFS – OMG Files
    • Replication of file ‘classes’ happens transparently
    • Storage nodes are not mirrored – replication is piecemeal
    • Reading and writing go through trackers, but are performed directly upon storage nodes
  • Flickr File System
    • Developed by Flickr
    • Proprietary
    • Designed for very large scalable web app storage
  • Flickr File System
    • No metadata store
      • Deal with it yourself
    • Multiple ‘StorageMaster’ nodes
    • Multiple storage nodes with virtual volumes
  • Flickr File System SM SM SM
  • Flickr File System
    • Metadata stored by app
      • Just a virtual volume number
      • App chooses a path
    • Virtual nodes are mirrored
      • Locally and remotely
    • Reading is done directly from nodes
  • Flickr File System
    • StorageMaster nodes only used for write operations
    • Reading and writing can scale separately
  • Amazon S3
    • A big disk in the sky
    • Multiple ‘buckets’
    • Files have user-defined keys
    • Data + metadata
  • Amazon S3 Servers Amazon
  • Amazon S3 Servers Amazon Users
  • The cost
    • Fixed price, by the GB
    • Store: $0.15 per GB per month
    • Serve: $0.20 per GB
  • The cost S3
  • The cost S3 Regular Bandwidth
  • End costs
    • ~$2k to store 1TB for a year
    • ~$63 a month for 1Mb
    • ~$65k a month for 1Gb
  • Serving
  • Serving files
    • Serving files is easy!
    Apache Disk
  • Serving files
    • Scaling is harder
    Apache Disk Apache Disk Apache Disk
  • Serving files
    • This doesn’t scale well
    • Primary storage is expensive
      • And takes a lot of space
    • In many systems, we only access a small number of files most of the time
  • Caching
    • Insert caches between the storage and serving nodes
    • Cache frequently accessed content to reduce reads on the storage nodes
    • Software (Squid, mod_cache)
    • Hardware (Netcache, Cacheflow)
  • Why it works
    • Keep a smaller working set
    • Use faster hardware
      • Lots of RAM
      • SCSI
      • Outer edge of disks (ZCAV)
    • Use more duplicates
      • Cheaper, since they’re smaller
  • Two models
    • Layer 4
      • ‘Simple’ balanced cache
      • Objects in multiple caches
      • Good for few objects requested many times
    • Layer 7
      • URL balances cache
      • Objects in a single cache
      • Good for many objects requested a few times
  • Replacement policies
    • LRU – Least recently used
    • GDSF – Greedy dual size frequency
    • LFUDA – Least frequently used with dynamic aging
    • All have advantages and disadvantages
    • Performance varies greatly with each
  • Cache Churn
    • How long do objects typically stay in cache?
    • If it gets too short, we’re doing badly
      • But it depends on your traffic profile
    • Make the cached object store larger
  • Problems
    • Caching has some problems:
      • Invalidation is hard
      • Replacement is dumb (even LFUDA)
    • Avoiding caching makes your life (somewhat) easier
  • CDN – Content Delivery Network
    • Akamai, Savvis, Mirror Image Internet, etc
    • Caches operated by other people
      • Already in-place
      • In lots of places
    • GSLB/DNS balancing
  • Edge networks Origin
  • Edge networks Origin Cache Cache Cache Cache Cache Cache Cache Cache
  • CDN Models
    • Simple model
      • You push content to them, they serve it
    • Reverse proxy model
      • You publish content on an origin, they proxy and cache it
  • CDN Invalidation
    • You don’t control the caches
      • Just like those awful ISP ones
    • Once something is cached by a CDN, assume it can never change
      • Nothing can be deleted
      • Nothing can be modified
  • Versioning
    • When you start to cache things, you need to care about versioning
      • Invalidation & Expiry
      • Naming & Sync
  • Cache Invalidation
    • If you control the caches, invalidation is possible
    • But remember ISP and client caches
    • Remove deleted content explicitly
      • Avoid users finding old content
      • Save cache space
  • Cache versioning
    • Simple rule of thumb:
      • If an item is modified, change its name (URL)
    • This can be independent of the file system!
  • Virtual versioning
    • Database indicates version 3 of file
    • Web app writes version number into URL
    • Request comes through cache and is cached with the versioned URL
    • mod_rewrite converts versioned URL to path
    Version 3 example.com/foo_3.jpg Cached: foo_3.jpg foo_3.jpg -> foo.jpg
  • Authentication
    • Authentication inline layer
      • Apache / perlbal
    • Authentication sideline
      • ICP (CARP/HTCP)
    • Authentication by URL
      • FlickrFS
  • Auth layer
    • Authenticator sits between client and storage
    • Typically built into the cache software
    Cache Authenticator Origin
  • Auth sideline
    • Authenticator sits beside the cache
    • Lightweight protocol used for authenticator
    Cache Authenticator Origin
  • Auth by URL
    • Someone else performs authentication and gives URLs to client (typically the web app)
    • URLs hold the ‘keys’ for accessing files
    Cache Origin Web Server
  • BCP
  • Business Continuity Planning
    • How can I deal with the unexpected?
      • The core of BCP
    • Redundancy
    • Replication
  • Reality
    • On a long enough timescale, anything that can fail, will fail
    • Of course, everything can fail
    • True reliability comes only through redundancy
  • Reality
    • Define your own SLAs
    • How long can you afford to be down?
    • How manual is the recovery process?
    • How far can you roll back?
    • How many $ node boxes can fail at once?
  • Failure scenarios
    • Disk failure
    • Storage array failure
    • Storage head failure
    • Fabric failure
    • Metadata node failure
    • Power outage
    • Routing outage
  • Reliable by design
    • RAID avoids disk failures, but not head or fabric failures
    • Duplicated nodes avoid host and fabric failures, but not routing or power failures
    • Dual-colo avoids routing and power failures, but may need duplication too
  • Tend to all points in the stack
    • Going dual-colo: great
    • Taking a whole colo offline because of a single failed disk: bad
    • We need a combination of these
  • Recovery times
    • BCP is not just about continuing when things fail
    • How can we restore after they come back?
    • Host and colo level syncing
      • replication queuing
    • Host and colo level rebuilding
  • Reliable Reads & Writes
    • Reliable reads are easy
      • 2 or more copies of files
    • Reliable writes are harder
      • Write 2 copies at once
      • But what do we do when we can’t write to one?
  • Dual writes
    • Queue up data to be written
      • Where?
      • Needs itself to be reliable
    • Queue up journal of changes
      • And then read data from the disk whose write succeeded
    • Duplicate whole volume after failure
      • Slow!
  • Cost
  • Judging cost
    • Per GB?
    • Per GB upfront and per year
    • Not as simple as you’d hope
      • How about an example
  • Hardware costs Cost of hardware Usable GB Single Cost
  • Power costs Cost of power per year Usable GB Recurring Cost
  • Power costs Power installation cost Usable GB Single Cost
  • Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost
  • Network costs Cost of network gear Usable GB Single Cost
  • Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs
  • Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]
  • TCO
    • Total cost of ownership in two parts
      • Upfront
      • Ongoing
    • Architecture plays a huge part in costing
      • Don’t get tied to hardware
      • Allow heterogeneity
      • Move with the market
  • (fin)
  • Photo credits
    • flickr.com/photos/ebright/260823954/
    • flickr.com/photos/thomashawk/243477905/
    • flickr.com/photos/tom-carden/116315962/
    • flickr.com/photos/sillydog/287354869/
    • flickr.com/photos/foreversouls/131972916/
    • flickr.com/photos/julianb/324897/
    • flickr.com/photos/primejunta/140957047/
    • flickr.com/photos/whatknot/28973703/
    • flickr.com/photos/dcjohn/85504455/
    • You can find these slides online:
    • iamcal.com/talks/