Web20expo Filesystems

Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson

Big file systems? Too vague! What is a file system? What constitutes big? Some requirements would be nice

Scalable Looking at storage and serving infrastructures 1

Reliable Looking at redundancy, failure rates, on the fly changes 2

Cheap Looking at upfront costs, TCO and lifetimes 3

Four buckets Storage Serving BCP Cost

The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB

Hardware overview The storage scale NAS SAN DAS Internal Higher Lower

Internal storage A disk in a computer SCSI, IDE, SATA 4 disks in 1U is common 8 for half depth boxes

DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U

SAN Storage Area Network Dumb disk shelves Clients connect via a ‘fabric’ Fibre Channel, iSCSI, Infiniband Low level protocols

NAS Network Attached Storage Intelligent disk shelf Clients connect via a network NFS, SMB, CIFS High level protocols

Of course, it’s more confusing than that

Meet the LUN Logical Unit Number A slice of storage space Originally for addressing a single drive: c1t2d3 Controller, Target, Disk (Slice) Now means a virtual partition/volume LVM, Logical Volume Management

NAS vs SAN With a SAN, a single host (initiator) owns a single LUN/volume With NAS, multiple hosts own a single LUN/volume NAS head – NAS access to a SAN

SAN Advantages Virtualization within a SAN offers some nice features: Real-time LUN replication Transparent backup SAN booting for host replacement

Some Practical Examples There are a lot of vendors Configurations vary Prices vary wildly Let’s look at a couple Ones I happen to have experience with Not an endorsement ;)

NetApp Filers Heads and shelves, up to 500TB in 6 Cabs FC SAN with 1 or 2 NAS heads

Isilon IQ 2U Nodes, 3-96 nodes/cluster, 6-600 TB FC/InfiniBand SAN with NAS head on each node

Scaling Vertical vs Horizontal

Vertical scaling Get a bigger box Bigger disk(s) More disks Limited by current tech – size of each disk and total number in appliance

Horizontal scaling Buy more boxes Add more servers/appliances Scales forever* *sort of

Storage scaling approaches Four common models: Huge FS Physical nodes Virtual nodes Chunked space

Huge FS Create one giant volume with growing space Sun’s ZFS Isilon IQ Expandable on-the-fly? Upper limits Always limited somewhere

Huge FS Pluses Simple from the application side Logically simple Low administrative overhead Minuses All your eggs in one basket Hard to expand Has an upper limit

Physical nodes Application handles distribution to multiple physical nodes Disks, Boxes, Appliances, whatever One ‘volume’ per node Each node acts by itself Expandable on-the-fly – add more nodes Scales forever

Physical Nodes Pluses Limitless expansion Easy to expand Unlikely to all fail at once Minuses Many ‘mounts’ to manage More administration

Virtual nodes Application handles distribution to multiple virtual volumes, contained on multiple physical nodes Multiple volumes per node Flexible Expandable on-the-fly – add more nodes Scales forever

Virtual Nodes Pluses Limitless expansion Easy to expand Unlikely to all fail at once Addressing is logical, not physical Flexible volume sizing, consolidation Minuses Many ‘mounts’ to manage More administration

Chunked space Storage layer writes parts of files to different physical nodes A higher-level RAID striping High performance for large files read multiple parts simultaneously

Chunked space Pluses High performance Limitless size Minuses Conceptually complex Can be hard to expand on the fly Can’t manually poke it

GFS – Google File System Developed by … Google Proprietary Everything we know about it is based on talks they’ve given Designed to store huge files for fast access

GFS – Google File System Single ‘Master’ node holds metadata SPF – Shadow master allows warm swap Grid of ‘chunkservers’ 64bit filenames 64 MB file chunks

GFS – Google File System 1(a) 2(a) 1(b) Master

GFS – Google File System Client reads metadata from master then file parts from multiple chunkservers Designed for big files (>100MB) Master server allocates access leases Replication is automatic and self repairing Synchronously for atomicity

GFS – Google File System Reading is fast (parallelizable) But requires a lease Master server is required for all reads and writes

MogileFS – OMG Files Developed by Danga / SixApart Open source Designed for scalable web app storage

MogileFS – OMG Files Single metadata store (MySQL) MySQL Cluster avoids SPF Multiple ‘tracker’ nodes locate files Multiple ‘storage’ nodes store files

MogileFS – OMG Files Tracker Tracker MySQL

MogileFS – OMG Files Replication of file ‘classes’ happens transparently Storage nodes are not mirrored – replication is piecemeal Reading and writing go through trackers, but are performed directly upon storage nodes

Flickr File System Developed by Flickr Proprietary Designed for very large scalable web app storage

Flickr File System No metadata store Deal with it yourself Multiple ‘StorageMaster’ nodes Multiple storage nodes with virtual volumes

Flickr File System Metadata stored by app Just a virtual volume number App chooses a path Virtual nodes are mirrored Locally and remotely Reading is done directly from nodes

Flickr File System StorageMaster nodes only used for write operations Reading and writing can scale separately

Amazon S3 A big disk in the sky Multiple ‘buckets’ Files have user-defined keys Data + metadata

Amazon S3 Servers Amazon Users

The cost Fixed price, by the GB Store: $0.15 per GB per month Serve: $0.20 per GB

End costs ~$2k to store 1TB for a year ~$63 a month for 1Mb ~$65k a month for 1Gb

Serving files Serving files is easy! Apache Disk

Serving files Scaling is harder Apache Disk Apache Disk Apache Disk

Serving files This doesn’t scale well Primary storage is expensive And takes a lot of space In many systems, we only access a small number of files most of the time

Caching Insert caches between the storage and serving nodes Cache frequently accessed content to reduce reads on the storage nodes Software (Squid, mod_cache) Hardware (Netcache, Cacheflow)

Why it works Keep a smaller working set Use faster hardware Lots of RAM SCSI Outer edge of disks (ZCAV) Use more duplicates Cheaper, since they’re smaller

Two models Layer 4 ‘Simple’ balanced cache Objects in multiple caches Good for few objects requested many times Layer 7 URL balances cache Objects in a single cache Good for many objects requested a few times

Replacement policies LRU – Least recently used GDSF – Greedy dual size frequency LFUDA – Least frequently used with dynamic aging All have advantages and disadvantages Performance varies greatly with each

Cache Churn How long do objects typically stay in cache? If it gets too short, we’re doing badly But it depends on your traffic profile Make the cached object store larger

Problems Caching has some problems: Invalidation is hard Replacement is dumb (even LFUDA) Avoiding caching makes your life (somewhat) easier

CDN – Content Delivery Network Akamai, Savvis, Mirror Image Internet, etc Caches operated by other people Already in-place In lots of places GSLB/DNS balancing

Edge networks Origin Cache Cache Cache Cache Cache Cache Cache Cache

CDN Models Simple model You push content to them, they serve it Reverse proxy model You publish content on an origin, they proxy and cache it

CDN Invalidation You don’t control the caches Just like those awful ISP ones Once something is cached by a CDN, assume it can never change Nothing can be deleted Nothing can be modified

Versioning When you start to cache things, you need to care about versioning Invalidation & Expiry Naming & Sync

Cache Invalidation If you control the caches, invalidation is possible But remember ISP and client caches Remove deleted content explicitly Avoid users finding old content Save cache space

Cache versioning Simple rule of thumb: If an item is modified, change its name (URL) This can be independent of the file system!

Virtual versioning Database indicates version 3 of file Web app writes version number into URL Request comes through cache and is cached with the versioned URL mod_rewrite converts versioned URL to path Version 3 example.com/foo_3.jpg Cached: foo_3.jpg foo_3.jpg -> foo.jpg

Authentication Authentication inline layer Apache / perlbal Authentication sideline ICP (CARP/HTCP) Authentication by URL FlickrFS

Auth layer Authenticator sits between client and storage Typically built into the cache software Cache Authenticator Origin

Auth sideline Authenticator sits beside the cache Lightweight protocol used for authenticator Cache Authenticator Origin

Auth by URL Someone else performs authentication and gives URLs to client (typically the web app) URLs hold the ‘keys’ for accessing files Cache Origin Web Server

Business Continuity Planning How can I deal with the unexpected? The core of BCP Redundancy Replication

Reality On a long enough timescale, anything that can fail, will fail Of course, everything can fail True reliability comes only through redundancy

Reality Define your own SLAs How long can you afford to be down? How manual is the recovery process? How far can you roll back? How many $ node boxes can fail at once?

Failure scenarios Disk failure Storage array failure Storage head failure Fabric failure Metadata node failure Power outage Routing outage

Reliable by design RAID avoids disk failures, but not head or fabric failures Duplicated nodes avoid host and fabric failures, but not routing or power failures Dual-colo avoids routing and power failures, but may need duplication too

Tend to all points in the stack Going dual-colo: great Taking a whole colo offline because of a single failed disk: bad We need a combination of these

Recovery times BCP is not just about continuing when things fail How can we restore after they come back? Host and colo level syncing replication queuing Host and colo level rebuilding

Reliable Reads & Writes Reliable reads are easy 2 or more copies of files Reliable writes are harder Write 2 copies at once But what do we do when we can’t write to one?

Dual writes Queue up data to be written Where? Needs itself to be reliable Queue up journal of changes And then read data from the disk whose write succeeded Duplicate whole volume after failure Slow!

Judging cost Per GB? Per GB upfront and per year Not as simple as you’d hope How about an example

Hardware costs Cost of hardware Usable GB Single Cost

Power costs Cost of power per year Usable GB Recurring Cost

Power costs Power installation cost Usable GB Single Cost

Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost

Network costs Cost of network gear Usable GB Single Cost

Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs

Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]

TCO Total cost of ownership in two parts Upfront Ongoing Architecture plays a huge part in costing Don’t get tied to hardware Allow heterogeneity Move with the market

Photo credits flickr.com/photos/ebright/260823954/ flickr.com/photos/thomashawk/243477905/ flickr.com/photos/tom-carden/116315962/ flickr.com/photos/sillydog/287354869/ flickr.com/photos/foreversouls/131972916/ flickr.com/photos/julianb/324897/ flickr.com/photos/primejunta/140957047/ flickr.com/photos/whatknot/28973703/ flickr.com/photos/dcjohn/85504455/

You can find these slides online: iamcal.com/talks/

Web20expo Filesystems

More Related Content

What's hot

Viewers also liked

Similar to Web20expo Filesystems

More from royans

Recently uploaded

Web20expo Filesystems