Kosmos Filesystem

Kosmos Filesystem Sriram Rao July 22, 2008

Background KFS was initially designed and implemented at Kosmix in 2006 Two developers: myself and Blake Lewis KFS released as an open-source project in Sep. 2007 One release-meister/developer/…: myself Lots of interest in the project (1000+ downloads of the code) Quantcast is now the primary sponsor of the project

Talk Outline Introduction System Architecture File I/O (reads/writes) Handling failures Software availability KFS+Hadoop Summary

Introduction Growing class of applications that process large volumes of data Web Search, Web log analysis, Web 2.0 apps, Grid computing, … Key requirement: cost-efficient scalable compute/storage infrastructure Our work is focused towards building scalable storage infrastructure

Workload Few millions of large files Files are typically tens of MB to a few GB in size Data is written-once; read many, many times Files are accessed (mostly) sequentially

Approach Build using off-the-shelf commodity PCs ~$30K for a 16-node cluster … Virtualize storage As storage needs grow, scale the system by adding more storage nodes System adapts to the increased storage automagically Design for crash recovery from ground-up

Storage Virtualization Construct a global namespace by decoupling storage from filesystem namespace Build a ''disk'' by aggregating the storage from individual nodes in the cluster To improve performance, stripe a file across multiple nodes in the cluster Use replication for tolerating failures Simplify storage management System automagically balances storage utilization across all nodes Any file can be accessed from any machine in the network

Terminology File consists of a set of chunks Applications are oblivious of chunks I/O is done on files Translation from file offset to chunk/offset is transparent to the application Each chunk is fixed in size Chunk size is 64MB

System Architecture Single meta-data server that maintains the global namespace Multiple chunkservers that enable access to data Client library linked with applications for accessing files in KFS System implemented in C++

Design Choices Inter-process communication is via non-blocking TCP sockets Communication protocol is text-based Patterned after HTTP Connections between meta-server and chunkservers are persistent Simple failure model: Connection break implies failure. Works for LAN settings

Meta-data Server Maintains the directory tree in-memory using a B+ tree Tree records the chunks that belong to a file and file attributes For each chunk Record the file offset/chunk version Meta-server logs mutations to the tree to a log Log is periodically rolled over (once every 10 mins) Offline process compacts logs to produce a checkpoint file Chunk locations are tracked by the metaserver in an in-core table Chunks are versioned to handle chunkserver failures Periodically, heartbeats chunkservers to determine load information as well as responsiveness

Crash Recovery Following crash, restart metaserver Rebuild tree using last checkpoint+log files Chunkservers connect back to the metaserver and send chunk information Metaserver rebuilds the chunk location information Metaserver identifies stale chunks and notifies appropriate chunkservers Meta-server is a single point of failure To protect filesystem, backup logs/checkpoint files to remote nodes Will be addressed in a future release

Chunk Server Stores chunks as files in the underlying filesystem (such as, XFS/ZFS) Chunk size is fixed at 64MB To handle disk corruptions, Adler-32 checksum is computed on 64K blocks Checksums are validated on each read Chunk file has a fixed length header (~5K) for storing checksums and other meta info

Crash Recovery Following a crash, restart chunkserver Chunkserver scans the chunk directory to determine chunks it has Chunk filename identifies the owning file-id, chunk-id, chunk version Chunkserver connects to metaserver and tells it the chunks/versions it has Metaserver responds with stale chunk id’s (if any) Stale chunks are moved to lost+found

Data Scrubbing Package contains a tool, chunkscrubber that can be used to scrub chunks Scrubber verifies checksums and identifies corrupted blocks Support for periodic scrubbing will be added in a future release Scrubber will identify corrupted blocks and they will be moved to lost+found Metaserver will use re-replication to proactively recover lost chunks

Client Library Client library interfaces with the metaserver, chunkserver Client library provides a POSIX like API for usual file/directory operations: Create, read, write, mkdir, rmdir, etc. For reads/writes, client library does the translation from file offset to chunk/offset Applications can specify degree of replication on a per-file basis (default = 3) Java/Python glue-code to get at C++ library

Client library Client library keeps a cache of chunk data Reads: Download a chunk from the server and serve requests Writes: Buffer the writes and push data Chunk buffer is 64MB

Features WORM support for archiving data Incremental scalability Load balancing (static) in terms of disk usage of chunkservers “ Retire” chunkserver for scheduled downtime Client side caching for performance Leases for cache consistency Re-replication for availability Lots of support for handling failures Tools for accessing KFS tree FUSE support to allow existing fs utils to manipulate the KFS tree

WORM Support Configuration variable on the metaserver to operate in “WORM” Remove/Rename/Rmdir disallowed on all files except those “.tmp” extension Enables KFS to be used as an archival system

Handling Writes Follows the well-understood model Application creates a file File is visible in the directory tree Application writes data to the file Client library caches the write When cache is full, write is flushed to chunkservers Application can force data to be pushed to chunkservers by doing Sync() Data written to chunkserver becomes visible to another application

Leases Client library gets a read lease on a chunk prior to reading While client has a read lease, server promises that the content will not change For writes, a “master” chunkserver is assigned a write lease Master serializes concurrent writes to a chunk

Writing Data Client requests meta-server for allocation Meta-server allocates a chunk on 3 servers and anoints one server as “master” Master is given the “write lease” and is responsible for serializing writes to the chunk After doing allocation, client writes to a chunk in two steps: Client pushes data out to the master Data is forwarded amongst chunkservers in a daisy-chain At the end of push, client sends a SYNC message Master responds with a status for the SYNC message If the write fails, client retries

Write Failures When a write fails, client may force re-allocation Metaserver changes version number for the chunk on the servers hosting the chunk Picking a new location can cause chunks to diverge Re-replication is used to get chunks replication level back up

Reads Prior to reading data, client library downloads the locations of a chunk Client library picks a “nearby” server to read data: Localhost Chunkserver on the same subnet Random to balance out server load Future: use server load to determine a good location

Heartbeats Once a minute, metaserver sends a heartbeat message to chunkserver Chunkserver responds with an ack No ack => non-responsive server No work (read/write) is sent to non-responsive servers

Placement Placement algorithm: Spread chunks across servers while balancing server write load Placement is rack-aware: 3 copies of a chunk are stored on 3 different racks Localhost optimization: One copy of data is placed on the chunkserver on the same host as the client doing the write Helps reduce network traffic

Incr. Scalability Startup a new chunkserver Chunkserver connects to the meta-server Sends a HELLO message Determines which chunks are good Nukes out stale chunks Chunkserver is now part of the system

Rebalancing As new nodes are added to system, imbalance in disk utilization Use re-replication to migrate blocks from over-utilized nodes (> 80% full) to under-utilized nodes (< 20% full)

Retiring A Node Want to schedule downtime on chunkserver nodes for maintenance Use the retiring chunkserver to proactively replicate data off it Will have sufficient copies of data when the chunkserver is taken offline

KFS Software Code is released under Apache 2.0 license Software is hosted on Sourceforge http://kosmosfs.sourceforge.net Current release is alpha version 0.2.1 Contains scripts to simplify deployment Specify machine configuration Install software on the machines Script to start/stop servers on remote machines Script to backup metaserver logs/checkpoints to remote machine

Hadoop+KFS KFS integrated with Hadoop using the FileSystem API’s Allows existing Hadoop apps to use KFS seamlessly

KFS @ Quantcast Two deployments: 130 node cluster hosting log data ~2M files; 70TB of data; WORM system Metaserver uses ~2GB RAM ~1TB of data copied in during a week Used for daily jobs in read mode Plan is to use both KFS and HDFS For job output, backup from KFS to HDFS using Hadoop’s distcp

Summary KFS implemented with a set of features: Replication, leases, fault-tolerance, data integrity Basic tools for accessing/monitoring KFS Fair bit of work done to make the system resilient to failures System is incrementally scalable Integrated with Hadoop to enable usage

References The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.

Kosmos Filesystem

More Related Content

What's hot

Similar to Kosmos Filesystem

More from elliando dias

Recently uploaded

Kosmos Filesystem