Kosmos Filesystem Sriram Rao July 22, 2008
Background KFS was initially designed and implemented at Kosmix in 2006 Two developers: myself and Blake Lewis KFS released as an open-source project in Sep. 2007 One release-meister/developer/…: myself Lots of interest in the project (1000+ downloads of the code) Quantcast is now the primary sponsor of the project
Talk Outline Introduction System Architecture File I/O (reads/writes) Handling failures Software availability KFS+Hadoop Summary
Introduction Growing class of applications that process large volumes of data Web Search, Web log analysis, Web 2.0 apps, Grid computing, … Key requirement: cost-efficient scalable compute/storage infrastructure Our work is focused towards building scalable storage infrastructure
Workload Few millions of large files Files are typically tens of MB to a few GB in size Data is written-once; read many, many times Files are accessed (mostly) sequentially
Approach Build using off-the-shelf commodity PCs ~$30K for a 16-node cluster … Virtualize storage As storage needs grow, scale the system by adding more storage nodes System adapts to the increased storage automagically Design for crash recovery from ground-up
Storage Virtualization Construct a global namespace by  decoupling  storage from filesystem namespace Build a ''disk'' by aggregating the storage from individual nodes in the cluster To improve performance, stripe a file across multiple nodes in the cluster Use replication for tolerating failures Simplify storage management System automagically balances storage utilization across all nodes Any file can be accessed from any machine in the network
Terminology File consists of a set of  chunks Applications are oblivious of chunks I/O is done on files Translation from file offset to chunk/offset is transparent to the application Each chunk is fixed in size Chunk size is 64MB
System Architecture Single  meta-data server  that maintains the global namespace Multiple  chunkservers  that enable access to data Client library  linked with applications for accessing files in KFS System implemented in C++
Design Choices Inter-process communication is via non-blocking TCP sockets Communication protocol is text-based  Patterned after HTTP Connections between meta-server and chunkservers are persistent Simple failure model: Connection break implies failure.  Works for LAN settings
Meta-data Server Maintains the directory tree in-memory using a B+ tree Tree records the chunks that belong to a file and file attributes For each chunk Record the file offset/chunk version Meta-server logs mutations to the tree to a log  Log is periodically rolled over (once every 10 mins) Offline process compacts logs to produce a checkpoint file Chunk locations are tracked by the metaserver in an in-core table Chunks are versioned to handle chunkserver failures Periodically, heartbeats chunkservers to determine load information as well as responsiveness
Crash Recovery Following crash, restart metaserver Rebuild tree using last checkpoint+log files Chunkservers connect back to the metaserver and send chunk information Metaserver rebuilds the chunk location information Metaserver identifies  stale  chunks and notifies appropriate chunkservers Meta-server is a single point of failure To protect filesystem, backup logs/checkpoint files to remote nodes Will be addressed in a future release
Chunk Server Stores chunks as files in the underlying filesystem (such as, XFS/ZFS) Chunk size is fixed at 64MB To handle disk corruptions, Adler-32 checksum is computed on 64K blocks Checksums are validated on each read Chunk file has a fixed length header (~5K) for storing checksums and other meta info
Crash Recovery Following a crash, restart chunkserver Chunkserver scans the chunk directory to determine chunks it has Chunk filename identifies the owning file-id, chunk-id, chunk version Chunkserver connects to metaserver and tells it the chunks/versions it has Metaserver responds with stale chunk id’s (if any) Stale chunks are moved to lost+found
Data Scrubbing Package contains a tool, chunkscrubber that can be used to scrub chunks Scrubber verifies checksums and identifies corrupted blocks Support for periodic scrubbing will be added in a future release Scrubber will identify corrupted blocks and they will be moved to lost+found Metaserver will use re-replication to proactively recover lost chunks
Client Library Client library interfaces with the  metaserver, chunkserver Client library provides a POSIX like API for usual file/directory operations: Create, read, write, mkdir, rmdir, etc. For reads/writes, client library does the translation from file offset to chunk/offset Applications can specify degree of replication on a per-file basis (default = 3) Java/Python glue-code to get at C++ library
Client library Client library keeps a cache of chunk data Reads: Download a chunk from the server and serve requests Writes: Buffer the writes and push data Chunk buffer is 64MB
Features WORM support for archiving data Incremental scalability Load balancing (static) in terms of disk usage of chunkservers “ Retire” chunkserver for scheduled downtime Client side caching for performance Leases for cache consistency Re-replication for availability Lots of support for handling failures Tools for accessing KFS tree FUSE support to allow existing fs utils to manipulate the KFS tree
WORM Support Configuration variable on the metaserver to operate in “WORM” Remove/Rename/Rmdir disallowed on all files except those “.tmp” extension Enables KFS to be used as an archival system
Handling Writes Follows the well-understood model Application creates a file File is visible in the directory tree Application writes data to the file Client library caches the write When cache is full, write is flushed to chunkservers Application can force data to be pushed to chunkservers by doing Sync() Data written to chunkserver becomes visible to another application
Leases Client library gets a read lease on a chunk prior to reading While client has a read lease, server promises that the content will not change For writes, a “master” chunkserver is assigned a write lease Master serializes concurrent writes to a chunk
Writing Data Client requests meta-server for allocation Meta-server allocates a chunk on 3 servers and anoints one server as “master” Master is given the “write lease” and is responsible for serializing writes to the chunk After doing allocation, client writes to a chunk in two steps: Client pushes data out to the master Data is forwarded amongst chunkservers in a daisy-chain At the end of push, client sends a SYNC message Master responds with a status for the SYNC message If the write fails, client retries
Write Failures  When a write fails, client may force re-allocation Metaserver changes version number for the chunk on the servers hosting the chunk Picking a new location can cause chunks to diverge Re-replication is used to get chunks replication level back up
Reads Prior to reading data, client library downloads the locations of a chunk Client library picks a “nearby” server to read data: Localhost Chunkserver on the same subnet Random to balance out server load Future: use server load to determine a good location
Heartbeats Once a minute, metaserver sends a heartbeat message to chunkserver Chunkserver responds with an ack No ack => non-responsive server No work (read/write) is sent to non-responsive servers
Placement Placement algorithm: Spread chunks across servers while balancing server write load Placement is rack-aware: 3 copies of a chunk are stored on 3 different racks Localhost optimization: One copy of data is placed on the chunkserver on the same host as the client doing the write Helps reduce network traffic
Incr. Scalability Startup a new chunkserver Chunkserver connects to the meta-server  Sends a HELLO message Determines which chunks are good Nukes out stale chunks Chunkserver is now part of the system
Rebalancing As new nodes are added to system, imbalance in disk utilization Use re-replication to migrate blocks from over-utilized nodes (> 80% full) to under-utilized nodes (< 20% full)
Retiring A Node Want to schedule downtime on chunkserver nodes for maintenance Use the retiring chunkserver to proactively replicate data off it Will have sufficient copies of data when the chunkserver is taken offline
KFS Software Code is released under Apache 2.0 license Software is hosted on Sourceforge http://kosmosfs.sourceforge.net Current release is alpha version 0.2.1 Contains scripts to simplify deployment Specify machine configuration Install software on the machines Script to start/stop servers on remote machines Script to backup metaserver logs/checkpoints to remote machine
Hadoop+KFS KFS integrated with Hadoop using the FileSystem API’s Allows existing Hadoop apps to use KFS seamlessly
KFS @ Quantcast Two deployments: 130 node cluster hosting log data ~2M files; 70TB of data; WORM system Metaserver uses ~2GB RAM ~1TB of data copied in during a week Used for daily jobs in read mode Plan is to use both KFS and HDFS For job output, backup from KFS to HDFS using Hadoop’s distcp
Summary KFS implemented with a set of features: Replication, leases, fault-tolerance, data integrity Basic tools for accessing/monitoring KFS Fair bit of work done to make the system resilient to failures System is incrementally scalable Integrated with Hadoop to enable usage
References The Google File System,  19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.

Kosmos Filesystem

  • 1.
    Kosmos Filesystem SriramRao July 22, 2008
  • 2.
    Background KFS wasinitially designed and implemented at Kosmix in 2006 Two developers: myself and Blake Lewis KFS released as an open-source project in Sep. 2007 One release-meister/developer/…: myself Lots of interest in the project (1000+ downloads of the code) Quantcast is now the primary sponsor of the project
  • 3.
    Talk Outline IntroductionSystem Architecture File I/O (reads/writes) Handling failures Software availability KFS+Hadoop Summary
  • 4.
    Introduction Growing classof applications that process large volumes of data Web Search, Web log analysis, Web 2.0 apps, Grid computing, … Key requirement: cost-efficient scalable compute/storage infrastructure Our work is focused towards building scalable storage infrastructure
  • 5.
    Workload Few millionsof large files Files are typically tens of MB to a few GB in size Data is written-once; read many, many times Files are accessed (mostly) sequentially
  • 6.
    Approach Build usingoff-the-shelf commodity PCs ~$30K for a 16-node cluster … Virtualize storage As storage needs grow, scale the system by adding more storage nodes System adapts to the increased storage automagically Design for crash recovery from ground-up
  • 7.
    Storage Virtualization Constructa global namespace by decoupling storage from filesystem namespace Build a ''disk'' by aggregating the storage from individual nodes in the cluster To improve performance, stripe a file across multiple nodes in the cluster Use replication for tolerating failures Simplify storage management System automagically balances storage utilization across all nodes Any file can be accessed from any machine in the network
  • 8.
    Terminology File consistsof a set of chunks Applications are oblivious of chunks I/O is done on files Translation from file offset to chunk/offset is transparent to the application Each chunk is fixed in size Chunk size is 64MB
  • 9.
    System Architecture Single meta-data server that maintains the global namespace Multiple chunkservers that enable access to data Client library linked with applications for accessing files in KFS System implemented in C++
  • 10.
    Design Choices Inter-processcommunication is via non-blocking TCP sockets Communication protocol is text-based Patterned after HTTP Connections between meta-server and chunkservers are persistent Simple failure model: Connection break implies failure. Works for LAN settings
  • 11.
    Meta-data Server Maintainsthe directory tree in-memory using a B+ tree Tree records the chunks that belong to a file and file attributes For each chunk Record the file offset/chunk version Meta-server logs mutations to the tree to a log Log is periodically rolled over (once every 10 mins) Offline process compacts logs to produce a checkpoint file Chunk locations are tracked by the metaserver in an in-core table Chunks are versioned to handle chunkserver failures Periodically, heartbeats chunkservers to determine load information as well as responsiveness
  • 12.
    Crash Recovery Followingcrash, restart metaserver Rebuild tree using last checkpoint+log files Chunkservers connect back to the metaserver and send chunk information Metaserver rebuilds the chunk location information Metaserver identifies stale chunks and notifies appropriate chunkservers Meta-server is a single point of failure To protect filesystem, backup logs/checkpoint files to remote nodes Will be addressed in a future release
  • 13.
    Chunk Server Storeschunks as files in the underlying filesystem (such as, XFS/ZFS) Chunk size is fixed at 64MB To handle disk corruptions, Adler-32 checksum is computed on 64K blocks Checksums are validated on each read Chunk file has a fixed length header (~5K) for storing checksums and other meta info
  • 14.
    Crash Recovery Followinga crash, restart chunkserver Chunkserver scans the chunk directory to determine chunks it has Chunk filename identifies the owning file-id, chunk-id, chunk version Chunkserver connects to metaserver and tells it the chunks/versions it has Metaserver responds with stale chunk id’s (if any) Stale chunks are moved to lost+found
  • 15.
    Data Scrubbing Packagecontains a tool, chunkscrubber that can be used to scrub chunks Scrubber verifies checksums and identifies corrupted blocks Support for periodic scrubbing will be added in a future release Scrubber will identify corrupted blocks and they will be moved to lost+found Metaserver will use re-replication to proactively recover lost chunks
  • 16.
    Client Library Clientlibrary interfaces with the metaserver, chunkserver Client library provides a POSIX like API for usual file/directory operations: Create, read, write, mkdir, rmdir, etc. For reads/writes, client library does the translation from file offset to chunk/offset Applications can specify degree of replication on a per-file basis (default = 3) Java/Python glue-code to get at C++ library
  • 17.
    Client library Clientlibrary keeps a cache of chunk data Reads: Download a chunk from the server and serve requests Writes: Buffer the writes and push data Chunk buffer is 64MB
  • 18.
    Features WORM supportfor archiving data Incremental scalability Load balancing (static) in terms of disk usage of chunkservers “ Retire” chunkserver for scheduled downtime Client side caching for performance Leases for cache consistency Re-replication for availability Lots of support for handling failures Tools for accessing KFS tree FUSE support to allow existing fs utils to manipulate the KFS tree
  • 19.
    WORM Support Configurationvariable on the metaserver to operate in “WORM” Remove/Rename/Rmdir disallowed on all files except those “.tmp” extension Enables KFS to be used as an archival system
  • 20.
    Handling Writes Followsthe well-understood model Application creates a file File is visible in the directory tree Application writes data to the file Client library caches the write When cache is full, write is flushed to chunkservers Application can force data to be pushed to chunkservers by doing Sync() Data written to chunkserver becomes visible to another application
  • 21.
    Leases Client librarygets a read lease on a chunk prior to reading While client has a read lease, server promises that the content will not change For writes, a “master” chunkserver is assigned a write lease Master serializes concurrent writes to a chunk
  • 22.
    Writing Data Clientrequests meta-server for allocation Meta-server allocates a chunk on 3 servers and anoints one server as “master” Master is given the “write lease” and is responsible for serializing writes to the chunk After doing allocation, client writes to a chunk in two steps: Client pushes data out to the master Data is forwarded amongst chunkservers in a daisy-chain At the end of push, client sends a SYNC message Master responds with a status for the SYNC message If the write fails, client retries
  • 23.
    Write Failures When a write fails, client may force re-allocation Metaserver changes version number for the chunk on the servers hosting the chunk Picking a new location can cause chunks to diverge Re-replication is used to get chunks replication level back up
  • 24.
    Reads Prior toreading data, client library downloads the locations of a chunk Client library picks a “nearby” server to read data: Localhost Chunkserver on the same subnet Random to balance out server load Future: use server load to determine a good location
  • 25.
    Heartbeats Once aminute, metaserver sends a heartbeat message to chunkserver Chunkserver responds with an ack No ack => non-responsive server No work (read/write) is sent to non-responsive servers
  • 26.
    Placement Placement algorithm:Spread chunks across servers while balancing server write load Placement is rack-aware: 3 copies of a chunk are stored on 3 different racks Localhost optimization: One copy of data is placed on the chunkserver on the same host as the client doing the write Helps reduce network traffic
  • 27.
    Incr. Scalability Startupa new chunkserver Chunkserver connects to the meta-server Sends a HELLO message Determines which chunks are good Nukes out stale chunks Chunkserver is now part of the system
  • 28.
    Rebalancing As newnodes are added to system, imbalance in disk utilization Use re-replication to migrate blocks from over-utilized nodes (> 80% full) to under-utilized nodes (< 20% full)
  • 29.
    Retiring A NodeWant to schedule downtime on chunkserver nodes for maintenance Use the retiring chunkserver to proactively replicate data off it Will have sufficient copies of data when the chunkserver is taken offline
  • 30.
    KFS Software Codeis released under Apache 2.0 license Software is hosted on Sourceforge http://kosmosfs.sourceforge.net Current release is alpha version 0.2.1 Contains scripts to simplify deployment Specify machine configuration Install software on the machines Script to start/stop servers on remote machines Script to backup metaserver logs/checkpoints to remote machine
  • 31.
    Hadoop+KFS KFS integratedwith Hadoop using the FileSystem API’s Allows existing Hadoop apps to use KFS seamlessly
  • 32.
    KFS @ QuantcastTwo deployments: 130 node cluster hosting log data ~2M files; 70TB of data; WORM system Metaserver uses ~2GB RAM ~1TB of data copied in during a week Used for daily jobs in read mode Plan is to use both KFS and HDFS For job output, backup from KFS to HDFS using Hadoop’s distcp
  • 33.
    Summary KFS implementedwith a set of features: Replication, leases, fault-tolerance, data integrity Basic tools for accessing/monitoring KFS Fair bit of work done to make the system resilient to failures System is incrementally scalable Integrated with Hadoop to enable usage
  • 34.
    References The GoogleFile System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.