Kosmos Filesystem Sriram Rao July 22, 2008
Background <ul><li>KFS was initially designed and implemented at Kosmix in 2006 </li></ul><ul><ul><li>Two developers: myse...
Talk Outline <ul><li>Introduction </li></ul><ul><li>System Architecture </li></ul><ul><li>File I/O (reads/writes) </li></u...
Introduction <ul><li>Growing class of applications that process large volumes of data </li></ul><ul><ul><li>Web Search, We...
Workload <ul><li>Few millions of large files </li></ul><ul><ul><li>Files are typically tens of MB to a few GB in size </li...
Approach <ul><li>Build using off-the-shelf commodity PCs </li></ul><ul><ul><li>~$30K for a 16-node cluster … </li></ul></u...
Storage Virtualization <ul><li>Construct a global namespace by  decoupling  storage from filesystem namespace </li></ul><u...
Terminology <ul><li>File consists of a set of  chunks </li></ul><ul><ul><li>Applications are oblivious of chunks </li></ul...
System Architecture <ul><li>Single  meta-data server  that maintains the global namespace </li></ul><ul><li>Multiple  chun...
Design Choices <ul><li>Inter-process communication is via non-blocking TCP sockets </li></ul><ul><li>Communication protoco...
Meta-data Server <ul><li>Maintains the directory tree in-memory using a B+ tree </li></ul><ul><ul><li>Tree records the chu...
Crash Recovery <ul><li>Following crash, restart metaserver </li></ul><ul><li>Rebuild tree using last checkpoint+log files ...
Chunk Server <ul><li>Stores chunks as files in the underlying filesystem (such as, XFS/ZFS) </li></ul><ul><ul><li>Chunk si...
Crash Recovery <ul><li>Following a crash, restart chunkserver </li></ul><ul><li>Chunkserver scans the chunk directory to d...
Data Scrubbing <ul><li>Package contains a tool, chunkscrubber that can be used to scrub chunks </li></ul><ul><ul><li>Scrub...
Client Library <ul><li>Client library interfaces with the  metaserver, chunkserver </li></ul><ul><li>Client library provid...
Client library <ul><li>Client library keeps a cache of chunk data </li></ul><ul><ul><li>Reads: Download a chunk from the s...
Features <ul><li>WORM support for archiving data </li></ul><ul><li>Incremental scalability </li></ul><ul><li>Load balancin...
WORM Support <ul><li>Configuration variable on the metaserver to operate in “WORM” </li></ul><ul><li>Remove/Rename/Rmdir d...
Handling Writes <ul><li>Follows the well-understood model </li></ul><ul><ul><li>Application creates a file </li></ul></ul>...
Leases <ul><li>Client library gets a read lease on a chunk prior to reading </li></ul><ul><ul><li>While client has a read ...
Writing Data <ul><li>Client requests meta-server for allocation </li></ul><ul><li>Meta-server allocates a chunk on 3 serve...
Write Failures  <ul><li>When a write fails, client may force re-allocation </li></ul><ul><li>Metaserver changes version nu...
Reads <ul><li>Prior to reading data, client library downloads the locations of a chunk </li></ul><ul><li>Client library pi...
Heartbeats <ul><li>Once a minute, metaserver sends a heartbeat message to chunkserver </li></ul><ul><li>Chunkserver respon...
Placement <ul><li>Placement algorithm: Spread chunks across servers while balancing server write load </li></ul><ul><ul><l...
Incr. Scalability <ul><li>Startup a new chunkserver </li></ul><ul><li>Chunkserver connects to the meta-server  </li></ul><...
Rebalancing <ul><li>As new nodes are added to system, imbalance in disk utilization </li></ul><ul><li>Use re-replication t...
Retiring A Node <ul><li>Want to schedule downtime on chunkserver nodes for maintenance </li></ul><ul><li>Use the retiring ...
KFS Software <ul><li>Code is released under Apache 2.0 license </li></ul><ul><li>Software is hosted on Sourceforge </li></...
Hadoop+KFS <ul><li>KFS integrated with Hadoop using the FileSystem API’s </li></ul><ul><li>Allows existing Hadoop apps to ...
KFS @ Quantcast <ul><li>Two deployments: </li></ul><ul><ul><li>130 node cluster hosting log data </li></ul></ul><ul><ul><u...
Summary <ul><li>KFS implemented with a set of features: </li></ul><ul><ul><li>Replication, leases, fault-tolerance, data i...
References <ul><li>The Google File System,  19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, ...
Upcoming SlideShare
Loading in …5
×

Kosmos Filesystem

1,685 views

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,685
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
32
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Kosmos Filesystem

  1. 1. Kosmos Filesystem Sriram Rao July 22, 2008
  2. 2. Background <ul><li>KFS was initially designed and implemented at Kosmix in 2006 </li></ul><ul><ul><li>Two developers: myself and Blake Lewis </li></ul></ul><ul><li>KFS released as an open-source project in Sep. 2007 </li></ul><ul><ul><li>One release-meister/developer/…: myself </li></ul></ul><ul><ul><li>Lots of interest in the project (1000+ downloads of the code) </li></ul></ul><ul><li>Quantcast is now the primary sponsor of the project </li></ul>
  3. 3. Talk Outline <ul><li>Introduction </li></ul><ul><li>System Architecture </li></ul><ul><li>File I/O (reads/writes) </li></ul><ul><li>Handling failures </li></ul><ul><li>Software availability </li></ul><ul><li>KFS+Hadoop </li></ul><ul><li>Summary </li></ul>
  4. 4. Introduction <ul><li>Growing class of applications that process large volumes of data </li></ul><ul><ul><li>Web Search, Web log analysis, Web 2.0 apps, Grid computing, … </li></ul></ul><ul><li>Key requirement: cost-efficient scalable compute/storage infrastructure </li></ul><ul><li>Our work is focused towards building scalable storage infrastructure </li></ul>
  5. 5. Workload <ul><li>Few millions of large files </li></ul><ul><ul><li>Files are typically tens of MB to a few GB in size </li></ul></ul><ul><li>Data is written-once; read many, many times </li></ul><ul><li>Files are accessed (mostly) sequentially </li></ul>
  6. 6. Approach <ul><li>Build using off-the-shelf commodity PCs </li></ul><ul><ul><li>~$30K for a 16-node cluster … </li></ul></ul><ul><li>Virtualize storage </li></ul><ul><li>As storage needs grow, scale the system by adding more storage nodes </li></ul><ul><ul><li>System adapts to the increased storage automagically </li></ul></ul><ul><li>Design for crash recovery from ground-up </li></ul>
  7. 7. Storage Virtualization <ul><li>Construct a global namespace by decoupling storage from filesystem namespace </li></ul><ul><ul><li>Build a ''disk'' by aggregating the storage from individual nodes in the cluster </li></ul></ul><ul><ul><li>To improve performance, stripe a file across multiple nodes in the cluster </li></ul></ul><ul><ul><li>Use replication for tolerating failures </li></ul></ul><ul><li>Simplify storage management </li></ul><ul><ul><li>System automagically balances storage utilization across all nodes </li></ul></ul><ul><ul><li>Any file can be accessed from any machine in the network </li></ul></ul>
  8. 8. Terminology <ul><li>File consists of a set of chunks </li></ul><ul><ul><li>Applications are oblivious of chunks </li></ul></ul><ul><ul><li>I/O is done on files </li></ul></ul><ul><ul><ul><li>Translation from file offset to chunk/offset is transparent to the application </li></ul></ul></ul><ul><li>Each chunk is fixed in size </li></ul><ul><li>Chunk size is 64MB </li></ul>
  9. 9. System Architecture <ul><li>Single meta-data server that maintains the global namespace </li></ul><ul><li>Multiple chunkservers that enable access to data </li></ul><ul><li>Client library linked with applications for accessing files in KFS </li></ul><ul><li>System implemented in C++ </li></ul>
  10. 10. Design Choices <ul><li>Inter-process communication is via non-blocking TCP sockets </li></ul><ul><li>Communication protocol is text-based </li></ul><ul><ul><li>Patterned after HTTP </li></ul></ul><ul><li>Connections between meta-server and chunkservers are persistent </li></ul><ul><ul><li>Simple failure model: Connection break implies failure. </li></ul></ul><ul><ul><ul><li>Works for LAN settings </li></ul></ul></ul>
  11. 11. Meta-data Server <ul><li>Maintains the directory tree in-memory using a B+ tree </li></ul><ul><ul><li>Tree records the chunks that belong to a file and file attributes </li></ul></ul><ul><ul><li>For each chunk </li></ul></ul><ul><ul><ul><li>Record the file offset/chunk version </li></ul></ul></ul><ul><ul><li>Meta-server logs mutations to the tree to a log </li></ul></ul><ul><ul><li>Log is periodically rolled over (once every 10 mins) </li></ul></ul><ul><ul><li>Offline process compacts logs to produce a checkpoint file </li></ul></ul><ul><li>Chunk locations are tracked by the metaserver in an in-core table </li></ul><ul><li>Chunks are versioned to handle chunkserver failures </li></ul><ul><li>Periodically, heartbeats chunkservers to determine load information as well as responsiveness </li></ul>
  12. 12. Crash Recovery <ul><li>Following crash, restart metaserver </li></ul><ul><li>Rebuild tree using last checkpoint+log files </li></ul><ul><li>Chunkservers connect back to the metaserver and send chunk information </li></ul><ul><ul><li>Metaserver rebuilds the chunk location information </li></ul></ul><ul><ul><li>Metaserver identifies stale chunks and notifies appropriate chunkservers </li></ul></ul><ul><li>Meta-server is a single point of failure </li></ul><ul><ul><li>To protect filesystem, backup logs/checkpoint files to remote nodes </li></ul></ul><ul><ul><li>Will be addressed in a future release </li></ul></ul>
  13. 13. Chunk Server <ul><li>Stores chunks as files in the underlying filesystem (such as, XFS/ZFS) </li></ul><ul><ul><li>Chunk size is fixed at 64MB </li></ul></ul><ul><li>To handle disk corruptions, </li></ul><ul><ul><li>Adler-32 checksum is computed on 64K blocks </li></ul></ul><ul><ul><li>Checksums are validated on each read </li></ul></ul><ul><li>Chunk file has a fixed length header (~5K) for storing checksums and other meta info </li></ul>
  14. 14. Crash Recovery <ul><li>Following a crash, restart chunkserver </li></ul><ul><li>Chunkserver scans the chunk directory to determine chunks it has </li></ul><ul><ul><li>Chunk filename identifies the owning file-id, chunk-id, chunk version </li></ul></ul><ul><li>Chunkserver connects to metaserver and tells it the chunks/versions it has </li></ul><ul><li>Metaserver responds with stale chunk id’s (if any) </li></ul><ul><li>Stale chunks are moved to lost+found </li></ul>
  15. 15. Data Scrubbing <ul><li>Package contains a tool, chunkscrubber that can be used to scrub chunks </li></ul><ul><ul><li>Scrubber verifies checksums and identifies corrupted blocks </li></ul></ul><ul><li>Support for periodic scrubbing will be added in a future release </li></ul><ul><ul><li>Scrubber will identify corrupted blocks and they will be moved to lost+found </li></ul></ul><ul><ul><li>Metaserver will use re-replication to proactively recover lost chunks </li></ul></ul>
  16. 16. Client Library <ul><li>Client library interfaces with the metaserver, chunkserver </li></ul><ul><li>Client library provides a POSIX like API for usual file/directory operations: </li></ul><ul><ul><li>Create, read, write, mkdir, rmdir, etc. </li></ul></ul><ul><ul><li>For reads/writes, client library does the translation from file offset to chunk/offset </li></ul></ul><ul><li>Applications can specify degree of replication on a per-file basis (default = 3) </li></ul><ul><li>Java/Python glue-code to get at C++ library </li></ul>
  17. 17. Client library <ul><li>Client library keeps a cache of chunk data </li></ul><ul><ul><li>Reads: Download a chunk from the server and serve requests </li></ul></ul><ul><ul><li>Writes: Buffer the writes and push data </li></ul></ul><ul><ul><li>Chunk buffer is 64MB </li></ul></ul>
  18. 18. Features <ul><li>WORM support for archiving data </li></ul><ul><li>Incremental scalability </li></ul><ul><li>Load balancing (static) in terms of disk usage of chunkservers </li></ul><ul><li>“ Retire” chunkserver for scheduled downtime </li></ul><ul><li>Client side caching for performance </li></ul><ul><li>Leases for cache consistency </li></ul><ul><li>Re-replication for availability </li></ul><ul><li>Lots of support for handling failures </li></ul><ul><li>Tools for accessing KFS tree </li></ul><ul><li>FUSE support to allow existing fs utils to manipulate the KFS tree </li></ul>
  19. 19. WORM Support <ul><li>Configuration variable on the metaserver to operate in “WORM” </li></ul><ul><li>Remove/Rename/Rmdir disallowed on all files except those “.tmp” extension </li></ul><ul><li>Enables KFS to be used as an archival system </li></ul>
  20. 20. Handling Writes <ul><li>Follows the well-understood model </li></ul><ul><ul><li>Application creates a file </li></ul></ul><ul><ul><ul><li>File is visible in the directory tree </li></ul></ul></ul><ul><ul><li>Application writes data to the file </li></ul></ul><ul><ul><ul><li>Client library caches the write </li></ul></ul></ul><ul><ul><ul><li>When cache is full, write is flushed to chunkservers </li></ul></ul></ul><ul><ul><li>Application can force data to be pushed to chunkservers by doing Sync() </li></ul></ul><ul><ul><li>Data written to chunkserver becomes visible to another application </li></ul></ul>
  21. 21. Leases <ul><li>Client library gets a read lease on a chunk prior to reading </li></ul><ul><ul><li>While client has a read lease, server promises that the content will not change </li></ul></ul><ul><li>For writes, a “master” chunkserver is assigned a write lease </li></ul><ul><ul><li>Master serializes concurrent writes to a chunk </li></ul></ul>
  22. 22. Writing Data <ul><li>Client requests meta-server for allocation </li></ul><ul><li>Meta-server allocates a chunk on 3 servers and anoints one server as “master” </li></ul><ul><ul><li>Master is given the “write lease” and is responsible for serializing writes to the chunk </li></ul></ul><ul><li>After doing allocation, client writes to a chunk in two steps: </li></ul><ul><ul><li>Client pushes data out to the master </li></ul></ul><ul><ul><li>Data is forwarded amongst chunkservers in a daisy-chain </li></ul></ul><ul><ul><li>At the end of push, client sends a SYNC message </li></ul></ul><ul><ul><li>Master responds with a status for the SYNC message </li></ul></ul><ul><ul><li>If the write fails, client retries </li></ul></ul>
  23. 23. Write Failures <ul><li>When a write fails, client may force re-allocation </li></ul><ul><li>Metaserver changes version number for the chunk on the servers hosting the chunk </li></ul><ul><ul><li>Picking a new location can cause chunks to diverge </li></ul></ul><ul><li>Re-replication is used to get chunks replication level back up </li></ul>
  24. 24. Reads <ul><li>Prior to reading data, client library downloads the locations of a chunk </li></ul><ul><li>Client library picks a “nearby” server to read data: </li></ul><ul><ul><ul><li>Localhost </li></ul></ul></ul><ul><ul><ul><li>Chunkserver on the same subnet </li></ul></ul></ul><ul><ul><ul><li>Random to balance out server load </li></ul></ul></ul><ul><ul><ul><li>Future: use server load to determine a good location </li></ul></ul></ul>
  25. 25. Heartbeats <ul><li>Once a minute, metaserver sends a heartbeat message to chunkserver </li></ul><ul><li>Chunkserver responds with an ack </li></ul><ul><li>No ack => non-responsive server </li></ul><ul><li>No work (read/write) is sent to non-responsive servers </li></ul>
  26. 26. Placement <ul><li>Placement algorithm: Spread chunks across servers while balancing server write load </li></ul><ul><ul><li>Placement is rack-aware: 3 copies of a chunk are stored on 3 different racks </li></ul></ul><ul><ul><li>Localhost optimization: One copy of data is placed on the chunkserver on the same host as the client doing the write </li></ul></ul><ul><ul><ul><li>Helps reduce network traffic </li></ul></ul></ul>
  27. 27. Incr. Scalability <ul><li>Startup a new chunkserver </li></ul><ul><li>Chunkserver connects to the meta-server </li></ul><ul><ul><li>Sends a HELLO message </li></ul></ul><ul><ul><li>Determines which chunks are good </li></ul></ul><ul><ul><li>Nukes out stale chunks </li></ul></ul><ul><li>Chunkserver is now part of the system </li></ul>
  28. 28. Rebalancing <ul><li>As new nodes are added to system, imbalance in disk utilization </li></ul><ul><li>Use re-replication to migrate blocks from over-utilized nodes (> 80% full) to under-utilized nodes (< 20% full) </li></ul>
  29. 29. Retiring A Node <ul><li>Want to schedule downtime on chunkserver nodes for maintenance </li></ul><ul><li>Use the retiring chunkserver to proactively replicate data off it </li></ul><ul><li>Will have sufficient copies of data when the chunkserver is taken offline </li></ul>
  30. 30. KFS Software <ul><li>Code is released under Apache 2.0 license </li></ul><ul><li>Software is hosted on Sourceforge </li></ul><ul><ul><li>http://kosmosfs.sourceforge.net </li></ul></ul><ul><li>Current release is alpha version 0.2.1 </li></ul><ul><li>Contains scripts to simplify deployment </li></ul><ul><ul><li>Specify machine configuration </li></ul></ul><ul><ul><li>Install software on the machines </li></ul></ul><ul><ul><li>Script to start/stop servers on remote machines </li></ul></ul><ul><ul><li>Script to backup metaserver logs/checkpoints to remote machine </li></ul></ul>
  31. 31. Hadoop+KFS <ul><li>KFS integrated with Hadoop using the FileSystem API’s </li></ul><ul><li>Allows existing Hadoop apps to use KFS seamlessly </li></ul>
  32. 32. KFS @ Quantcast <ul><li>Two deployments: </li></ul><ul><ul><li>130 node cluster hosting log data </li></ul></ul><ul><ul><ul><li>~2M files; 70TB of data; WORM system </li></ul></ul></ul><ul><ul><ul><li>Metaserver uses ~2GB RAM </li></ul></ul></ul><ul><ul><ul><li>~1TB of data copied in during a week </li></ul></ul></ul><ul><ul><ul><li>Used for daily jobs in read mode </li></ul></ul></ul><ul><li>Plan is to use both KFS and HDFS </li></ul><ul><ul><li>For job output, backup from KFS to HDFS using Hadoop’s distcp </li></ul></ul>
  33. 33. Summary <ul><li>KFS implemented with a set of features: </li></ul><ul><ul><li>Replication, leases, fault-tolerance, data integrity </li></ul></ul><ul><li>Basic tools for accessing/monitoring KFS </li></ul><ul><li>Fair bit of work done to make the system resilient to failures </li></ul><ul><li>System is incrementally scalable </li></ul><ul><li>Integrated with Hadoop to enable usage </li></ul>
  34. 34. References <ul><li>The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. </li></ul>

×