Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Filesystems Review


Published on

Published in: Technology

Distributed Filesystems Review

  1. 1. Distributed File System Review Schubert Zhang May 2008
  2. 2. File Systems <ul><li>Google File System (GFS) </li></ul><ul><li>Kosmos File System (KFS) </li></ul><ul><li>Hadoop Distributed File System (HDFS) </li></ul><ul><li>GlusterFS </li></ul><ul><li>Red Hat Global File System </li></ul><ul><li>Luster </li></ul><ul><li>Summary </li></ul>
  3. 3. Google File System (GFS)
  4. 4. Google File System (GFS) <ul><li>Specified applications oriented file system. </li></ul><ul><ul><li>Search engines. </li></ul></ul><ul><ul><li>Grid computing applications. </li></ul></ul><ul><ul><li>Data mining applications. </li></ul></ul><ul><ul><li>Other application for the generation and processing of data. </li></ul></ul><ul><li>Workload Characters </li></ul><ul><ul><li>Performance, scalability, reliability, and availability requirements. </li></ul></ul><ul><ul><li>Large distributed data-intensive applications. </li></ul></ul><ul><ul><li>Large/H uge files (tens of MB to tens of GB in size). </li></ul></ul><ul><ul><li>Primarily write-once/read-many. </li></ul></ul><ul><ul><li>Appending rather than overwriting. </li></ul></ul><ul><ul><li>Mostly sequential access. </li></ul></ul><ul><ul><li>The emphasis is on high sustained throughput of data access rather than low latency of data access. </li></ul></ul><ul><li>System Requirements </li></ul><ul><ul><li>Inexpensive commodity hardware that may often fail. </li></ul></ul><ul><ul><li>Adequate memory for Master-Server. </li></ul></ul><ul><ul><li>GE network interface. </li></ul></ul><ul><li>Architecture </li></ul><ul><ul><li>Usually both client and chunkserer run on a same machine. </li></ul></ul><ul><ul><li>Fixed-size chunks (usually 64MB) (memory of master). </li></ul></ul><ul><ul><li>File replicated, chunk replicated (usually 3). </li></ul></ul><ul><ul><li>Single master and multiple chunkservers and accessed by multiple clients. </li></ul></ul>
  5. 5. Google File System (GFS) <ul><li>Single masterserver – metadata server </li></ul><ul><ul><li>Namespaces (files and chunks) </li></ul></ul><ul><ul><li>File access control info </li></ul></ul><ul><ul><li>Mapping from files to chunks </li></ul></ul><ul><ul><li>Locations of chunks replicas </li></ul></ul><ul><ul><li>Metadata in memory </li></ul></ul><ul><ul><li>Namespaces and mapping stored in disk by checkpoints and operation log. </li></ul></ul><ul><ul><li>Namespace management and locking </li></ul></ul><ul><ul><li>Metadata HA and fault tolerance </li></ul></ul><ul><ul><li>Replica Placement, rack-aware replica placement policy </li></ul></ul><ul><ul><li>Chunk creation, re-replication, rebalancing </li></ul></ul><ul><ul><li>chunk server management (heartbeat and control.) </li></ul></ul><ul><ul><li>chunk lease management </li></ul></ul><ul><ul><li>Garbage collection </li></ul></ul><ul><ul><li>Minimize the master’s involvement in all operations. </li></ul></ul><ul><li>  </li></ul>
  6. 6. Google File System (GFS) <ul><li>Large number of chunkserver </li></ul><ul><ul><li>No cache for file data </li></ul></ul><ul><ul><li>Chunk allocation (lazy) </li></ul></ul><ul><ul><li>Lease, data replication chain </li></ul></ul><ul><ul><li>Blocks checksums </li></ul></ul><ul><ul><li>Chunk state report </li></ul></ul><ul><ul><li>P2P replication, Replication Pipelining and Clone </li></ul></ul><ul><li>Large number of clients </li></ul><ul><ul><li>Linked into each application. </li></ul></ul><ul><ul><li>Interact with the master for metadata operation </li></ul></ul><ul><ul><li>Data-bearing communication goes directly to the chunkservers </li></ul></ul><ul><ul><li>No cache for file data, but cache metadata. </li></ul></ul><ul><ul><li>Translate operation offset to chunk index. </li></ul></ul><ul><ul><li>Applications/clients get over the limitations of GFS implementation. </li></ul></ul>
  7. 7. Google File System (GFS) <ul><li>Cluster scale and performance </li></ul><ul><ul><li>Thousands of disks on over a thousand machines </li></ul></ul><ul><ul><li>Hundreds of TB or several PB of storage </li></ul></ul><ul><ul><li>Hundreds or thousands of clients </li></ul></ul><ul><li>Limitations </li></ul><ul><ul><li>No standard API such as POSIX. </li></ul></ul><ul><ul><li>Not integrated File System operations. </li></ul></ul><ul><ul><li>Some performance issues depend on applications and clients implementation. </li></ul></ul><ul><ul><li>GFS does not guarantee that all replicas are byte-wise identical. It only guarantees that the data is written at least once as an atomic unit. Append operation atomically at least once issue. (GFS may insert padding or record duplicates in between.) </li></ul></ul><ul><ul><li>Application/Client have opportunity to get a stale chunk replica. (Reader deal with it) </li></ul></ul><ul><ul><li>If a write by the application is large or straddles a chunk boundary, it may be added fragments from different clients. </li></ul></ul><ul><ul><li>Need tight cooperate of applications. </li></ul></ul><ul><ul><li>Not support hard links or soft links. </li></ul></ul>
  8. 8. Google File System (GFS) <ul><li>Need further components to achieve completeness </li></ul><ul><ul><li>Chubby (Distributed lock and Consistency) </li></ul></ul><ul><ul><li>BigTable (A Distributed Storage System for Structured Data ) </li></ul></ul><ul><ul><li>etc. </li></ul></ul>
  9. 9. Kosmos File System (KFS) <ul><li>A open source implementation of the Google File System </li></ul>
  10. 10. Kosmos File System (KFS) <ul><li>Architecture </li></ul><ul><ul><li>Meta-data server = Google FS Master </li></ul></ul><ul><ul><li>Block server = Google FS Chunk Server </li></ul></ul><ul><ul><li>Client library = Google FS Client </li></ul></ul><ul><li>Workload characters </li></ul><ul><ul><li>Primarily write-once/read-many workloads </li></ul></ul><ul><ul><li>Few millions of large files, where each file is on the order of a few tens of MB to a few tens of GB in size </li></ul></ul><ul><ul><li>Mostly sequential access </li></ul></ul><ul><li>Implemented in C++ </li></ul><ul><ul><li>Client API support C++, Java, Python </li></ul></ul>
  11. 11. Kosmos File System (KFS) <ul><li>Valued Stuff </li></ul><ul><ul><li>Client write cache (Google said not necessary) </li></ul></ul><ul><ul><li>FUSE support: KFS exports a POSIX file interface, Hadoop does not (GFS does not, either) </li></ul></ul><ul><ul><li>Monitor tools and shell </li></ul></ul><ul><ul><li>Deploy scripts </li></ul></ul><ul><ul><li>Job placement and local read optimization </li></ul></ul><ul><ul><li>Can be integrated with Hadoop: replace HDFS, use the map-reduce of Hadoop. (patch to Hadoop-JIRA-1963) </li></ul></ul><ul><ul><li>KFS supports atomic append, HDFS does not </li></ul></ul><ul><ul><li>KFS supports rebalancing, HDFS does not </li></ul></ul><ul><li>Status and Limitations </li></ul><ul><ul><li>Not good implemented yet. </li></ul></ul><ul><ul><li>No real user </li></ul></ul><ul><ul><li>Failed to build a usable program. </li></ul></ul><ul><ul><li>Similar limitations of Google FS. </li></ul></ul>
  12. 12. Kosmos File System (KFS) <ul><li>Client support FUSE </li></ul>
  13. 13. Hadoop Distributed File System ( HDFS) <ul><li>A open source implementation of the Google File System </li></ul><ul><li>HDFS relaxes a few POSIX requirements to enable streaming access to file system data. </li></ul><ul><li>From infrastructure for the Apache Nutch. </li></ul><ul><li>“ Moving Computation is Cheaper than Moving Data” </li></ul><ul><li>Portability Across Heterogeneous Hardware and Software Platforms, Implemented by Java. </li></ul><ul><ul><li>Java client API </li></ul></ul><ul><ul><li>C language wrapper for this Java API </li></ul></ul><ul><ul><li>HTTP browser interface </li></ul></ul><ul><li>Architecture (master/slave) </li></ul><ul><ul><li>Namenode = Google FS masterserver </li></ul></ul><ul><ul><li>Datanodes = Google FS chunkservers </li></ul></ul><ul><ul><li>Clients = Google FS clients </li></ul></ul><ul><ul><li>Blocks = Google FS chunks </li></ul></ul><ul><li>Namenode Safe Mode </li></ul><ul><li>The Persistence of File System Metadata like google FS </li></ul><ul><ul><li>Not yet support periodic checkpoints. </li></ul></ul><ul><li>Communication Protocols </li></ul><ul><ul><li>RPCs </li></ul></ul><ul><li>Staging, client data buffing (like POSIX implementation) </li></ul>
  14. 14. Hadoop Distributed File System ( HDFS)
  15. 15. Hadoop Distributed File System ( HDFS) <ul><li>Status and Limitations </li></ul><ul><ul><li>Similar limitations of Google FS. </li></ul></ul><ul><ul><li>Not yet support appending-writes to files. </li></ul></ul><ul><ul><li>Not yet implement user quotas or access permissions. </li></ul></ul><ul><ul><li>Replica placement policy not completed. </li></ul></ul><ul><ul><li>Not yet support periodic checkpoints of metadata. </li></ul></ul><ul><ul><li>Not yet support re-balancing. </li></ul></ul><ul><ul><li>Not yet support snapshot. </li></ul></ul><ul><li>Who’s using HDFS </li></ul><ul><ul><li>Facebook (implement a read-only FUSE over HDFS, 300 nodes) </li></ul></ul><ul><ul><li>Yahoo! (1000 nodes) </li></ul></ul><ul><ul><li>For some non-commercial usage (log analysis, search, etc.) </li></ul></ul>
  16. 16. GlusterFS <ul><li>Gluster for specific tasks such as HPC Clustering, Storage Clustering, Enterprise Provisioning, Database Clustering etc. </li></ul><ul><ul><li>GlusterFS </li></ul></ul><ul><ul><li>GlusterHPC </li></ul></ul>
  17. 17. GlusterFS
  18. 18. GlusterFS
  19. 19. GlusterFS
  20. 20. GlusterFS <ul><li>Architecture </li></ul><ul><ul><li>Different from GoogleFS series. </li></ul></ul><ul><ul><li>No meta-data , no master server. </li></ul></ul><ul><ul><li>User space logical volume management scenario. </li></ul></ul><ul><ul><li>Server node machines export disk storages as bricks. The brick nodes store distributed files in underling Linux file system. </li></ul></ul><ul><ul><li>The file namespaces are also stored at storage bricks, just as the file data bricks. Except the size of the files is zero. </li></ul></ul><ul><ul><li>Bricks (file data or namespaces) support replication. </li></ul></ul><ul><ul><li>NFS like Disk Layout </li></ul></ul><ul><li>Interconnect </li></ul><ul><ul><li>Infiniband RDMA (High throughput) </li></ul></ul><ul><ul><li>TCP/IP </li></ul></ul><ul><li>Features </li></ul><ul><ul><li>Support FUSE, complete POSIX interface. </li></ul></ul><ul><ul><li>AFR (mirror) </li></ul></ul><ul><ul><li>Self Heal </li></ul></ul><ul><ul><li>Stripe (note: not good implemented) </li></ul></ul>
  21. 21. GlusterFS <ul><li>Valued Stuff </li></ul><ul><ul><li>Easy to setup for a moderate cluster. </li></ul></ul><ul><ul><li>FUSE and POSIX </li></ul></ul><ul><ul><li>Scheduler Modules for balancing </li></ul></ul><ul><ul><li>Performance tuning flexibly </li></ul></ul><ul><ul><li>Design: </li></ul></ul><ul><ul><ul><li>Stackable Modules,Translators, run-time .so implementation. </li></ul></ul></ul><ul><ul><ul><li>Not tied to I/O Profiles or Hardware or OS </li></ul></ul></ul><ul><ul><li>Well-tested and with different representative benchmarks. </li></ul></ul><ul><ul><li>Performance and simplicity is better then Luster. </li></ul></ul><ul><li>Limitations </li></ul><ul><ul><li>Lacks global management function, no master. </li></ul></ul><ul><ul><li>The AFR function depends on configuration, lacks automation and flexibility. </li></ul></ul><ul><ul><li>Now, cannot automatic add new bricks. </li></ul></ul><ul><ul><li>If a master component is added, it will be a better Cluster FS. </li></ul></ul><ul><li>Who’s using GlusterFS </li></ul><ul><ul><li>Indian Institute of Technology Kanpur, 24 brick GlusterFS storage on Infiniband. </li></ul></ul><ul><ul><li>Other small cluster projects. </li></ul></ul>
  22. 22. Red Hat Global File System <ul><li>Red Hat Cluster Suite </li></ul><ul><li>It’s a shared storage solution, which is a traditional solution. </li></ul><ul><li>Depends on </li></ul><ul><ul><li>Red Hat Cluster Suite components </li></ul></ul><ul><ul><li>Configuration and management function </li></ul></ul><ul><ul><ul><li>Conga (luci and ricci) </li></ul></ul></ul><ul><ul><li>GLVM </li></ul></ul><ul><ul><li>DLM </li></ul></ul><ul><ul><li>GNBD </li></ul></ul><ul><ul><li>SAN/NAS/DAS </li></ul></ul>
  23. 23. Red Hat Global File System <ul><li>Deploy </li></ul><ul><ul><li>GFS with a SAN (Superior Performance and Scalability) </li></ul></ul><ul><ul><li>GFS and GNBD with a SAN (Performance, Scalability, Moderate Price) </li></ul></ul><ul><ul><li>GFS and GNBD with Directly Connected Storage (Economy and Performance) </li></ul></ul>
  24. 24. Red Hat Global File System <ul><li>GFS Functions </li></ul><ul><ul><li>Making a File System </li></ul></ul><ul><ul><li>Mounting a File System </li></ul></ul><ul><ul><li>Unmounting a File System </li></ul></ul><ul><ul><li>GFS Quota Management </li></ul></ul><ul><ul><li>Growing a File System </li></ul></ul><ul><ul><li>Adding Journals to a File System </li></ul></ul><ul><ul><li>Direct I/O </li></ul></ul><ul><ul><li>Data Journaling </li></ul></ul><ul><ul><li>Configuring atime Updates </li></ul></ul><ul><ul><li>Suspending Activity on a File System </li></ul></ul><ul><ul><li>Displaying Extended GFS Information and Statistics </li></ul></ul><ul><ul><li>Repairing a File System </li></ul></ul><ul><ul><li>Context-Dependent Path Names (CDPN) </li></ul></ul><ul><li>Cluster Volume Management </li></ul><ul><ul><li>aggregate multiple physical volumes into a single, logical device across all nodes in a cluster. </li></ul></ul><ul><ul><li>provides a logical view of the storage to GFS. </li></ul></ul><ul><li>Lock Management </li></ul><ul><li>Cluster Management, Fencing, and Recovery </li></ul><ul><li>Cluster Configuration Management </li></ul>
  25. 25. Red Hat Global File System <ul><li>Status </li></ul><ul><ul><li>It is a shared storage solution. </li></ul></ul><ul><ul><li>The solution is far from our target. </li></ul></ul><ul><ul><li>A little too complicated and not easy to manage. </li></ul></ul><ul><ul><li>High performance and scalability need high level storage hardware and network (eg.SAN). </li></ul></ul><ul><ul><li>The implementation is not sample. </li></ul></ul>
  26. 26. Luster <ul><li>Sun Microsystems </li></ul><ul><li>Target 10,000 of nodes, PB of storage, 100GB/sec throughput. </li></ul><ul><li>Lustre is kernel software, which interacts with storage devices. Your Lustre deployment must be correctly installed, configured, and administered to reduce the risk of security issues or data loss. </li></ul><ul><li>It uses Object-Based Storage Devices (OSDs), to manage entire file objects (inodes) instead of blocks. </li></ul><ul><li>Components </li></ul><ul><ul><li>Meta Data Servers (MDSs) </li></ul></ul><ul><ul><li>Object Storage Targets (OSTs) </li></ul></ul><ul><ul><li>Lustre clients. </li></ul></ul><ul><li>Luster is a little too complex to be used. </li></ul><ul><li>But it seems a verified and reliable File System. </li></ul>
  27. 27. Luster OSD Architecture
  28. 28. Summary <ul><li>Shared </li></ul><ul><li>Cluster </li></ul><ul><li>Parallel </li></ul><ul><li>Cloud </li></ul>
  29. 29. Summary <ul><li>Cluster Volume Managers </li></ul><ul><li>SAN File Systems </li></ul><ul><li>Cluster File Systems </li></ul><ul><li>Parallel NFS (pNFS) </li></ul><ul><li>Object-based Storage Devices (OSD) </li></ul><ul><li>Global/Parallel File System </li></ul><ul><li>Distribute/Cluster/Parallel Level </li></ul><ul><ul><li>Volume level (block based) </li></ul></ul><ul><ul><li>File or File system level (file, block or object(for OSD) based) </li></ul></ul><ul><ul><li>Database or application level </li></ul></ul><ul><li>Directly at the storage or in the network </li></ul>
  30. 30. Summary <ul><li>Traditional/Historical </li></ul><ul><ul><li>Block level: Volume Management </li></ul></ul><ul><ul><ul><li>EMC PowerPath (PPVM) </li></ul></ul></ul><ul><ul><ul><li>HP Shared LVM </li></ul></ul></ul><ul><ul><ul><li>IBM LVM </li></ul></ul></ul><ul><ul><ul><li>MACROIMPACT SAN CVM </li></ul></ul></ul><ul><ul><ul><li>REDHAT LVM </li></ul></ul></ul><ul><ul><ul><li>SANBOLIC LaScala </li></ul></ul></ul><ul><ul><ul><li>VERITAS </li></ul></ul></ul><ul><ul><li>File/File System level: </li></ul></ul><ul><ul><ul><li>Local Disk FS </li></ul></ul></ul><ul><ul><ul><li>Distributed: NAS, Samba, AFP, DFS, AFS, RFS, Coda… </li></ul></ul></ul><ul><ul><ul><li>SAN FS </li></ul></ul></ul><ul><ul><li>App/DB level: RDBMS, Email system </li></ul></ul><ul><li>Advanced/Recent: File/FS level </li></ul><ul><ul><li>Distributed: WAFS(NAS extention), NFM, GlobalFS, SANFS, ClusterFS </li></ul></ul>