Lustre: A Scalable Clustered Filesystem


Published on

I gave this presentation on Lustre at in 2008 while working at Sun Microsystems in the Lustre group. Lustre is an open-source, scalable, clustered filesystem that runs on 7 of the top 10 super-computers. This presentation describes architecture, working and features of Lustre.

1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lustre: A Scalable Clustered Filesystem

  1. 1. Lustre: A Scalable Clustered Filesystem Kalpak Shah 1
  2. 2. TOPICS Lustre Introduction Lustre Internals Hands-on Lustre Features currently under production Lustre roadmap: deep-dive into some interesting features Contributing Q&A 2
  3. 3. Lustre: Introduction Storage architecture for clusters A cluster file-system is a shared file-system like NFS Capable of scaling upto tens of thousands of clients, petabytes of storage and hundreds of gigabytes per second (GB/sec) of I/O throughput. Object-based storage Complete software-only solution for commodity hardware 0-single points of failure, 100% POSIX compliant Powering seven of the top ten HPC clusters Lustre is open-source software, licensed under GNU GPL. 3
  4. 4. Lustre: Components Metadata servers(MDS) Manage the names and directories in the fileystem Exports one or Metadata Targets (MDTs – disk partitions) Object storage servers (OSS) Provides the file IO service Exports one or Object Storage Targets(OSTs – disk partitions) Clients Mount and use the filesystem Computation or desktop nodes 4
  5. 5. LUSTRE: Components 5
  6. 6. Where are the files? MDS stores inodes in ext3 MDT file systems The MDS inodes have expected filenames Extended attributes(EAs) point to objects on OSTs OSS data objects are ext3 file inodes on OST file systems Objects have normal file data management, but the filename is just a number 6
  7. 7. Interaction between components 7
  8. 8. Lustre: Software components 8
  9. 9. High Availability OSS is active/active. MDS is active/passive. External software like Heartbeat, STONITH is required to facilitate failover. Failover is used to upgrade the software without cluster downtime. 9
  10. 10. Configuring a small Lustre cluster On the MDS, $ mkfs.lustre --mdt --mgs --fsname=test-fs /dev/sda $ mount -t lustre /dev/sda /mnt/mdt On OSS1: $ mkfs.lustre --ost --fsname=test-fs – /dev/sdb $ mount -t lustre /dev/sdb /mnt/ost1 On OSS2: $ mkfs.lustre --ost --fsname=test-fs – /dev/sdc $ mount -t lustre /dev/sdc /mnt/ost2 On clients: $ mount -t lustre /mnt/lustre 10
  11. 11. Architect a 50GB/sec cluster 64 OSS servers – each with two 8-TB targets (OSTs) Gives us filesystem capacity of 64x16TB = 1PB Suppose each OSS uses 16 1-TB disks, each disk providing 50MB/sec 800 MB/sec disk bandwidth Use system network like Infiniband then each OSS can provide 800MB/sec of end-to-end I/O throughput. Aggregate bandwidth of cluster: 64x800 = 50GB/sec MDS should have lots of RAM and atleast four processor cores. External journal gives 20% performance gain 11
  12. 12. Features currently under production - 1.8 Adaptive timeouts OST Pools Changelogs 12
  13. 13. Adaptive timeouts On large clusters (>10000 nodes) extreme server load became indistinguishable from server death. Modify RPC timeouts based on server load Timeouts increase as server load increases and vice-a-versa. Track service time histories on all servers and estimates for future RPCs are reported back to clients. Server send early replies to client asking for more time if RPCs queued on the server near their timeouts. Adaptive timeouts offers these benefits: Relieves users from having to tune the obd_timeout value. Reduces RPC timeouts and disconnect/reconnect cycles. 13
  14. 14. OST Pools Allows an administrator to name a group of OSTs for file striping purposes Pools can separate heterogeneous OSTs within the same filesystem: Fast vs. slow disks Local network vs. remote network Type of RAID backing storage Specific OSTs for user/group/application Pool usage is defined and stored along with the rest of the striping information(stripe count, stripe size) for directories or individual files. OSTs can be added and removed from a pool at any time. OSTs can be associated with multiple pools. 14
  15. 15. Changelogs - I A changelog is a log of data or metadata changes used to track filesystem operations. Types of changelogs MDT changes – create, delete, rename, unlink, open, close, hardlink, softlink, etc. Auditing – access times, pid, nid, uid, permission failures. Changelogs are stored persistently on-disk and removed when all registered consumers have purged them. Changelogs are updated transactionally i.e they are made a part of the actual filesystem transaction. On-demand 15
  16. 16. Changelogs - II Uses: Replication – propagate changes from master filesystem to one or more replicas Auditing – Record auditable actions (file create, access violations, quota excesses) HSM – for policy decisions (copyout, purge) Rules-based changelogs Per server – example, files striped on particular server Per fileset Per user, group 16
  17. 17. Other features in Lustre 1.8 OSS read cache Version based recovery Directory readahead Interoperability changes 17
  18. 18. Future Features Clustered Metadata Hardened filesystem Filesets 18
  19. 19. Clustered Metadata - I Introduces ability to have more than 1 metadata server – scalable metadata performance. Salient design features: Inode groups Special directory format to hold inode group and identifier (FID) Logical Metadata Volume (LMV) layer – figures out which MDS to use 19
  20. 20. Clustered Metadata - II Single metadata protocol for client-MDS changes and MDS-MDS changes Single recovery protocol for client-MDS and MDS- MDS service Directories can be split across multiple MDS nodes. Cross-MDS operations Metadata throughput can be increased by adding servers on the fly. 20
  21. 21. Hardened filesystem Drive capacity has been doubling every 2 years but drive error rates have stayed constant Failing hardware – network cards, cables, faulty firmware Data corruption on the rise – need to detect and correct if possible Journal checksums A checksum is stored for each on-disk journal transaction to avoid replaying corrupt journals. End-to-end data checksumming Guard against data corruption over the network – Lustre client can perform end-to-end checksums. Various types of checksum algorithms – adler32, CRC32 Persistent checksums by backend filesystem ZFS-based Lustre 21
  22. 22. Filesets A user application may want to perform an action on a very large set of files, for example: migration to slower storage purging old files replication of a subset to a proxy server A fileset is an efficient representation of these file identifiers(FIDs) Defining and maintaining filesets will be done by external agents who will be consumers of Lustre changelogs. Types of filesets: Enumeration of files and directories Inclusive file trees Filesets as objects – operations on filesets Maintenance Client access – mount -t lustre mgs://fsname/fileset mntpt 22
  23. 23. Lustre roadmap HSM (Hierarchical Storage Management) ZFS-based Lustre Security Windows native client Solaris client Proxy servers Writeback cache 23
  24. 24. Do Read - for design discussions Open bugzilla: Check out our open CVS repository: 24
  25. 25. QUESTIONS? 25
  26. 26. Backup slides 26
  27. 27. OSS Read Cache Read-only caching of data on the OSS Uses regular Linux pagecache to store the data Improves repeated reads to match network speeds instead of disk speeds Overhead of this caching is very low. Most importantly, this lays the groundwork for OST write cache for small write aggregation. 27
  28. 28. Write-back cache The metadata writeback cache(MDWBC) will allow clients to delay and batch metadata operations. Advantages: Lesser number of RPCs Increased client throughput Better network and server utilization The metadata batch is a group of MD operations performed by the client such that: It transforms the filesystem from one consistent state to another No other client depends on seeing the filesystem in any state where some, but not all of the MD operations in the batch are in effect. Reintegration Dependency Sub-tree locks 28