How to Improve GFS/GFS2 File System Performance
and Prevent Processes from Hanging
Author: John Ruemker, Shane Bradley, an...
Glocks
A glock (pronounced “gee-lock”) is a cluster-wide GFS lock. GFS/GFS2 file systems use glocks to
coordinate locking ...
FILE-SYSTEM DESIGN CONSIDERATIONS
Before putting a clustered file system into production, you should spend some time desig...
MOUNT OPTIONS
Unless atime support is essential, Red Hat recommends setting noatime on every GFS/GFS2 mount
point. This wi...
cache invalidation. The function of a lock depends on each individual situation, and locking should be
assessed as a propo...
problem does reproduce on a single node, it is probably not related to clustering at all. If the problem only
appears in t...
Can I Use Discard Requests for Thin-Provisioning?
On Red Hat Enterprise Linux 6 and later, GFS2 supports the generation of...
APPLICATIONS
Email (imap/sendmail/etc.)
Locality of access often causes problems when email is run on clustered file syste...
SYSTEM CALLS
Read/Write
Read/write performance should be acceptable for most applications, provided you are careful not to...
support the fadvise/madvise interface). Overall performance can be significantly improved by flushing the
page cache for a...
DLM
There is no reason why applications should not make use of the DLM directly. Interfaces are available, and
details can...
to more than one mount point.

statfs_fast
The statfs_fast tunable parameter can be used in Red Hat Enterprise Linux 4.5 o...
Upcoming SlideShare
Loading in …5
×

Rhel cluster gfs_improveperformance

2,016 views

Published on

Published in: Technology
  • Be the first to comment

Rhel cluster gfs_improveperformance

  1. 1. How to Improve GFS/GFS2 File System Performance and Prevent Processes from Hanging Author: John Ruemker, Shane Bradley, and Steven Whitehouse Editor: Allison Pranger 02/04/2009, 10/12/2010 OVERVIEW Cluster file systems such as the Red Hat Global File System (GFS) and Red Hat Global File System 2 (GFS2) are complex systems that allows multiple computers (nodes) to simultaneously share the same storage device in a cluster. There can be many reasons why performance does not match expectations. In some workloads or environments, the overhead associated with distributed locking on GFS/GFS2 file systems might affect performance or cause certain commands to appear to hang. This document addresses common problems and how to avoid them, how to discover if a particular file system is affected by a problem, and how to know if you have found a real bug (rather than just a performance issue). This document is for users in the design stage of a cluster who want to know how to get the best from a GFS/GFS2 file system, as well as for for users of GFS/GFS2 file systems who need to track down a performance problem in the system. NOTE: This document provides recommended values only. Values should be thoroughly tested before implementing in a production environment. Under some workloads, they might have a negative impact on the performance of GFS/GFS2 file systems. Environment • Red Hat Enterprise Linux 4 and later Terminology This document assumes a basic knowledge and understanding of file systems in general. The following subsections briefly discuss relevant terminology. Inodes and Resource Groups In the framework of GFS/GFS2 file systems, inodes correspond to file-system objects like files, directories, and symlinks. A resource group corresponds to the way GFS and GFS2 keep track of areas within the file system. Each resource group contains a number of file system blocks, and there are bitmaps associated with each resource group that determine whether each block of that resource group is free, allocated for data, or allocated for an inode. Since the file system is shared, the resource group information/bitmaps and inode information must be kept synchronized between nodes so the file system remains consistent (not corrupted) on all nodes of the cluster. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 1
  2. 2. Glocks A glock (pronounced “gee-lock”) is a cluster-wide GFS lock. GFS/GFS2 file systems use glocks to coordinate locking of file system resources such as inodes and resource groups. The glock subsystem provides a cache-management function that is implemented using DLM as the underlying communication layer. Holders When a process is using a GFS/GFS2 file-system resource, it locks the glock associated with that resource and is said to be holding that glock. Each glock can have a number of holders that each lay claim on that resource. Processes waiting for a glock are considered to be waiting to hold the glock, and they also have holders attached to the glock, but in a waiting state. THEORY OF OPERATION Both GFS and GFS2 work like local file systems, except in regards to caching. In GFS/GFS2, caching is controlled by glocks. There are two essential things to know about caching in order to understand GFS/GFS2 performance characteristics. The first is that the cache is split between nodes: either only a single node may cache a particular part of the file system at one time, or, in the case of a particular part of the file system being read but not modified, multiple nodes may cache the same part of the file system simultaneously. Caching granularity is per inode or per resource group so that each object is associated with a glock (types 2 and 3, respectively) that controls its caching. The second thing to note is that there is no other form of communication between GFS/GFS2 nodes in the file system. All cache-control information comes from the glock layer and the underlying lock manager (DLM). When a node makes an exclusive-use access request (for a write or modification operation) to locally cache some part of the file system that is currently in use elsewhere in the cluster, all the other cluster nodes must write any pending changes and empty their caches. If a write or modification operation has just been performed on another node, this requires both log flushing and writing back of data, which can be tremendously slower than accessing data that is already cached locally. These caching principles apply to directories as well as to files. Adding or removing a directory entry is the same (from a caching point of view) as writing to a file, and reading the directory or looking up a single entry is the same as reading a file. The speed is slower if the file or directory is larger, although it also depends on how much of the file or directory needs to be read in order to complete the operation. Reading cached data can be very fast. In GFS2, the code path used to read cached data is almost identical to that used by the ext3 file system: the read path goes directly to the page cache in order to check the page state and copy the data to the application. There will only be a call into the file system to refresh the pages if the pages are non-existent or not up to date. GFS works slightly differently: it wraps the read call in a glock directly; however, reading data that is already cached this way is still fast. You can read the same data at the same speed in parallel across multiple nodes, and the effective transfer rate can be very large. It is generally possible to achieve acceptable performance for most applications by being careful about how files are accessed. Simply taking an application designed to run on a single node and moving it to a cluster rarely improves performance. For further advice, contact Red Hat Support. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 2
  3. 3. FILE-SYSTEM DESIGN CONSIDERATIONS Before putting a clustered file system into production, you should spend some time designing the file system to allow for the least amount of contention between nodes in the cluster. Since access to file-system blocks are controlled by glocks that potentially require inter-node communications, you will get the best performance if you design your file system to avoid contention. File/Directory Contention If, for example, you have dozens of nodes that all mount the same GFS2 file system and all access the same file, then access will only be fast if all nodes have read-only access (nodes mounted with the noatime mount option). As soon as there is one writer to the shared file, the performance will slow down dramatically. If the application knows when it has written a file that will not be used again on the local node, then calling fsync and then fadvise/madvise with the DONT_NEED flag will help to speed up access from other cluster nodes. The other important item to note is that for directories, file create/unlink activity has the same effect as a write to a regular file: it requires exclusive access to the directory to perform the operation and then subsequent access from other nodes requires rereading the directory information into cache, which can be a slow operation for large directories. It is usually better to split up directories with a lot of write activity into several subdirectories that are indexed by a hash or some similar system in order to reduce the amount of times each individual directory has to be reread from disk. Resource-Group Contention GFS/GFS2 file systems are logically divided into several areas known as resource groups. The size of the resource groups can be controlled by the mkfs.gfs/mkfs.gfs2 command (-r parameter). The GFS/GFS2 mkfs program attempts to estimate an optimal size for your resource groups, but it might not be precise enough for optimal performance. If you have too many resource groups, the nodes in your cluster might waste unnecessary time searching through tens of thousands of resource groups trying to find one suitable for block allocation. On the other hand, if you have too few resource groups, each will cover a larger area, so block allocations might suffer from the opposite problem: too much time wasted in glock contention waiting for available resource groups. You might want to experiment with different resource-group sizes to find one that optimizes system performance. Block-Size Considerations When the file system is formatted with the mkfs.gfs/mkfs.gfs2 command, you may specify a block size with -b. If no size is specified, the default is 4K. Different block sizes will often provide different performance characteristics for your application. Most hardware is designed to operate efficiently with the default block size of 4K. Using the default 4k block size is recommended for all file systems. However, if there is a requirement for efficient storage of very small files, 1k should be considered the minimum block size (-b 1024). Ideal block size depends on how the file system is used. You might want to experiment with different block sizes to find one that optimizes system performance. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 3
  4. 4. MOUNT OPTIONS Unless atime support is essential, Red Hat recommends setting noatime on every GFS/GFS2 mount point. This will significantly improve performance since it prevents reads from turning into writes. With GFS2 on Red Hat Enterprise Linux 6 and later, there is also the option of relatime, which updates atime when other timestamps are being updated; however, noatime is still recommended. Do not use the journaled data mode (chattr +j) unless it is required. The default ordered-data mode will prevent any uninitialized data from appearing in files after a crash. To ensure data is actually on disk, use fsync(2) for both the parent directory and newly created files. ANSWERS TO COMMON QUESTIONS If your cluster is running slowly or appears to be stopped and you are not sure why, the steps below should help to resolve the issue. Remember that Red Hat Support is always available to help. First Steps Begin by collecting the answers to several simple questions that Red Hat Support will ask when you contact them: • What is the workload on the cluster? • What applications are running, and are they using large or small files? • How are the files organized? • What is the architecture of the cluster? • How many nodes? • What is the storage, and how is it attached? • How large is the file system(s)? • What are the timing constraints? • Does the issue always occur at a certain time of day or have some relationship with a particular event (for example, nightly backups)? • Is the problem a performance problem (slow), a real bug (completely stuck, kernel panic, file-system assertion), or a corruption issue (usually indicated by a file-system withdraw)? • Is the problem reproducible, or did it happen only once? • Does the problem occur on a single node or in a cluster situation? • Does the problem occur on the same node, or does it move around? Of course, not every situation will require the same set of information, but the answers to these questions will give you a good idea of where to start looking for the root of the problem. If the problem always occurs at a specific time, look for periodic processes that might be running (not all are in crontab, but that is a good place to begin). Is a Task Stuck or Just Slow? It is often difficult to tell if a task is completely stuck or if it is just slow, but there are signs that point to poor performance resulting from contention for glocks. One example is increased network traffic. A significant amount of DLM network traffic indicates that there is a lot of locking activity and thus potentially a lot of How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 4
  5. 5. cache invalidation. The function of a lock depends on each individual situation, and locking should be assessed as a proportion of the total network bandwidth instead of measured against specific metrics. Also, increased locking activity is only an indication of a problem and not a guarantee that one exists. The actual level of locking activity is highly dependent on the workload. Information from glock dumps can be used to show whether tasks are still making progress. Take two glock dumps, spaced apart by a few seconds or a few minutes, and then look for glocks with a lot of waiting holders (ignoring granted holders). If the same glocks have exactly the same list of waiting holders in the second glock dump as they did in the first, it is an indication that the cluster has become stuck. If the list has changed at all, then the cluster is just running slowly. Sometimes taking a glock dump can take a long time due to the amount of data involved, which depends on the number of cached glocks (GFS/GFS2 keep a large number of glocks in memory for performance purposes). The time needed will change depending on the total memory size of the node in question and the amount of GFS/GFS2 activity that has taken place on that node. GFS2 tracepoints (available in Red Hat Enterprise Linux 6 and later) can also be used to monitor activity in order to see whether any nodes are stuck. Which Task is Involved in the Slowdown? The glock dump file includes details of the tasks that have requested each glock (the owner of each holder). This information can be used to find out which task is stuck or involved in contention for a glock. Which Inode is Contended? Glock numbers are made up of two parts. In the glock dump, glock numbers are represented as type, number. Type 2 indicates an inode, and type 3 indicates a resource group. There are additional types of glocks, but the majority of slowdown and glock-contention issues will be associated with these two glock types. The number of the type 2 glocks (inode glocks) indicates the disk location (in file system blocks) of the inode and also serves as the inode identification number. You can convert the inode number listed in the glock dump from hexadecimal to decimal format and use it to track down the inode associated with that glock. Identifying the contended inode should be possible using find -inum, preferably on an otherwise idle file system since it will try to read all the inodes in the file system, making any contention problem worse. Why Does gfs2_quotad Appear Stuck When I'm Not Even Using Quotas? The gfs2_quotad process performs two jobs. One of those is related to quotas, and the other is updating the statfs information for the file system. If a problem occurs elsewhere in the file system, gfs2_quotad often becomes stuck since the periodic writes to update the statfs information can become queued behind other operations in the system. If gfs2_quotad appears stuck, it is usually a symptom of a different problem elsewhere in the file system. Is It Worth Trying to Reproduce a Problem While Only a Single Node is Mounted? In almost every case, you should try to reproduce a problem while only a single node is mounted. If the How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 5
  6. 6. problem does reproduce on a single node, it is probably not related to clustering at all. If the problem only appears in the cluster, it indicates either an I/O issue or a contention issue on one or more inodes in the cluster. How Can I Calculate the Maximum Throughput of GFS/GFS2 on My Hardware? Maximum throughput depends on the I/O pattern from the application, the I/O scheduler on each node, the distribution of I/O among the nodes, and the characteristics of the hardware itself. One simple example involves two nodes, each performing streaming writes to its own file on a GFS2 file system. This scenario can be simulated at the block-device level by creating two streams of I/O to different parts of the shared block device using dd. This test will allow you to measure the absolute maximum performance that the hardware can sustain. Actual file-system performance will differ due to the overhead of block allocation and file-system metadata, but this test will provide an upper limit. In this example, we are assuming that the block device is a single, shared, rotational hard disk that is able to write to only a single sector at once. The two streams of I/O will be sent to the disk by the I/O schedulers on the two different nodes, each without any knowledge of the other. The disk must perform scheduling in order to move the disk head between the two areas of the disk receiving the streams. This process will be slow, and it might even be slower than writing the two streams of I/O sequentially from a single node. If, on the other hand, we assume that the block device in the example is a RAID array with many spindles, then the two streams of I/O may be written at the same time without having to move disk heads between the two areas. This will improve performance. Storage hardware must be specified according to the expected I/O patterns from the nodes so that it can be capable of delivering the level of performance that will support file-system requirements. What About I/O Barriers? Beginning with Red Hat Enterprise Linux 6, GFS2 uses I/O barriers by default when flushing the log. Red Hat recommends the use of I/O barriers for most block devices; however, barriers are not required in all cases and can sometimes be detrimental to performance, depending on how the storage device implements them. If the shared block device has no write cache or if the write cache is not volatile (for example, if it is powered from a UPS or similar device), then you might wish to turn off barrier support. You can prevent the use of I/O barriers by setting the nobarrier option with the mount command (or in /etc/fstab). If the underlying storage does not support them, I/O barriers will automatically be turned off, indicated by a log message and the appearance of the nobarrier option in /proc/mounts as if it had been set using the command line. GFS2 only issues a single barrier each time it flushes the log. The total number of barriers issued over time can be minimized by reducing the number of operations that result in a log flush (for example, fsync(2) or glock contention) or (depending on workload) increasing the journal size. This has potential side-effects, so you should attempt to strike a balance between performance and the potential for data loss in the event of a node failure. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 6
  7. 7. Can I Use Discard Requests for Thin-Provisioning? On Red Hat Enterprise Linux 6 and later, GFS2 supports the generation of discard requests. These requests allow the file system to tell the underlying block device which blocks are no longer required due to deallocation of a file or directory. Sending a discard request implies an I/O barrier. There is a small performance penalty when these requests are generated by GFS2, and the performance penalty might be larger depending on how the underlying storage device interprets the requests. In order to increase performance and decrease overhead, GFS2 saves up as many requests as possible and merges them into a single request whenever it is able. In order to turn on this feature, you need to set the discard option with the mount command. This feature will only work when both the volume manager and the underlying storage device support the requests. Some storage devices might consider the requests a suggestion rather than a requirement (for example, if the file system requests a discard of a single block, but the underlying storage is only able to discard larger chunks of storage). Other storage devices might not deallocate any storage at all, but they might use the suggestion to implement a secure delete by zeroing out the selected blocks. Are there Benefits to Using Solid-State Storage with GFS/GFS2? In general, solid-state storage has a much lower seek time, which can improve overall system performance, particularly where glock contention is a major factor. Is Network Traffic a Major Factor in GFS/GFS2 Performance? Network traffic should not greatly affect overall system performance, provided that there is enough bandwidth for the cluster infrastructure to keep quorum and synchronize any POSIX locks, that multicast is working between all the nodes, and that DLM is able to communicate its lock requests. However, if a network device is shared between storage and/or application traffic as well as cluster traffic, Red Hat recommends using tc to implement suitable bandwidth limits. The use of jumbograms is not recommended unless the network is carrying storage traffic. Latency is generally regarded as more important than overall throughput in terms of GFS/GFS2 performance. Watching traffic levels to ensure that the links do not saturate is a sensible policy since that can be a warning sign of other issues (such as contention), but traffic statistics otherwise do not provide a great deal of useful information. What if I Have a Different Problem? Red Hat Support representatives are available to help if you cannot find the answer to your question here. If you have experienced a kernel Oops, assertion, or similar problem, contact Red Hat Support immediately. If you have experienced a file-system withdraw, it is almost always due to corruption of the file system, and fsck.gfs/fsck.gfs2 can usually fix the problem. Unmount the file system on all nodes, take a backup, and then run fsck on the file system. Keep any output from fsck, as it will be needed in the event that fsck cannot fix the problem. Contact Red Hat Support if fsck fails to resolve the problem. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 7
  8. 8. APPLICATIONS Email (imap/sendmail/etc.) Locality of access often causes problems when email is run on clustered file systems. To optimize performance, Red Hat recommends arranging for users to have a "home" node to which each user connects by default, assuming that all the cluster nodes are working normally. If a problem occurs in the cluster, users can then be moved to another home node where all of their files are cached. This reduces cross-node invalidation. It is also possible to use the same technique on the delivery side by setting up MTA to forward to the user's home node and letting that node write to the file system. If that node isn't available, then MTA can write the message directly. Using maildir instead of mbox also helps scalability since each message, rather than the whole mailbox, has its own lock. When performance issues occur in maildir setups, they are almost always the result of contention on the directory lock. Backup Backup can affect performance since the process of backing up a node or set of nodes usually involves reading the entire file system in sequence. If a single node performs the backup, that node will retain all that information in cache until other nodes in the cluster start requesting locks. While running this type of backup program while the cluster is in operation is a sure way to reduce system performance, there are a number of ways to reduce the detrimental effect. One way is to drop the caches using echo -n 3 >/proc/sys/vm/drop_caches after the backup has completed. This reduces the time required by other nodes to get their glocks/caches back. However, this method is not ideal because the other nodes will have stopped caching the data that they were caching before the backup process started. Also, there is an effect on the overall cluster performance while the backup is running, which is often not acceptable. Another method is to back up at the block-device level and take a snapshot. This is currently only supported in cases where there is provision for a snapshot at the hardware (storage array) level. A better solution is to back up the working set of each cluster node from the node itself. This distributes the workload of the backup across the cluster and keeps the working set in cache on the nodes in question. This often requires custom scripting. Web Servers GFS/GFS2 file systems are ideally suited as web servers since serving web pages tends to involve a large amount of data that can be cached on all nodes. Issues can arise when data has to be updated, but you can reduce the potential for contention by preparing a new copy of the website and switching over rather than trying to update the files in place. Red Hat recommends making the root of the website a bind mount and using mount --move to have the web server(s) use a new set of files. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 8
  9. 9. SYSTEM CALLS Read/Write Read/write performance should be acceptable for most applications, provided you are careful not to cause too many cross-node accesses that require cache sync and/or invalidation. Streaming writes on GFS2 are currently slower than on GFS. This is a direct consequence of the locking hierarchy and results from GFS2 performing writes on a per-page basis like other (local) file systems. Each page written has a certain amount of overhead. Due to the different lock ordering, GFS does not suffer from the same problem since it is able to perform the overhead operations once for multiple pages. Speed of multiple-page write calls aside, there are many advantages to the GFS2 file system, including faster performance for cached reads and simpler code for deadlock avoidance during complicated write calls (for example, when the source page being written is from a memory-mapped file on a different file system type). Red Hat is currently working to allow multiple-page writes, which will make GFS2’s streaming write calls faster than the equivalent GFS operation. This streaming-writes issue is the only known regression in speed between GFS and GFS2. Smaller writes (page sized and below) on GFS2 are faster than on GFS. Memory Mapping GFS and GFS2 implement memory mapping differently. In GFS (and some earlier GFS2 kernels), a page fault on a writable shared mapping would always result in an exclusive lock being taken for the inode in question. This is consequence of an optimization that was originally introduced for local file systems where pages would be made writable on the initial page fault in order to avoid a potential second fault later (if the first access was a read and a subsequent access was a write). In Red Hat Enterprise Linux 6 (and some later versions of Red Hat Enterprise Linux 5) kernels, GFS2 has implemented a system of only providing a read-only mapping for read requests, significantly improving scalability. A file that is mapped on multiple nodes of a GFS2 cluster in a shared writable manner can be cached on all nodes, provided no writes occur. NOTE: While in theory you can use the feature of shared writable memory mapping on a single shared file to implement distributed shared memory, any such implementation would be very slow due to cache-bouncing issues. This is not recommended. Sharing a read-only shared file in this way is acceptable, but only on recent kernels with GFS2. If you need to share a file in this way on GFS, then open and map it read only to avoid locking issues. Cache Control (fsync/fadvise/madvise) Both GFS and GFS2 support fsync(2), which functions the same way as in any local file system. When using fsync(2) with numerous small files, Red Hat recommends sorting them by inode number. This will improve performance by reducing the disk seeks required to complete the operation. If it is possible to save up fsync(2) on a set of files and sync them all back together, it will help performance when compared with using either O_SYNC or fsync(2) after each write. To improve performance with GFS2, you can use the fadvise and/or madvise pair of system calls to request read ahead or cache flushing when it is known that data will not be used again (GFS does not How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 9
  10. 10. support the fadvise/madvise interface). Overall performance can be significantly improved by flushing the page cache for an inode when it will not be used from a particular node again and is likely to be requested by another node. It is also possible to drop caches globally. Using echo -n 3 >/proc/sys/vm/drop_caches will drop the caches for all file systems on a node and not just GFS/GFS2. This can be useful when you have a problem that might be caused by caching and you want to run a cache cold test, for example. However, it should not be used in the normal course of operation (see the Backup section above). File Locking The locking methods below are only recommendations as GFS and GFS2 do not support mandatory locks. flock The flock system call is implemented by type 6 glocks and works across the cluster in the normal way. It is affected by the localflocks mount option, as described below. Flocks are a relatively fast method of file locking and are preferred to fcntl locks on performance grounds (the difference becomes greater on clusters with larger node counts). fcntl (POSIX Locks) POSIX locks have been supported on a single-node basis since the early days of GFS. The ability to use these locks in a clustered environment was added in Red Hat Enterprise Linux 5. Unlike the other GFS/GFS2 locking implementations, POSIX locks do not use DLM and are instead performed in user space via corosync. By default, POSIX locks are rate limited to a maximum of 100 locks per second in order to conserve network bandwidth that might otherwise be flooded with POSIX-lock requests. To raise the limit, you can edit the cluster.conf file (setting the limit to 0 removes it altogether). NOTE: Some applications using POSIX locks might try to use F_GETLK fcntl to try to obtain the PID of a blocking process. This will work on GFS/GFS2, but due to clustering, the process might not be on the same node as the process that used F_GETLK. Sending a signal is not as straightforward in this case, and care should be taken not to send signals to the wrong processes. It is possible to make use of POSIX locks on a single-node basis by setting the localflocks mount option. This also affects flock(2), but it is not usually a problem since it is unusual to require both forms of locking for a single application. NOTE: Localflocks must be set for all NFS-exported GFS2 file systems, and the only supported NFS-overGFS/GFS2 solutions are those with only a single active NFS server at a time designed for active/passive failover. NFS is not currently supported in combination with either Samba or local applications. Due to the user-space implementation of POSIX locks, they are not suitable for high-performance locking requirements. Leases Leases are not supported on either GFS or GFS2. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 10
  11. 11. DLM There is no reason why applications should not make use of the DLM directly. Interfaces are available, and details can be found in the DLM documentation. RECOMMENDED TUNABLE SETTINGS The following sections describe recommended values for GFS tunable parameters. glock_purge In Red Hat Enterprise Linux 4.6/5.1 and later, a GFS tunable parameter, glock_purge, has been added to reduce the total number of locks cached for a particular file system on a cluster node. NOTE: This setting does not exist in Red Hat Enterprise Linux 6 or later, and it is not a recommended solution to any problem for which there is another solution. In Red Hat Enterprise Linux 6 and later, this parameter is self tuning, and caching can be controlled from the userspace via the fsync/fadvise system calls as described earlier in this document. This tunable parameter defines the percentage of unused glocks for a file system to clear every five seconds, as shown below, where X is an integer between 0 and 100 indicating the percentage to clear. # gfs_tool settune /path/to/mount glock_purge X A setting of 0 disables glock_purge. This value is typically set somewhere between 30-60 to start and can be further tuned based on testing and performance benchmarks. This setting is not persistent, so it must be reapplied every time the file system is mounted. It is typically placed in /etc/rc.local or /etc/init.d/gfs in the start function on every node so that it is applied at boot time after the file systems are mounted. demote_secs Another tunable parameter, demote_secs, can be used in conjunction with glock_purge. This tunable parameter demotes GFS write locks into less restricted states and subsequently flushes the cache data into disk. Shorter demote second(s) can be used to avoid accumulation of too much cached data, resulting in burst-mode flushing activities and prolonging another node’s lock access. NOTE: This setting does not exist in Red Hat Enterprise Linux 6 or later, and it is not a recommended solution to any problem for which there is another solution. In Red Hat Enterprise Linux 6 and later, this parameter is self tuning, and caching can be controlled from the userspace via the fsync/fadvise system calls as described earlier in this document. The default value is 300 seconds. To enable the demoting every 200 seconds on mount point /mnt/gfs1, enter the following command: $ gfs_tool settune /mnt/gfs1 demote_secs 200 To set back to default of 300 seconds, enter the following command: $ gfs_tool settune /mnt/gfs1 demote_secs 300 Note that this setting only applies to an individual file system, so multiple commands must be used to apply it How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 11
  12. 12. to more than one mount point. statfs_fast The statfs_fast tunable parameter can be used in Red Hat Enterprise Linux 4.5 or later to speed up statfs calls on GFS. NOTE: For Red Hat Enterprise Linux 6 and later, this can be set via the mount command line using the statfs_quantum and statfs_percent mount arguments. This is the preferred method since it is then set at mount time and does not require a separate tool to change it. To enable statfs_fast, enter the following command: # gfs_tool settune /path/to/mount statfs_fast 1 Red Hat recommends the use of mount options noquota, noatime, and nodiratime for GFS file systems, if possible, as they are known to improve performance in many cases. They can be added in /etc/fstab, as shown below. /dev/clustervg/lv1 /mnt/appdata gfs defaults,noquota,noatime,nodiratime 0 0 NOTE: An issue has been reported to Red Hat Engineering regarding the usage of noquota in Red Hat Enterprise Linux 5.3: Why do I get a mount error reporting 'Invalid argument' on my GFS or GFS2 file system on Red Hat Enterprise Linux 5.3?. Disabling ls Colors It might also be beneficial to remove the aliases for the ls command that cause it to display colors in its output when using the bash or csh shells. By default, Red Hat Enterprise Linux systems are configured with the following aliases from /etc/profile.d/colorls.sh and colorls.csh: # alias | grep 'ls' alias l.='ls -d .* --color=tty' alias ll='ls -l --color=tty' alias ls='ls --color=tty' In situations where a GFS file system is slow to respond, the first response of many users is to run ls in order to determine the problem. If the --color option is enabled, ls must run a stat() against every entry, which creates additional lock requests and can create contention for those files with other processes. This might exacerbate the problem and cause slower response times for processes accessing that file system. To prevent ls from using the --color=tty option for all users, the following lines can be added to the end of /etc/profile: alias ll='ls -l' 2>/dev/null alias l.='ls -d .*' 2>/dev/null unalias ls These lines can also be placed in a user’s ~/.bash_profile to disable --color=tty on an individual basis. In general, however, it is best to avoid excessive use of the ls command due to the locking overhead. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 12 Copyright © 2011 Red Hat, Inc. “Red Hat,” Red Hat Linux, the Red Hat “Shadowman” logo, and the products listed are trademarks of Red Hat, Inc., registered in the U.S. and other countries. Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries. www.redhat.com

×