Hi, I’m Andy Wang. I’m going to tell you about the conquest file system I built at UCLA.
First, I’ll give a brief overview of the talk The problem of modern file system design is that they are designed and optimized for disks. This assumption is problematic in terms of performance and complexity. However, now we have tons of inexpensive RAM. The natural question is “what can we do with that RAM?”
The Conquest approach is to combine disk and persistent RAM is an interesting way So, the resulting system is simpler with better performance. In terms of simplification, Conquest code base is 20% smaller than ext2, reiserfs and SGI XFS. In terms of performance, Conquest is 24% to 1900 faster than LRU caching solutions.
Here is the outline of the talk I’ll go through motivation, conquest design, major components, performance evaluation, and conclusion.
Now here is the full version of the talk. Modern file systems are built for disks. There are two major problems with the disk assumption. First, disk is slow. As a consequence, researchers have been adding all kinds of complexity to mask this performance gap.
This graph shows hardware evolution over time. The vertical axis is in log scale. First, CPU and memory are improving at 50% per year. However, disk is improving only at the rate of 15% per year. And, this gap is widening from 5 orders of magnitude to 6 orders of magnitude. Personally, I have trouble to feel those numbers, I try to think of them in the human scale. Back in 1990, if a CPU access took 1 second, it would have taken disk 6 days to perform one access. In 2000, this ratio has grown to 1 second to 3 months.
So what’s going on? A disk contains mechanical components—disk arm and disk platters. To access a piece of data, you need to move the disk arm to the correct track, wait for the data to rotate underneath the disk head, and transfer the data.
As a result, researchers do all kinds of things to speed up disks. We try to schedule disk arm to minimize the movement. We try to group related information on the surface of the disk. We try to figure out that disk is going to read next and prefetch information. We try to buffer write requests, so we can reorder writes to minimize disk head movement. Also, we use memory to cache information on disk. We try to mirror data on other disks to leverage hardware parallelism.
So, what does it take for a single byte of data to traverse from the user space down to the surface of the disk? First, we need to worry about multiple copies of data due to disk cache, or data consistency. To achieve consistency, we need to worry about synchronization. We need to go through a predictive logic of cache replacement, which runs head to head against the predictive readahead logic. In the background, we still have buffered writes, which juggles write requests. After that, we need to find the correct grouping of data, and take the elevator algorithm to land on the disk.
So, what are some storage alternatives? This graph plots the speed against the cost for current storage technologies. Both axis are in the log scale. First, we have tape, which is slow and inexpensive. Then, we have disk, which is two orders of magnitude more expensive than tape, but four orders of magnitude faster than tape. Somewhere down the horizon, we start to see flash memory, battery-backed DRAM, magnetic RAM (which is under research), which are commonly referred as the persistent RAM. Persistent RAM is again two orders more expensive than disk, but they are five to six orders faster than disk. If you look at the cost and performance tradeoffs, persistent RAM looks promising.
Now, let us look at the price trend over time. Again, the cost is in the log scale. First, let’s see the cost of paper and film, which is a critical barrier to cross for any storage technology to achieve an economy of scale. Once a storage technology crosses this barrier, it becomes cheap enough to become a cheap enough storage alternative. (animate) Now let’s look at various cost curves. For disks, various disk geometries are introduced roughly at the top boundary of the paper and film cost barrier. Also, notice the cost curve for the persistent RAM. Back in 1998, the booming of digital photography changed the slope of the curve. By 2005, we would expect to see 4 to 10 GB of persistent RAM on personal desktops.
So, let’s think about these facts for a while. I think that disk will stay around for at least cost reasons. However, now RAM is a viable storage alternative, as we see on PDAs and various devices. However, I expect to see more architectural changes due to RAM, because it is a big change in design assumptions. We need to rethink system components ranging from data structures, interface, to applications.
For Conquest, the biggest question is what does it take to design and build a system that assumes lots of persistent RAM as the primary storage medium? The solution is to start from ground up.
The idea of Conquest is to design and build a disk/persistent RAM hybrid file system, which delivers all file system services from memory, with the single exception of high-capacity storage. In essence, Conquest provides two separate data paths to memory and disk. Two major benefits are simplicity and performance.
By simplicity, Conquest removes disk-related complexities for most files. Also, Conquest makes things simpler for disk as well. Less complexity means fewer bugs, easier maintenance, and shorter data paths.
In terms of performance, all management is performed in memory, which will improve overall performance. For memory data path, there is no disk-related overhead. For disk data path, we have faster speed due to simpler usage models, as I will explain later.
Conquest consists of the following components. They have to do with how media storage is used, how metadata are represented, how to provide directory service, how to allocate data and metadata, and how to support persistence and resiliency
First, let’s revisit the common user access patterns. We know that small files take up relatively little space, and they represent most user accesses. Large files take up most of the space, and they are accessed sequentially most of the time. Of course, database is an exception, and Conquest is not designed for database workload.
Based on this user behavior pattern, Conquest stores small files, metadata, executables and shared libraries in RAM. Small files benefit the most from being stored in memory, because disk seek time and rotational delays are dominating overhead for accessing small files. Also, we now have fast byte-level accesses as opposed to block-level accesses. Small files are allocated contiguously. Storing metadata in memory avoids the notorious synchronous update problem. And, it deserves some discussion. Basically, if there is no metadata; there is no file system. Therefore, system designers take extra caution when it comes to handling metadata. If you update a directory, for example, most disk-based file systems will propagate the change synchronously to disk. It is a serious performance problem. By storing in metadata in memory, synchronous updates are a lot faster. Also, we now have a single representation for metadata, as opposed to a runtime representation and storage representation of metadata. Executables and shared libraries are also stored in core, so we can execute programs in-place, which reduces program startup time significantly.
Now let’s take a look at the data path for conventional file systems. A storage request has to go through the IO buffer management, which handles caching. If the request is not in the cache, it has to go through persistence support, which is responsible for translating storage and runtime forms of metadata. The request then needs to go through disk management, which handles disk layout, disk arm scheduling and so on before reaching the disk. For conquest memory, updates to metadata and data are in-place. There is no IO buffer management and disk management. Also, for persistence support, we don’t need to translate between runtime and storage states.
Since small files and metadata are taken care of, the disk only needs to handle large files. Which means, we can allocate disk space in big chucks, and it translates into lower access and management overhead. Also, without small objects, we don’t need to worry about fragmentation. We don’t need tricks for small files, such as storing data inside the metadata, or elaborate data structures, such as wrapping a balanced tree onto the geometry of the disk cylinders.
For large files that are accessed sequentially, disk can deliver near raw bandwidth, which is about 100 MB per second. And that speed is 200 times faster than random disk accesses. Also, large files have well-defined readahead semantics. Since they are read mostly, large file handling involve little synchronization overhead
This shows the disk data path of Conquest. Again, on the left side, we have the data path for conventional file systems. Immediately, you see that Conquest data conquest bypasses mechanisms involved in persistence support. The IO buffer management is greatly simplified because we know the behavior of large file accesses. Also, the disk management is greatly simplified due to the lack of small files and fragmentation management. fonts
You may ask, “what about large files that are randomly accessed?” In literature, random accesses are commonly defined as nonsequential accesses. However, if you take a look say a movie file, typically it has 150 scene changes. There are 150 places you may randomly jump to, and perform disk accesses sequentially. Also, looking at a mp3 file, the title is stored at the end of the file, so the typical access is to jump to the end of the file and go back to the beginning to play sequentially. Therefore, what may be random accesses are really near sequentially accesses. With this knowledge, we can simplify large-file metadata representation significantly.
Before I introduce conquest metadata representation, let’s look at how it’s commonly done in ext2 file systems. First, we have a logical file. If you look into the file, you will find I-node, which contains file attributes and data.
However, at the physical level, data is broken into data blocks, because disk is a block oriented device. Therefore, inode has to keep track of data locations as well.
Ext2 has the following inode structure to handle both small and large files. For small files, there are ten pointers for fast data accesses. After consuming these ten pointers, we start to use singly indirect block to keep track of the additional data blocks; after that, doubly indirected blocks; and triply indirect blocks. This demonstrates how small files introduce complexity into data structure design. Also, this index block with high fan out is characteristics of disk data structures such as b-tree.
So, here are the problems for the ext2 design. First, the metadata is designed for disk storage. The optimization for small files really makes this data structure complex. Also, we have random access data structure for sequential access mostly large files. The data access is dependent on the byte position in a file. In addition, the access time is dependent on the byte position in a file. And, the maximum file size is limited by the number of pointers.
Conquest representation is a lot simpler. For persistent RAM, we just hash a file name to the location of data. And just take the offset of the data. As for disk storage, for each file we have a doubly linked list of disk block segments stored in persistent RAM.
So, conquest design has direct data access for files that reside in memory. For large files, the worse case is to traverse the doubly linked segment list in memory for random accesses. Also, the maximum file size is now limited by the physical storage.
Now, let’s look at how to provide directory service. All of you should be familiar with the ls and dir command, depending on the OS platform. So, directory service should provide fast sequential traversal. Also, it should provide fast random lookup and hardlinks, meaning, providing multiple names to a piece of data.
For the first design, I used the double hashing data structure commonly used in compilers. fashion. Double hashing conserves memory size quite well. This data structure can handle sequential traversal by walking through the table. It provides random access by hashing. Hardlinks are handled by hashing two file names to the same data. However, I discovered that when resizing directories, this data structure runs into problems. Conventional directories are maintained as a regular file. Therefore, there is a notion of a current file position. As the hash table rehashes during a resize operation, this file position information is lost. This case is particularly important for the common recursive deletes. Explain more
The second design is based on a variant of extensible hash table. Basically, it hashes the same way as any hash table. However, it hashes by using the upper bits of a hash key. When the table resizes, more upper bits are used for hashing, and hash entries can remain in the same relative order. A hardlink is also supported by hashing multiple names to the same piece of data.
Extensible hashing shows how an old data structured invented back in 1970’s fits nicely in this RAM-rich environment. However, Conquest still needs to overcome other engineering obstacles. For example, popular hash functions tend to randomize lower bits as opposed to upper bits. Also, maintaining the dynamic file positioning and handling collisions are tricky. There is a constant tension between the memory overhead and complexity tradeoffs. Currently, Conquest is still transitioning to the extensible hash table solution. Move an old data structure to the previous slide
Now, let’s move on to metadata allocation. The main functions of metadata allocator is to keep track of unallocated metadata entries. The allocator has to avoid the reuse of metadata IDs since data is uniquely associated with a metadata ID. Also, with a given ID, we should be able to retrieve metadata very quickly.
Let’s forget about the metadata allocator for a moment. Let’s take a look at what we have for the existing memory allocation. Memory allocator keeps track of unallocated memory and it makes sure that we don’t have duplicate allocation for the same physical address….
Now, lets put two allocators side-by-side. We can see that the existing allocator can already provide the usage status. The physical addresses can be used as unique ids, and they provide fast retrieval of data because once we know the address, we know where the metadata is located. Therefore, we can avoid building the metadata management completely. Basically, whenever we need to allocate metadata, we can use existing memory allocator. The unique id can be obtained by assigning the physical address of metadata being allocated.
Persistent support enables a file system to restore state after a reboot. Data and metadata can survive the reboot by not being zeroed out by BIOS. However, memory manager is typically reinitialized after a reboot, which is problematic because it keeps track of metadata allocation. I must provide a way for the memory manager to survive reboots.
Let’s first take a look at the structure of Linux memory managers. The Linux memory manager is structured in layers. At the bottom layer, we have the page allocator, which keeps track of individual page allocation and attribute information.
Above the page allocator, we have the zone allocator, which divides memory into zones with different purposes. Some examples are IO buffering, DMA, and high memory. Within each zone, memory is allocated in power-of-two sizes.
At the top level, we have the slab allocator, which groups allocations by sizes to reduce internal memory fragmentation. For example, if you frequently allocate a data structure of 54 bytes, the slab allocator will allocate a pageful of data structures with that size at a time to reduce allocation overhead and minimize fragmentation.
This layered memory management poses some difficulties for conquest to restore the persistent states. First, Conquest need to somehow resurrect three layers of pointer-rich mappings. Second, existing allocators have no notion of persistent and temporary allocations. It is difficult to resurrect memory manager states selectively.
Conquest solution is to dynamically create dedicated zones with their instantiations of memory allocators. Since they are instantiations, Conquest can share code with the existing memory manager.
In addition, all pointers are encapsulated within each zone means that pointers can survive reboots without serialization and deserialization. Swapping and paging are disabled for Conquest memory zones. However, they are enabled for non-Conquest zones for backward compatibility.
For resiliency support. Metadata commit is instantaneous under Conquest, which means there is no need for fsck. Conquest has built-in checkpointing to rollback to the previous file system states. Also, Conquest heavily relies on the pointer-switch commit semantics. Basically, if you want to modify a pointer, you just allocate and initialize a new object, switch the pointer, and deallocate. In the worse case, we will have memory leak that can be garbage collected.
Conquest is currently built as a kernel module under linux 2-4-2. It’s fully functional and POSIX compliant. I have modified the memory manager to support Conquest persistence. Currently, the BIOS is the limitation for distribution. UCLA is looking for licensing opportunities at the moment.
Now, let’s take a look at the conquest performance. Two aspects of evaluations are architectural simplification and performance improvement. For the architectural simplification, I’ll present the feature count. For performance improvement, I measured memory-only workload, and memory and disk-workload.
First, let’s take a look at the conventional data path. It’s color coded. IO buffer management involves buffer allocation, garbage collection, data caching, metadata caching, predictive readahead, write behind, and cache replacement. For persistence support, we need to worry about metadata allocation, metadata placement, and metadata translation. For disk management, we have disk layout and fragmentation management
For the memory data path of conquest, the only thing left is the metadata allocation and memory manager encapsulation code. Since metadata allocation is based on existing memory memory, there is really no implementation. However, we do have the additional code to encapsulate memory manager.
For the disk data path, conquest avoids metadata caching. Predictive logic is much more lightweight. The persistence support is gone, since metadata is stored in memory, and we have simpler disk management.
Now, let’s look at the performance of Conquest. This slide shows the result for PostMark benchmark, which models ISP workload. The graph plots the number of files against the transaction rate. The horizontal axis shows the number of files being accessed, and the vertical access shows the transaction rate. The total file size being exercise is varied from 40 to 250 MB, running with 2GB of physical RAM. The dark blue Conquest is compared against ramfs and other leading disk-based file systems. As you can see, Conquest is performance is comparable to the performance of ramfs. Compared to other disk-based file systems, Conquest is at least 24% faster. Note that all these file systems are operating in the LRU disk cache. File systems optimized for disk does not take full advantage of memory speed.
Now let’s fix the number of files to 10,000, and vary the percentage of large files from 0 to 10 percent. Since the working set is larger than memory, the graph does not include the ramfs. As you can see, when both memory and disk components are exercised, Conquest can be still be several times faster than leading disk-based file systems. Here is the boundary of physical RAM. Since we can’t see the right side of the graph too well, let’s zoom into the graph.
When the working set is greater than RAM, Conquest still runs 1.4 to 2 times faster than various disk-based file systems. This improvement is very significant.
Now let’s look at the microbenchmark numbers. The microbenchmark we chose was the Sprite LFS microbenchmark suite. It is divided into two sets of benchmarks. One is the small file benchmark. It basically create, read, and delete 10,000 files in three separate phases. The dark blue bars represent the operation rate for Conquest. First, you can notice that Conquest creation and deletion are not as fast as ramfs because I didn’t disable the metadata caching inside the VFS. At it turned out; creation and deletion operations account for a relative little fraction of file system accesses. For file read, Conquest is 15% faster than ramfs because Conquest bypasses many disk-related mechanisms used in ramfs.
For the modified large file benchmark, there are 5 phases—sequential write, sequential read, random write, random read, and sequential read. Each phase iterates through 10 files of 1MB each. The vertical axis shows the bandwidth for various file systems. Conquest outperforms ramfs by 8-15% for various operations. Of course, you may ask, what would happen if files are bigger than 1MB.
This graph shows the Conquest performance with 1.01 MB files. For reads, Conquest falls back to the speed of IO buffer. For writes, Conquest also commits changes to disk.
For 100 MB files, Conquest performance matches leading disk-based file systems. Therefore, Conquest has done no harm for the disk performance. Also, Conquest currently is using 4KB access granularity as opposed to 256 KB access granularity, there is still room for improving the disk bandwidth.
So far, you see that Conquest is doing great. However, I have gone a long way to reach this kind of performance numbers. I can still recall that one year ago, I was discussing some puzzling microbenchmark numbers with my colleague, Geoff Kuenning. He told me that “if Conquest is slower than ext2, I will toss you off the balcony.”
So, with me hanging off a balcony, I discovered some odd numbers with the original large-file microbenchmark. Basically, the benchmark runs those file operations on a single 1 MB file--Sequential write on a file with fsync, sequential read of the same file. Random write on a new file, random read of the same file, and sequential read of the same file.
So, I first try to evade Geoff by pointing out that “isn’t it strange that random reads are slower than sequential reads in memory?” Geoff told me, “well, it’s easy to explain. Random reads are not aligned in the benchmark.”
Then, he asked “why are RAM-based file systems run slower than disk-based file systems?”
So, I have gone through a series of hypotheses. Could it be some kind of warm-up effect? Maybe. However, why do RAM-based systems warm up slower than disk-based systems? Did we have bad initial states? I examined the initial benchmark condition and found nothing. How about the Pentium III streaming IO option? Is it possible that ramfs triggers the streaming IO and leaves L2 with little content to reuse for the subsequent read operation. From the profiling, I found nothing.
Finally, I tracked down the effects of cache footprint on the microbenchmark performance. On the left, I have a large cache footprint for a file system, and on the right, I have a small cache footprint for a file system. For the large cache footprint, after writing a file sequentially, the cache is left some of the dirty cache lines from the end of the file. For the subsequent sequential read, the cache is likely to evict the dirty content before reading in the file again from the beginning. However, if we have a small active cache footprint, we will have more room to cache the dirty writes. For the subsequent read, cache has to evict more dirty content before reading in the beginning of the file. The lesson here is that a smaller cache footprint can leave more room to cache dirty writes, which can amplify the performance swing for the subsequent operation. Of course, this result also depends on the relative sizes of the file and L2 cache.
After considering the memory and L2 caching effects into the microbenchmark, we obtain the following graph. It’s interesting to see that random accesses in memory can be faster than sequential accesses due to the reuse of cache content.
After considering the memory and L2 caching effects into the microbenchmark, we obtain the following graph. It’s interesting to see that random accesses in memory can be faster than sequential accesses due to the reuse of cache content.
I have walked into this project not expecting Conquest to perform better than LRU. However, not only Conquest outperforms LRU, it also outperforms ramfs in many circumstances. This lesson shows that the disk handling is very heavyweight and imposes severe penalty for accessing memory content. Also, matching user access patterns and storage media offers considerable simplification and performance improvement. This result is not automatic. Since Conquest has two separate data path, it is quite possible that the combined memory footprint is larger than before, and result in poorer performance. The system needs careful design.
For the performance measurement, the effects of L2 caching are becoming very visible for memory workloads. This lesson is particularly important because modern workloads are more and more memory intensive. Also, we cannot just blindly apply existing disk-based microbenchmarks to measure the memory performance of file systems. We really need to consider states of L2 cache and memory behavior at each stage of microbenchmarking.
One additional lesson is that don’t discuss your performance numbers next to a balcony…unless….you are romeo and juliet…
Now, let’s look at some related work. The first one is disk caching. The fundamental difference between disk caching and Conquest is that disk caching assume the scarcity of memory resources. Therefore, disk caching cares a lot about moving information back and forward being memory and disk in a speculative manner. Disk caching introduces complex mechanisms to maintain consistency between memory and disk, especially with the presence of metadata. For RAM drives and RAM file systems, they are not meant to be persistent, and they use disk-related mechanisms to access memory content. Also, they are limited by the storage capacity.
Disk emulators, or commonly known as the solid-state disk, are memory blocks connected through the SCSI interface. As one paper indicates, the use of SCSI interface can impose 45% to 55% performance penalty on the performance of disk emulators. There are also various ad hoc approaches, such as manual transferring of files to and from ramfs. This approach has capacity limitation, and you also need to know which part of the file system should be used in memory as opposed to disk. Also, this process can be automated through a background daemon. But, there are semantic and name space problems such as the semantics of mounting, links, and handling the location of storage.
So, what’s after Conquest? We have learned that the principle of matching user access patterns and storage media result in better system performance. The natural extension is to apply this principle to the distributed domain, especially with heterogeneous machines. The reason why I mention heterogeneity is that I see heterogeneous machines can introduce more opportunities for specialization. It would be interesting to explore specialization within a cluster. Preferably, this can be achieved in a self-organizing and self-evolving manner. More and more modern systems are designed with the philosophy of statelessness. However, it does not mean that stateful computing is fading away. In fact, the extensive use of caching to improve performance shows the importance of stateful computing, and Conquest further advocates the direction of the state-rich computing. Basically, in additional to caching data, we can cache runtime data structures to improve system performance. This concept is similar to /tmp. Instead of storing files, we store data structures.
The concept of separating the storage of metadata and data opens up more possibilities, because we have a greater flexibility to associate metadata and data. For example, metadata can be associated with data with different fidelity on computing devices of different caliber. It is interesting to know how we can use this characteristic to replicate data across a PDA, which has only memory storage, laptop, which has limited disk storage, and desktop, which has a large storage capacity. Although the computing world is moving toward the memory-rich environment, but file system benchmarks are still designed to measure the disk performance. Therefore, it is important to design new memory benchmarks that take considerations of underlying behaviors of hardware.
In this research, I have contributed early insights to the design and implementation of disk-memory hybrid file systems. I have demonstrates that a system can achieve performance and simplicity at the same time. Also, I have identified that disk-based microbenchmarks are not suitable to measure the memory performance of file systems. And, Conquest has opened doors to many exciting areas of research.
Conquest demonstrates how rethinking changes in underlying assumptions can lead to significant architectural and performance improvements. This thinking process can be applied to other areas of operating systems as well.
Memory reliable? I didn’t change the protection mechanism; also, Google Paging and swapping? Back to the high overhead situation Workload representative? Show disk caching is not as fast as the memory performance Do we have enough RAM? 1G is enough to run Conquest I didn’t disable metadata caching…requires changes in an internal access interface.
Conquest : Preparing for Life After Disks CS239 Seminar October 24, 2002 An-I Andy Wang University of California, Los Angeles
Memory Data Path of Conquest Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Conventional File Systems IO buffer disk management storage requests IO buffer management disk persistence support Conquest Memory Data Path storage requests persistence support battery-backed RAM small file and metadata storage
Disk Data Path of Conquest Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Conventional File Systems IO buffer disk management storage requests IO buffer management disk persistence support Conquest Disk Data Path IO buffer management IO buffer storage requests disk management disk battery-backed RAM small file and metadata storage large-file-only file system
Ext2 Data Representation i -node Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion data block location index block location index block location index block location data block location index block location index block location data block location data block location 10 data block location data block location data block location data block location index block location
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion footprint footprint write a file sequentially footprint file end read the same file sequentially footprint flush file end file read write a file sequentially footprint file end read the same file sequentially footprint flush file end read file