This graph shows hardware evolution over time. The vertical axis is in log scale. First, CPU and memory are improving at 50% per year. However, disk is improving only at the rate of 15% per year. And, this gap is widening from 5 orders of magnitude to 6 orders of magnitude. In human scales, back in 1990, if CPU access takes 1 second, disk would take 6 days. In 2000, this ratio is now 1 second to 3 months. Let’s think about that this means. It takes 1 second to grab a sheet of paper and write something down. If you call the Santa Claus to physically mail you a piece of paper, it may take 6 days. It takes about a month to make your own paper from papyrus, and most of the time is waiting for your paper to dry up. 3 months is a long time!
Now, let us look at the price trend over time. Again, the cost is in the log scale. First, let’s see the cost of paper and film, which is a critical barrier to cross for any storage technology to achieve an economy of scale. Once a storage technology crosses this barrier, it becomes cheap enough to be a storage alternative. (animate) Now let’s look at various cost curves. For disks, various disk geometries are introduced roughly at the top boundary of the paper and film cost barrier. Also, notice the cost curve for the persistent RAM. Back in 1998, the booming of digital photography changed the slope of the curve. By 2005, we would expect to see 4 to 10 GB of persistent RAM on personal desktops. Cheap enough, once cross the boundary
The idea of Conquest is to design and build a disk/persistent RAM hybrid file system, which delivers all file system services from memory, with the single exception of high-capacity storage. Two major benefits are simplicity and performance.
It is well known that small files take up little space and represent most accesses. Large files take up most of the storage capacity, and they are accessed sequentially most of the time. Of course, database is an exception, and Conquest currently does not handle database workload.
Based on this user behavior pattern, Conquest stores the following files in persistent RAM. Small files benefit the most from being stored in memory, because seek time and rotational delays comprise the bulk of the time spent on accessing small objects. Also, we now have fast byte-level accesses as opposed to block-level accesses. Small files are allocated contiguously. Storing metadata in memory avoids the notorious synchronous update problem. And, it deserves some discussion. Basically, if there is no metadata; there is no file system. Therefore, system designers take extra caution when it comes to handling metadata. If you update a directory, for example, most disk-based file systems will propagate the change synchronously to disk. It is a serious performance problem. By storing in metadata in memory, we no longer have this problem. Also, we now have a single representation for metadata, as opposed to a runtime representation and storage representation of metadata. Executables and shared libraries are also stored in core, so we can execute programs in-place, which reduces program startup time significantly.
Now let’s take a look at the data path for conventional file systems. A storage request has to go through the IO buffer management, which handles caching. If the request is not in the cache, it has to go through persistence support, which is responsible for translating storage and runtime forms of metadata. The request then needs to go through disk management, which handles disk layout, disk arm scheduling and so on before reaching the disk. For conquest memory, updates to metadata and data are in-place. There is no IO buffer management and disk management. Also, for persistence support, we don’t need to translate between runtime and storage states. All we need to manage is metadata allocation, which I will describe a bit later.
Since small files and metadata are taken care of, the disk only needs to handle large files. Which means, we can allocate disk space in big chucks, and it translates into lower access and management overhead. Also, without small objects, we don’t need to worry about fragmentation. We don’t need tricks for small files, such as storing data inside the metadata, or elaborate data structures, such as wrapping a balanced tree onto the geometry of the disk cylinders.
For large files that are accessed sequentially, disk can deliver near raw bandwidth, which is about 100 MB per second. And that speed is 200 times faster than random disk accesses. Also, large files have well-defined readahead semantics. Since they are read mostly, large file handling involve little synchronization overhead
This shows the disk data path of Conquest. Again, on the left side, we have the data path for conventional file systems. Immediately, you see that Conquest data conquest bypasses mechanisms involved in persistence support. The IO buffer management is greatly simplified because we know the behavior of large file accesses. Also, the disk management is greatly simplified due to the lack of small files and fragmentation management. fonts
You may ask, “what about large files that are randomly accessed?” In literature, random accesses are commonly defined as nonsequential accesses. However, if you take a look say a movie file, typically it has 150 scene changes. There are 150 places you may randomly jump to, and perform disk accesses sequentially. Also, looking at a mp3 file, the title is stored at the end of the file, so the typical access is to jump to the end of the file and go back to the beginning to play sequentially. Therefore, what may be random accesses are really near sequentially accesses. With this knowledge, we can simplify large-file metadata representation significantly. Mostly is…dumb data structures are still fast in memory
Now, let’s look at the performance of Conquest. This slide shows the result for PostMark benchmark, which models ISP workload. The graph plots the number of files against the transaction rate. Conquest, represented in dark blue, is compared against ramfs, ext2, reiserfs, and SGI XFS. Ramfs, in the light blue, does not provide persistence, but it is a base case comparison for the quality of Conquest implementation. Ext2, in green, is the most widely used file system in the UNIX world. Reiserfs, in orange, is a journaling file system optimized for small files. SGI XFS, in red, is also a journaling file system, which is based on the IO-Lite technology. As you can see, Conquest is performance is comparable to the performance of ramfs. Compared to other disk-based file systems, Conquest is at least 24% faster. Note that all these file systems are operating in the LRU disk cache. File systems optimized for disk does not take full advantage of memory speed.
Now let’s fix the number of files to 10,000, and vary the percentage of large files from 0 to 10 percent. Since the working set is larger than memory, the graph does not include the ramfs. As you can see, when both memory and disk components are exercised, Conquest can be still be several times faster than leading disk-based file systems. Here is the boundary of physical RAM. Since we can’t see the right side of the graph too well, let’s zoom into the graph.
When the working set is greater than RAM, Conquest still runs 1.4 to 2 times faster than various disk-based file systems. This improvement is very significant.
File System Extensibility and Non-Disk File Systems Andy Wang COP 5611 Advanced Operating Systems
Memory Data Path of Conquest Conventional file systems IO buffer Disk management Storage requests IO buffer management Disk Persistence support Conquest Memory Data Path Storage requests Persistence support Battery-backed RAM Small file and metadata storage
Little synchronization overhead (between memory and disk)
Disk Data Path of Conquest Conventional file systems IO buffer Disk management Storage requests IO buffer management Disk Persistence support Conquest Disk Data Path IO buffer management IO buffer Storage requests Disk management Disk Battery-backed RAM Small file and metadata storage Large-file-only file system