Analysis of Disk Access Patterns on File Systems for Content Addressable Storage Kuniyasu Suzaki, Kengo Iijima, Toshiki Yagi, Cyrille Artho Research Center for Information SecurityLinux Symposium 2011 at Ottawa
What I want to talk about!• I show the evidences of affinity between file systems and CAS (fixed size deduplication storage).• The evidences indicate – We should NOT use ext3 on deduplication storage (IaaS Cloud). – Good FS for deduplication storage is • NTFS was good on deduplication test, but it cannot boot Linux. • ext4 was stable on deduplication test and real case, but it was not the best. • ReiserFS showed good results on real case, but has weak points. • JFS showed high same chunk ratio, but other results were not good. • btrfs was good on deduplication test, but was not tested on real case yet.# Please discuss or comment.
Contents• What is CAS? What is deduplication?• Block allocation strategy of file systems• Preliminary Evaluation of affinity between file systems and CAS. – Propose file deduplication test, and evaluate 9 file systems (ext3, ext4, XFS, JFS, ReiserFS, NILFS, btrfs, FAT32 and NTFS).• Real case evaluation – Ubuntu installed on ext3/ext4/JFS/ReiserFS/XFS on CAS• Conclusion
CAS: Content addressable Storage• Virtual block device. Data is not addressed by its physical location. Data is addressed by a unique name (a secure hash is used usually) derived from the content.• Same contents are expressed by one original content (same hash) and others are addressed by indirect link. (deduplication storage) – Plan9 has Venti [USENIX FAST02] – Data Domain (EMC) Deduplication [USENIX FAST08] – LBCAS (Loopback Content Addressable Storage) [LinuxSymp09] Virtual Disk CAS Storage Archive Indexing Address SHA-1 0000000-0003FFF 4ad36ffe8… 0004000-0007FFF 974daf34a… New block is 0008000-000BFFF 2d34ff3e1… created with 000C000-000FFFF 974daf34a… … … new SHA-1 sharing Deduplication
Fixed Size v.s. Variable length• Contents for deduplication is managed by a unit called “chunk”.• According to the chunk size, CAS is divided into 2 categories. – Fixed size: is efficient, but cannot find contents which do not match to the alignment. • Chunk is usually bigger than 4KB (FS block) for performance – Variable length: finds any length same contents, but is not efficient.• In this talk, we assume CAS is fixed size chunk.
Open Source CAS (deduplication storage)• LBCAS :Loopback Content Addressable Storage – http://openlab.jp/oscircular/• SDFS: A user space deduplication file system – http://www.opendedup.org/• lessfs: Open source data deduplication for less – http://www.lessfs.com/# In this talk, I use LBCAS.
Where is it used?• Current main target is backup server. – Many commercial products exist. (EMC, Symantec, NetApp, etc)• IaaS hosts many virtual machines, and keeps many virtual disks for them. Fortunately, most people use popular OS and have same contents.• Deduplication is applied to reduce storage consumption caused by many virtual disks.• Even if same contents are saved in virtual disks, the effects of fixed size deduplication depend on how to store data on a virtual disk via file system.
File Systems• Linux has many file systems for many purposes.• File system works as a filter to allocate data on a disk. – Each filter changes the location of data by its own strategy. – Depending on the location, the effect of deduplication changes.
File Systems File System Feature for block allocation ext3 * ext2 with journaling, Block Group is imported from FFS. ext4 * Successor of ext3, extent allocation, delayed allocation JFS * Dynamic i-node allocation, extent allocation. XFS * Variable block size, extent allocation. ReiserFS (v3) * Block sub-allocation(Tail packing) Nilfs stackable(log structured) FS Btrfs copy-on-write, extent allocation. FAT32 FS for Windows, File allocation table. No journaling. NTFS FS for Windows NT, extent allocation. Linux uses NTFS-3G driver.“*” indicates bootable FS.All file systems except FAT32, have same function of journaling.
Allocation Techniques• extent allocation – Keep contiguous physical blocks for a file and reduces fragmentation.• Block sub-allocation(Tail packing) – Allocate last partial blocks (less than 4KB) of multiple files into a single block.• stackable(log structured) FS – Allocate data in succession from top to tail in a disk.
To increase deduplication• FS (which is a filter to allocate data on a disk) should keep some features – Alignment matching • If FS allocate each file to fit to alignment of chunk, it is easy deduplicated. – Contiguous allocation of data blocks • If 4KB data blocks is not allocated contiguously, deduplication will be reduced, especially on a large file. Extent will solve this problem. – Non-contamination chunk • If a chunk is shared by files, deduplication will be reduced. • If 4KB data block is shared by another file (block sub-allocation), deduplication will be reduced. (ReiserFS will not fit.)
File Deduplication TEST• When 1,000 files which have 1MB same-content are stored on a disk through a normal file system, it will use 1,000 MB storage.• However, if deduplication of CAS works perfectly, the files are save in1MB only.
File Deduplication Address 00000000 BD43AD313 Same-Contents FilesTEST 9AAE1AD46 CD24A6784 File System A AF1368981 563AD62AA B13718935 4621679AE 67272AAFD volume is compared 66572ZF78 Save files to 7774362AA File system A77271113 7468906FF Filter CCCA65276 Compare Allocate files on a disk AFAA1657F by own strategy 4621679AE A few chunks FFFFFFFFF 4621679AE are dedup CAS System Address 00000000 AF135D24D Same-Contents Files 4621679AE The volume is CB962A6F4 compared File System B AF135D24D 4621679AE As the result, CB962A6F4 chunks are CB962A6F4 identified and AF135D24D deduplicated 4621679AE Save files to CB962A6F4 File system AF135D24D 4621679AE allocate files with CB962A6F4 alignment, contiguity, and CB962A6F4 non-contamination AF135D24D 4621679AE FFFFFFFFF CAS System
File Deduplication TEST• We tried to save same files to fill 1 GB on 4GB LBCAS (We evaluate 2 chunk size: 32KB and 256KB). – The files has same random data• 5 cases – 100 KB file * 10,000 – 1,000 KB (1 MB) file * 1,000 – 10,000 KB (10 MB) file * 100 – 256KB file * 3,906 • check data is allocated on alignment of power of 2 – 252KB file * 3,968 • used to compare 256KB file. If one 4BK block is used for meta-data or something , it will fit to alignment of power of 2. • We assume stackable FS fit to 256KB or 252KB file cases.
Result overview 32KB chunk• Nilfs and ext3 are the smaller chunk has many chances to be deduplicated, but bad. the overhead becomes heavy.• Most FS do not treat 10MB file well. – Contiguous allocation is not kept.• 252KB and 256KB files don’t show 256KB chunk special features.
32KB chunkResult detail• Ideal deduplication line shows the ideal smallest CAS. The closer bar to the line is better.• NTFS is good on both 32KB and 256KB chunk• Ext4 and btrfs are good on 256KB chunk 32KB chunk
Result :Comparison between 32KB and 256KB chunk• (CAS size on 256KB chunk) / (CAS size on 32KB chunk)• They show the degree to be worse on larger chunk size (from 32KB to 256KB. x8).• FAT32 shows durability for larger chunk – Almost 4 times on any file size, but the deduplication is not good
Summary of File Deduplication TEST • Ext3 and nilfs are not good for fixed side deduplication (LBCAS). • NTFS is good on both chunk sizes (32KB and 256KB) and any file size (100KB, 1MB, 10MB, 252KB and 256KB) . • Ext4 and Btrfs are good on 32KB chunk size.
Real Case Evaluation• We evaluate installing and booting of Ubuntu (11.04 desktop) on CAS.• Ubuntu is installed on different file system. – The contents on a CAS is almost same. We evaluate the feature of file system.• Target files systems are bootable FS. GRUB recognizes them. – ext3, ext4, XFS, JFS, and ReiserFS• Evaluate dynamic behavior at Installing and Booting, and static CAS images.
Evaluation condition• Ubuntu 11.04 desktop is installed on a 4GB virtual disk (LBCAS) with KVM virtual Machine.• KVM has 768 MB memory, and runs on ThinkPAD T400 (Intel Core2 Duo, 2 GB memory).• We compared the effect of 32 KB and 256 KB chunk of LBCAS.
Statistics for each file size in Ubuntu• The contents installed by Ubuntu is almost 2GB on any FS. # Less than 4KB is rounded up to 4KB, because normal block is 4KB. Total 132,205• 77.9% files are less than 4KB, but the amount of them occupies 20.1% disk space.• File systems works as a filter and allocates them with own strategy. Total 2GB
Access Trace on each FS Red is read Green is WriteInstalling 4GB Booting 4GB ext3 ext3 2,000 120 sec sec ext4 ext4 JFS JFS XFS XFS ReiserFS ReiserFS
MB Installing Reduced by more than 1GB Remember the amount of files is 2GB. Amount of read and write requests issued from installer, and accessed chunks.• The amount of write request was more than 3GB and reduced on LBCAS (by more than 1GB). – It means installer issues redundant write requests.• XFS requires the most write requests, even if almost same image is installed. JFS requires the least.
Overhead for creating FS MB Ext3 had more than 100MB loss. JFS has almost no loss. It means the chunks are full of data. Amount of write requests issued from mkfs, and created chunks.• Creating FS (mksf) has many losses from the view of LBCAS, except JFS. – It means creating FS issues redundant write requests. However, the loss at installation (more than 1GB) is not compensated by Creating FS.
Static Disk Image (Coverage of created chunks) MB Left is 32KB chunk Right is 256kB chunk Only One Zero-filled chunk covers half of diskRemember theamount of files is2GB. 10% is reduced by tail packing Coverage of created chunks. Zero chunk is only one, but covers half of the disk.• ReiserFS made the smallest CAS image. It comes from tail packing.
Deduplication on each Single Disk Image MB Left is 32KB chunk Right is 256kB chunk Reduced by deduplication Effect of deduplication on each disk image• ext3 and ext4 has many same chunks. They are deduplicated. However, the total is too small (less than 80MB) compared to 2GB image. The impact is low in single disk image.• We should evaluate the ratio of same chunks in other CAS images. (talk later).
MB Booting Request issued from OS booting, and chunks for the requests.• The amount of chunks which read at boot time is more than the requests from OS. – Redundant data is read from CAS. – The file system should be optimized to pack data into chunk. • See our paper presented ASPLOS2011 workshop “Resolve”.
Relation between CAS Images• Compare the ratio of same chunks between different FS.• Compare the ratio of same chunks between different installations with same FS• The results indicate affinity of CAS images on multi- tenant IaaS. between different installations – High ratio is desired. on same file system between different file systems ext3 CAS image Another CAS image with same installation ext4 ReiserFS jfs xfs
Relation between CAS Images• From Upper graph Between different file systems – There is no strong relation between different FS. – 4KB chunk has high similarity, because most file system use 4KB block.• From Lower Graph – JFS and ReiserFS show high same chunk ratio on any chunk size. We Between different installations guess there is block allocation with same file system repeatability. • ReiserFS has block sub-allocation (tail packing) and total CAS size is reduced by 10%. However, there are many similar chunk on different installations. It means that there are identical combinations of sub- allocations.
Block Allocation Repeatability• Next Challenge – Why is there Block Allocation Repeatability on JFS and ReiserFS? Why not on ext3,ext4 and XFS? – Is it caused by installer?• Important for fixed size deduplication storage
Conclusions• I show the evidence of affinity file systems and CAS (fixed size deduplication storage).• The results indicate – We should NOT use ext3 on deduplication storage (IaaS Cloud). – Good FS for deduplcation storage is • NTFS was good on deduplication test, but it cannot boot Linux. • ext4 was stable on deduplication test and real case, but it was not the best. • ReiserFS showed good results on real case, but has weak points. • JFS showed high same chunk ratio, but other results were not good. • btrfs was good on deduplication test, but was not tested on real case yet.# Please discuss or comment.
Reference• EuroSys 2011 Tutorial “Data Deduplication” by Andre Brinkmann (University of Paderborn) – PDF http://bit.ly/khrs1a