SlideShare a Scribd company logo
Consistency Without Ordering 

    Vijay Chidambaram, Tushar Sharma, 
Andrea Arpaci‐Dusseau, Remzi Arpaci‐Dusseau 
                      
       The Advanced Systems Laboratory 
          University of Wisconsin Madison 
The problem: crash consistency 
•  Single operaBon updates mulBple blocks 

•  System might crash in the middle of operaBon 
      –  Some blocks updated, some blocks not updated 


•  AEer crash, file system needs to be repaired 
      –  In order to restore consistency among blocks 


2/15/12                      FAST 12                     2 
SoluBon #1: Lazy, opBmisBc approach 
•  Write blocks to disk in any order 
      –  Fix inconsistencies upon reboot 

•  Advantage: Simple, High performance 

•  Disadvantage: Expensive recovery 

•  Example: ext2 with fsck [Card94]  

2/15/12                      FAST 12        3 
SoluBon #2: Eager, pessimisBc approach 
•  Carefully order writes to disk 

•  Advantage: Quick recovery 

•  Disadvantage: Perpetual performance penalty 

•  Examples 
      –   SoE updates (FFS) [Ganger94] 
      –   Journaling (CFS) [Hangmann87] 
      –   Copy‐on‐write (ZFS) [Bonwick04] 
2/15/12                          FAST 12      4 
Ordering points considered harmful 
•  Reduce performance 
      –  Constrain scheduling of disk writes 


•  Increase complexity 

•  Require lower‐level primiBves 
      –  IDE/SATA Cache flush commands 
 
2/15/12                       FAST 12           5 
Ordering points require trust  
 •  File system runs on stack of virtual devices 
       –  Consistency fails if any device ignores commands to 
          flush cache 

F_FULLFSYNC  “…The  operaLon  may  take  quite  a  while  to  complete. 
                Certain  FireWire  drives  have  also  been  known  to  ignore 
                the request to flush their buffered data.” 


 VirtualBox  “If desired, the virtual disk images can be flushed when the 
                guest issues the IDE FLUSH CACHE command. Normally 
                these requests are ignored for improved performance” 
                 
  2/15/12                            FAST 12                                 6 
Is crash‐consistency possible 
               without ordering points? 

•  Middle ground between lazy and eager approaches 

•  Simplicity and high performance of lazy approach 

•  Strong consistency and availability of eager approach 



 2/15/12                  FAST 12                      7 
Our soluBon: 
           No‐Order File System (NoFS) 

     Order‐less file system which uses 
    mutual agreement between objects 
           to obtain consistency 



2/15/12                FAST 12            8 
Results 
•  Designed a new crash‐consistency technique 
   – Backpointer‐based consistency (BBC) 

•  TheoreBcally and experimentally verified that 
   NoFS provides strong consistency 

•  Evaluated NoFS against ext2 and ext3 
      –  NoFS performance comparable to ext2 
      –  NoFS performance equal to or beger than ext3 
2/15/12                     FAST 12                      9 
Outline 
•  IntroducBon 

•  Crash‐consistency and Object idenBty 

•  The No‐Order File System 

•  Results 

•  Conclusion 


2/15/12                 FAST 12            10 
Crash consistency and object idenBty 
    All file system inconsistencies are due to 
 ambiguity about the logical idenLty of an object 

•  Logical idenBty of an object 
      –  Data block: Owner file, offset 
      –  File: Parent directories 
•  Common inconsistencies 
      –  Two files claim the same data block 
      –  File points to garbage data 

2/15/12                      FAST 12            11 
Crash Scenario 
•  AcBons: 
    –  File A is truncated 
    –  The freed data block is allocated to File B 
    –  The updated data blocks are wrigen to disk 
•  Problem: Due to a crash, File A is not updated on disk 
•  Result: On disk, both files claim the data block 

                                     Data 
  MEMORY        File A                               File B 
                                     block 




      DISK                           Data 
                File A 
                                     block 
2/15/12                        FAST 12                         12 
Outline 
•  IntroducBon 

•  Crash‐consistency and Object idenBty 

•  The No‐Order File System 
    –  Backpointer‐based consistency (BBC) 
    –  Non‐persistent allocaBon structures 

•  Results 

•  Conclusion 


2/15/12                      FAST 12          13 
Backpointer‐based consistency (BBC) 
•  Associate object with its logical idenBty 
   –  Embed backpointer into each object 
   –  Owner(s) of the object found through backpointer 
•  Consistency obtained through mutual agreement 
•  Key AssumpBon  
   –  Object and backpointer wrigen atomically 


                                        Data 
                   File A 
                                        block 

 2/15/12                     FAST 12                      14 
Using backpointers in a crash scenario 
•  AcBons: 
    –  File A is truncated 
    –  The freed data block is allocated to File B 
    –  The updated data blocks are wrigen to disk 
•  Problem: Due to a crash, File A is not updated on disk 
•  Result: Using the backpointer, the true owner is idenBfied 

                                    Data 
  MEMORY        File A                              File B 
                                    block 




      DISK                          Data 
                File A 
                                    block 
2/15/12                       FAST 12                           15 
Backpointers of different objects 
•  Data blocks have a single backpointer to file 
•  Files can have many backpointers 
    –   One for each parent directory 
•  DetecBon of inconsistencies 
    –  Each access of an object involves checking its backpointer 



           Directory 


                                              Data 
           Directory           File 
                                              block 


2/15/12                        FAST 12                           16 
Formal Model of BBC 
•  Extended a formal model for file systems with 
   backpointers [Sivathanu05] 

•  Defined the level of consistency provided by BBC 
    –  Data consistency 
 
•  Proved that a file system with backpointers 
   provides data consistency 
2/15/12                    FAST 12                 17 
Outline 
•  IntroducBon 

•  Crash‐consistency and Object idenBty 

•  The No‐Order File System 
    –  Backpointer‐based consistency 
    –  Non‐persistent allocaBon structures 

•  Results 

•  Conclusion 


2/15/12                      FAST 12          18 
AllocaBon structures 
•  File systems need to track allocaBon status 
•  Crash may leave such structures inconsistent 
•  True allocaBon status needs to be found 


                           Data     Data block bitmap 
  MEMORY       File A 
                           block    1 
                                    0 




                                    Data block bitmap 
      DISK 
                                    0 

2/15/12                  FAST 12                         19 
AllocaBon structures 
•  AEer a crash, true allocaBon status of all 
   objects must be found 

•  TradiBonal file systems do this proacBvely 
      –  File‐system check scans disk to get status 
      –  Journaling file systems write to a log to avoid scan 




2/15/12                       FAST 12                      20 
Non‐persistent allocaBon structures 
•  NoFS does not persist allocaBon structures 

•  Why? 
      –  Cannot be trusted aEer crash, need to be verified 
      –  Complicate update protocol 




2/15/12                      FAST 12                     21 
Non‐persistent allocaBon structures 
•  How is allocaBon informaBon tracked then? 
      –  Need to know which metadata/data blocks are free 


•  Move the work of finding allocaBon informaBon 
   to the background 
      –  CreaBon of new objects can proceed without 
         complete allocaBon informaBon 



2/15/12                     FAST 12                    22 
Non‐persistent allocaBon structures 
•  Backpointers used to determine allocaBon 
       –  Object in use if pointers mutually agree 
       –  Check each object individually 
       –  Use validity bitmaps to track checked objects 


•  AllocaBon structures built up incrementally 



2/15/12                       FAST 12                      23 
Determining allocaBon informaBon 
                      ext2                                           NoFS 
                                                       Data block         Data block  
               Data block bitmap                         bitmap         validity bitmap 
                     1  0  1  0                        1  0  1  1 
                                                       ‐  ‐  ‐  ‐             1  1  0  1 
                                                                              0  0  0  0 
                                                                                    1 


           File A             Data block                  File A             Data block 


           File B             Data block                  File B             Data block 


           File C             Data block                  File C             Data block 


           File D             Data block                  File D             Data block 

2/15/12                                     FAST 12                                         24 
Background Scan 
•  Complete allocaBon informaBon not needed  

•  AllocaBon informaBon discovered using two 
   background threads 
      –  One for metadata 
      –  One for data 

•  Scheduling of scan can be configured 
      –  Run when idle 
      –  Run periodically 


2/15/12                      FAST 12            25 
Design 
Memory      Group descriptor        Inode bitmap            Data block bitmap 


                                Inode validity bitmap     Data block validity bitmap 



            Group descriptor         Inode bitmap            Data block bitmap 
 Disk 




                     Directory                    File        Data block 



 2/15/12                               FAST 12                                    26 
ImplementaBon 
•  Based on ext2 codebase 

•  Three types of backpointers 
      –  Data block backpointers {inode num, offset} 
      –  Inode backlinks {inode num} 
      –  Directory block backpointers {dot directory entry} 

•  Inode size increased to support 32 backlinks 

•  Modified the linux page cache to add checks 
2/15/12                        FAST 12                         27 
Outline 
•  IntroducBon 

•  Crash‐consistency and Object idenBty 

•  The No‐Order File System 
    –  Backpointer‐based consistency 
    –  Non‐persistent allocaBon structures 

•  Results 

•  Conclusion 


2/15/12                      FAST 12          28 
EvaluaBon 
•  Q: Is NoFS robust against crashes? 
    –  Fault injecBon tesBng 

•  Q: What is the overhead of NoFS? 
    –  Evaluated on micro and macro benchmarks 


•  Q: How does the background scan affect performance? 
    –  Measured write bandwidth, access latency during scan 


 2/15/12                        FAST 12                    29 
Is NoFS robust against crashes? 
Fault injecBon tesBng 
•  Interpose pseudo‐device driver           Writes from file system 

   between the file system and disk 
        NoFS detected all inconsistencies 
•  Discard writes to selected sectors 
                                                Pseudo‐device 
                                                     driver 
           •  Errors returned on invalid access 
•  Emulate crash with different blocks  Selected writes 
           •  Orphan inodes/blocks reclaimed 
   successfully updated on disk 
•  20 different crash scenarios                       Disk 

 


 2/15/12                       FAST 12                          30 
What is the overhead of NoFS? 
                                     Performance in micro and macro benchmarks 
                                                        ext2       NoFS       ext3 
Normalized throughput 




                           1 
                         0.8      NoFS performance comparable to ext2 
                                                    
        vs ext2 




                         0.6 
                         0.4 
                                 NoFS performance is beger than ext3 for 
                         0.2 
                           0 
                                          sync heavy workloads 
                                  SeqWrite            RandWrite             File Create          Varmail 
                                Writes to 1 GB file    4088 bytes           100K files over 100    Filebench 
                                                      per write to          directories with 
                                                        1 GB file                 fsync 


         2/15/12                                                FAST 12                                       31 
How does the background scan 
                affect performance? 
 
•  Scan reads are interleaved with file system I/O 

•  Access to objects not verified by scan incurs a 
   performance penalty 




2/15/12                   FAST 12                    32 
Scan reads are interleaved with file system I/O 

•  Scan reads interfere with applicaBon reads 
   and writes 

•  Experiment 
      –  Write a 200 MB file every 30 seconds 
      –  Measure bandwidth 




2/15/12                     FAST 12              33 
Scan reads are interleaved with file system I/O 

                                          Write bandwidth obtained  
                       70 
                       60 
   Bandwidth (MB/s) 




                       50 
                       40 
                       30 
                       20 
                       10 
                        0 
                             0    200    400    600      800     1000    1200    1400    1600 
                                                       Time (s) 


2/15/12                                            FAST 12                                 34 
Scan reads are interleaved with file system I/O 

                                                 Write bandwidth obtained 
                               70 
                               60 




                                                             Scan compleBon 
           Bandwidth (MB/s) 




                               50 

                         I/O bandwidth is reduced during scan, but peak 
                          40 
                          30 performance achieved on scan compleBon 
                               20 
                               10 
                                0 
                                     0    30  60  90  120  150  180  210  240  270  300  330  360 
                                                              Time (s) 


2/15/12                                                     FAST 12                                  35 
Access to objects not verified by scan costs more

 •  The stat problem 
       –  stat returns number of blocks allocated 
       –  This informaBon might be stale for un‐verified inode 
       –  NoFS verifies the inode upon stat 
           •  Involves checking each inode data block 




 2/15/12                      FAST 12                     36 
Access to objects not verified by scan costs more

 •  Experiment 
      –  Create a number of directories with 128 files (each 1 MB) 

      –  At each 50 second interval, starBng from file‐system mount 
          •  Run ls –l on directory 
          •  This causes a stat call on every inode 
          •  stat on un‐verified inodes requires reading all its data 

      –  Measure Bme taken 



 2/15/12                          FAST 12                            37 
Access to objects not verified by scan costs more

                              18 
  Time taken for ls –l (s) 



                              16 




                                                                      Scan compleBon 
                              14 
                              12 
                                               There is a performance cost to accessing 
                              10 
                               8 
                                                 un‐verified objects during the scan 
                               6                                   
                               4               One Bme cost, only unBl scan compleBon 
                               2 
                               0 
                                    0    50     100  150  200  250  300  350  400  450  500  550  600 
                                                   Time aKer file‐system mount (s) 

 2/15/12                                                        FAST 12                             38 
Outline 
•  IntroducBon 

•  Crash‐consistency and Object idenBty 

•  The No‐Order File System 
    –  Backpointer‐based consistency 
    –  Non‐persistent allocaBon structures 

•  Results 

•  Conclusion 


2/15/12                      FAST 12          39 
Summary 
•  Problem: Providing crash‐consistency and high 
   availability without ordering points 

•  SoluBon: NoFS with Backpointer‐based consistency 
      –  Use mutual agreement to drive consistency  


•  Advantages: 
      –  Strong consistency guarantees  
      –  Performance similar to order‐less file system  

2/15/12                       FAST 12                     40 
Conclusion 
•  Trust is implicit in many layers of storage systems 

•  Removing such trust is key to building robust, 
   reliable storage systems 




2/15/12                  FAST 12                     41 
Thank you! 


                 QuesBons? 


            Advanced Systems Lab (ADSL) 
           University of Wisconsin‐Madison 
            hcp://www.cs.wisc.edu/adsl 

2/15/12                FAST 12                42 
Backup Slides 




2/15/12         FAST 12     43 
Running Mme of scan 
                       160 
                       140 
                       120 
                       100 
           Time (s) 




                        80 
                        60 
                        40 
                        20 
                         0 
                              1    2    4    8    16  32  64  128  256  512  1024 
                                         Total data in the file system (MB) 




2/15/12                                           FAST 12                            44 
Performance cost of stat on unverified inodes 
                                  60 
                                                                                                Total data: 128 MB 
   Time for ls system call (s) 




                                  50                                                            Total data: 256 MB 
                                                                                                Total data: 512 MB 
                                  40 




                                                                            Scan compleBon 
                                  30 

                                  20 

                                  10 

                                   0 
                                        0        70      140    210  250  280                 350      420      490 
                                                                Time (s) 



2/15/12                                                          FAST 12                                               45 
Effect of background scan on write bandwidth 
                          80 
                          70 
Write bandwidth (MB/s) 




                          60 
                          50 
                          40 
                          30 
                          20                                               Writes starBng at 20s 
                                                 Background scan every 
                          10                     30 seconds                Writes starBng at 0s 
                           0 
                                0    30    60      90    120  150  180  210  240  270  300  330 
                                                               Time (s) 




     2/15/12                                                   FAST 12                              46 
Performance of data block scan 
                             100 

                              10 
           Time taken (s) 




                                1 

                              0.1 

                             0.01 
                                     1          10            100       1000 
                                             Total data scanned (MB) 




2/15/12                                           FAST 12                       47 
Lines of code:     6765 
            
           Kernel:            2869 
           File system:        3869 




2/15/12               FAST 12          48 
Use cases 

•  NoFS provides crash‐consistency without 
   ordering 
      •   BBC can be used in convenBonal file systems to ensure 
          runBme integrity 
      •  NoFs can be used as local file system in GFS, HDFS 
•  NoFS allows virtual machines to maintain 
   consistency without trusBng lower‐layer 
   primiBves 


2/15/12                        FAST 12                            49 

More Related Content

Similar to Consistency Without Ordering

Openstack Swift - Lots of small files
Openstack Swift - Lots of small filesOpenstack Swift - Lots of small files
Openstack Swift - Lots of small files
Alexandre Lecuyer
 
Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...
Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...
Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...
Kuniyasu Suzaki
 
Oracle 12c PDB insights
Oracle 12c PDB insightsOracle 12c PDB insights
Oracle 12c PDB insights
Kirill Loifman
 
final_rac
final_racfinal_rac
final_rac
malayappan
 
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
Kuniyasu Suzaki
 
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Alex Gorbachev
 
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
Blackboard APAC
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
Maris Elsins
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
StampedeCon
 
Caching: A Guided Tour - 10/12/2010
Caching: A Guided Tour - 10/12/2010Caching: A Guided Tour - 10/12/2010
Caching: A Guided Tour - 10/12/2010
Jason Ragsdale
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
Takashi Hoshino
 
Presentation data domain advanced features and functions
Presentation   data domain advanced features and functionsPresentation   data domain advanced features and functions
Presentation data domain advanced features and functions
xKinAnx
 
Dueling duplications RMAN vs Delphix
Dueling duplications RMAN vs DelphixDueling duplications RMAN vs Delphix
Dueling duplications RMAN vs Delphix
Kyle Hailey
 
SharePoint Storage Best Practices
SharePoint Storage Best PracticesSharePoint Storage Best Practices
SharePoint Storage Best Practices
Mark Ginnebaugh
 
Pyramid: A large-scale array-oriented active storage system
Pyramid: A large-scale array-oriented active storage systemPyramid: A large-scale array-oriented active storage system
Pyramid: A large-scale array-oriented active storage system
Viet-Trung TRAN
 
Kscope 2013 delphix
Kscope 2013 delphixKscope 2013 delphix
Kscope 2013 delphix
Kyle Hailey
 
Veeam backup Oracle DB in a VM is easy and reliable way to protect data
Veeam backup Oracle DB in a VM is easy and reliable way to protect dataVeeam backup Oracle DB in a VM is easy and reliable way to protect data
Veeam backup Oracle DB in a VM is easy and reliable way to protect data
Aleks Y
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing data
Phil Cryer
 
Windows Azure platform
Windows Azure platformWindows Azure platform
Windows Azure platform
GetDev.NET
 
20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar
Ben Blaiszik
 

Similar to Consistency Without Ordering (20)

Openstack Swift - Lots of small files
Openstack Swift - Lots of small filesOpenstack Swift - Lots of small files
Openstack Swift - Lots of small files
 
Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...
Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...
Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...
 
Oracle 12c PDB insights
Oracle 12c PDB insightsOracle 12c PDB insights
Oracle 12c PDB insights
 
final_rac
final_racfinal_rac
final_rac
 
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
 
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
 
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
Caching: A Guided Tour - 10/12/2010
Caching: A Guided Tour - 10/12/2010Caching: A Guided Tour - 10/12/2010
Caching: A Guided Tour - 10/12/2010
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
 
Presentation data domain advanced features and functions
Presentation   data domain advanced features and functionsPresentation   data domain advanced features and functions
Presentation data domain advanced features and functions
 
Dueling duplications RMAN vs Delphix
Dueling duplications RMAN vs DelphixDueling duplications RMAN vs Delphix
Dueling duplications RMAN vs Delphix
 
SharePoint Storage Best Practices
SharePoint Storage Best PracticesSharePoint Storage Best Practices
SharePoint Storage Best Practices
 
Pyramid: A large-scale array-oriented active storage system
Pyramid: A large-scale array-oriented active storage systemPyramid: A large-scale array-oriented active storage system
Pyramid: A large-scale array-oriented active storage system
 
Kscope 2013 delphix
Kscope 2013 delphixKscope 2013 delphix
Kscope 2013 delphix
 
Veeam backup Oracle DB in a VM is easy and reliable way to protect data
Veeam backup Oracle DB in a VM is easy and reliable way to protect dataVeeam backup Oracle DB in a VM is easy and reliable way to protect data
Veeam backup Oracle DB in a VM is easy and reliable way to protect data
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing data
 
Windows Azure platform
Windows Azure platformWindows Azure platform
Windows Azure platform
 
20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar
 

Consistency Without Ordering