SlideShare a Scribd company logo
1 of 44
Download to read offline
Coerced	
  Cache	
  Evic-on	
  and	
  Discreet-­‐Mode	
  Journaling:	
  
             Dealing	
  with	
  Misbehaving	
  Disks	
  

Abhishek	
  Rajimwale*,	
  Vijay	
  Chidambaram,	
  Deepak	
  Ramamurthi	
  	
  	
  
       	
  Andrea	
  Arpaci-­‐Dusseau,	
  Remzi	
  Arpaci-­‐Dusseau	
  

                                 *Data	
  Domain	
  Inc	
  

                        University	
  of	
  Wisconsin	
  Madison	
  
Disks	
  are	
  not	
  perfect	
  
  •  Expanding	
  disk	
  fault	
  model	
  

  •  Latent	
  Sector	
  Errors	
  [Bairavasundaram	
  SIGMETRICS	
  07]	
  
      –  RAID-­‐6	
  
  •  Block	
  Corrup-on	
  [Bairavasundaram FAST 08]	
  
      –  Checksums	
                                                            Disk	
  Cache	
  


                                                                                       	
  
  •  The	
  disk	
  cache	
                                                    Disk	
  Surface	
  

      –  Always	
  trusted	
  so	
  far	
  

3/13/12                                               DSN 11                                         2
Disk	
  Caches	
  
  •  Disk	
  cache	
  improves	
  performance	
  
      –  But	
  at	
  the	
  risk	
  of	
  data	
  loss	
  
                                                                   Write	
  to	
  disk	
  
  •  Order	
  of	
  writes	
  issued	
  by	
  file	
  system:	
  
      –  A,	
  B	
  ,C	
  
  •  Disks	
  reorder	
  writes	
  during	
  destaging:	
  
      –  B,	
  A,	
  C	
                                               Disk	
  Cache	
  

  •  File	
  systems	
  flush	
  the	
  disk	
  cache	
  to	
  
                                                                              	
  
     ensure	
  correct	
  ordering	
  of	
  writes	
                 Disk	
  Surface	
  
      –  A,	
  flush,	
  B,	
  flush,	
  C	
  


3/13/12                                   DSN 11                                           3
Problem:	
  Flushing	
  doesn’t	
  work	
  
   •  Disks	
  can	
  fail	
  to	
  flush	
  data	
  upon	
  request	
  

   •  One	
  reason:	
  Bugs	
  
       –  Errors	
  in	
  the	
  storage	
  stack	
  [Bairavasundaram	
  FAST	
  08]	
  
       –  Improper	
  propaga-on	
  of	
  error	
  codes	
  [Bairavasundaram	
  FAST	
  08]	
  
       –  Inadequate	
  failure	
  policies	
  [Prabhakaran	
  SOSP	
  05]	
  
       –  Bugs	
  in	
  the	
  firmware	
  [Ghemawat	
  SOSP	
  03]	
  




3/13/12                                        DSN 11                                             4
Disks	
  can	
  lie!	
  
       •  Misbehaving	
  disks	
  ignore	
  or	
  delay	
  flush	
  requests	
  
       •  Increases	
  risk	
  for	
  data	
  loss	
  
          -  File	
  systems	
  usually	
  blamed	
  for	
  such	
  loss	
  

       	
                                          50	
                  Sequen6al	
  writes	
  
	
                                                 45	
  
                        Avg	
  6me	
  (msec)	
  




                                                   40	
              w/	
  cache	
  
                                                   35	
              w/o	
  cache	
  
                                                   30	
  
                                                   25	
  
                                                   20	
  
                                                   15	
  
                                                   10	
  
                                                    5	
  
                                                    0	
  
                                                            4k	
     16k	
        64k	
     128k	
     512k	
     1m	
  
                                                                                Write	
  size	
  

       3/13/12                                                                          DSN 11                             5
Disks	
  can	
  lie!	
  
•  Evidence	
  from	
  industry	
  experts	
  
    –  Microsoc	
  
    –  Seagate	
  

•  From	
  the	
  fcntl	
  man	
  page	
  in	
  Mac	
  OSX:	
  
F_FULLFSYNC	
     Does the same thing as fsync(2) then asks the drive to flush all buffered data to
                  the permanent storage device (arg is ignored). This is currently implemented on
                  HFS, MS-DOS (FAT), and Universal Disk Format (UDF) file systems. The operation
                  may take quite a while to complete.
                  Certain FireWire drives have also been known to ignore the request to flush their
                  buffered data.




3/13/12                                       DSN 11                                              6
Ordering	
  points	
  are	
  essen-al	
  
  •  All	
  modern	
  file	
  systems	
  depend	
  on	
  ordering	
  points	
  
      –  Journaling	
  file	
  systems	
  (ext3,	
  ext4)	
  
             •  Data	
  before	
  the	
  commit	
  block	
  
      –  Copy	
  on	
  write	
  file	
  systems	
  (ZFS)	
  
             •  Data	
  before	
  the	
  uber-­‐block	
  

  •  If	
  ordering	
  points	
  are	
  not	
  enforced:	
  
      –  Data	
  corrup-on	
  
      –  Inconsistent	
  file	
  system	
  


3/13/12                                   DSN 11                                 7
Summary	
  
•  We	
  present	
  Coerced	
  Cache	
  Evic-on	
  (CCE)	
  
    –  Write	
  extra	
  data	
  into	
  the	
  cache	
  to	
  evict	
  target	
  blocks	
  
•  We	
  show	
  how	
  to	
  characterize	
  9	
  SATA	
  disk	
  drive	
  cache	
  
    –  Examine	
  the	
  wide	
  range	
  of	
  caching	
  policies	
  	
  
•  We	
  implement	
  CCE	
  in	
  ext3	
  
    –  Well	
  known	
  journaling	
  file	
  system	
  
•  CCE	
  provides	
  stronger	
  enforcement	
  for	
  ordering	
  points	
  
    –  At	
  acceptable	
  overheads	
  



3/13/12                                    DSN 11                                              8
Outline	
  
•    Mo-va-on	
  
•    Background	
  
•    Coerced	
  Cache	
  Evic-on	
  
•    Cache	
  Fingerprin-ng	
  
•    Discreet	
  Mode	
  Journaling	
  
•    Evalua-on	
  
•    Conclusion	
  




3/13/12                               DSN 11   9
File	
  System	
  Background	
  
•  Consider	
  dele-ng	
  a	
  file	
  
    –  Removing	
  its	
  directory	
  entry	
  
    –  Freeing	
  the	
  space	
  occupied	
  by	
  the	
  file	
  and	
  its	
  metadata	
  
•  Journaling	
  file	
  system	
  
    –  Makes	
  sure	
  all	
  changes	
  get	
  to	
  disk	
  or	
  none	
  do	
  
    –  Groups	
  writes	
  into	
  transac-ons	
  
    –  Writes	
  everything	
  to	
  a	
  log	
  first	
  
    –  Checkpoints	
  to	
  disk	
  later	
  



3/13/12                                  DSN 11                                        10
File	
  System	
  Background	
  
•  Ext3	
  file	
  system	
  
    –  Semi-­‐modern	
  journaling	
  file	
  system	
  
    –  Well	
  known,	
  well	
  understood	
  
•  Variants	
  of	
  journaling	
  
    –  Data	
  journaling	
  mode	
  
        •  Everything	
  (data,	
  metadata)	
  goes	
  to	
  the	
  log	
  first	
  
    –  Ordered	
  journaling	
  mode	
  
        •  Only	
  metadata	
  is	
  logged	
  



3/13/12                                  DSN 11                                        11
Data	
  Journaling	
  

Memory              M
                    C
                    D
                    B




          Disk Surface
          Journal


          Fixed locations



3/13/12                              DSN 11          12
Data	
  Journaling	
  

Memory              M
                    C
                    D
                    B




          Disk Cache




          Disk Surface
          Journal


          Fixed locations



3/13/12                              DSN 11          13
Outline	
  
•    Mo-va-on	
  
•    Background	
  
•    Coerced	
  Cache	
  Evic-on	
  
•    Cache	
  Fingerprin-ng	
  
•    Discreet	
  Mode	
  Journaling	
  
•    Evalua-on	
  
•    Conclusion	
  




3/13/12                               DSN 11   14
Coerced	
  Cache	
  Evic-on	
  
 •  Ensures	
  that	
  cache	
  has	
  been	
  truly	
  flushed	
  	
  
 •  Key	
  idea:	
  
     –  Extra	
  writes	
  to	
  flush	
  the	
  disk	
  cache	
  
     –  Desired	
  Order	
  of	
  writes:	
  A,	
  B,	
  C	
  
     –  With	
  CCE:	
  
           •  Write	
  A	
  
           •  Write	
  to	
  flush	
  zone	
  
           •  Write	
  B	
  
           •  Write	
  to	
  flush	
  zone	
  
           •  Write	
  C	
  


3/13/12                                   DSN 11                         15
Coerced	
  Cache	
  Evic-on	
  

Memory               M
                     C
                     D
                     B      F   F   F   F        F   F   F   F   F


          Disk Cache




          Disk Surface
          Journal

          Flush Zone

          Fixed locations



3/13/12                                 DSN 11                       16
Coerced	
  Cache	
  Evic-on	
  
  •  Desired	
  proper-es:	
  
      –  High	
  probability	
  of	
  flushing	
  target	
  blocks	
  
      –  Low	
  performance	
  overhead	
  

  •  Need	
  to	
  understand	
  the	
  disk	
  cache	
  to	
  design	
  
  	
  	
  	
  	
  	
  the	
  flush	
  workload	
  




3/13/12                                   DSN 11                            17
Outline	
  
•    Mo-va-on	
  
•    Background	
  
•    Coerced	
  Cache	
  Evic-on	
  
•    Cache	
  Fingerprin-ng	
  
•    Discreet	
  Mode	
  Journaling	
  
•    Evalua-on	
  
•    Conclusion	
  




3/13/12                               DSN 11   18
Cache	
  Fingerprin-ng	
  
  •  Manufacturers	
  don’t	
  expose	
  details	
  about	
  disk	
  caches	
  

  •  Disk	
  caches	
  can	
  vary	
  in:	
  
      –  Read/Write	
  par--on	
  size	
  
      –  Number	
  of	
  segments	
  
      –  Replacement	
  policy	
  

                                                           Disk	
  Cache	
  
  •  Poorly	
  characterized	
  in	
  literature	
  


3/13/12                                DSN 11                                     19
Cache	
  Fingerprin-ng	
  
  •  Flush	
  micro-­‐benchmark:	
  
      –  Write	
  target	
  block	
  
      –  Write	
  varied	
  flush	
  workload	
  –	
  measure	
  cost	
  
          –  fsync()	
  
      –  Read	
  target	
  –	
  infer	
  evic>on	
  
  •  Micro-­‐benchmark	
  is	
  repeated	
  	
  
      –  Probability	
  of	
  evic-on	
  is	
  calculated	
  
  •  Vary	
  in	
  each	
  workload:	
  
      –  Number	
  of	
  writes	
  
      –  Amount	
  of	
  data	
  in	
  each	
  write	
  
      –  Sequen-al/Random	
  writes	
  
3/13/12                                 DSN 11                             20
Cache	
  Fingerprin-ng	
  
 •  Evic-on	
  fingerprint	
  
     –  Probability	
  of	
  evic-on	
  is	
  visually	
  shown	
  
     –  Darker	
  region	
  indicates	
  higher	
  probability	
  

                                                            Eviction Probability
                                                                 	
  90	
  –	
  100%	
  
                                                                 	
  70	
  -­‐	
  90	
  	
  
                                                                 	
  50	
  –	
  70	
  
                                                                 	
  30	
  –	
  50	
  
                                                                 	
  10	
  –	
  30	
  
                                                                 	
  0	
  –	
  10	
  
3/13/12                               DSN 11                                                   21
Cache	
  Fingerprin-ng	
  
 •  Performance	
  fingerprint	
  
     –  Time	
  taken	
  to	
  write	
  flush	
  workload	
  
     –  Darker	
  region	
  indicates	
  more	
  -me	
  
                                                               Flush Latency
                                                                	
  500+	
  ms	
  
                                                                	
  100	
  –	
  500	
  
                                                                	
  50	
  –	
  100	
  
                                                                	
  10	
  -­‐	
  50	
  
                                                                	
  0	
  -­‐	
  10	
  


3/13/12                                 DSN 11                                            22
Cache	
  Fingerprin-ng	
  
  •  Selec-ng	
  a	
  flush	
  workload:	
  
      –  Combine	
  informa-on	
  from	
  both	
  fingerprints	
  
      –  High	
  probability	
  of	
  evic-on	
  	
  
          •  Dark	
  region	
  in	
  evic-on	
  fingerprint	
  
      –  Low	
  performance	
  cost	
  
          •  Light	
  region	
  in	
  performance	
  fingerprint	
  




3/13/12                             DSN 11                            23
Cache	
  Fingerprin-ng	
  
            Manufacturer	
         Cache	
  (MB)	
       Capacity	
  
                                                            (GB)	
  
          Hitachi	
                              8	
             80	
  
          Hitachi	
                            32	
          1024	
  
          Samsung	
                              8	
           250	
  
          Samsung	
                            16	
            250	
  
          Western	
  Digital	
                 16	
            320	
  
          Western	
  Digital	
                 64	
            800	
  
          Seagate	
                              8	
           250	
  
          Seagate	
                            16	
            320	
  
          Seagate	
                            32	
            750	
  




3/13/12                            DSN 11                                 24
Cache	
  Fingerprin-ng	
  
 Sequen-al	
  writes	
  may	
  be	
  ineffec-ve	
  at	
  flushing	
  
    –  	
   Regardless	
  of	
  the	
  size	
  of	
  the	
  write	
  
 	
  

 A	
  number	
  of	
  random	
  writes	
  are	
  required	
  

                                                                        Eviction Probability

                                                                            	
  90	
  –	
  100%	
  
                                                                            	
  70	
  -­‐	
  90	
  	
  
                                                                            	
  50	
  –	
  70	
  
                                                                            	
  30	
  –	
  50	
  
                                                                            	
  10	
  –	
  30	
  
                                                                            	
  0	
  –	
  10	
  




3/13/12                                 DSN 11                                                            25
Cache	
  Fingerprin-ng	
  
 Ver-cal	
  stripes	
  indicate	
  that	
  the	
  cache	
  is	
  segmented	
  
    –  Each	
  write,	
  regardless	
  of	
  size,	
  is	
  sent	
  to	
  one	
  segment	
  



                                                                              Eviction Probability

                                                                                  	
  90	
  –	
  100%	
  
                                                                                  	
  70	
  -­‐	
  90	
  	
  
                                                                                  	
  50	
  –	
  70	
  
                                                                                  	
  30	
  –	
  50	
  
                                                                                  	
  10	
  –	
  30	
  
                                                                                  	
  0	
  –	
  10	
  




3/13/12                                   DSN 11                                                                26
Cache	
  Fingerprin-ng	
  
 Cache	
  behavior	
  of	
  disks	
  from	
  the	
  same	
  manufacturer	
  is	
  
 qualita-vely	
  similar	
  across	
  their	
  different	
  models	
  
 	
                                                                   Eviction Probability

                                                                              	
  90	
  –	
  100%	
  
                                                                              	
  70	
  -­‐	
  90	
  	
  
                                                                              	
  50	
  –	
  70	
  
                                                                              	
  30	
  –	
  50	
  
                                                                              	
  10	
  –	
  30	
  
                                                                              	
  0	
  –	
  10	
  




3/13/12                                DSN 11                                                         27
Cache	
  Fingerprin-ng	
  
  •  It’s	
  not	
  all	
  good	
  news	
  however:	
  
      –  Some	
  caches	
  appear	
  to	
  use	
  random	
  
           replacement	
  policies	
  
      –  For	
  such	
  caches,	
  we	
  cannot	
  evict	
  blocks	
  
           with	
  100%	
  certainty	
  
      –  A	
  large	
  number	
  of	
  random	
  writes	
  are	
  
           required	
  to	
  get	
  high	
  evic-on	
  probability	
  
      	
  




3/13/12                                  DSN 11                          28
Cache	
  Fingerprin-ng	
  -­‐	
  Results	
  
                          Drive	
                Number	
   Total	
               Evic6on	
             Time	
  (s)	
  
                                                 of	
  writes	
   Data	
         Probability	
  
                                                                  (MB)	
  
            Hitachi	
  8	
  MB	
                            1	
     2.38	
                 100	
           0.05	
  
            Hitachi	
  32	
  MB	
                           1	
        11	
                100	
          0.087	
  
            Seagate	
  8	
  MB	
                        256	
          31	
                100	
           0.87	
  
            Seagate	
  16	
  MB	
                       128	
          17	
                100	
          0.342	
  
            Seagate	
  64	
  MB	
                       128	
          37	
                100	
          0.396	
  
            Samsung	
  8	
  MB	
                        128	
          49	
               ~	
  90	
       1.328	
  
            Samsung	
  16	
  MB	
                       256	
        128	
                ~	
  90	
       2.872	
  
            Western	
  Digital	
  16	
  MB	
           1792	
          19	
               ~	
  90	
       5.107	
  
            Western	
  Digital	
  64	
  MB	
            256	
            1	
               100	
          7.705	
  




3/13/12                                                 DSN 11                                                            29
Outline	
  
•    Mo-va-on	
  
•    Background	
  
•    Coerced	
  Cache	
  Evic-on	
  
•    Cache	
  Fingerprin-ng	
  
•    Discreet	
  Mode	
  Journaling	
  
•    Evalua-on	
  
•    Conclusion	
  




3/13/12                               DSN 11   30
Discreet	
  Mode	
  Journaling	
  
 •  Incorpora-ng	
  CCE	
  into	
  ext3	
  
     –  Fingerprint	
  the	
  disk	
  to	
  find	
  op-mal	
  flush	
  workload	
  
     –  Create	
  flush	
  zone	
  with	
  suitable	
  size	
  
     –  Modify	
  ext3	
  to	
  issue	
  flush	
  zone	
  writes:	
  
         •  One	
  at	
  each	
  ordering	
  point	
  
         •  #	
  of	
  CCE	
  opera-ons	
  =	
  #	
  of	
  ordering	
  points	
  

 •  Can	
  be	
  used	
  with	
  any	
  disk:	
  
     –  	
  As	
  long	
  as	
  the	
  disk	
  is	
  fingerprinted	
  first	
  

3/13/12                                        DSN 11                               31
Outline	
  
•    Mo-va-on	
  
•    Background	
  
•    Coerced	
  Cache	
  Evic-on	
  
•    Cache	
  Fingerprin-ng	
  
•    Discreet	
  Mode	
  Journaling	
  
•    Evalua-on	
  
•    Conclusion	
  




3/13/12                               DSN 11   32
Evalua-on	
  
 •  Goal:	
  
     –  CCE	
  provides	
  higher	
  reliability	
  
     –  At	
  what	
  cost?	
  Is	
  it	
  prac-cal	
  to	
  use?	
  
 •  Experimental	
  setup:	
  
     –  File	
  system:	
  Ext3	
  
     –  Disk:	
  Hitachi	
  8	
  MB	
  
     –  Journaling	
  mode:	
  Data	
  journaling	
  
         •  (See	
  paper	
  for	
  ordered	
  journaling	
  results)	
  
     –  Opera-ng	
  system:	
  Linux	
  2.6.13,	
  Linux	
  2.6.23	
  


3/13/12                                 DSN 11                              33
Evalua-on	
  
 •  What	
  we	
  compare:	
  
    –  Regular	
  journaling	
  with	
  disk	
  cache	
  turned	
  off	
  
        •  “Safe”	
  but	
  slow	
  
        •  Disk	
  might	
  not	
  obey	
  command	
  to	
  turn	
  off	
  cache!	
  
    –  Regular	
  journaling	
  with	
  disk	
  cache	
  turned	
  on	
  
        •  Unsafe	
  but	
  fast	
  
    –  Discreet	
  mode	
  journaling	
  
        •  Midway	
  op-on	
  –	
  Safe	
  but	
  with	
  cost	
  



3/13/12                                 DSN 11                                         34
Evalua-on	
  
 •  Benchmarks:	
  
     –  OpenSSH	
  
         •  	
  copy,	
  untar,	
  configure,	
  make	
  
     –  Postmark	
  
         •  Simulates	
  a	
  mail	
  server	
  
         •  Single	
  threaded	
  
     –  Filebench	
  Webserver	
  
         •  	
  I/O	
  intensive	
  
     –  Filebench	
  Varmail	
  
         •  Mul-threaded	
  postmark	
  
3/13/12                                 DSN 11             35
Evalua-on	
  –	
  OpenSSH	
  
                                  Data	
  Journaling	
  Mode	
  

                                45	
  
                                40	
  
                                35	
  
                                30	
  
              Time	
  (s)	
  


                                25	
  
                                20	
  
                                15	
  
                                10	
  
                                 5	
  
                                 0	
  
                                         regular	
  w/o	
  cache	
  
                                         discreet	
  	
  
                                         regular	
  w/	
  cache	
  

3/13/12                                        DSN 11                  36
Evalua-on	
  –	
  Postmark	
  
                                    Data	
  Journaling	
  Mode	
  

                                800	
  
                                700	
  
                                600	
  
              Time	
  (s)	
  

                                500	
  
                                400	
  
                                300	
  
                                200	
  
                                100	
  
                                  0	
  
                                          regular	
  w/o	
  cache	
  
                                          discreet	
  	
  
                                          regular	
  w/	
  cache	
  

3/13/12                                        DSN 11                   37
Evalua-on	
  –	
  Filebench	
  Webserver	
  
                                             Data	
  Journaling	
  Mode	
  

                                         300	
  
             Throughtput	
  (MB/s)	
     250	
  
                                         200	
  
                                         150	
  
                                         100	
  
                                           50	
  
                                             0	
  
                                                     regular	
  w/o	
  cache	
  
                                                     discreet	
  	
  
                                                     regular	
  w/	
  cache	
  

3/13/12                                                  DSN 11                    38
Evalua-on	
  –	
  Filebench	
  Varmail	
  
                                                  Data	
  Journaling	
  Mode	
  

                                                12	
  
                    Throughtput	
  (MB/s)	
     10	
  
                                                  8	
  
                                                  6	
  
                                                  4	
  
                                                  2	
  
                                                  0	
  
                                                          regular	
  w/o	
  cache	
  
                                                          discreet	
  	
  
                                                          regular	
  w/	
  cache	
  

3/13/12                                                        DSN 11                   39
Evalua-on	
  –	
  Filebench	
  varmail	
  
•  Workload	
  writes	
  a	
  small	
  amount	
  of	
  data	
  and	
  calls	
  
   fsync()	
  repeatedly	
  

•  Each	
  fsync()causes	
  3	
  CCEs	
  

•  Number	
  of	
  op-miza-ons	
  :	
  
    –  Incorporate	
  Group	
  Commit	
  in	
  varmail	
  
         •  Improves	
  throughput	
  for	
  all	
  modes	
  
    –  We	
  use	
  a	
  few	
  other	
  techniques	
  as	
  well	
  (see	
  paper)	
  



3/13/12                                    DSN 11                                         40
Evalua-on	
  –	
  Filebench	
  Varmail	
  
                                   Original	
  performance	
                                With	
  op-miza-ons	
  
                                25	
                                                                     25	
  
    Throughtput	
  (MB/s)	
  




                                20	
                                                                     20	
  




                                                                              Throughput	
  (MB/s)	
  
                                15	
                                                                     15	
  
                                10	
                                                                     10	
  
                                  5	
  
                                                                                                           5	
  
                                  0	
  
                                                                                   0	
  
                                                          regular	
  w/o	
  cache	
  
                                                          discreet	
  	
  
                                                          regular	
  w/	
  cache	
  
3/13/12                                                          DSN 11                                               41
Summary	
  
•  Coerced	
  Cache	
  Evic-on	
  (CCE):	
  
    –  	
  Run	
  file	
  systems	
  reliably	
  on	
  top	
  of	
  misbehaving	
  disks	
  
•  Characteriza-on	
  of	
  9	
  SATA	
  disk	
  caches	
  through	
  fingerprints	
  
•  Discreet	
  Mode	
  Journaling:	
  
    –  Implementa-on	
  of	
  CCE	
  for	
  ext3	
  filesystem	
  
    –  Acceptable	
  performance	
  on	
  3	
  workloads	
  
            •  Only	
  if	
  the	
  cache	
  doesn’t	
  use	
  random	
  replacement	
  
    –  High	
  overhead	
  for	
  apps	
  which	
  call	
  fsync()	
  frequently	
  


3/13/12                                  DSN 11                                       42
Conclusion	
  
•  Trust	
  in	
  disk	
  is	
  weakening:	
  
    –  Latent	
  Sector	
  Errors	
  
    –  Block	
  corrup-on	
  
    –  Cache	
  flushing	
  
•  Cloud	
  compu-ng	
  systems:	
  
    –  Virtualized	
  hardware	
  
    –  Large	
  socware	
  stack	
  
•  Can	
  such	
  hardware	
  be	
  trusted?	
  	
  
•  Will	
  coercion	
  be	
  more	
  widely	
  used?	
  

3/13/12                                  DSN 11            43
Thank you!

           Advanced	
  Systems	
  Lab	
  (ADSL)	
  
          University	
  of	
  Wisconsin-­‐Madison	
  
           hEp://www.cs.wisc.edu/adsl	
  

3/13/12                   DSN 11                        44

More Related Content

What's hot

Virtualization And Disk Performance
Virtualization And Disk PerformanceVirtualization And Disk Performance
Virtualization And Disk PerformanceDiskeeper
 
I/O System and Case study
I/O System and Case studyI/O System and Case study
I/O System and Case studyLavanya G
 
I/O System and Case study
I/O System and Case studyI/O System and Case study
I/O System and Case studymalarselvi mms
 
Mass Storage Devices
Mass Storage DevicesMass Storage Devices
Mass Storage DevicesPrabu U
 
Introduction to Storage technologies
Introduction to Storage technologiesIntroduction to Storage technologies
Introduction to Storage technologiesKaivalya Shah
 
White Paper: DB2 and FAST VP Testing and Best Practices
White Paper: DB2 and FAST VP Testing and Best Practices   White Paper: DB2 and FAST VP Testing and Best Practices
White Paper: DB2 and FAST VP Testing and Best Practices EMC
 
Os solaris memory management
Os  solaris memory managementOs  solaris memory management
Os solaris memory managementTech_MX
 
12.mass stroage system
12.mass stroage system12.mass stroage system
12.mass stroage systemSenthil Kanth
 
DBMS Unit IV and V Material
DBMS Unit IV and V MaterialDBMS Unit IV and V Material
DBMS Unit IV and V MaterialArthyR3
 

What's hot (18)

RDBMS
RDBMSRDBMS
RDBMS
 
OSCh14
OSCh14OSCh14
OSCh14
 
Virtualization And Disk Performance
Virtualization And Disk PerformanceVirtualization And Disk Performance
Virtualization And Disk Performance
 
I/O System and Case study
I/O System and Case studyI/O System and Case study
I/O System and Case study
 
Ch10
Ch10Ch10
Ch10
 
I/O System and Case study
I/O System and Case studyI/O System and Case study
I/O System and Case study
 
Mass Storage Devices
Mass Storage DevicesMass Storage Devices
Mass Storage Devices
 
Introduction to Storage technologies
Introduction to Storage technologiesIntroduction to Storage technologies
Introduction to Storage technologies
 
Massstorage
MassstorageMassstorage
Massstorage
 
White Paper: DB2 and FAST VP Testing and Best Practices
White Paper: DB2 and FAST VP Testing and Best Practices   White Paper: DB2 and FAST VP Testing and Best Practices
White Paper: DB2 and FAST VP Testing and Best Practices
 
RAID and LVM
RAID and LVMRAID and LVM
RAID and LVM
 
Os solaris memory management
Os  solaris memory managementOs  solaris memory management
Os solaris memory management
 
Mass storage systemsos
Mass storage systemsosMass storage systemsos
Mass storage systemsos
 
DB_ch11
DB_ch11DB_ch11
DB_ch11
 
FlashMemorySummit_2015_NVMFS
FlashMemorySummit_2015_NVMFSFlashMemorySummit_2015_NVMFS
FlashMemorySummit_2015_NVMFS
 
12.mass stroage system
12.mass stroage system12.mass stroage system
12.mass stroage system
 
DBMS Unit IV and V Material
DBMS Unit IV and V MaterialDBMS Unit IV and V Material
DBMS Unit IV and V Material
 
Ch11
Ch11Ch11
Ch11
 

Viewers also liked (6)

Cache memory
Cache memoryCache memory
Cache memory
 
Cache memory
Cache memoryCache memory
Cache memory
 
Cache memory
Cache memoryCache memory
Cache memory
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
cache memory
cache memorycache memory
cache memory
 
Cache memory presentation
Cache memory presentationCache memory presentation
Cache memory presentation
 

Similar to Coerced Cache Eviction: Dealing with Misbehaving Disks through Discreet-Mode Journaling

Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsJavier González
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeDataWorks Summit
 
DownloadClassSessionFile (44).pdf
DownloadClassSessionFile (44).pdfDownloadClassSessionFile (44).pdf
DownloadClassSessionFile (44).pdfHanaBurhan1
 
4_5837027281599466545_220608_094042.pdf
4_5837027281599466545_220608_094042.pdf4_5837027281599466545_220608_094042.pdf
4_5837027281599466545_220608_094042.pdfssusere05ec21
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Hsien-Hsin Sean Lee, Ph.D.
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012StampedeCon
 
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Raid Data Recovery
 
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...NETWAYS
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureHeechul Yun
 
Romanticos com drbd 2
Romanticos com drbd 2Romanticos com drbd 2
Romanticos com drbd 2eiichi2009
 
Ext4 write barrier
Ext4 write barrierExt4 write barrier
Ext4 write barrierSomdutta Roy
 
Learn about log structured file system
Learn about log structured file systemLearn about log structured file system
Learn about log structured file systemGang He
 
PostgreSQL + ZFS best practices
PostgreSQL + ZFS best practicesPostgreSQL + ZFS best practices
PostgreSQL + ZFS best practicesSean Chittenden
 
19IS305_U4_LP10_LM10-22-23.pdf
19IS305_U4_LP10_LM10-22-23.pdf19IS305_U4_LP10_LM10-22-23.pdf
19IS305_U4_LP10_LM10-22-23.pdfJESUNPK
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptxJawaharPrasad3
 

Similar to Coerced Cache Eviction: Dealing with Misbehaving Disks through Discreet-Mode Journaling (20)

Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDs
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage Challenge
 
DownloadClassSessionFile (44).pdf
DownloadClassSessionFile (44).pdfDownloadClassSessionFile (44).pdf
DownloadClassSessionFile (44).pdf
 
Magnetic disk - Krishna Geetha.ppt
Magnetic disk  - Krishna Geetha.pptMagnetic disk  - Krishna Geetha.ppt
Magnetic disk - Krishna Geetha.ppt
 
4_5837027281599466545_220608_094042.pdf
4_5837027281599466545_220608_094042.pdf4_5837027281599466545_220608_094042.pdf
4_5837027281599466545_220608_094042.pdf
 
INFLOW-2014-NVM-Compression
INFLOW-2014-NVM-CompressionINFLOW-2014-NVM-Compression
INFLOW-2014-NVM-Compression
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
 
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
Romanticos com drbd 2
Romanticos com drbd 2Romanticos com drbd 2
Romanticos com drbd 2
 
Ext4 write barrier
Ext4 write barrierExt4 write barrier
Ext4 write barrier
 
Learn about log structured file system
Learn about log structured file systemLearn about log structured file system
Learn about log structured file system
 
PostgreSQL + ZFS best practices
PostgreSQL + ZFS best practicesPostgreSQL + ZFS best practices
PostgreSQL + ZFS best practices
 
19IS305_U4_LP10_LM10-22-23.pdf
19IS305_U4_LP10_LM10-22-23.pdf19IS305_U4_LP10_LM10-22-23.pdf
19IS305_U4_LP10_LM10-22-23.pdf
 
Zfs intro v2
Zfs intro v2Zfs intro v2
Zfs intro v2
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 

Coerced Cache Eviction: Dealing with Misbehaving Disks through Discreet-Mode Journaling

  • 1. Coerced  Cache  Evic-on  and  Discreet-­‐Mode  Journaling:   Dealing  with  Misbehaving  Disks   Abhishek  Rajimwale*,  Vijay  Chidambaram,  Deepak  Ramamurthi        Andrea  Arpaci-­‐Dusseau,  Remzi  Arpaci-­‐Dusseau   *Data  Domain  Inc   University  of  Wisconsin  Madison  
  • 2. Disks  are  not  perfect   •  Expanding  disk  fault  model   •  Latent  Sector  Errors  [Bairavasundaram  SIGMETRICS  07]   –  RAID-­‐6   •  Block  Corrup-on  [Bairavasundaram FAST 08]   –  Checksums   Disk  Cache     •  The  disk  cache   Disk  Surface   –  Always  trusted  so  far   3/13/12 DSN 11 2
  • 3. Disk  Caches   •  Disk  cache  improves  performance   –  But  at  the  risk  of  data  loss   Write  to  disk   •  Order  of  writes  issued  by  file  system:   –  A,  B  ,C   •  Disks  reorder  writes  during  destaging:   –  B,  A,  C   Disk  Cache   •  File  systems  flush  the  disk  cache  to     ensure  correct  ordering  of  writes   Disk  Surface   –  A,  flush,  B,  flush,  C   3/13/12 DSN 11 3
  • 4. Problem:  Flushing  doesn’t  work   •  Disks  can  fail  to  flush  data  upon  request   •  One  reason:  Bugs   –  Errors  in  the  storage  stack  [Bairavasundaram  FAST  08]   –  Improper  propaga-on  of  error  codes  [Bairavasundaram  FAST  08]   –  Inadequate  failure  policies  [Prabhakaran  SOSP  05]   –  Bugs  in  the  firmware  [Ghemawat  SOSP  03]   3/13/12 DSN 11 4
  • 5. Disks  can  lie!   •  Misbehaving  disks  ignore  or  delay  flush  requests   •  Increases  risk  for  data  loss   -  File  systems  usually  blamed  for  such  loss     50   Sequen6al  writes     45   Avg  6me  (msec)   40   w/  cache   35   w/o  cache   30   25   20   15   10   5   0   4k   16k   64k   128k   512k   1m   Write  size   3/13/12 DSN 11 5
  • 6. Disks  can  lie!   •  Evidence  from  industry  experts   –  Microsoc   –  Seagate   •  From  the  fcntl  man  page  in  Mac  OSX:   F_FULLFSYNC   Does the same thing as fsync(2) then asks the drive to flush all buffered data to the permanent storage device (arg is ignored). This is currently implemented on HFS, MS-DOS (FAT), and Universal Disk Format (UDF) file systems. The operation may take quite a while to complete. Certain FireWire drives have also been known to ignore the request to flush their buffered data. 3/13/12 DSN 11 6
  • 7. Ordering  points  are  essen-al   •  All  modern  file  systems  depend  on  ordering  points   –  Journaling  file  systems  (ext3,  ext4)   •  Data  before  the  commit  block   –  Copy  on  write  file  systems  (ZFS)   •  Data  before  the  uber-­‐block   •  If  ordering  points  are  not  enforced:   –  Data  corrup-on   –  Inconsistent  file  system   3/13/12 DSN 11 7
  • 8. Summary   •  We  present  Coerced  Cache  Evic-on  (CCE)   –  Write  extra  data  into  the  cache  to  evict  target  blocks   •  We  show  how  to  characterize  9  SATA  disk  drive  cache   –  Examine  the  wide  range  of  caching  policies     •  We  implement  CCE  in  ext3   –  Well  known  journaling  file  system   •  CCE  provides  stronger  enforcement  for  ordering  points   –  At  acceptable  overheads   3/13/12 DSN 11 8
  • 9. Outline   •  Mo-va-on   •  Background   •  Coerced  Cache  Evic-on   •  Cache  Fingerprin-ng   •  Discreet  Mode  Journaling   •  Evalua-on   •  Conclusion   3/13/12 DSN 11 9
  • 10. File  System  Background   •  Consider  dele-ng  a  file   –  Removing  its  directory  entry   –  Freeing  the  space  occupied  by  the  file  and  its  metadata   •  Journaling  file  system   –  Makes  sure  all  changes  get  to  disk  or  none  do   –  Groups  writes  into  transac-ons   –  Writes  everything  to  a  log  first   –  Checkpoints  to  disk  later   3/13/12 DSN 11 10
  • 11. File  System  Background   •  Ext3  file  system   –  Semi-­‐modern  journaling  file  system   –  Well  known,  well  understood   •  Variants  of  journaling   –  Data  journaling  mode   •  Everything  (data,  metadata)  goes  to  the  log  first   –  Ordered  journaling  mode   •  Only  metadata  is  logged   3/13/12 DSN 11 11
  • 12. Data  Journaling   Memory M C D B Disk Surface Journal Fixed locations 3/13/12 DSN 11 12
  • 13. Data  Journaling   Memory M C D B Disk Cache Disk Surface Journal Fixed locations 3/13/12 DSN 11 13
  • 14. Outline   •  Mo-va-on   •  Background   •  Coerced  Cache  Evic-on   •  Cache  Fingerprin-ng   •  Discreet  Mode  Journaling   •  Evalua-on   •  Conclusion   3/13/12 DSN 11 14
  • 15. Coerced  Cache  Evic-on   •  Ensures  that  cache  has  been  truly  flushed     •  Key  idea:   –  Extra  writes  to  flush  the  disk  cache   –  Desired  Order  of  writes:  A,  B,  C   –  With  CCE:   •  Write  A   •  Write  to  flush  zone   •  Write  B   •  Write  to  flush  zone   •  Write  C   3/13/12 DSN 11 15
  • 16. Coerced  Cache  Evic-on   Memory M C D B F F F F F F F F F Disk Cache Disk Surface Journal Flush Zone Fixed locations 3/13/12 DSN 11 16
  • 17. Coerced  Cache  Evic-on   •  Desired  proper-es:   –  High  probability  of  flushing  target  blocks   –  Low  performance  overhead   •  Need  to  understand  the  disk  cache  to  design            the  flush  workload   3/13/12 DSN 11 17
  • 18. Outline   •  Mo-va-on   •  Background   •  Coerced  Cache  Evic-on   •  Cache  Fingerprin-ng   •  Discreet  Mode  Journaling   •  Evalua-on   •  Conclusion   3/13/12 DSN 11 18
  • 19. Cache  Fingerprin-ng   •  Manufacturers  don’t  expose  details  about  disk  caches   •  Disk  caches  can  vary  in:   –  Read/Write  par--on  size   –  Number  of  segments   –  Replacement  policy   Disk  Cache   •  Poorly  characterized  in  literature   3/13/12 DSN 11 19
  • 20. Cache  Fingerprin-ng   •  Flush  micro-­‐benchmark:   –  Write  target  block   –  Write  varied  flush  workload  –  measure  cost   –  fsync()   –  Read  target  –  infer  evic>on   •  Micro-­‐benchmark  is  repeated     –  Probability  of  evic-on  is  calculated   •  Vary  in  each  workload:   –  Number  of  writes   –  Amount  of  data  in  each  write   –  Sequen-al/Random  writes   3/13/12 DSN 11 20
  • 21. Cache  Fingerprin-ng   •  Evic-on  fingerprint   –  Probability  of  evic-on  is  visually  shown   –  Darker  region  indicates  higher  probability   Eviction Probability  90  –  100%    70  -­‐  90      50  –  70    30  –  50    10  –  30    0  –  10   3/13/12 DSN 11 21
  • 22. Cache  Fingerprin-ng   •  Performance  fingerprint   –  Time  taken  to  write  flush  workload   –  Darker  region  indicates  more  -me   Flush Latency  500+  ms    100  –  500    50  –  100    10  -­‐  50    0  -­‐  10   3/13/12 DSN 11 22
  • 23. Cache  Fingerprin-ng   •  Selec-ng  a  flush  workload:   –  Combine  informa-on  from  both  fingerprints   –  High  probability  of  evic-on     •  Dark  region  in  evic-on  fingerprint   –  Low  performance  cost   •  Light  region  in  performance  fingerprint   3/13/12 DSN 11 23
  • 24. Cache  Fingerprin-ng   Manufacturer   Cache  (MB)   Capacity   (GB)   Hitachi   8   80   Hitachi   32   1024   Samsung   8   250   Samsung   16   250   Western  Digital   16   320   Western  Digital   64   800   Seagate   8   250   Seagate   16   320   Seagate   32   750   3/13/12 DSN 11 24
  • 25. Cache  Fingerprin-ng   Sequen-al  writes  may  be  ineffec-ve  at  flushing   –    Regardless  of  the  size  of  the  write     A  number  of  random  writes  are  required   Eviction Probability  90  –  100%    70  -­‐  90      50  –  70    30  –  50    10  –  30    0  –  10   3/13/12 DSN 11 25
  • 26. Cache  Fingerprin-ng   Ver-cal  stripes  indicate  that  the  cache  is  segmented   –  Each  write,  regardless  of  size,  is  sent  to  one  segment   Eviction Probability  90  –  100%    70  -­‐  90      50  –  70    30  –  50    10  –  30    0  –  10   3/13/12 DSN 11 26
  • 27. Cache  Fingerprin-ng   Cache  behavior  of  disks  from  the  same  manufacturer  is   qualita-vely  similar  across  their  different  models     Eviction Probability  90  –  100%    70  -­‐  90      50  –  70    30  –  50    10  –  30    0  –  10   3/13/12 DSN 11 27
  • 28. Cache  Fingerprin-ng   •  It’s  not  all  good  news  however:   –  Some  caches  appear  to  use  random   replacement  policies   –  For  such  caches,  we  cannot  evict  blocks   with  100%  certainty   –  A  large  number  of  random  writes  are   required  to  get  high  evic-on  probability     3/13/12 DSN 11 28
  • 29. Cache  Fingerprin-ng  -­‐  Results   Drive   Number   Total   Evic6on   Time  (s)   of  writes   Data   Probability   (MB)   Hitachi  8  MB   1   2.38   100   0.05   Hitachi  32  MB   1   11   100   0.087   Seagate  8  MB   256   31   100   0.87   Seagate  16  MB   128   17   100   0.342   Seagate  64  MB   128   37   100   0.396   Samsung  8  MB   128   49   ~  90   1.328   Samsung  16  MB   256   128   ~  90   2.872   Western  Digital  16  MB   1792   19   ~  90   5.107   Western  Digital  64  MB   256   1   100   7.705   3/13/12 DSN 11 29
  • 30. Outline   •  Mo-va-on   •  Background   •  Coerced  Cache  Evic-on   •  Cache  Fingerprin-ng   •  Discreet  Mode  Journaling   •  Evalua-on   •  Conclusion   3/13/12 DSN 11 30
  • 31. Discreet  Mode  Journaling   •  Incorpora-ng  CCE  into  ext3   –  Fingerprint  the  disk  to  find  op-mal  flush  workload   –  Create  flush  zone  with  suitable  size   –  Modify  ext3  to  issue  flush  zone  writes:   •  One  at  each  ordering  point   •  #  of  CCE  opera-ons  =  #  of  ordering  points   •  Can  be  used  with  any  disk:   –   As  long  as  the  disk  is  fingerprinted  first   3/13/12 DSN 11 31
  • 32. Outline   •  Mo-va-on   •  Background   •  Coerced  Cache  Evic-on   •  Cache  Fingerprin-ng   •  Discreet  Mode  Journaling   •  Evalua-on   •  Conclusion   3/13/12 DSN 11 32
  • 33. Evalua-on   •  Goal:   –  CCE  provides  higher  reliability   –  At  what  cost?  Is  it  prac-cal  to  use?   •  Experimental  setup:   –  File  system:  Ext3   –  Disk:  Hitachi  8  MB   –  Journaling  mode:  Data  journaling   •  (See  paper  for  ordered  journaling  results)   –  Opera-ng  system:  Linux  2.6.13,  Linux  2.6.23   3/13/12 DSN 11 33
  • 34. Evalua-on   •  What  we  compare:   –  Regular  journaling  with  disk  cache  turned  off   •  “Safe”  but  slow   •  Disk  might  not  obey  command  to  turn  off  cache!   –  Regular  journaling  with  disk  cache  turned  on   •  Unsafe  but  fast   –  Discreet  mode  journaling   •  Midway  op-on  –  Safe  but  with  cost   3/13/12 DSN 11 34
  • 35. Evalua-on   •  Benchmarks:   –  OpenSSH   •   copy,  untar,  configure,  make   –  Postmark   •  Simulates  a  mail  server   •  Single  threaded   –  Filebench  Webserver   •   I/O  intensive   –  Filebench  Varmail   •  Mul-threaded  postmark   3/13/12 DSN 11 35
  • 36. Evalua-on  –  OpenSSH   Data  Journaling  Mode   45   40   35   30   Time  (s)   25   20   15   10   5   0   regular  w/o  cache   discreet     regular  w/  cache   3/13/12 DSN 11 36
  • 37. Evalua-on  –  Postmark   Data  Journaling  Mode   800   700   600   Time  (s)   500   400   300   200   100   0   regular  w/o  cache   discreet     regular  w/  cache   3/13/12 DSN 11 37
  • 38. Evalua-on  –  Filebench  Webserver   Data  Journaling  Mode   300   Throughtput  (MB/s)   250   200   150   100   50   0   regular  w/o  cache   discreet     regular  w/  cache   3/13/12 DSN 11 38
  • 39. Evalua-on  –  Filebench  Varmail   Data  Journaling  Mode   12   Throughtput  (MB/s)   10   8   6   4   2   0   regular  w/o  cache   discreet     regular  w/  cache   3/13/12 DSN 11 39
  • 40. Evalua-on  –  Filebench  varmail   •  Workload  writes  a  small  amount  of  data  and  calls   fsync()  repeatedly   •  Each  fsync()causes  3  CCEs   •  Number  of  op-miza-ons  :   –  Incorporate  Group  Commit  in  varmail   •  Improves  throughput  for  all  modes   –  We  use  a  few  other  techniques  as  well  (see  paper)   3/13/12 DSN 11 40
  • 41. Evalua-on  –  Filebench  Varmail   Original  performance   With  op-miza-ons   25   25   Throughtput  (MB/s)   20   20   Throughput  (MB/s)   15   15   10   10   5   5   0   0   regular  w/o  cache   discreet     regular  w/  cache   3/13/12 DSN 11 41
  • 42. Summary   •  Coerced  Cache  Evic-on  (CCE):   –   Run  file  systems  reliably  on  top  of  misbehaving  disks   •  Characteriza-on  of  9  SATA  disk  caches  through  fingerprints   •  Discreet  Mode  Journaling:   –  Implementa-on  of  CCE  for  ext3  filesystem   –  Acceptable  performance  on  3  workloads   •  Only  if  the  cache  doesn’t  use  random  replacement   –  High  overhead  for  apps  which  call  fsync()  frequently   3/13/12 DSN 11 42
  • 43. Conclusion   •  Trust  in  disk  is  weakening:   –  Latent  Sector  Errors   –  Block  corrup-on   –  Cache  flushing   •  Cloud  compu-ng  systems:   –  Virtualized  hardware   –  Large  socware  stack   •  Can  such  hardware  be  trusted?     •  Will  coercion  be  more  widely  used?   3/13/12 DSN 11 43
  • 44. Thank you! Advanced  Systems  Lab  (ADSL)   University  of  Wisconsin-­‐Madison   hEp://www.cs.wisc.edu/adsl   3/13/12 DSN 11 44