Coerced Cache Eviction: Dealing with Misbehaving Disks through Discreet-Mode Journaling

Coerced
Cache
Evic-on
and
Discreet-‐Mode
Journaling:

Dealing
with
Misbehaving
Disks

Abhishek
Rajimwale*,
Vijay
Chidambaram,
Deepak
Ramamurthi

Andrea
Arpaci-‐Dusseau,
Remzi
Arpaci-‐Dusseau

*Data
Domain
Inc

University
of
Wisconsin
Madison

Disks
are
not
perfect

•  Expanding
disk
fault
model

•  Latent
Sector
Errors
[Bairavasundaram
SIGMETRICS
07]

–  RAID-‐6

•  Block
Corrup-on
[Bairavasundaram FAST 08]

–  Checksums
Disk
Cache

•  The
disk
cache
Disk
Surface

–  Always
trusted
so
far

3/13/12 DSN 11 2

Disk
Caches

•  Disk
cache
improves
performance

–  But
at
the
risk
of
data
loss

Write
to
disk

•  Order
of
writes
issued
by
file
system:

–  A,
B
,C

•  Disks
reorder
writes
during
destaging:

–  B,
A,
C
Disk
Cache

•  File
systems
flush
the
disk
cache
to

ensure
correct
ordering
of
writes
Disk
Surface

–  A,
flush,
B,
flush,
C

3/13/12 DSN 11 3

Problem:
Flushing
doesn’t
work

•  Disks
can
fail
to
ﬂush
data
upon
request

•  One
reason:
Bugs

–  Errors
in
the
storage
stack
[Bairavasundaram
FAST
08]

–  Improper
propaga-on
of
error
codes
[Bairavasundaram
FAST
08]

–  Inadequate
failure
policies
[Prabhakaran
SOSP
05]

–  Bugs
in
the
ﬁrmware
[Ghemawat
SOSP
03]

3/13/12 DSN 11 4

Disks
can
lie!

•  Misbehaving
disks
ignore
or
delay
ﬂush
requests

•  Increases
risk
for
data
loss

-  File
systems
usually
blamed
for
such
loss

50
Sequen6al
writes

45

Avg
6me
(msec)

40
w/
cache

35
w/o
cache

30

25

20

15

10

5

0

4k
16k
64k
128k
512k
1m

Write
size

3/13/12 DSN 11 5

Disks
can
lie!

•  Evidence
from
industry
experts

–  Microsoc

–  Seagate

•  From
the
fcntl
man
page
in
Mac
OSX:

F_FULLFSYNC
Does the same thing as fsync(2) then asks the drive to flush all buffered data to
the permanent storage device (arg is ignored). This is currently implemented on
HFS, MS-DOS (FAT), and Universal Disk Format (UDF) file systems. The operation
may take quite a while to complete.
Certain FireWire drives have also been known to ignore the request to flush their
buffered data.

3/13/12 DSN 11 6

Ordering
points
are
essen-al

•  All
modern
file
systems
depend
on
ordering
points

–  Journaling
file
systems
(ext3,
ext4)

•  Data
before
the
commit
block

–  Copy
on
write
file
systems
(ZFS)

•  Data
before
the
uber-‐block

•  If
ordering
points
are
not
enforced:

–  Data
corrup-on

–  Inconsistent
file
system

3/13/12 DSN 11 7

Summary

•  We
present
Coerced
Cache
Evic-on
(CCE)

–  Write
extra
data
into
the
cache
to
evict
target
blocks

•  We
show
how
to
characterize
9
SATA
disk
drive
cache

–  Examine
the
wide
range
of
caching
policies

•  We
implement
CCE
in
ext3

–  Well
known
journaling
ﬁle
system

•  CCE
provides
stronger
enforcement
for
ordering
points

–  At
acceptable
overheads

3/13/12 DSN 11 8

Outline

•  Mo-va-on

•  Background

•  Coerced
Cache
Evic-on

•  Cache
Fingerprin-ng

•  Discreet
Mode
Journaling

•  Evalua-on

•  Conclusion

3/13/12 DSN 11 9

File
System
Background

•  Consider
dele-ng
a
file

–  Removing
its
directory
entry

–  Freeing
the
space
occupied
by
the
file
and
its
metadata

•  Journaling
file
system

–  Makes
sure
all
changes
get
to
disk
or
none
do

–  Groups
writes
into
transac-ons

–  Writes
everything
to
a
log
first

–  Checkpoints
to
disk
later

3/13/12 DSN 11 10

File
System
Background

•  Ext3
file
system

–  Semi-‐modern
journaling
file
system

–  Well
known,
well
understood

•  Variants
of
journaling

–  Data
journaling
mode

•  Everything
(data,
metadata)
goes
to
the
log
first

–  Ordered
journaling
mode

•  Only
metadata
is
logged

3/13/12 DSN 11 11

Data
Journaling

Memory M
C
D
B

Disk Surface
Journal

Fixed locations

3/13/12 DSN 11 12

Data
Journaling

Memory M
C
D
B

Disk Cache

Disk Surface
Journal

Fixed locations

3/13/12 DSN 11 13

Outline

•  Mo-va-on

•  Background

•  Coerced
Cache
Evic-on

•  Cache
Fingerprin-ng

•  Discreet
Mode
Journaling

•  Evalua-on

•  Conclusion

3/13/12 DSN 11 14

Coerced
Cache
Evic-on

•  Ensures
that
cache
has
been
truly
flushed

•  Key
idea:

–  Extra
writes
to
flush
the
disk
cache

–  Desired
Order
of
writes:
A,
B,
C

–  With
CCE:

•  Write
A

•  Write
to
flush
zone

•  Write
B

•  Write
to
flush
zone

•  Write
C

3/13/12 DSN 11 15

Coerced
Cache
Evic-on

Memory M
C
D
B F F F F F F F F F

Disk Cache

Disk Surface
Journal

Flush Zone

Fixed locations

3/13/12 DSN 11 16

Coerced
Cache
Evic-on

•  Desired
proper-es:

–  High
probability
of
ﬂushing
target
blocks

–  Low
performance
overhead

•  Need
to
understand
the
disk
cache
to
design

the
ﬂush
workload

3/13/12 DSN 11 17

Outline

•  Mo-va-on

•  Background

•  Coerced
Cache
Evic-on

•  Cache
Fingerprin-ng

•  Discreet
Mode
Journaling

•  Evalua-on

•  Conclusion

3/13/12 DSN 11 18

Cache
Fingerprin-ng

•  Manufacturers
don’t
expose
details
about
disk
caches

•  Disk
caches
can
vary
in:

–  Read/Write
par--on
size

–  Number
of
segments

–  Replacement
policy

Disk
Cache

•  Poorly
characterized
in
literature

3/13/12 DSN 11 19

Cache
Fingerprin-ng

•  Flush
micro-‐benchmark:

–  Write
target
block

–  Write
varied
ﬂush
workload
–
measure
cost

–  fsync()

–  Read
target
–
infer
evic>on

•  Micro-‐benchmark
is
repeated

–  Probability
of
evic-on
is
calculated

•  Vary
in
each
workload:

–  Number
of
writes

–  Amount
of
data
in
each
write

–  Sequen-al/Random
writes

3/13/12 DSN 11 20

Cache
Fingerprin-ng

•  Evic-on
ﬁngerprint

–  Probability
of
evic-on
is
visually
shown

–  Darker
region
indicates
higher
probability

Eviction Probability

90
–
100%

70
-‐
90

50
–
70

30
–
50

10
–
30

0
–
10

3/13/12 DSN 11 21

Cache
Fingerprin-ng

•  Performance
ﬁngerprint

–  Time
taken
to
write
ﬂush
workload

–  Darker
region
indicates
more
-me

Flush Latency

500+
ms

100
–
500

50
–
100

10
-‐
50

0
-‐
10

3/13/12 DSN 11 22

Cache
Fingerprin-ng

•  Selec-ng
a
flush
workload:

–  Combine
informa-on
from
both
fingerprints

–  High
probability
of
evic-on

•  Dark
region
in
evic-on
fingerprint

–  Low
performance
cost

•  Light
region
in
performance
fingerprint

3/13/12 DSN 11 23

Cache
Fingerprin-ng

Manufacturer
Cache
(MB)
Capacity

(GB)

Hitachi
8
80

Hitachi
32
1024

Samsung
8
250

Samsung
16
250

Western
Digital
16
320

Western
Digital
64
800

Seagate
8
250

Seagate
16
320

Seagate
32
750

3/13/12 DSN 11 24

Cache
Fingerprin-ng

Sequen-al
writes
may
be
ineﬀec-ve
at
ﬂushing

– 
Regardless
of
the
size
of
the
write

A
number
of
random
writes
are
required


90
–
100%

70
-‐
90

50
–
70

30
–
50

10
–
30

0
–
10

3/13/12 DSN 11 25

Cache
Fingerprin-ng

Ver-cal
stripes
indicate
that
the
cache
is
segmented

–  Each
write,
regardless
of
size,
is
sent
to
one
segment


90
–
100%

70
-‐
90

50
–
70

30
–
50

10
–
30

0
–
10

3/13/12 DSN 11 26

Cache
Fingerprin-ng

Cache
behavior
of
disks
from
the
same
manufacturer
is

qualita-vely
similar
across
their
diﬀerent
models


90
–
100%

70
-‐
90

50
–
70

30
–
50

10
–
30

0
–
10

3/13/12 DSN 11 27

Cache
Fingerprin-ng

•  It’s
not
all
good
news
however:

–  Some
caches
appear
to
use
random

replacement
policies

–  For
such
caches,
we
cannot
evict
blocks

with
100%
certainty

–  A
large
number
of
random
writes
are

required
to
get
high
evic-on
probability

3/13/12 DSN 11 28

Cache
Fingerprin-ng
-‐
Results

Drive
Number
Total
Evic6on
Time
(s)

of
writes
Data
Probability

(MB)

Hitachi
8
MB
1
2.38
100
0.05

Hitachi
32
MB
1
11
100
0.087

Seagate
8
MB
256
31
100
0.87

Seagate
16
MB
128
17
100
0.342

Seagate
64
MB
128
37
100
0.396

Samsung
8
MB
128
49
~
90
1.328

Samsung
16
MB
256
128
~
90
2.872

Western
Digital
16
MB
1792
19
~
90
5.107

Western
Digital
64
MB
256
1
100
7.705

3/13/12 DSN 11 29

Outline

•  Mo-va-on

•  Background

•  Coerced
Cache
Evic-on

•  Cache
Fingerprin-ng

•  Discreet
Mode
Journaling

•  Evalua-on

•  Conclusion

3/13/12 DSN 11 30

Discreet
Mode
Journaling

•  Incorpora-ng
CCE
into
ext3

–  Fingerprint
the
disk
to
find
op-mal
flush
workload

–  Create
flush
zone
with
suitable
size

–  Modify
ext3
to
issue
flush
zone
writes:

•  One
at
each
ordering
point

•  #
of
CCE
opera-ons
=
#
of
ordering
points

•  Can
be
used
with
any
disk:

– 
As
long
as
the
disk
is
fingerprinted
first

3/13/12 DSN 11 31

Outline

•  Mo-va-on

•  Background

•  Coerced
Cache
Evic-on

•  Cache
Fingerprin-ng

•  Discreet
Mode
Journaling

•  Evalua-on

•  Conclusion

3/13/12 DSN 11 32

Evalua-on

•  Goal:

–  CCE
provides
higher
reliability

–  At
what
cost?
Is
it
prac-cal
to
use?

•  Experimental
setup:

–  File
system:
Ext3

–  Disk:
Hitachi
8
MB

–  Journaling
mode:
Data
journaling

•  (See
paper
for
ordered
journaling
results)

–  Opera-ng
system:
Linux
2.6.13,
Linux
2.6.23

3/13/12 DSN 11 33

Evalua-on

•  What
we
compare:

–  Regular
journaling
with
disk
cache
turned
oﬀ

•  “Safe”
but
slow

•  Disk
might
not
obey
command
to
turn
oﬀ
cache!

–  Regular
journaling
with
disk
cache
turned
on

•  Unsafe
but
fast

–  Discreet
mode
journaling

•  Midway
op-on
–
Safe
but
with
cost

3/13/12 DSN 11 34

Evalua-on

•  Benchmarks:

–  OpenSSH

• 
copy,
untar,
conﬁgure,
make

–  Postmark

•  Simulates
a
mail
server

•  Single
threaded

–  Filebench
Webserver

• 
I/O
intensive

–  Filebench
Varmail

•  Mul-threaded
postmark

3/13/12 DSN 11 35

Evalua-on
–
OpenSSH

Data
Journaling
Mode

45

40

35

30

Time
(s)

25

20

15

10

5

0

regular
w/o
cache

discreet

regular
w/
cache

3/13/12 DSN 11 36

Evalua-on
–
Postmark

Data
Journaling
Mode

800

700

600

Time
(s)

500

400

300

200

100

0

regular
w/o
cache

discreet

regular
w/
cache

3/13/12 DSN 11 37

Evalua-on
–
Filebench
Webserver

Data
Journaling
Mode

300

Throughtput
(MB/s)
250

200

150

100

50

0

regular
w/o
cache

discreet

regular
w/
cache

3/13/12 DSN 11 38

Evalua-on
–
Filebench
Varmail

Data
Journaling
Mode

12

Throughtput
(MB/s)
10

8

6

4

2

0

regular
w/o
cache

discreet

regular
w/
cache

3/13/12 DSN 11 39

Evalua-on
–
Filebench
varmail

•  Workload
writes
a
small
amount
of
data
and
calls

fsync()
repeatedly

•  Each
fsync()causes
3
CCEs

•  Number
of
op-miza-ons
:

–  Incorporate
Group
Commit
in
varmail

•  Improves
throughput
for
all
modes

–  We
use
a
few
other
techniques
as
well
(see
paper)

3/13/12 DSN 11 40

Evalua-on
–
Filebench
Varmail

Original
performance
With
op-miza-ons

25
25

Throughtput
(MB/s)

20
20

Throughput
(MB/s)

15
15

10
10

5

5

0

0

regular
w/o
cache

discreet

regular
w/
cache

3/13/12 DSN 11 41

Summary

•  Coerced
Cache
Evic-on
(CCE):

– 
Run
file
systems
reliably
on
top
of
misbehaving
disks

•  Characteriza-on
of
9
SATA
disk
caches
through
fingerprints

•  Discreet
Mode
Journaling:

–  Implementa-on
of
CCE
for
ext3
filesystem

–  Acceptable
performance
on
3
workloads

•  Only
if
the
cache
doesn’t
use
random
replacement

–  High
overhead
for
apps
which
call
fsync()
frequently

3/13/12 DSN 11 42

Conclusion

•  Trust
in
disk
is
weakening:

–  Latent
Sector
Errors

–  Block
corrup-on

–  Cache
ﬂushing

•  Cloud
compu-ng
systems:

–  Virtualized
hardware

–  Large
socware
stack

•  Can
such
hardware
be
trusted?

•  Will
coercion
be
more
widely
used?

3/13/12 DSN 11 43

Thank you!

Advanced
Systems
Lab
(ADSL)

University
of
Wisconsin-‐Madison

hEp://www.cs.wisc.edu/adsl

3/13/12 DSN 11 44

Coerced Cache Eviction: Dealing with Misbehaving Disks through Discreet-Mode Journaling

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (6)

Similar to Coerced Cache Eviction: Dealing with Misbehaving Disks through Discreet-Mode Journaling

Similar to Coerced Cache Eviction: Dealing with Misbehaving Disks through Discreet-Mode Journaling (20)

Coerced Cache Eviction: Dealing with Misbehaving Disks through Discreet-Mode Journaling