[G2]fa ce deview_2012

Flash-‐Based
Extended
Cache

for
Higher
Throughput
and
Faster
Recovery

Woon-‐hak
Kang,
Sang-‐won
Lee,
and
Bongki
Moon

12.
9.
19. 1

Outline
•  IntroducIon

•  Related
work

•  Flash
as
Cache
Extension
(FaCE)

–  Design
choice

–  Two
opImizaIons

•  Recovery
in
FaCE

•  Performance
EvaluaIon

•  Conclusion

2

Outline
•  IntroducIon

•  Related
work

•  Flash
as
Cache
Extension
(FaCE)

–  Design
choice

–  Two
opImizaIons

•  Recovery
in
FaCE

•  Performance
EvaluaIon

•  Conclusion

3

IntroducIon
•  Flash
Memory
Solid
State
Drive(SSD
)

–  NAND
ﬂash
memory
based
non-‐volaI
le
storage

•  CharacterisIcs

–  No
mechanical
parts

•  Low
access
latency
and
High
random
IOP
S

–  MulI-‐channel
and
mulI-‐plane

•  Intrinsic
parallelism,
high
concurrency

–  No
overwriIng

•  Erase-‐before-‐overwriIng

•  Read
cost
<<
Write
cost

–  Limited
life
span

•  #
of
erasures
of
the
ﬂash
block

4
Image
from
:
hXp://www.legitreviews.com/arIcle/1197/2/

IntroducIon(2)
•  IOPS
(IOs
Per
Second)
maXers
in
OLTP

•  IOPS/$:
SSDs
>>
HDDs

–  e.g.
SSD

63
(=
28,495
IOPS
/
450$)
vs.
HDD
1.7
(=
409
IOPS
/
240$)

•  GB/$:
HDDs
>>
SSDs

–  e.g.
SSD

0.073
(=
32GB
/
440$)
vs.

HDD
0.617
(=
146.8GB
/
240$)

•  Therefore,
it
is
more
sensible
to
use
SSDs
to
su
pplement
HDDs,
rather
than
to
replace
them

–  SSDs
as
cache
between
RAM
and

HDDs

–  To
provide
both
the
performance
of
SSDs
and
the
c
apacity
of
HDDs
as
liXle
cost
as
possible

5

IntroducIon(3)
•  A
few
exisIng
ﬂash-‐based
cache
schemes

–  e.g.
Oracle
Exadata,
IBM,
MS

–  Pages
cached
in
SSDs
are
overwriXen;
the
write
paXern
in
SS
Ds
is
random

•  Write
bandwidth
disparity
in
SSDs

–  e.g.
random
write
(25MB/s
=
6,314
x
4KBs/s
)
vs.
sequenIal
w
rite
(243MB/s)
vs.

4KB
Random
Throughput
( Ra=o
Sequen=al/Random
Sequen=al
Bandwidth
(MBPS)
IOPS)
write

Read Write Read Write
SSD
mid
A 28,495 6,314 251 243 9.85
SSD
mid
B 35,601 2,547 259 80 8.04
HDD
Single

409 343 156 154 114.94
HDD
Single
(x8
) 2,598 2,502 848 843 86.25 6

IntroducIon(4)

•  FaCE
(Flash
as
Cache
Extension)
–
main
contribuIons

–  Write-‐opImized
ﬂash
cache
scheme:
e.g.
3x
higher
throughput
t
han
the
exisIng
ones

–  Faster
database
recovery
support
by
exploiIng
the
non-‐volaIle
c
ache
pages
in
SSDs
for
recovery:
e.g.
4x
faster
recovery
Ime

DRAM
Random
Read
Sequen=al
Write

(Low
cost) (à
High
throughput)
Random

Read
Non-‐vola=lity
of
ﬂash

SSD cache
for
recovery

Random
(faster
recovery)
Write

HDD
7

Contents
•  IntroducIon

•  Related
work

•  Flash
as
Cache
Extension
(FaCE)

–  Design
choice

–  Two
opImizaIons

•  Recovery
in
FaCE

•  Performance
EvaluaIon

•  Conclusion

8

Related
work
•  How
to
adopt
SSDs
in
the
DBMS
area?

1.  SSD
as
faster
disk

–  VLDB
‘08,
Koltsidas
et
al.,
“Flashing
up
the
Storage
Layer”

–  VLDB
’09,
Canim
et
al.
“An
Object
Placement
Advisor
for
DB2
Usin
g
Solid
State
Storage”

–  SIGMOD
‘08,
Lee
et
al.,
"A
Case
for
Flash
Memory
SSD
in
Enterpris
e
Database
ApplicaIons"

2.  SSD
as
DRAM
buffer
extension

–  VLDB
’10,
Canim
et
al.,
“SSD
Bufferpool
extensions
for
Database
s
ystems”

–  SIGMOD
’11,
Do
et
al.,
“Turbocharging

DBMS
Buffer
Pool
Using
SS
Ds”

9

Lazy
Cleaning
(LC)
[SIGMOD’11]

•  Cache
on
exit

•  Write-‐back
policy

•  LRU-‐based
SSD
cache
replacement
policy

–  To
incur
almost
random
writes
against
SSD

•  No
eﬃcient
recovery
mechanism
provided

Flash
hit
Random
writes

Evict
RAM Buffer (LRU) Flash memory SSD
Fetch
on
miss Stage
out
dirty
pages

HDD
10

Contents
•  IntroducIon

•  Related
work

•  Flash
as
Cache
Extension
(FaCE)

–  Design
choices

–  Two
opImizaIons

•  Recovery
in
FaCE

•  Performance
EvaluaIon

•  Conclusion

11

FaCE:
Design
Choices
1.  When
to
cache
pages
in
SSD?

2.  What
pages
to
cache
in
SSD?

3.  Sync
policy
b/w
SSD
and
HDD

4.  SSD
Cache
Replacement
Policy

12

Design
Choices:
When/What/Sync
Policy
•  When
:
on
entry
vs.
on
exit

•  What
:
clean
vs.
dirty
vs.
both

•  Sync
policy
:
write-‐thru
vs.
write-‐back

On
exit
:
dirty
pages
athe
ell
as

Sync
policy
:
for
s
w
performance,
write-‐back
sync

clean
pages

Evict

RAM Buffer
Flash as Cache Extension
(LRU)

Fetch
on
miss Stage
out
dirty
pages

HDD
13

Design
Choices:
SSD
Cache
Replacement
Policy
•  What
to
do
when
a
page
is
evicted
from
DRAM
buﬀe
r
and
SSD
cache
is
full

•  LRU
vs.
FIFO
(First-‐In-‐First-‐Out)

–  Write
miss:
LRU-‐based
vicIm
selecIon,
write-‐back
if
dirt
y
vicIm,
and
overwrite
the
old
vicIm
page
with
the
new

page
being
evicted

–  Write
hit:
overwrite
the
old
copy
in
ﬂash
cache
with
the

updated
page
being
evicted

Random

writes

Evict

RAM Buffer against
SSD
(LRU)

HDD
14

Design
Choices:
SSD
Cache
Replacement
Policy

•  LRU
vs.
FIFO
(First-‐In-‐First-‐Out)

–  VicIms
are
chosen
from
the
rear
end
of
ﬂash
cache

:
“sequenIal
writes”
against
SSD

–  Write
hit
:
no
addiIonal
acIon
is
taken
in
order
not

to
incur
random
writes.

•  mulIple
versions
in
SSD
cache

Evict

RAM Buffer
(LRU)
Multi-Version FIFO
(mvFIFO)

HDD
15

Write
ReducIon
in
mvFIFO
•  Example

–  Reduce
three
writes
to
HDD
to
one
Versions
of
Page
P
Mul=ple

Choose

Invalidated

Invalidated

Write-‐back

version Discard
Vic=m
version
to
HDD
Page
P-‐v2 Page
P-‐v1
Page
P-‐v3

Evict

RAM Buffer
(LRU)

HDD

16

Design
Choices:
SSD
Cache
Replacement
Policy

•  LRU
vs.
FIFO

LRU FIFO
Write
paXern Random Sequen=al
Write
performance Low High

#
of
copy
pages Single MulIple
Space
uIlizaIon High Low
Hit
raIo
&
write
reducIon High Low

•  Trade-‐oﬀ
:
hit-‐raIo
<>
write
performance

–  Write
performance
beneﬁt
by
FIFO
>>
Performance

gain
from
higher
hit
raIo
by
LRU

17

mvFIFO:
Two
OpImizaIons
•  Group
Replacement
(GR)

–  MulIple
pages
are
replaced
in
a
group
in
order
to
exploi
t
the
internal
parallelism
in
modern
SSDs

–  Replacement
depth
is
limited
by
parallelism
size
(chann
el
*
plane)

–  GR
can
improve
SSD
I/O
throughput

•  Group
Second
Chance
(GSC)

–  GR
+
Second
chance

–  if
a
vicIm
candidate
page
is
valid
and
referenced,
will
re
-‐enque
the
vicIm
to
SSD
cache

•  A
variant
of
“clock”
replacement
algorithm
for
the
FaCE

–  GSC
can
achieve
higher
hit
raIo
and
more
write
reducI
ons

18

Group
Replacement
(GR)

•  Single
group
read
from
SSD

(64/128
pages)

•  Batch
random
writes
to
HD RAM Check
valid
and

dirty
ﬂag

D

Flash
Cache

•  Single
group
write
to
SSD
becomes
FULL

2.
Evict

RAM Buffer Flash as Cache Extension
(LRU)

1.
Fetch
on
miss

HDD
19

Group
Second
Chance
(GSC)

•  GR
+
Second
Chance
reference
bit
is
ON
Check
reference
bit,

RAM if
true
galid
them

Check
vave
and

dirty
ﬂag
2nd
chance

Flash
Caches

become
FULL

2.
Evict

RAM Buffer Flash as Cache Extension
(LRU)

1.
Fetch
on
miss

HDD
20

Contents
•  IntroducIon

•  Related
work

•  Flash
as
Cache
Extension
(FaCE)

–  Design
choice

–  Two
opImizaIons

•  Recovery
in
FaCE

•  Performance
EvaluaIon

•  Conclusion

21

Recovery
Issues
in
SSD
Cache
•  With
write-‐back
sync
policy,
many
recent
copies
of
data
pages
ar
e
kept
in
SSD,
not
in
HDD.

•  Therefore,
database
in
HDD
is
in
an
inconsistent
state
ayer
syste
m
failure

New
version

of
page
P

RAM SSD Mapping Information
(Metadata)
Crash
Inconsistent
as Cache Extension
Flash
state
Old
version

of
page
P
HDD
22

Recovery
Issues
in
SSD
Cache
•  With
write-‐back
sync
policy,
many
recent
copies
of
data
pages
are
kept

in
SSD,
not
in
HDD.

•  Therefore,
database
in
HDD
is
in
an
inconsistent
state
ayer
system
failu
re

•  In
this
situa=on,
one
recovery
approach
with
ﬂash
cache
is
to
view
da
tabase
in
harddisk
as
the
only
persistent
DB
[SIGMOD
11]

–  Periodically
checkpoint
updated
pages
from
SSD
cache
as
well
as
DRAM
bu
ﬀer
to
HDD

New
version

of
page
P
RAM SSD Mapping Information Excessive
Checkpoint
Checkpoint
Checkpoint

Cost
Persistent

DB
HDD Old
version

of
page
P 23

Recovery
Issues
in
SSD
Cache(2)
•  Fortunately,
because
SSDs
are
non-‐vola=le,
pages
cached
in
SSD
are
al
ive
even
ayer
system
failure.

•  SSD
mapping
informaIon
has
gone

•  Two
approaches
for
recovering
metadata.

1.  Rebuild
lost
metadata
by
scanning
the
whole
pages
cached
in
SSD
(Naïve

approach)
–
Time-‐consuming
scanning

2.  Write
metadata
persistently
whenever
metadata
is
changed
[DaMon
11]

–
Run-‐Ime
overhead
for
managing
metadata
persistently

New
version

of
page
P
RAM SSD Mapping Information
Full
Scanning

Flush
every
update
Persistent

DB
HDD Old
version

of
page
P 24

Recovery
in
FaCE
•  Metadata
checkpoinIng

–  Because
a
data
page
entering
SSD
cache
is
wriXen
t
o
the
rear
in
chronological
order,
metadata
can
be

wriXen
regularly
in
a
single
large
segment

64K

RAM Recovery
:
SSD Metadata page

info.
Mapping
Segment
Scanning

segment

Periodically
checkpoint

Flash as Cache Extension Crash metadata

HDD
25

Contents
•  IntroducIon

•  Related
work

•  Flash
as
Cache
Extension
(FaCE)

–  Design
choice

–  Two
opImizaIons

•  Recovery
in
FaCE

•  Performance
EvaluaIon

•  Conclusion

26

Experimental
Set-‐Up
•  FaCE
ImplementaIon
in
PostgreSQL

–  3
funcIons
in
buffer
mgr.
:
bufferAlloc(),
getFreeBuffer(),
buff
erSync()

–  2
funcIons
in
bootstrap
for
recovery
:
startupXLOG(),
initBuff
erPool()

•  Experiment
Setup

–  Centos
Linux

–  Intel
Core
i7-‐860
2.8
GHz
(quad
core)
and
4G
DRAM

–  Disks
:
8
RAIDed
15k
rpm
Seagate
SAS
HDDs
(146.8GB)

–  SSD
:
Samsung
MLC
(256GB)

•  Workloads

–  TPC-‐C
with
500
warehouses
(50GB)
and
50
concurrent
clients

–  BenchmarkSQL

27

TransacIon
Throughput
HDD
only
SSD
only
LC
FaCE
FaCE+GR
FaCE+GSC

7000

3.9x
6000
FaCE+GSC
FaCE+GR
3.1x
5000

Transac=ons
Per
minute

2.6x
4000
FaCE-‐basic
2.6x
2.4x
2.1x
3000
1.5x
LC
2000

SSD
only

1000
HDD
only

0

4
8
12
16
20
24
28

|Flash
cache|/|Database|
(%)

28

Hit
RaIo,
Write
ReducIon,
and
I/O
Throughput

Flash
Cache
Hit
Ra=o
100

95

90
LC
85
FaCE+GSC
Hit
ra=o
(%)

80

75
FaCE-‐basic
70

FaCE+GR

65

60

2GB
4GB
6GB
8GB
10GB

Flash
cache
size

LC
FaCE
FaCE+GR
FaCE+GSC

29

Hit
RaIo,
Write
ReducIon,
and
I/O
Throughput

Write
Reduc=on
Ra=o
By
Flash
Cache
Write
Reduc=on
Ra=o

100
Flash
Cache
Hit
Ra=o By
Flash
Cache
100

100

90

95

90

90

80

85
80
LC
Ra=o(%)

Hit
ra=o
(%)

Ra=o(%)

70

80
70
FaCE+GSC
75

60

60

FaCE-‐basic
70

50
50

FaCE+GR
65

60

40
40

2GB
2GB
4GB
6GB
4GB
8GB
10GB
6GB
2GB
4GB

8GB
6GB
8GB

10GB
10GB

Flash
cache
size Flash
cache
size Flash
cache
size

LC
FaCE
FaCE+GR
FaCE+GSC
FaCE

LC
FaCE+GR
LC

FaCE+GSC
FaCE
FaCE+GR
FaCE+GSC

30

Hit
RaIo,
Write
ReducIon,
and
I/O
Throughput

Write
Reduc=on
Ra=o

Throughput
of
4KB-‐page
I/O Throughput
of
4KB-‐page

Flash
Cache
Hit
Ra=o
16000

Throughput
of
4KB-‐page

By
Flash
Cache I/O
100
I/O
14000

100

FaCE+GSC 16000

16000

95
14000

90

12000
14000

90
12000

FaCE+GR
Throughput
(4KB)

10000
80
12000

Throughput
(4KB)
85
10000

Hit
ra=o
(%)

Throughput
(4KB)
Ra=o(%)

10000

80
8000
70
8000

FaCE-‐basic
8000

75
6000
6000

60

6000

70
4000
4000

4000

65
2000
LC 50
2000

2000

60
40
0

0

2GB
4GB
6GB
8GB
10GB
2GB
4GB
0
6GB
8GB
10GB

6GB
2GB
4GB
6GB
8GB
10GB

2GB
4GB
8GB
10GB

Flash
cache
size Flash
cache
size 4GB
6GB

2GB
8GB
10GB
Flash
cache
size
Flash
cache
size
Flash
cache
size
LC
FaCE
FaCE+GR
FaCE+GSC
LC
FaCE
FaCE+GR
FaCE+GSC
LC
FaCE
FaCE+GR
FaCE+GSC

LC
FaCE
FaCE+GR
FaCE+GSC

LC
FaCE
FaCE+GR
FaCE+GSC

31

Recovery
Performance
•  4.4x
faster
recovery
than
HDD
only
approac
h

Metadata
recovery
:
2
redo
Ime
:me
:
823
redo
I
186

32

Contents
•  IntroducIon

•  Related
work

•  Flash
as
Cache
Extension
(FaCE)

–  Design
choice

–  Two
opImizaIons

•  Recovery
in
FaCE

•  Performance
EvaluaIon

•  Conclusion

33

Conclusion
•  We
presented
a
low-‐overhead
caching
method

called
FaCE
that
uIlizes
flash
memory
as
an
ext
ension
to
a
DRAM
buffer
for
a
recoverable
data
base.

•  FaCE
can
maximized
the
I/O
throughput
of
a
fla
sh
caching
device
by
turning
small
random
writ
es
to
large
sequenIal
ones

•  Also,
FaCE
takes
advantage
of
the
non-‐volaIlity

of
flash
memory
to
accelerate
the
system
resta
rt
from
a
failure.

34

[G2]fa ce deview_2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to [G2]fa ce deview_2012

Similar to [G2]fa ce deview_2012 (20)

More from NAVER D2

More from NAVER D2 (20)

Recently uploaded

Recently uploaded (20)

[G2]fa ce deview_2012