CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
CaSSanDra:
An
SSD

Boosted
Key-‐Value
Store
Prashanth
Menon,
Tilmann
Rabl,
Mohammad
Sadoghi
(*),

Hans-‐Arno
Jacobsen
!1
*

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Outline
• ApplicaHon
Performance
Management

• Cassandra
and
SSDs

• Extending
Cassandra’s
Row
Cache

• ImplemenHng
a
Dynamic
Schema
Catalogue

• Conclusions
!2

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Modern
Enterprise
Architecture
• Many
diﬀerent
soPware
systems

• Complex
interacHons

• Stateful
systems
oPen
distributed/parHHoned/replicated

• Stateless
systems
certainly
duplicated
!3

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
ApplicaHon
Performance
Management
• Lightweight
agent
aSached
to
each
soPware
system
instance

• Monitors
system
health

• Traces
transacHons

• Determines
root
causes

• Raw
APM
metric:
!4
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
ApplicaHon
Performance
Management
• Problem:
Agents
have
short
memory
and
only
have
a
local
view

• What
was
the
average
response
Hme
for
requests
served
by
servlet
X

between
December
18-‐31
2011?

• What
was
the
average
Hme
spent
in
each
service/database
to
respond

to
client
requests?
!5
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
APM
Metrics
Datastore
• All
agents
store
metric
data
in
high
write-‐throughput
datastore

• Metric
data
is
at
a
ﬁne
granularity
(per-‐acHon,
millisecond
etc)

• User
now
has
global
view
of
metrics

• What
is
the
best
database
to
store
APM
metrics?
!6
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent
?

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra
Wins
APM
• APM
experiments
performed
by
Rabl
et
al.
[1]

show
Cassandra
performs

best
for
APM
use
case

• In
memory
workloads
including
95%,
50%
and
5%
read

• Workloads
requiring
disk
access
with
95%,
50%
and
5%
reads
!7
Read: 95%
0
50000
100000
150000
200000
250000
2 4 6 8 10 12
Throughput(Ops/sec)
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 6: Throughput for Workload RW
0.1
1
10
100
1000
2 4 6 8 10 12
Latency(ms)-Logarithmic
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Read: 50%
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2 4 6 8 10 12
Throughput(Operations/sec)
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 3: Throughput for Workload R
million records per node, thus, scaling the problem size with the
cluster size. For each run, we used a freshly installed system and
loaded the data. We ran the workload for 10 minutes with max-
imum throughput. Figure 3 shows the maximum throughput for
workload R for all six systems.
In the experiment with only one node, Redis has the highest
throughput (more than 50K ops/sec) followed by VoltDB. There
are no signiﬁcant differences between the throughput of Cassan-
dra and MySQL, which is about half that of Redis (25K ops/sec).
Voldemort is 2 times slower than Cassandra (with 12K ops/sec).
The slowest system in this test on a single node is HBase with 2.5K
operation per second. However, it is interesting to observe that the
0.1
1
10
100
2 4 6 8 10 12
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 4: Read latency for Workload R
0.01
0.1
1
10
100
2 4 6 8 10 12
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 5: Write latency for Workload R
[1] http://msrg.org/publications/pdf_ﬁles/2012/vldb12-bigdata-Solving_Big_Data_Challenges_fo.pdf

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra
• Built
at
Facebook
by
previous
Dynamo
engineers

• Open
sourced
to
Apache
in
2009

• DHT
with
consistent
hashing

• MD5
hash
of
key

• MulHple
nodes
handle
segments
of
ring
for
load
balancing

• Dynamo
distribuHon
and
replicaHon
model
+
BigTable
storage
model
!8
Commit&&
Log&
Memtable&
SS&Tables&

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra
and
SSDs
• Improve
performance
by
either
adding
nodes
or
improving
per-‐
node
performance

• Node
performance
is
directly
dependent
on
the
disk
I/O

performance
of
the
system

• Cassandra
stores
two
enHHes
on
disk:

• Commit
Log

• SSTables

• Should
SSDs
be
used
to
store
both?

• We
evaluated
each
possible
conﬁgura<on
!9

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Experiment
Setup
• Server
speciﬁcaHon:

• 2x
Intel
8-‐core
X5450,
16GB
RAM,
2x
2TB
RAID0
HDD,
2x
250GB
Intel
x520
SSD

• Apache
Cassandra
1.10

• Used
YCSB
benchmark

• 100M
rows,
50GB
total
raw
data,
‘latest’
distribuHon

• 95%
read,
5%
write

• Minimum
three
runs
per
workload,
fresh
data
on
each
run

• Broken
into
phases:

• Data
load

• FragmentaHon

• Cache
warm-‐up

• Workload
(>
12h
process)
!10

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
SSD
vs.
HDD
• LocaHon
of
log
is
irrelevant

• LocaHon
of
data
is
important

• DramaHc
performance
improvement
of
SSD
over
HDD

• SSD
benefits
from
high
parallelism
!11
Configura<on #
of
clients #
of
threads/client Loca<on
of
Data Loca<on
of
Commit
Log
C1 1 2 RAID
(HDD) RAID
(HDD)
C2 1 2 RAID
(HDD) SSD
C3 1 2 SSD RAID
(HDD)
C4 1 2 SSD SSD
C5 4 16 RAID
(HDD) RAID
(HDD)
C6 4 16 SSD SSD
0
1000
2000
3000
4000
5000
6000
7000
8000
C1 C2 C3 C4 C5 C6
Throughput(ops/sec)
Configuration
(a) HDD vs SSD Throughput
0
1
2
3
4
5
6
7
8
C1 C2 C3 C4 C5 C6
Latency(ms)
Configuration
(b) HDD vs SDD Latency
0
1000
2000
3000
4000
5000
6000
7000
8000
HDD
Throughput(ops/sec)
Data
Empty Disk
Full Disk
(c) 99% Fill HDD v
Fig. 4. Throughput/Latency Results for HDD vs SSD and D
on HDD for the bulk of data that is infrequently accessed.
Another reason to do this is the fact that SSD performance
degrades with higher fill ratios. As seen in Figure 4(c), the
performance of a highly filled SSD degrades much worse than
This is becau
the SSD; in f
twice the amo
alone, achiev

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
SSD
vs.
HDD
(II)
• SSD
offers
more
than
7x
improvement
to
throughput
on
empty
disk

• SSD
performance
degrades
by
half
as
storage
device
fills
up

• Filling
the
SSD
or
running
it
near
capacity
is
not
advisable
!12
3 C4 C5 C6
iguration
SDD Latency
0
1000
2000
3000
4000
5000
6000
7000
8000
HDD SSD
Throughput(ops/sec)
Data Location
Empty Disk
Full Disk
(c) 99% Fill HDD vs SDD Throughput
0
50
100
150
200
250
HDD SSD
Latency(ms)
Data Location
Empty Disk
Full Disk
(d) 99% Fill HDD vs SDD Latency
t/Latency Results for HDD vs SSD and Disk Full vs Disk Empty
quently accessed.
SSD performance
Figure 4(c), the
much worse than
s to be noted that
, for write heavy
experienced.
This is because a larger portion of the hot data is cached on
the SSD; in fact, our configuration enabled storing more than
twice the amount of data than when using an in-memory cache
alone, achieving a cache-hit ratio of more than 85%. When
a read operation reaches the server for a row that does not
reside in the off-heap memory cache, only a single SSD seek
is required to fulfill the request. In addition, cached data is

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
SSD
vs.
HDD:
Summary
• Cassandra
beneﬁts
most
when
storing
data
on
SSD
(not
the
log)

• LocaHon
of
commit
log
not
important

• SSD
performance
inversely
proporHonal
to
ﬁll
raHo

• Storing
all
data
on
SSD
is
uneconomical

• Replacing
3TB
HDD
with
3x
1TB
SSD
is
10x
more
costly

• SSDs
have
limited
lifeHme
(10-‐50K
write-‐erase
cycles),
replacement

more
frequently

• Rabl
et
al.
[1]
show
adding
node
is
100%
costlier,
with
100%
throughput

improvement

• Build
hybrid
system
to
get
comparable
performance
for
marginal
cost
!13

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra:
Read
+
Write
Path
• Write
path
is
fast:

1. Write
update
into
commit
log

2. Write
update
into
Memtable

• Memtables
ﬂush
to
SSTables
asynchronously

when
full

• Never
blocks
writes

• Read
path
can
be
slow:

1. Read
key-‐value
from
Memtable

2. Read
key-‐value
from
each
SSTable
on
disk

3. Construct
merged
view
of
row
from
each

input
source
!14
ReadUpdate
Memtable
SSTableSSTableSSTable
Memory
• Each
read
needs
to
do
O(#
of
SSTables)
I/O
Disk
Log

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra:
SSTables
• Cassandra
allows
blind-‐writes

• Row
data
can
be
fragmented
over
mulHple
SSTables
over
Hme

!
!
!
!
• Bloom
ﬁlters
and
indexes
can
potenHally
help

• Ul<mately,
mul<ple
fragments
need
to
be
read
from
disk
!15
Employee(ID( First(Name( Last(Name( Age( Department(ID(
99231234& Prashanth& Menon& 25& MSRG&
{SSTables

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra:
Row
Cache
• Row
cache
buﬀers
full
merged
row
in

memory

• Cache
miss
follows
regular
read
path,

constructs
merged
row,
brings
into
cache

• Makes
read
path
faster
for
frequently

accessed
data

• Problem:
Row
cache
occupies
memory

• Takes
away
precious
memory
from

rest
of
system
!16
• Extend
the
row
cache
eﬃciently
onto
SSD
ReadUpdate
Memtable
Memory
Disk
Log
Row Cache

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Extended
Row
Cache
• Extend
the
row
cache
onto
SSD

• Chained
with
in-‐memory
row
cache

• LRU
in-‐memory,
overflow
onto
LRU

SSD
row
cache

• Implemented
as
append-‐only
cache
files

• Efficient
sequenHal
writes

• Fast
random
reads

• Zero
I/O
for
hit
in
first
level
row
cache

• One
random
I/O
on
SSD
for
second
level

row
cache

!17
Log SSTableSSTableSSTable
Memory
Memtable
1rst Level Row
Cache
2nd Level Cache
Index
Disk
2nd Level Row Cache
SSD
ReadUpdate

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
EvaluaHon:
SSD
Row
Cache
• Setup:

• 100M
rows,
50GB
total
data,
6GB
row
cache

• Results:

• 75%
improvement
in
throughput

• 75%
improvement
in
latency

• RAM-‐only
cache
has
too
liSle
hit
raHo
!18
0
200
400
600
800
1000
95% 85% 75%
Throughput(ops/sec)
Read Percentage
Disabled
RAM
RAM+SSD
(a) Row Cache (Throughput)
0
1
2
3
4
5
6
7
8
95% 85% 75%
Latency(ms)
Read Percentage
Disabled
RAM
RAM+SSD
(b) Row Cache (Latency)
0
1000
2000
3000
4000
5000
6000
7000
95%
Throughput(ops/sec)
Re
Regular
Dynamic
(c) Dynamic Sc
Fig. 5. Throughput/Latency Results for Row Cache Exten
and we find this to be much more compelling. In normal
operation, data sizes averaged 6.8GB compressed after the
initial load of 40 million keys. With a modified Cassandra,
data sizes averaged at 6.01GB of data, a savings of roughly
10%. This value will grow as the number of columns in the
table grow and as column names grow in length.
Another potential benefit for dynamic schema model (omit-
we identify
key-value s
In this p
SSDs in k
figurations
and implem

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Dynamic
Schema
• Key-‐value
stores
covet
schema-‐less
data
model

• Very
flexible,
good
for
highly
varying
data

• Schemas
oPen
change,
defining
up
front
can
be
detrimental

!
!
!
!
!
!
• ObservaHon:
many
big
data
applicaHons
have
relaHvely
stable
schemas

• e.g.,
Click
stream,
APM,
sensor
data
etc.

• Redundant
schemas
have
significant
overhead
in
I/O
and
space
usage
!19
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'
Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'
OnHDisk'Format'
Metric'Name' Timestamp' Value' Max' Min'
HostA/AgentX/AVGResponse' 1332988833' 4' 6' 1'
ApplicaKon'Format'

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Dynamic
Schema
(III)
• Don’t
serialize
redundant
schema
with
rows

• Extract
schema
from
data,
store
on
SSD,
serialize
schema
ID
with
data

• Allows
for
large
number
of
schemas
!20
Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'
S1'
S2'
Metric'Name'Timestamp' Value' Max' Min'
Metric'Name'Timestamp' All' Warn' Error'
HostA/AgentX/AVGResponse'1332988833'S1' 4' 6' 1'
HostA/AgentX/AVGResponse'1332988848'
HostA/AgentX/Failures' 1332988849'
S1'
S2'
5' 7' 1'
4' 3' 1'
New'Disk'Format'Schema'Catalogue'
Old'Disk'Format'
SSD

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
EvaluaHon:
Dynamic
Schema
• Setup:

• 40M
rows,
variable
columns
5-‐10
(638
schemas),
6GB
row
cache

• Results:

• 10%
reducHon
in
disk
usage
(6.8GB
vs
6GB)

• Slightly
improved
throughput,
stable
latency

• EffecHve
SSD
usage
(only
random
reads)
&
reduce
I/O
and
space
usage
!21
85% 75%
Percentage
he (Latency)
0
1000
2000
3000
4000
5000
6000
7000
95% 50% 5%
Throughput(ops/sec)
Read Percentage
Regular
Dynamic
(c) Dynamic Schema (Throughput)
0
20
40
60
80
100
120
140
95% 50% 5%
Latency(ms)
Read Percentage
Regular
Dynamic
(d) Dynamic Schema (Latency)
atency Results for Row Cache Extension and Dynamic Schema
ing. In normal
essed after the
fied Cassandra,
ngs of roughly
columns in the
th.
ma model (omit-
we identify new avenues for exploiting the use of SSDs within
key-value stores, namely, our dynamic cataloguing technique.
VIII. CONCLUSION
In this paper, we investigated the performance benefits of
SSDs in key-value stores. We benchmarked different con-
figurations of SSD and HDD combinations. We proposed
and implemented two specific optimizations for SSD-HDD

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Conclusions
• Storing
Cassandra
commit
logs
on
SSD
doesn’t
help

• Managing
SSDs
at
capacity
degrades
its
performance

• Using
SSDs
as
a
secondary
row-‐cache
dramaHcally

improves
performance

• ExtracHng
redundant
schemas
onto
and
SSD
reduces

disk
space
usage
and
required
I/O
!22

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Thanks!
!
• QuesHons?

!
• Contact:

• Prashanth
Menon
(prashanth.menon@utoronto.ca)
!23

UNIVERSITY OF
TORONTO
Fighting back:
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Future
Work
• What
types
of
tables
beneﬁt
most
from
a
dynamic

schema?

• Impact
of
compacHon
on
read-‐heavy
workloads

• How
can
SSDs
be
used
to
improve
the
performance
of

compacHon?

• How
is
performance
when
storing
only
SSTable
indexes

on
SSD?
!24

CaSSanDra: An SSD Boosted Key-Value Store

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (18)

Similar to CaSSanDra: An SSD Boosted Key-Value Store

Similar to CaSSanDra: An SSD Boosted Key-Value Store (20)

More from Tilmann Rabl

More from Tilmann Rabl (7)

Recently uploaded

Recently uploaded (20)

CaSSanDra: An SSD Boosted Key-Value Store