Master presentation-21-7-2014

Master Thesis, 21 July 2014, University of Crete
A Distributed Key-Value Store based on
Replicated LSM-Trees
Panagiotis Garefalakis
Computer Science Department – University of Crete

21 July 2014, University of Crete
Motivation
• This is the age of big data
• Distributed key value stores are key to analyzing
them

Motivation
• Companies such as Amazon and Google and open-
source communities such as Apache have proposed
several key-value stores
– Availability and fault tolerance through data replication

LSM-Trees

Data partitioning over LSM-Trees

Replication
Primary-Backup
replication
L
Zookeeper
F F
ZAB
Replication Group (RG)
…..

Primary-Backup
replication
L
F F
ZAB
SSTables
Write
#
Valu
e
#
#
Key
#
memtable
memorydisk
1 N2 3
…Commit log
ﬂush
Compaction
LSM Trees
batch/
periodic
WAL

Primary-Backup
replication
L
Zookeeper
F F
ZAB
Apache Cassandra
SSTables
Write
#
Valu
e
#
#
Key
#
memtable
memorydisk
1 N2 3
…Commit log
ﬂush
Compaction
LSM Trees
batch/
periodic
WAL
ACaZoo

Thesis Contributions
• A high performance data replication primitive:
– Combines the ZAB protocol with an implementation of LSM-Trees
– Key point: Replication of LSM-Tree WAL
• A novel technique that reduces the impact of LSM-Tree
compactions on write performance
– Changing leader prior to heavy compactions results to up to 60%
higher throughput

Data model
A18-v1 XYZ18-v2
cf2:col2-XYZ
B18-v3 foobar18-v1
row-6
cf1:col-B cf2:foobar
row-5
Foo18-v1
cf2:col-Foo
row-2
row-7
row-1
cf1:col-A
row-10
row-18 A18 - v1
Column Family 1 Column Family 2
Coordinates for a Cell: Row Key Column Family Name Column Qualiﬁer Version
B18 - v3
Peter - v2
Bob - v1
Foo18-v1 XYZ18-v2
Mary - v1
foobar18 - v1
CF Preﬁx

Consistent Hashing
A18-v1 XYZ18-v2
cf2:col2-XYZ
B18-v3 foobar18-v1
row-6
cf1:col-B cf2:foobar
row-5
Foo18-v1
cf2:col-Foo
row-2
row-7
row-1
cf1:col-A
row-10
row-18 A18 - v1
Column Family 1 Column Family 2
Coordinates for a Cell: Row Key Column Family Name Column Qualiﬁer Version
B18 - v3
Peter - v2
Bob - v1
Foo18-v1 XYZ18-v2
Mary - v1
foobar18 - v1
CF Preﬁx
md5

System Architecture

System Architecture Replication

RG leader switch policies
SSTables
1 N’2 3
…
Compaction
ACaZoo
L
F F
ZAB
SSTables
1 N’’2 3
Compaction
…
SSTables
1 N2 3
Compaction
…
High
Low
High
Low
#1: When to switch
High
Low

RG leader switch policies
SSTables
1 N’2 3
…
Compaction
ACaZoo
L
F F
ZAB
SSTables
1 N’’2 3
Compaction
…
SSTables
1 N2 3
Compaction
…
High
Low
High
Low
#1: When to switch
High
Low
Weighted Votes
#2: Whom to elect
Round Robin and Random policies

Evaluation
• OpenStack private Cloud
• VMs with 2 CPUs, 2 GB RAM and 20GB remotely mounted disk
• Software:
– Apache Cassandra version 2.0.1
– Apache Zookeeper version 3.4.5
– Oracle NoSQL version 2.1.54
• Benchmarks:
– YCSB version 0.1.4
– 1 KB accesses, 10 columns of 100 bytes cells
– three different operation mixes (100/0, 50/50, 0/100 reads/writes)
– # concurrent threads
– Postal version 0.72
– configurable message size
– # concurrent threads

Systems compared
• ACaZoo with/without RG leader changes
– Batch and Periodic
• Cassandra Quorum (2 out of 3 replicas)
• Cassandra Serial (extension of Paxos algorithm)
• Oracle NoSQL
– Absolute consistency

Impact of compaction
0
500
1000
1500
2000
2500
0 25 50 75 100 125 150 175 200
WriteThroughput(ops/100ms)
Time (sec)
Smoothed Average Throughput
0
500
1000
1500
2000
2500
0 25 50 75 100 125 150 175 2
WriteThroughput(ops/100ms)
Time (sec)
Smoothed Average Throughput
• YCSB 100% write workload, 64 Threads
ACaZoo without RG changes ACaZoo with RG changes
Memtable flush Leader electionCompaction

A deeper look into background activity
Count
(#)
Longest
(sec)
Average
(sec)
Total
(sec)
Compaction (RA) 11 78.44 17.96 197.64
Memtable flush (RA) 53 - - -
Garbage Collection (RA) 197 0.91 0.148 29.33
Compaction (RR) 12 72.65 15.94 191.39
Memtable flush (RR) 52 - - -
Garbage Collection (RR) 192 0.85 0.147 27.84
• YCSB 20min 100% write workload, 256 Threads
• RA : RG change random policy
• RR : RG round robin policy

Time correlation of compactions
across replicas
23% 13%
12%

Evaluation – 3 Node RG
25%
40%

Evaluation – 5 Node RG
60%

Application Performance: CassMail
ACaZoo ACaZoo ACaZoo

CassMail on a 3-node RG
50KB-500KB attachment 200KB-2MB attachment
30% 31%

CassMail on a 5-node RG
50KB-500KB attachment 200KB-2MB attachment
35%
42%

Future Work
• Elasticity: stream a number of key ranges to a newly
joining RG.
• Further investigate the load balancing methodology
for Zookeeper watch notifications.

Thesis Publications
1. Panagiotis Garefalakis, Panagiotis Papadopoulos, and Kostas
Magoutis, “ACaZoo: A distributed key-value store based on
replicated LSM-trees.” in 33rd IEEE International Symposium
on Reliable Distributed Systems (SRDS), IEEE 2014.
2. Panagiotis Garefalakis, Panagiotis Papadopoulos, Ioannis
Manousakis, and Kostas Magoutis, “Strengthening consistency
in the Cassandra distributed key-value store.” in Distributed
Applications and Interoperable Systems (DAIS), Springer 2013.

Other Publications
1. Baryannis G., Garefalakis P., Kritikos K., Magoutis K.,
Papaioannou A., Plexousakis D., & Zeginis C.
“Lifecycle management of service-based applications on multi-
clouds: a research roadmap.” In Proceedings of the 2013
international workshop on Multi-cloud applications and federated
clouds. ACM, 2013.
2. Zeginis C., Kritikos K., Garefalakis P., Konsolaki K., Magoutis K.,
& Plexousakis D.
“Towards cross-layer monitoring of multi-cloud service-based
applications.” In Service-Oriented and Cloud Computing. Springer,
2013.
3. Garefalakis Panagiotis, and Kostas Magoutis.
"Improving Datacenter Operations Management using Wireless
Sensor Networks." Green Computing and Communications
(GreenCom), 2012 IEEE International Conference on. IEEE, 2012.

Email : pgaref@ics.forth.gr

RG Leader Failover
0
500
1000
1500
2000
2500
3000
0 5 10 15 20 25 30 35 40 45
Throughput(ops/100ms)
sec
0
500
1000
1500
2000
2500
0 4 8 12 16 20 24 28 32 36 40 44
Throughput(ops/100ms)
sec
• YCSB read-only 64 threads
• 1.19sec for client to notice
• 220ms for the RG to elect a new leader
• 970ms to propagate to the client through the CM
• 2 sec to establish connection
ACaZoo Oracle NoSQL

Backup - ArchitectureCassandra’s

Cassandra’s Architecture

Cassandra’s Architecture
2/3 Responses: {X,Y}
Need for reconciliation!

Backup-Paxos1

Benefit of client coordinated I/O
• Yahoo Cloud Serving Benchmark (YCSB).
– 4 threads and read 1 GB of Data
Throughput
(ops/sec)
Read latency
(average,
ms)
Read latency
(99 percentile,
ms
Original
Cassandra
317 3.1 4
Client
Coordinated I/O
412 2.3 3

CM load balancer
0
500
1000
1500
2000
2500
1 10 100 1000 10000
AverageLatency(ms)
# Threads
1 node
3 nodes
3 nodes balanced

Master presentation-21-7-2014

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Master presentation-21-7-2014

Similar to Master presentation-21-7-2014 (20)

More from Panagiotis Garefalakis

More from Panagiotis Garefalakis (8)

Recently uploaded

Recently uploaded (20)

Master presentation-21-7-2014

Editor's Notes