1. Master Thesis, 21 July 2014, University of Crete
A Distributed Key-Value Store based on
Replicated LSM-Trees
Panagiotis Garefalakis
Computer Science Department – University of Crete
2. 21 July 2014, University of Crete
Motivation
• This is the age of big data
• Distributed key value stores are key to analyzing
them
3. 21 July 2014, University of Crete
Motivation
• Companies such as Amazon and Google and open-
source communities such as Apache have proposed
several key-value stores
– Availability and fault tolerance through data replication
5. 21 July 2014, University of Crete
Data partitioning over LSM-Trees
6. 21 July 2014, University of Crete
Replication
Primary-Backup
replication
L
Zookeeper
F F
ZAB
Replication Group (RG)
…..
7. 21 July 2014, University of Crete
Replicated LSM-Trees
Primary-Backup
replication
L
F F
ZAB
Replication Group (RG)
SSTables
Write
#
Valu
e
#
#
Key
#
memtable
memorydisk
1 N2 3
…Commit log
flush
Compaction
LSM Trees
batch/
periodic
WAL
8. 21 July 2014, University of Crete
Replicated LSM-Trees
Primary-Backup
replication
L
Zookeeper
F F
ZAB
Replication Group (RG)
Apache Cassandra
SSTables
Write
#
Valu
e
#
#
Key
#
memtable
memorydisk
1 N2 3
…Commit log
flush
Compaction
LSM Trees
batch/
periodic
WAL
ACaZoo
9. 21 July 2014, University of Crete
Thesis Contributions
• A high performance data replication primitive:
– Combines the ZAB protocol with an implementation of LSM-Trees
– Key point: Replication of LSM-Tree WAL
• A novel technique that reduces the impact of LSM-Tree
compactions on write performance
– Changing leader prior to heavy compactions results to up to 60%
higher throughput
10. 21 July 2014, University of Crete
Data model
A18-v1 XYZ18-v2
cf2:col2-XYZ
B18-v3 foobar18-v1
row-6
cf1:col-B cf2:foobar
row-5
Foo18-v1
cf2:col-Foo
row-2
row-7
row-1
cf1:col-A
row-10
row-18 A18 - v1
Column Family 1 Column Family 2
Coordinates for a Cell: Row Key Column Family Name Column Qualifier Version
B18 - v3
Peter - v2
Bob - v1
Foo18-v1 XYZ18-v2
Mary - v1
foobar18 - v1
CF Prefix
11. 21 July 2014, University of Crete
Consistent Hashing
A18-v1 XYZ18-v2
cf2:col2-XYZ
B18-v3 foobar18-v1
row-6
cf1:col-B cf2:foobar
row-5
Foo18-v1
cf2:col-Foo
row-2
row-7
row-1
cf1:col-A
row-10
row-18 A18 - v1
Column Family 1 Column Family 2
Coordinates for a Cell: Row Key Column Family Name Column Qualifier Version
B18 - v3
Peter - v2
Bob - v1
Foo18-v1 XYZ18-v2
Mary - v1
foobar18 - v1
CF Prefix
md5
12. 21 July 2014, University of Crete
System Architecture
13. 21 July 2014, University of Crete
System Architecture Replication
14. 21 July 2014, University of Crete
RG leader switch policies
SSTables
1 N’2 3
…
Compaction
ACaZoo
L
F F
ZAB
Replication Group (RG)
SSTables
1 N’’2 3
Compaction
…
SSTables
1 N2 3
Compaction
…
High
Low
High
Low
#1: When to switch
High
Low
15. 21 July 2014, University of Crete
RG leader switch policies
SSTables
1 N’2 3
…
Compaction
ACaZoo
L
F F
ZAB
Replication Group (RG)
SSTables
1 N’’2 3
Compaction
…
SSTables
1 N2 3
Compaction
…
High
Low
High
Low
#1: When to switch
High
Low
Weighted Votes
#2: Whom to elect
Round Robin and Random policies
16. 21 July 2014, University of Crete
Evaluation
• OpenStack private Cloud
• VMs with 2 CPUs, 2 GB RAM and 20GB remotely mounted disk
• Software:
– Apache Cassandra version 2.0.1
– Apache Zookeeper version 3.4.5
– Oracle NoSQL version 2.1.54
• Benchmarks:
– YCSB version 0.1.4
– 1 KB accesses, 10 columns of 100 bytes cells
– three different operation mixes (100/0, 50/50, 0/100 reads/writes)
– # concurrent threads
– Postal version 0.72
– configurable message size
– # concurrent threads
17. 21 July 2014, University of Crete
Systems compared
• ACaZoo with/without RG leader changes
– Batch and Periodic
• Cassandra Quorum (2 out of 3 replicas)
– Batch and Periodic
• Cassandra Serial (extension of Paxos algorithm)
– Batch and Periodic
• Oracle NoSQL
– Absolute consistency
18. 21 July 2014, University of Crete
Impact of compaction
0
500
1000
1500
2000
2500
0 25 50 75 100 125 150 175 200
WriteThroughput(ops/100ms)
Time (sec)
Smoothed Average Throughput
0
500
1000
1500
2000
2500
0 25 50 75 100 125 150 175 2
WriteThroughput(ops/100ms)
Time (sec)
Smoothed Average Throughput
• YCSB 100% write workload, 64 Threads
ACaZoo without RG changes ACaZoo with RG changes
Memtable flush Leader electionCompaction
19. 21 July 2014, University of Crete
A deeper look into background activity
Count
(#)
Longest
(sec)
Average
(sec)
Total
(sec)
Compaction (RA) 11 78.44 17.96 197.64
Memtable flush (RA) 53 - - -
Garbage Collection (RA) 197 0.91 0.148 29.33
Compaction (RR) 12 72.65 15.94 191.39
Memtable flush (RR) 52 - - -
Garbage Collection (RR) 192 0.85 0.147 27.84
• YCSB 20min 100% write workload, 256 Threads
• RA : RG change random policy
• RR : RG round robin policy
20. 21 July 2014, University of Crete
Time correlation of compactions
across replicas
23% 13%
12%
21. 21 July 2014, University of Crete
Evaluation – 3 Node RG
25%
40%
22. 21 July 2014, University of Crete
Evaluation – 5 Node RG
60%
23. 21 July 2014, University of Crete
Application Performance: CassMail
ACaZoo ACaZoo ACaZoo
24. 21 July 2014, University of Crete
CassMail on a 3-node RG
50KB-500KB attachment 200KB-2MB attachment
30% 31%
25. 21 July 2014, University of Crete
CassMail on a 5-node RG
50KB-500KB attachment 200KB-2MB attachment
35%
42%
26. 21 July 2014, University of Crete
Thesis Contributions
• A high performance data replication primitive:
– Combines the ZAB protocol with an implementation of LSM-Trees
– Key point: Replication of LSM-Tree WAL
• A novel technique that reduces the impact of LSM-Tree
compactions on write performance
– Changing leader prior to heavy compactions results to up to 60%
higher throughput
27. 21 July 2014, University of Crete
Future Work
• Elasticity: stream a number of key ranges to a newly
joining RG.
• Further investigate the load balancing methodology
for Zookeeper watch notifications.
28. 21 July 2014, University of Crete
Thesis Publications
1. Panagiotis Garefalakis, Panagiotis Papadopoulos, and Kostas
Magoutis, “ACaZoo: A distributed key-value store based on
replicated LSM-trees.” in 33rd IEEE International Symposium
on Reliable Distributed Systems (SRDS), IEEE 2014.
2. Panagiotis Garefalakis, Panagiotis Papadopoulos, Ioannis
Manousakis, and Kostas Magoutis, “Strengthening consistency
in the Cassandra distributed key-value store.” in Distributed
Applications and Interoperable Systems (DAIS), Springer 2013.
29. 21 July 2014, University of Crete
Other Publications
1. Baryannis G., Garefalakis P., Kritikos K., Magoutis K.,
Papaioannou A., Plexousakis D., & Zeginis C.
“Lifecycle management of service-based applications on multi-
clouds: a research roadmap.” In Proceedings of the 2013
international workshop on Multi-cloud applications and federated
clouds. ACM, 2013.
2. Zeginis C., Kritikos K., Garefalakis P., Konsolaki K., Magoutis K.,
& Plexousakis D.
“Towards cross-layer monitoring of multi-cloud service-based
applications.” In Service-Oriented and Cloud Computing. Springer,
2013.
3. Garefalakis Panagiotis, and Kostas Magoutis.
"Improving Datacenter Operations Management using Wireless
Sensor Networks." Green Computing and Communications
(GreenCom), 2012 IEEE International Conference on. IEEE, 2012.
30. 21 July 2014, University of Crete
Email : pgaref@ics.forth.gr
31. 21 July 2014, University of Crete
RG Leader Failover
0
500
1000
1500
2000
2500
3000
0 5 10 15 20 25 30 35 40 45
Throughput(ops/100ms)
sec
0
500
1000
1500
2000
2500
0 4 8 12 16 20 24 28 32 36 40 44
Throughput(ops/100ms)
sec
• YCSB read-only 64 threads
• 1.19sec for client to notice
• 220ms for the RG to elect a new leader
• 970ms to propagate to the client through the CM
• 2 sec to establish connection
ACaZoo Oracle NoSQL
32. 21 July 2014, University of Crete
Backup - ArchitectureCassandra’s
33. 21 July 2014, University of Crete
Cassandra’s Architecture
34. 21 July 2014, University of Crete
Cassandra’s Architecture
35. 21 July 2014, University of Crete
Cassandra’s Architecture
2/3 Responses: {X,Y}
Need for reconciliation!
38. 21 July 2014, University of Crete
Benefit of client coordinated I/O
• Yahoo Cloud Serving Benchmark (YCSB).
– 4 threads and read 1 GB of Data
Throughput
(ops/sec)
Read latency
(average,
ms)
Read latency
(99 percentile,
ms
Original
Cassandra
317 3.1 4
Client
Coordinated I/O
412 2.3 3
39. 21 July 2014, University of Crete
CM load balancer
0
500
1000
1500
2000
2500
1 10 100 1000 10000
AverageLatency(ms)
# Threads
1 node
3 nodes
3 nodes balanced
Editor's Notes
Motivating this work
Ta teleutaia xronia ο όγκος των δεδομένων έχει αυξηθεί δραματικά.
Image of Key value stores…!!
Several companies.. A number of
eBay supports critical applications that need both real-time and analytics capabilities with the features of Cassandra.
Netflix increased the availability of member information and quality of data for its global streaming video service thanks to Cassandra.
Adobe relies on Cassandra to provide a highly scalable, low-latency database to support its distributed cache architecture.
Sas edeiksa pws einai h ulopoishs gia ena LSM dentro omws otan exw pollous komvous me mia ulopoihsh lsm se kathe komvo..
----- Meeting Notes (7/18/14 18:41) -----
Compaction is a problem
Cassandra no longer handles replication.
----- Meeting Notes (7/18/14 18:58) -----
An estiasoume ston leader, ola ta
----- Meeting Notes (7/18/14 18:58) -----
3 diaforetikes polites RR, RR kai antistrofos analoga tou Compacti
----- Meeting Notes (7/18/14 18:41) -----
Compaction is a problem
Focus on alternatives that exploit replication mechanisms.
This concludes my talk and I would be happy to take any questions
(a) 1.19 sec between the time the leader crashes until the client notices; (b) 2 sec until the client establishes a connection with the new leader and restores service. Interval (a) further breaks down into: (1) 220 ms for the RG to reconfigure (elect a new leader); (2) 970 ms to propagate the new-leader information (e.g., its IP address) to the client through the CM.
Cassandra works well with applications that share its relaxed semantics (such as customer carts in online stores).
Cassandra is not a good fit for more traditional applications requiring strong consistency.
All nodes in Cassandra are peers
No ordering guarantees, ad hoc synchronization mechanism, membership state to clients – gossip
If a replica misses a write, the row will be made consistent later via one of Cassandra’s built-in repair mechanisms: hinted handoff, read repair or anti-entropy node repairing eventually consistent
Cassandra works well with applications that share its relaxed semantics (such as customer carts in online stores).
Cassandra is not a good fit for more traditional applications requiring strong consistency.
All nodes in Cassandra are peers
No ordering guarantees, ad hoc synchronization mechanism, membership state to clients – gossip
If a replica misses a write, the row will be made consistent later via one of Cassandra’s built-in repair mechanisms: hinted handoff, read repair or anti-entropy node repairing eventually consistent
Cassandra works well with applications that share its relaxed semantics (such as customer carts in online stores).
Cassandra is not a good fit for more traditional applications requiring strong consistency.
All nodes in Cassandra are peers
No ordering guarantees, ad hoc synchronization mechanism, membership state to clients – gossip
If a replica misses a write, the row will be made consistent later via one of Cassandra’s built-in repair mechanisms: hinted handoff, read repair or anti-entropy node repairing eventually consistent
Cassandra works well with applications that share its relaxed semantics (such as customer carts in online stores).
Cassandra is not a good fit for more traditional applications requiring strong consistency.
All nodes in Cassandra are peers
No ordering guarantees, ad hoc synchronization mechanism, membership state to clients – gossip
If a replica misses a write, the row will be made consistent later via one of Cassandra’s built-in repair mechanisms: hinted handoff, read repair or anti-entropy node repairing eventually consistent