SlideShare a Scribd company logo
1 of 39
Master Thesis, 21 July 2014, University of Crete
A Distributed Key-Value Store based on
Replicated LSM-Trees
Panagiotis Garefalakis
Computer Science Department – University of Crete
21 July 2014, University of Crete
Motivation
• This is the age of big data
• Distributed key value stores are key to analyzing
them
21 July 2014, University of Crete
Motivation
• Companies such as Amazon and Google and open-
source communities such as Apache have proposed
several key-value stores
– Availability and fault tolerance through data replication
21 July 2014, University of Crete
LSM-Trees
21 July 2014, University of Crete
Data partitioning over LSM-Trees
21 July 2014, University of Crete
Replication
Primary-Backup
replication
L
Zookeeper
F F
ZAB
Replication Group (RG)
…..
21 July 2014, University of Crete
Replicated LSM-Trees
Primary-Backup
replication
L
F F
ZAB
Replication Group (RG)
SSTables
Write
#
Valu
e
#
#
Key
#
memtable
memorydisk
1 N2 3
…Commit log
flush
Compaction
LSM Trees
batch/
periodic
WAL
21 July 2014, University of Crete
Replicated LSM-Trees
Primary-Backup
replication
L
Zookeeper
F F
ZAB
Replication Group (RG)
Apache Cassandra
SSTables
Write
#
Valu
e
#
#
Key
#
memtable
memorydisk
1 N2 3
…Commit log
flush
Compaction
LSM Trees
batch/
periodic
WAL
ACaZoo
21 July 2014, University of Crete
Thesis Contributions
• A high performance data replication primitive:
– Combines the ZAB protocol with an implementation of LSM-Trees
– Key point: Replication of LSM-Tree WAL
• A novel technique that reduces the impact of LSM-Tree
compactions on write performance
– Changing leader prior to heavy compactions results to up to 60%
higher throughput
21 July 2014, University of Crete
Data model
A18-v1 XYZ18-v2
cf2:col2-XYZ
B18-v3 foobar18-v1
row-6
cf1:col-B cf2:foobar
row-5
Foo18-v1
cf2:col-Foo
row-2
row-7
row-1
cf1:col-A
row-10
row-18 A18 - v1
Column Family 1 Column Family 2
Coordinates for a Cell: Row Key Column Family Name Column Qualifier Version
B18 - v3
Peter - v2
Bob - v1
Foo18-v1 XYZ18-v2
Mary - v1
foobar18 - v1
CF Prefix
21 July 2014, University of Crete
Consistent Hashing
A18-v1 XYZ18-v2
cf2:col2-XYZ
B18-v3 foobar18-v1
row-6
cf1:col-B cf2:foobar
row-5
Foo18-v1
cf2:col-Foo
row-2
row-7
row-1
cf1:col-A
row-10
row-18 A18 - v1
Column Family 1 Column Family 2
Coordinates for a Cell: Row Key Column Family Name Column Qualifier Version
B18 - v3
Peter - v2
Bob - v1
Foo18-v1 XYZ18-v2
Mary - v1
foobar18 - v1
CF Prefix
md5
21 July 2014, University of Crete
System Architecture
21 July 2014, University of Crete
System Architecture Replication
21 July 2014, University of Crete
RG leader switch policies
SSTables
1 N’2 3
…
Compaction
ACaZoo
L
F F
ZAB
Replication Group (RG)
SSTables
1 N’’2 3
Compaction
…
SSTables
1 N2 3
Compaction
…
High
Low
High
Low
#1: When to switch
High
Low
21 July 2014, University of Crete
RG leader switch policies
SSTables
1 N’2 3
…
Compaction
ACaZoo
L
F F
ZAB
Replication Group (RG)
SSTables
1 N’’2 3
Compaction
…
SSTables
1 N2 3
Compaction
…
High
Low
High
Low
#1: When to switch
High
Low
Weighted Votes
#2: Whom to elect
Round Robin and Random policies
21 July 2014, University of Crete
Evaluation
• OpenStack private Cloud
• VMs with 2 CPUs, 2 GB RAM and 20GB remotely mounted disk
• Software:
– Apache Cassandra version 2.0.1
– Apache Zookeeper version 3.4.5
– Oracle NoSQL version 2.1.54
• Benchmarks:
– YCSB version 0.1.4
– 1 KB accesses, 10 columns of 100 bytes cells
– three different operation mixes (100/0, 50/50, 0/100 reads/writes)
– # concurrent threads
– Postal version 0.72
– configurable message size
– # concurrent threads
21 July 2014, University of Crete
Systems compared
• ACaZoo with/without RG leader changes
– Batch and Periodic
• Cassandra Quorum (2 out of 3 replicas)
– Batch and Periodic
• Cassandra Serial (extension of Paxos algorithm)
– Batch and Periodic
• Oracle NoSQL
– Absolute consistency
21 July 2014, University of Crete
Impact of compaction
0
500
1000
1500
2000
2500
0 25 50 75 100 125 150 175 200
WriteThroughput(ops/100ms)
Time (sec)
Smoothed Average Throughput
0
500
1000
1500
2000
2500
0 25 50 75 100 125 150 175 2
WriteThroughput(ops/100ms)
Time (sec)
Smoothed Average Throughput
• YCSB 100% write workload, 64 Threads
ACaZoo without RG changes ACaZoo with RG changes
Memtable flush Leader electionCompaction
21 July 2014, University of Crete
A deeper look into background activity
Count
(#)
Longest
(sec)
Average
(sec)
Total
(sec)
Compaction (RA) 11 78.44 17.96 197.64
Memtable flush (RA) 53 - - -
Garbage Collection (RA) 197 0.91 0.148 29.33
Compaction (RR) 12 72.65 15.94 191.39
Memtable flush (RR) 52 - - -
Garbage Collection (RR) 192 0.85 0.147 27.84
• YCSB 20min 100% write workload, 256 Threads
• RA : RG change random policy
• RR : RG round robin policy
21 July 2014, University of Crete
Time correlation of compactions
across replicas
23% 13%
12%
21 July 2014, University of Crete
Evaluation – 3 Node RG
25%
40%
21 July 2014, University of Crete
Evaluation – 5 Node RG
60%
21 July 2014, University of Crete
Application Performance: CassMail
ACaZoo ACaZoo ACaZoo
21 July 2014, University of Crete
CassMail on a 3-node RG
50KB-500KB attachment 200KB-2MB attachment
30% 31%
21 July 2014, University of Crete
CassMail on a 5-node RG
50KB-500KB attachment 200KB-2MB attachment
35%
42%
21 July 2014, University of Crete
Thesis Contributions
• A high performance data replication primitive:
– Combines the ZAB protocol with an implementation of LSM-Trees
– Key point: Replication of LSM-Tree WAL
• A novel technique that reduces the impact of LSM-Tree
compactions on write performance
– Changing leader prior to heavy compactions results to up to 60%
higher throughput
21 July 2014, University of Crete
Future Work
• Elasticity: stream a number of key ranges to a newly
joining RG.
• Further investigate the load balancing methodology
for Zookeeper watch notifications.
21 July 2014, University of Crete
Thesis Publications
1. Panagiotis Garefalakis, Panagiotis Papadopoulos, and Kostas
Magoutis, “ACaZoo: A distributed key-value store based on
replicated LSM-trees.” in 33rd IEEE International Symposium
on Reliable Distributed Systems (SRDS), IEEE 2014.
2. Panagiotis Garefalakis, Panagiotis Papadopoulos, Ioannis
Manousakis, and Kostas Magoutis, “Strengthening consistency
in the Cassandra distributed key-value store.” in Distributed
Applications and Interoperable Systems (DAIS), Springer 2013.
21 July 2014, University of Crete
Other Publications
1. Baryannis G., Garefalakis P., Kritikos K., Magoutis K.,
Papaioannou A., Plexousakis D., & Zeginis C.
“Lifecycle management of service-based applications on multi-
clouds: a research roadmap.” In Proceedings of the 2013
international workshop on Multi-cloud applications and federated
clouds. ACM, 2013.
2. Zeginis C., Kritikos K., Garefalakis P., Konsolaki K., Magoutis K.,
& Plexousakis D.
“Towards cross-layer monitoring of multi-cloud service-based
applications.” In Service-Oriented and Cloud Computing. Springer,
2013.
3. Garefalakis Panagiotis, and Kostas Magoutis.
"Improving Datacenter Operations Management using Wireless
Sensor Networks." Green Computing and Communications
(GreenCom), 2012 IEEE International Conference on. IEEE, 2012.
21 July 2014, University of Crete
Email : pgaref@ics.forth.gr
21 July 2014, University of Crete
RG Leader Failover
0
500
1000
1500
2000
2500
3000
0 5 10 15 20 25 30 35 40 45
Throughput(ops/100ms)
sec
0
500
1000
1500
2000
2500
0 4 8 12 16 20 24 28 32 36 40 44
Throughput(ops/100ms)
sec
• YCSB read-only 64 threads
• 1.19sec for client to notice
• 220ms for the RG to elect a new leader
• 970ms to propagate to the client through the CM
• 2 sec to establish connection
ACaZoo Oracle NoSQL
21 July 2014, University of Crete
Backup - ArchitectureCassandra’s
21 July 2014, University of Crete
Cassandra’s Architecture
21 July 2014, University of Crete
Cassandra’s Architecture
21 July 2014, University of Crete
Cassandra’s Architecture
2/3 Responses: {X,Y}
Need for reconciliation!
21 July 2014, University of Crete
Backup-Paxos1
Backup-Paxos2
21 July 2014, University of Crete
Benefit of client coordinated I/O
• Yahoo Cloud Serving Benchmark (YCSB).
– 4 threads and read 1 GB of Data
Throughput
(ops/sec)
Read latency
(average,
ms)
Read latency
(99 percentile,
ms
Original
Cassandra
317 3.1 4
Client
Coordinated I/O
412 2.3 3
21 July 2014, University of Crete
CM load balancer
0
500
1000
1500
2000
2500
1 10 100 1000 10000
AverageLatency(ms)
# Threads
1 node
3 nodes
3 nodes balanced

More Related Content

Viewers also liked

Test case-point-analysis (whitepaper)
Test case-point-analysis (whitepaper)Test case-point-analysis (whitepaper)
Test case-point-analysis (whitepaper)KMS Technology
 
Identity Based Secure Distributed Storage Scheme
Identity Based Secure Distributed Storage SchemeIdentity Based Secure Distributed Storage Scheme
Identity Based Secure Distributed Storage SchemeVenkatesh Devam ☁
 
Synopsis on cloud computing by Prashant upta
Synopsis on cloud computing by Prashant uptaSynopsis on cloud computing by Prashant upta
Synopsis on cloud computing by Prashant uptaPrashant Gupta
 
Readymade M Tech Thesis
Readymade M Tech ThesisReadymade M Tech Thesis
Readymade M Tech Thesise2-matrix
 
Cloud computing project report
Cloud computing project reportCloud computing project report
Cloud computing project reportNaveed Farooq
 
Cloud Computing Documentation Report
Cloud Computing Documentation ReportCloud Computing Documentation Report
Cloud Computing Documentation ReportUsman Sait
 

Viewers also liked (8)

Test case-point-analysis (whitepaper)
Test case-point-analysis (whitepaper)Test case-point-analysis (whitepaper)
Test case-point-analysis (whitepaper)
 
Identity Based Secure Distributed Storage Scheme
Identity Based Secure Distributed Storage SchemeIdentity Based Secure Distributed Storage Scheme
Identity Based Secure Distributed Storage Scheme
 
Disseration M.Tech
Disseration M.TechDisseration M.Tech
Disseration M.Tech
 
Synopsis on cloud computing by Prashant upta
Synopsis on cloud computing by Prashant uptaSynopsis on cloud computing by Prashant upta
Synopsis on cloud computing by Prashant upta
 
Readymade M Tech Thesis
Readymade M Tech ThesisReadymade M Tech Thesis
Readymade M Tech Thesis
 
M.tech thesis
M.tech thesisM.tech thesis
M.tech thesis
 
Cloud computing project report
Cloud computing project reportCloud computing project report
Cloud computing project report
 
Cloud Computing Documentation Report
Cloud Computing Documentation ReportCloud Computing Documentation Report
Cloud Computing Documentation Report
 

Similar to Master presentation-21-7-2014

The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008Ian Foster
 
Security Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research PlatformSecurity Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research PlatformLarry Smarr
 
Pacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataPacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataLarry Smarr
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
 
PRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path ForwardPRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path ForwardLarry Smarr
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemLarry Smarr
 
Building a Regional 100G Collaboration Infrastructure
Building a Regional 100G Collaboration InfrastructureBuilding a Regional 100G Collaboration Infrastructure
Building a Regional 100G Collaboration InfrastructureLarry Smarr
 
Introduction to Big data
Introduction to Big dataIntroduction to Big data
Introduction to Big datacthanopoulos
 
Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL...
Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL...Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL...
Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL...MDC_UNICA
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemLarry Smarr
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube
 
DSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
DSD-INT 2019 Modelling in DANUBIUS-RI-BellafioreDSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
DSD-INT 2019 Modelling in DANUBIUS-RI-BellafioreDeltares
 
OpenACC and Hackathons Monthly Highlights: April 2023
OpenACC and Hackathons Monthly Highlights: April  2023OpenACC and Hackathons Monthly Highlights: April  2023
OpenACC and Hackathons Monthly Highlights: April 2023OpenACC
 
Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...Larry Smarr
 
Creating a Big Data Machine Learning Platform in California
Creating a Big Data Machine Learning Platform in CaliforniaCreating a Big Data Machine Learning Platform in California
Creating a Big Data Machine Learning Platform in CaliforniaLarry Smarr
 

Similar to Master presentation-21-7-2014 (20)

The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008
 
Security Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research PlatformSecurity Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research Platform
 
Pacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataPacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big Data
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
PRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path ForwardPRP, NRP, GRP & the Path Forward
PRP, NRP, GRP & the Path Forward
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
 
Building a Regional 100G Collaboration Infrastructure
Building a Regional 100G Collaboration InfrastructureBuilding a Regional 100G Collaboration Infrastructure
Building a Regional 100G Collaboration Infrastructure
 
Introduction to Big data
Introduction to Big dataIntroduction to Big data
Introduction to Big data
 
Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL...
Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL...Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL...
Automated Design Flow for Coarse-Grained Reconfigurable Platforms: an RVC-CAL...
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Future
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013
 
DSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
DSD-INT 2019 Modelling in DANUBIUS-RI-BellafioreDSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
DSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
 
OpenACC and Hackathons Monthly Highlights: April 2023
OpenACC and Hackathons Monthly Highlights: April  2023OpenACC and Hackathons Monthly Highlights: April  2023
OpenACC and Hackathons Monthly Highlights: April 2023
 
Dash UCCSC 2016
Dash UCCSC 2016Dash UCCSC 2016
Dash UCCSC 2016
 
Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...
 
Creating a Big Data Machine Learning Platform in California
Creating a Big Data Machine Learning Platform in CaliforniaCreating a Big Data Machine Learning Platform in California
Creating a Big Data Machine Learning Platform in California
 

More from Panagiotis Garefalakis

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsPanagiotis Garefalakis
 
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch ApplicationsNeptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch ApplicationsPanagiotis Garefalakis
 
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersMedea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersPanagiotis Garefalakis
 
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref   Piccolo Building Fast, Distributed Programs with Partitioned TablesPgaref   Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned TablesPanagiotis Garefalakis
 

More from Panagiotis Garefalakis (8)

Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
 
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch ApplicationsNeptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
 
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersMedea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production Clusters
 
Mres presentation
Mres presentationMres presentation
Mres presentation
 
Dais 2013 2 6 june
Dais 2013 2 6 juneDais 2013 2 6 june
Dais 2013 2 6 june
 
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref   Piccolo Building Fast, Distributed Programs with Partitioned TablesPgaref   Piccolo Building Fast, Distributed Programs with Partitioned Tables
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
 
Storage managment using nagios
Storage managment using nagiosStorage managment using nagios
Storage managment using nagios
 
Ithings2012 20nov
Ithings2012 20novIthings2012 20nov
Ithings2012 20nov
 

Recently uploaded

Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 

Recently uploaded (20)

Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 

Master presentation-21-7-2014

  • 1. Master Thesis, 21 July 2014, University of Crete A Distributed Key-Value Store based on Replicated LSM-Trees Panagiotis Garefalakis Computer Science Department – University of Crete
  • 2. 21 July 2014, University of Crete Motivation • This is the age of big data • Distributed key value stores are key to analyzing them
  • 3. 21 July 2014, University of Crete Motivation • Companies such as Amazon and Google and open- source communities such as Apache have proposed several key-value stores – Availability and fault tolerance through data replication
  • 4. 21 July 2014, University of Crete LSM-Trees
  • 5. 21 July 2014, University of Crete Data partitioning over LSM-Trees
  • 6. 21 July 2014, University of Crete Replication Primary-Backup replication L Zookeeper F F ZAB Replication Group (RG) …..
  • 7. 21 July 2014, University of Crete Replicated LSM-Trees Primary-Backup replication L F F ZAB Replication Group (RG) SSTables Write # Valu e # # Key # memtable memorydisk 1 N2 3 …Commit log flush Compaction LSM Trees batch/ periodic WAL
  • 8. 21 July 2014, University of Crete Replicated LSM-Trees Primary-Backup replication L Zookeeper F F ZAB Replication Group (RG) Apache Cassandra SSTables Write # Valu e # # Key # memtable memorydisk 1 N2 3 …Commit log flush Compaction LSM Trees batch/ periodic WAL ACaZoo
  • 9. 21 July 2014, University of Crete Thesis Contributions • A high performance data replication primitive: – Combines the ZAB protocol with an implementation of LSM-Trees – Key point: Replication of LSM-Tree WAL • A novel technique that reduces the impact of LSM-Tree compactions on write performance – Changing leader prior to heavy compactions results to up to 60% higher throughput
  • 10. 21 July 2014, University of Crete Data model A18-v1 XYZ18-v2 cf2:col2-XYZ B18-v3 foobar18-v1 row-6 cf1:col-B cf2:foobar row-5 Foo18-v1 cf2:col-Foo row-2 row-7 row-1 cf1:col-A row-10 row-18 A18 - v1 Column Family 1 Column Family 2 Coordinates for a Cell: Row Key Column Family Name Column Qualifier Version B18 - v3 Peter - v2 Bob - v1 Foo18-v1 XYZ18-v2 Mary - v1 foobar18 - v1 CF Prefix
  • 11. 21 July 2014, University of Crete Consistent Hashing A18-v1 XYZ18-v2 cf2:col2-XYZ B18-v3 foobar18-v1 row-6 cf1:col-B cf2:foobar row-5 Foo18-v1 cf2:col-Foo row-2 row-7 row-1 cf1:col-A row-10 row-18 A18 - v1 Column Family 1 Column Family 2 Coordinates for a Cell: Row Key Column Family Name Column Qualifier Version B18 - v3 Peter - v2 Bob - v1 Foo18-v1 XYZ18-v2 Mary - v1 foobar18 - v1 CF Prefix md5
  • 12. 21 July 2014, University of Crete System Architecture
  • 13. 21 July 2014, University of Crete System Architecture Replication
  • 14. 21 July 2014, University of Crete RG leader switch policies SSTables 1 N’2 3 … Compaction ACaZoo L F F ZAB Replication Group (RG) SSTables 1 N’’2 3 Compaction … SSTables 1 N2 3 Compaction … High Low High Low #1: When to switch High Low
  • 15. 21 July 2014, University of Crete RG leader switch policies SSTables 1 N’2 3 … Compaction ACaZoo L F F ZAB Replication Group (RG) SSTables 1 N’’2 3 Compaction … SSTables 1 N2 3 Compaction … High Low High Low #1: When to switch High Low Weighted Votes #2: Whom to elect Round Robin and Random policies
  • 16. 21 July 2014, University of Crete Evaluation • OpenStack private Cloud • VMs with 2 CPUs, 2 GB RAM and 20GB remotely mounted disk • Software: – Apache Cassandra version 2.0.1 – Apache Zookeeper version 3.4.5 – Oracle NoSQL version 2.1.54 • Benchmarks: – YCSB version 0.1.4 – 1 KB accesses, 10 columns of 100 bytes cells – three different operation mixes (100/0, 50/50, 0/100 reads/writes) – # concurrent threads – Postal version 0.72 – configurable message size – # concurrent threads
  • 17. 21 July 2014, University of Crete Systems compared • ACaZoo with/without RG leader changes – Batch and Periodic • Cassandra Quorum (2 out of 3 replicas) – Batch and Periodic • Cassandra Serial (extension of Paxos algorithm) – Batch and Periodic • Oracle NoSQL – Absolute consistency
  • 18. 21 July 2014, University of Crete Impact of compaction 0 500 1000 1500 2000 2500 0 25 50 75 100 125 150 175 200 WriteThroughput(ops/100ms) Time (sec) Smoothed Average Throughput 0 500 1000 1500 2000 2500 0 25 50 75 100 125 150 175 2 WriteThroughput(ops/100ms) Time (sec) Smoothed Average Throughput • YCSB 100% write workload, 64 Threads ACaZoo without RG changes ACaZoo with RG changes Memtable flush Leader electionCompaction
  • 19. 21 July 2014, University of Crete A deeper look into background activity Count (#) Longest (sec) Average (sec) Total (sec) Compaction (RA) 11 78.44 17.96 197.64 Memtable flush (RA) 53 - - - Garbage Collection (RA) 197 0.91 0.148 29.33 Compaction (RR) 12 72.65 15.94 191.39 Memtable flush (RR) 52 - - - Garbage Collection (RR) 192 0.85 0.147 27.84 • YCSB 20min 100% write workload, 256 Threads • RA : RG change random policy • RR : RG round robin policy
  • 20. 21 July 2014, University of Crete Time correlation of compactions across replicas 23% 13% 12%
  • 21. 21 July 2014, University of Crete Evaluation – 3 Node RG 25% 40%
  • 22. 21 July 2014, University of Crete Evaluation – 5 Node RG 60%
  • 23. 21 July 2014, University of Crete Application Performance: CassMail ACaZoo ACaZoo ACaZoo
  • 24. 21 July 2014, University of Crete CassMail on a 3-node RG 50KB-500KB attachment 200KB-2MB attachment 30% 31%
  • 25. 21 July 2014, University of Crete CassMail on a 5-node RG 50KB-500KB attachment 200KB-2MB attachment 35% 42%
  • 26. 21 July 2014, University of Crete Thesis Contributions • A high performance data replication primitive: – Combines the ZAB protocol with an implementation of LSM-Trees – Key point: Replication of LSM-Tree WAL • A novel technique that reduces the impact of LSM-Tree compactions on write performance – Changing leader prior to heavy compactions results to up to 60% higher throughput
  • 27. 21 July 2014, University of Crete Future Work • Elasticity: stream a number of key ranges to a newly joining RG. • Further investigate the load balancing methodology for Zookeeper watch notifications.
  • 28. 21 July 2014, University of Crete Thesis Publications 1. Panagiotis Garefalakis, Panagiotis Papadopoulos, and Kostas Magoutis, “ACaZoo: A distributed key-value store based on replicated LSM-trees.” in 33rd IEEE International Symposium on Reliable Distributed Systems (SRDS), IEEE 2014. 2. Panagiotis Garefalakis, Panagiotis Papadopoulos, Ioannis Manousakis, and Kostas Magoutis, “Strengthening consistency in the Cassandra distributed key-value store.” in Distributed Applications and Interoperable Systems (DAIS), Springer 2013.
  • 29. 21 July 2014, University of Crete Other Publications 1. Baryannis G., Garefalakis P., Kritikos K., Magoutis K., Papaioannou A., Plexousakis D., & Zeginis C. “Lifecycle management of service-based applications on multi- clouds: a research roadmap.” In Proceedings of the 2013 international workshop on Multi-cloud applications and federated clouds. ACM, 2013. 2. Zeginis C., Kritikos K., Garefalakis P., Konsolaki K., Magoutis K., & Plexousakis D. “Towards cross-layer monitoring of multi-cloud service-based applications.” In Service-Oriented and Cloud Computing. Springer, 2013. 3. Garefalakis Panagiotis, and Kostas Magoutis. "Improving Datacenter Operations Management using Wireless Sensor Networks." Green Computing and Communications (GreenCom), 2012 IEEE International Conference on. IEEE, 2012.
  • 30. 21 July 2014, University of Crete Email : pgaref@ics.forth.gr
  • 31. 21 July 2014, University of Crete RG Leader Failover 0 500 1000 1500 2000 2500 3000 0 5 10 15 20 25 30 35 40 45 Throughput(ops/100ms) sec 0 500 1000 1500 2000 2500 0 4 8 12 16 20 24 28 32 36 40 44 Throughput(ops/100ms) sec • YCSB read-only 64 threads • 1.19sec for client to notice • 220ms for the RG to elect a new leader • 970ms to propagate to the client through the CM • 2 sec to establish connection ACaZoo Oracle NoSQL
  • 32. 21 July 2014, University of Crete Backup - ArchitectureCassandra’s
  • 33. 21 July 2014, University of Crete Cassandra’s Architecture
  • 34. 21 July 2014, University of Crete Cassandra’s Architecture
  • 35. 21 July 2014, University of Crete Cassandra’s Architecture 2/3 Responses: {X,Y} Need for reconciliation!
  • 36. 21 July 2014, University of Crete Backup-Paxos1
  • 38. 21 July 2014, University of Crete Benefit of client coordinated I/O • Yahoo Cloud Serving Benchmark (YCSB). – 4 threads and read 1 GB of Data Throughput (ops/sec) Read latency (average, ms) Read latency (99 percentile, ms Original Cassandra 317 3.1 4 Client Coordinated I/O 412 2.3 3
  • 39. 21 July 2014, University of Crete CM load balancer 0 500 1000 1500 2000 2500 1 10 100 1000 10000 AverageLatency(ms) # Threads 1 node 3 nodes 3 nodes balanced

Editor's Notes

  1. Motivating this work
  2. Ta teleutaia xronia ο όγκος των δεδομένων έχει αυξηθεί δραματικά.
  3. Image of Key value stores…!! Several companies.. A number of eBay supports critical applications that need both real-time and analytics capabilities with the features of Cassandra. Netflix increased the availability of member information and quality of data for its global streaming video service thanks to Cassandra. Adobe relies on Cassandra to provide a highly scalable, low-latency database to support its distributed cache architecture.
  4. Sas edeiksa pws einai h ulopoishs gia ena LSM dentro omws otan exw pollous komvous me mia ulopoihsh lsm se kathe komvo..
  5. ----- Meeting Notes (7/18/14 18:41) ----- Compaction is a problem
  6. Cassandra no longer handles replication.
  7. ----- Meeting Notes (7/18/14 18:58) ----- An estiasoume ston leader, ola ta
  8. ----- Meeting Notes (7/18/14 18:58) ----- 3 diaforetikes polites RR, RR kai antistrofos analoga tou Compacti
  9. Majority: 23, 13, 12 Minority: 21, 32, 44 ----- Meeting Notes (7/18/14 18:58) ----- afou ginetai replication tha perimene kaneis oti yparxei synchronismos omws den einai etsi..
  10. RW: 25 -20 b-p W: 40- 33 b-p
  11. ----- Meeting Notes (7/18/14 18:41) ----- Compaction is a problem
  12. Focus on alternatives that exploit replication mechanisms.
  13. This concludes my talk and I would be happy to take any questions
  14. (a) 1.19 sec between the time the leader crashes until the client notices; (b) 2 sec until the client establishes a connection with the new leader and restores service. Interval (a) further breaks down into: (1) 220 ms for the RG to reconfigure (elect a new leader); (2) 970 ms to propagate the new-leader information (e.g., its IP address) to the client through the CM.
  15. Cassandra works well with applications that share its relaxed semantics (such as customer carts in online stores). Cassandra is not a good fit for more traditional applications requiring strong consistency. All nodes in Cassandra are peers No ordering guarantees, ad hoc synchronization mechanism, membership state to clients – gossip  If a replica misses a write, the row will be made consistent later via one of Cassandra’s built-in repair mechanisms: hinted handoff, read repair or anti-entropy node repairing eventually consistent
  16. Cassandra works well with applications that share its relaxed semantics (such as customer carts in online stores). Cassandra is not a good fit for more traditional applications requiring strong consistency. All nodes in Cassandra are peers No ordering guarantees, ad hoc synchronization mechanism, membership state to clients – gossip  If a replica misses a write, the row will be made consistent later via one of Cassandra’s built-in repair mechanisms: hinted handoff, read repair or anti-entropy node repairing eventually consistent
  17. Cassandra works well with applications that share its relaxed semantics (such as customer carts in online stores). Cassandra is not a good fit for more traditional applications requiring strong consistency. All nodes in Cassandra are peers No ordering guarantees, ad hoc synchronization mechanism, membership state to clients – gossip  If a replica misses a write, the row will be made consistent later via one of Cassandra’s built-in repair mechanisms: hinted handoff, read repair or anti-entropy node repairing eventually consistent
  18. Cassandra works well with applications that share its relaxed semantics (such as customer carts in online stores). Cassandra is not a good fit for more traditional applications requiring strong consistency. All nodes in Cassandra are peers No ordering guarantees, ad hoc synchronization mechanism, membership state to clients – gossip  If a replica misses a write, the row will be made consistent later via one of Cassandra’s built-in repair mechanisms: hinted handoff, read repair or anti-entropy node repairing eventually consistent
  19. 26% response time 30% throuput
  20. 26% response time 30% throuput