SlideShare a Scribd company logo
1 of 24
Efficient Bootstrapping for Decentralised
Shared-nothing Key-value Stores
Click to edit Present’s Name
Han Li, SrikumarVenugopal
Never Stand Still

Faculty of Engineering

Computer Science and Engineering
Agenda
•
•
•
•
•

Motivations for Node Bootstrapping
Research Gap
Challenges and Solutions
Evaluations
Conclusion

School of Computer Science and Engineering
On-demand Provisioning

The Capacity versus Utilisation Curve

School of Computer Science and Engineering
Key-value Stores
• The standard component for cloud data management
• Increasing workload  Node bootstrapping
– Incorporate a new, empty node as a member of KVS

• Decreasing workload  Node decommissioning
– Eliminate an existing member with redundant data off the KVS

School of Computer Science and Engineering
Goals for Efficient Node Bootstrapping
• Minimise the overhead of data movement
– How to partition/store data?

• Balance the load at node bootstrapping
– Both data volume and workload
– How to place/allocate data?

• Maintain data consistency and availability
– How to execute data movement?

School of Computer Science and Engineering
Background: Storage model
• Shared Storage
– Access same storage
• Distributed file systems
• Networked attached storage

– E.g. GFS, HDFS
– Simply exchange metadata
• Albatross, by S. Das, UCSB

• Shared Nothing
– Use individual local storage
– Decentralised, peer-to-peer
– E.g. Dynamo, Cassandra,
Voldemort, etc.
– Require data movement
• Lightweight solutions?

School of Computer Science and Engineering
Background: Split-Move Approach
Key space

Node 1

A

②

A

I

B

I

New Node
B2

H
A

H

Node 2

②
①

A

B1
B
B2

I

B1

C

Node 4

Node 3
G

D
F

E

Partition at node bootstrapping

C
B1

A
B

B2

①
Master Replica

③

D
D
B1

C
B2
B

①
Slave Replica

To be deleted

School of Computer Science and Engineering
Background: Virtual-Node Approach
Node 1

Key space

D
I

H

C
G

D
F

E

Partition at system startup

H

I

B

B

E

A

G

Node 2
New Node
A

E

Node 3

B

A

C

D

F

G

I

F

Node 4

H

A

B

C

D

C

E

F

H

I

......

G

Data skew: e.g., the majority of data is stored in a minority of partitions.
Moving around giant partitions is not a good idea.

School of Computer Science and Engineering
Research Gap
• Shared Storage vs. Shared Nothing
– Require data movement

• Centralised vs. Decentralised
– Require coordination

• Split-Move vs. Virtual-node Based
– Partition at node bootstrapping is heavyweight
– Partition at system startup causes data skew

• The Gap: A scheme of data partitioning and placement that
improves the efficiency of bootstrapping in shared-nothing KVS

School of Computer Science and Engineering
Our Solution
• Virtual-node based movement
– Each partition of data is stored in separated files
– Reduced overhead of data movement
– Many existing nodes can participate in bootstrapping

• Automatic sharding
– Split and merge partitions at runtime
– Each partition stores a bounded volume of data
• Easy to reallocate data
• Easy to balance the load

School of Computer Science and Engineering
The timing for data partitioning
• Shard partitions at writes (insert and delete)
– Split:
– Merge:

Size(Pi) ≤ Θmax
Size(Pi) + Size(Pi+1) ≥ Θmin

Insert

A

Θmax ≥ 2Θmin
Avoid oscillation!

A
B1

B

B2
E

Split

C

E

C
D

D

A

A

B1

B1
B2

Delete

M

Merge

E
D

E

C
D

School of Computer Science and Engineering
Challenge 1: Sharding coordination
• Issues
– Totally decentralised
– Each partition has multiple replicas
– Each replica is split or merged locally

• Question
– How to guarantee that all the replicas of certain partition are
simultaneously sharded?

School of Computer Science and Engineering
Challenge 1: Sharding coordination
• Solution: Election-based coordination
Data/Node
mapping

Node-A
SortedList:

3rd
Step 2
Step 3
Step1
EnforceElection
Split/Merge
Finish Step 4
Split/Merge
Announce to all nodes 2nd

C, E, ..., A, ..., B

Node-B

Coordinator
Node-C
Node-C
4th

1st

Node-E

School of Computer Science and Engineering
Challenge 2: Node failover during sharding
Coordinator
Coordinator
Before
Before
execution
execution

Gossip
yes
Failed

Election
Election

Step1
Step1

NonNonNonNonNonNoncoordinators
coordinators
coordinators
coordinators
coordinators
coordinators

Resurrect

No
Notification:
Notification:
Shard Pi
Shard Pi

During
During
execution
execution
Announce:
Announce:
Announce:
Elect
Successful
Successful
Successful
New coordinator

After
After
execution Announce:
execution Successful
Time
Time

Append to
candidate list

No

Removed from
candidate list

Yes

Dead

Step2
Step2

Gossip
Yes

No

Continue without coordinator

Failed

Resurrect
Yes

Failed

Step3 Step3
Step3
Step3

Timeout

No

Invalidate Pi
in this node

Step4
Step4 Step4
Replace Replicas
Replace Replicas

School of Computer Science and Engineering
Challenge 3: Data consistency during sharding
• Use two sets of replicas at sharding
– Original partition and future partition
– Data from different partitions is stored separate files

• Approach 1
– Write to future partition, roll back at failure
– Read from both partitions

• Approach 2
– Write to both partitions, abandon future partition at failure
– Read from original partition

School of Computer Science and Engineering
Challenge 3: Data consistency during movement
• Use a pair of tokens for each partition
– A Boolean token to approve and disapprove read/write
t0 t1
t1
Source
node

Read
Write

Destine
node

Data
Transfer

t2

t3

t4

Read
Write
Negative

Positive

School of Computer Science and Engineering
Replica Placement at Node Bootstrap
• Partition re-allocation and sharding are mutually exclusive;
• Maintain data availability
– Each partition has at least R replicas

• Balance the load (e.g., number of requests)
– Heavily loaded nodes have higher priority to “move out” data

• Balance the data
– Balance the number of partitions across nodes
• Each partition, via sharding, is of similar size

• Two-phase bootstrap
– Phase 1: guarantee R replicas, shift load from heavily loaded nodes
– Phase 2: achieve load and data balancing in low-priority threads
School of Computer Science and Engineering
Evaluation Setup
• ElasCass: An implemention of auto-sharding, building on Apache
Cassandra (version 1.0.5), which uses Split-Move approach.
•
•
•
•

Key-value stores: ElasCass vs. Cassandra (v1.0.5)
Test bed: Amazon EC2, m1.large type, 2 CPU cores, 8GB ram
Benchmark: YCSB
System scale: Start from 1 node, with 100GB of data, R=2. Scale up
to 10 nodes.

School of Computer Science and Engineering
Evaluation – Bootstrap Time
•

In Split-Move, data volume transferred reduces
by half from 3 nodes onwards.

•

In ElasCass, data volume transferred remains
below 10GB from 2 nodes.

•

Bootstrap time is determined by data volume
transferred. ElasCass exhibits a consistent
performance at all scales.

School of Computer Science and Engineering
Evaluation – Data Volume
•
•
•

ElasCass uses two-phase bootstrap. More data is pulled in at phase 2.
Imbalance Index = standard deviation / average. Data is well balanced in ElasCass.
ElasCass occupies less storage space than Split-Move approach.

School of Computer Science and Engineering
Evaluation – Query Processing
•

ElasCass is scalable, while
Split-Move is not.

•

Write throughput is higher
than read throughput.

•

ElasCass has better
resources utilisation.

•

ElasCass achieves
balanced load.

School of Computer Science and Engineering
Key Takeaways
• Using virtual nodes introduces less overhead in data movement,
and reduces the bootstrap time to below 10 mins.
– Apache Cassandra v.1.1 uses virtual nodes

• Consolidating the partitions into bounded ranges simplifies replica
placement and facilitates load-balancing
– MySQL, MongoDB start to auto-shard partitions

• A balanced loadleads to 80% resource utilisation and increasing
throughput scalable to #nodes.

School of Computer Science and Engineering
Contributions and Acknowledgments
• We have designed and implemented a decentralised auto-sharding
scheme that
– consolidates each partition replica into single transferable units to
provide efficient data movement;
– automatically shards the partitions into bounded ranges to address data
skew;
– reduces the time to bootstrap nodes, achieves more balancing load and
better performance of query processing.

• The authors would like to thank Smart Services CRC Pty Ltd for the
grant of Services Aggregation project that made this work possible.
School of Computer Science and Engineering
Thank You!

School of Computer Science and Engineering

More Related Content

What's hot

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0Heiko Loewe
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integrationDylan Wan
 
The inner workings of Dynamo DB
The inner workings of Dynamo DBThe inner workings of Dynamo DB
The inner workings of Dynamo DBJonathan Lau
 
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksDynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksLinh Ngo
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6longda feng
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Architecting for the cloud cloud providers
Architecting for the cloud cloud providersArchitecting for the cloud cloud providers
Architecting for the cloud cloud providersLen Bass
 
F233842
F233842F233842
F233842irjes
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Dcnintroduction 141010054657-conversion-gate01
Dcnintroduction 141010054657-conversion-gate01Dcnintroduction 141010054657-conversion-gate01
Dcnintroduction 141010054657-conversion-gate01yibeltal yideg
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 

What's hot (20)

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
The inner workings of Dynamo DB
The inner workings of Dynamo DBThe inner workings of Dynamo DB
The inner workings of Dynamo DB
 
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksDynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Architecting for the cloud cloud providers
Architecting for the cloud cloud providersArchitecting for the cloud cloud providers
Architecting for the cloud cloud providers
 
F233842
F233842F233842
F233842
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Dcnintroduction 141010054657-conversion-gate01
Dcnintroduction 141010054657-conversion-gate01Dcnintroduction 141010054657-conversion-gate01
Dcnintroduction 141010054657-conversion-gate01
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 

Similar to Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores

Cassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating NetflixCassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating NetflixJason Brown
 
Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)Jason Brown
 
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation FinalDhritiman Halder
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSVipul Thakur
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of Viewaragozin
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...raghdooosh
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.pptDanBarcan2
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
System Design & Scalability
System Design & ScalabilitySystem Design & Scalability
System Design & ScalabilityJohn DiFini
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandraNguyen Quang
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez DataWorks Summit
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Dharma Shukla
 
Data warehouse 26 exploiting parallel technologies
Data warehouse  26 exploiting parallel technologiesData warehouse  26 exploiting parallel technologies
Data warehouse 26 exploiting parallel technologiesVaibhav Khanna
 
Resolve issues with throttled dynamo db tables
Resolve issues with throttled dynamo db tablesResolve issues with throttled dynamo db tables
Resolve issues with throttled dynamo db tablesJean Joseph
 

Similar to Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores (20)

Cassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating NetflixCassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating Netflix
 
Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)
 
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation Final
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMS
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
Cassandra
CassandraCassandra
Cassandra
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
System Design & Scalability
System Design & ScalabilitySystem Design & Scalability
System Design & Scalability
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019
 
Data warehouse 26 exploiting parallel technologies
Data warehouse  26 exploiting parallel technologiesData warehouse  26 exploiting parallel technologies
Data warehouse 26 exploiting parallel technologies
 
Resolve issues with throttled dynamo db tables
Resolve issues with throttled dynamo db tablesResolve issues with throttled dynamo db tables
Resolve issues with throttled dynamo db tables
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores

  • 1. Efficient Bootstrapping for Decentralised Shared-nothing Key-value Stores Click to edit Present’s Name Han Li, SrikumarVenugopal Never Stand Still Faculty of Engineering Computer Science and Engineering
  • 2. Agenda • • • • • Motivations for Node Bootstrapping Research Gap Challenges and Solutions Evaluations Conclusion School of Computer Science and Engineering
  • 3. On-demand Provisioning The Capacity versus Utilisation Curve School of Computer Science and Engineering
  • 4. Key-value Stores • The standard component for cloud data management • Increasing workload  Node bootstrapping – Incorporate a new, empty node as a member of KVS • Decreasing workload  Node decommissioning – Eliminate an existing member with redundant data off the KVS School of Computer Science and Engineering
  • 5. Goals for Efficient Node Bootstrapping • Minimise the overhead of data movement – How to partition/store data? • Balance the load at node bootstrapping – Both data volume and workload – How to place/allocate data? • Maintain data consistency and availability – How to execute data movement? School of Computer Science and Engineering
  • 6. Background: Storage model • Shared Storage – Access same storage • Distributed file systems • Networked attached storage – E.g. GFS, HDFS – Simply exchange metadata • Albatross, by S. Das, UCSB • Shared Nothing – Use individual local storage – Decentralised, peer-to-peer – E.g. Dynamo, Cassandra, Voldemort, etc. – Require data movement • Lightweight solutions? School of Computer Science and Engineering
  • 7. Background: Split-Move Approach Key space Node 1 A ② A I B I New Node B2 H A H Node 2 ② ① A B1 B B2 I B1 C Node 4 Node 3 G D F E Partition at node bootstrapping C B1 A B B2 ① Master Replica ③ D D B1 C B2 B ① Slave Replica To be deleted School of Computer Science and Engineering
  • 8. Background: Virtual-Node Approach Node 1 Key space D I H C G D F E Partition at system startup H I B B E A G Node 2 New Node A E Node 3 B A C D F G I F Node 4 H A B C D C E F H I ...... G Data skew: e.g., the majority of data is stored in a minority of partitions. Moving around giant partitions is not a good idea. School of Computer Science and Engineering
  • 9. Research Gap • Shared Storage vs. Shared Nothing – Require data movement • Centralised vs. Decentralised – Require coordination • Split-Move vs. Virtual-node Based – Partition at node bootstrapping is heavyweight – Partition at system startup causes data skew • The Gap: A scheme of data partitioning and placement that improves the efficiency of bootstrapping in shared-nothing KVS School of Computer Science and Engineering
  • 10. Our Solution • Virtual-node based movement – Each partition of data is stored in separated files – Reduced overhead of data movement – Many existing nodes can participate in bootstrapping • Automatic sharding – Split and merge partitions at runtime – Each partition stores a bounded volume of data • Easy to reallocate data • Easy to balance the load School of Computer Science and Engineering
  • 11. The timing for data partitioning • Shard partitions at writes (insert and delete) – Split: – Merge: Size(Pi) ≤ Θmax Size(Pi) + Size(Pi+1) ≥ Θmin Insert A Θmax ≥ 2Θmin Avoid oscillation! A B1 B B2 E Split C E C D D A A B1 B1 B2 Delete M Merge E D E C D School of Computer Science and Engineering
  • 12. Challenge 1: Sharding coordination • Issues – Totally decentralised – Each partition has multiple replicas – Each replica is split or merged locally • Question – How to guarantee that all the replicas of certain partition are simultaneously sharded? School of Computer Science and Engineering
  • 13. Challenge 1: Sharding coordination • Solution: Election-based coordination Data/Node mapping Node-A SortedList: 3rd Step 2 Step 3 Step1 EnforceElection Split/Merge Finish Step 4 Split/Merge Announce to all nodes 2nd C, E, ..., A, ..., B Node-B Coordinator Node-C Node-C 4th 1st Node-E School of Computer Science and Engineering
  • 14. Challenge 2: Node failover during sharding Coordinator Coordinator Before Before execution execution Gossip yes Failed Election Election Step1 Step1 NonNonNonNonNonNoncoordinators coordinators coordinators coordinators coordinators coordinators Resurrect No Notification: Notification: Shard Pi Shard Pi During During execution execution Announce: Announce: Announce: Elect Successful Successful Successful New coordinator After After execution Announce: execution Successful Time Time Append to candidate list No Removed from candidate list Yes Dead Step2 Step2 Gossip Yes No Continue without coordinator Failed Resurrect Yes Failed Step3 Step3 Step3 Step3 Timeout No Invalidate Pi in this node Step4 Step4 Step4 Replace Replicas Replace Replicas School of Computer Science and Engineering
  • 15. Challenge 3: Data consistency during sharding • Use two sets of replicas at sharding – Original partition and future partition – Data from different partitions is stored separate files • Approach 1 – Write to future partition, roll back at failure – Read from both partitions • Approach 2 – Write to both partitions, abandon future partition at failure – Read from original partition School of Computer Science and Engineering
  • 16. Challenge 3: Data consistency during movement • Use a pair of tokens for each partition – A Boolean token to approve and disapprove read/write t0 t1 t1 Source node Read Write Destine node Data Transfer t2 t3 t4 Read Write Negative Positive School of Computer Science and Engineering
  • 17. Replica Placement at Node Bootstrap • Partition re-allocation and sharding are mutually exclusive; • Maintain data availability – Each partition has at least R replicas • Balance the load (e.g., number of requests) – Heavily loaded nodes have higher priority to “move out” data • Balance the data – Balance the number of partitions across nodes • Each partition, via sharding, is of similar size • Two-phase bootstrap – Phase 1: guarantee R replicas, shift load from heavily loaded nodes – Phase 2: achieve load and data balancing in low-priority threads School of Computer Science and Engineering
  • 18. Evaluation Setup • ElasCass: An implemention of auto-sharding, building on Apache Cassandra (version 1.0.5), which uses Split-Move approach. • • • • Key-value stores: ElasCass vs. Cassandra (v1.0.5) Test bed: Amazon EC2, m1.large type, 2 CPU cores, 8GB ram Benchmark: YCSB System scale: Start from 1 node, with 100GB of data, R=2. Scale up to 10 nodes. School of Computer Science and Engineering
  • 19. Evaluation – Bootstrap Time • In Split-Move, data volume transferred reduces by half from 3 nodes onwards. • In ElasCass, data volume transferred remains below 10GB from 2 nodes. • Bootstrap time is determined by data volume transferred. ElasCass exhibits a consistent performance at all scales. School of Computer Science and Engineering
  • 20. Evaluation – Data Volume • • • ElasCass uses two-phase bootstrap. More data is pulled in at phase 2. Imbalance Index = standard deviation / average. Data is well balanced in ElasCass. ElasCass occupies less storage space than Split-Move approach. School of Computer Science and Engineering
  • 21. Evaluation – Query Processing • ElasCass is scalable, while Split-Move is not. • Write throughput is higher than read throughput. • ElasCass has better resources utilisation. • ElasCass achieves balanced load. School of Computer Science and Engineering
  • 22. Key Takeaways • Using virtual nodes introduces less overhead in data movement, and reduces the bootstrap time to below 10 mins. – Apache Cassandra v.1.1 uses virtual nodes • Consolidating the partitions into bounded ranges simplifies replica placement and facilitates load-balancing – MySQL, MongoDB start to auto-shard partitions • A balanced loadleads to 80% resource utilisation and increasing throughput scalable to #nodes. School of Computer Science and Engineering
  • 23. Contributions and Acknowledgments • We have designed and implemented a decentralised auto-sharding scheme that – consolidates each partition replica into single transferable units to provide efficient data movement; – automatically shards the partitions into bounded ranges to address data skew; – reduces the time to bootstrap nodes, achieves more balancing load and better performance of query processing. • The authors would like to thank Smart Services CRC Pty Ltd for the grant of Services Aggregation project that made this work possible. School of Computer Science and Engineering
  • 24. Thank You! School of Computer Science and Engineering

Editor's Notes

  1. I will start from the picture that we want to achieve in the end. The workload is mostly dynamic in web applications and services. There are peak hours and off-peaks every day and every week. In the Infrastructure-as-a-service cloud, computation resources can be provisioned on demand to deal with increasing workloads, and when the workload decreases, the resources can be dismissed in order to save on economic costs, assuming the billing model is pay-as-you-go.
  2. In the cloud environment, key-value stores have become the standard reference architecture for data management. When the workload rises up, key-value stores are required to bootstrap nodes, that is, incorporating new empty nodes as its members. When the workload declines, existing members with redundant data can be eliminated. That is, node decommissioning.
  3. This work is focused on efficient node bootstrapping in key-value stores. There are a few goals to achieve. First of all, we want to minimise the overhead of data movement, so as to reduce the time required to bootstrap a node. It depends on the way the data is partitioned and stored. Second, after an empty node is added to the system, we want to balance the load, in terms of both data volume and workload each node undertakes. It depends on how the data is allocated amongst the nodes. Third, we need to maintain data consistency and availability while nodes are being added or removed. It depends on how data movement is executed.
  4. The behaviour of node bootstrapping largely depends on the storage model of the system. There is the shared-storage model, in which all the nodes access the same underlying storage. node bootstrapping is efficient in this model, because it does not require data movement. Instead, the ownership of data can be taken over simply by exchanging the metadata. An example is Albatross proposed by Das from UCSB. In contrast, there is the shared-nothing model, in which each node of the key-value store uses individual disks for storing the data. These kinds of systems are usually deployed in a decentralised manner, and require actual data movement across nodes atbootstrapping. The question is, how to move the data in a lightweight manner?
  5. We reviewed the literature, and there are generally two approaches to data movement in shared-nothing key-value stores. The first one is what we call the Split-Move approach, which leverages hash functions to partition the key space based on the number of nodes in the system. *When a new node is added to the system, one or a few existing partitions are split to generate extra partitions. For example, Partition B is split into B1 and B2. *Then, the source nodes scans their local files to move out the data in the form of key-value pairs. The new node, which is the destination, receives the key-value pairs and reassemble the data into files. *The redundant data in the source nodes is deleted later. This approach is not efficient in two ways. First, it involves with scanning and reassembling, which are heavyweight operations and not appropriate when dealing with large amount of data. Second, only a limited number of existing nodes can participate in node bootstrapping, mostly because they use consistent hashing like algorithms.
  6. Alternatively, systems like Dynamo and Voldemort use the Virtual-node based approach, which is originated from Chord. In this approach, the key space is split when the system starts. It results in many small partitions (or virtual nodes as we call). *Bootstrapping a new node becomes simpler. A list of partitions are selected to move out from the source nodes, and are stored in the new node as the way they were. However, the drawback of this approach is that it introduces data skew problems. Since the key space is partitioned at startup, as more data are inserted and deleted at runtime, there is no guarantee that each partition is of similar sizes. In the worst case, the system may end up in storing, the majority of its data in a minority of the partitions. Moving the partitions with large amount of data across the node is never efficient.
  7. We have known that it is inevitable to move data in decentralised shared-nothing key-value stores. We have also reviewed the split-move and virtual-node based approaches. They are either heavyweight, or have data skew issues. What is lacking is the scheme of data partitioning and placement that handles node bootstrapping in an efficient, timely manner.
  8. We propose our strategy, which builds on the virtual-node approach. Each partition of data should be stored in separate files, so that when we move a partition replica, we can simply move the corresponding files. There is no heavyweight operation such as scanning or reassembling of key-value pairs. In addition, many existing nodes can participate in bootstrapping a new node, which also improves the performance. We also proposed to automatically split and merge the partitions at runtime, such that the data volume in each partition is of similar size. Once the partitions are consolidated into a bounded size, it becomes easier to reallocate the data and to balance the load.
  9. Now we talk about the timing for partitioning. Remember that in split-move approach, the key space is partitioned at node bootstrapping, while in the virtual-node approach, the key space is partitioned at system startup. In our approach, the key space is partitioned at writes. We have defined an upper bound for the size of each partition. A partition is split when its size reaches the threshold due to data insertion. *We have also defined a lower bound for the total size of any two adjacent partitions. *Two neighbouring partitions are merged if the total size falls below the lower bound. Of course, to avoid oscillation, the upper bound should be considerably larger than the lower bound. The idea of our approach is simple, but realising this idea is non-trivial. There are a few challenges to address.
  10. First of all, there is the coordination issue. We assume the system is totally decentralised. Each partitions has multiple copies in different nodes. Hence, the data files of each replica is split or merged locally within each node. The question is, how do we guarantee that all the replicas of certain partition can be simultaneously sharded?
  11. The answer to this question is to elect a coordinator for each sharding operation. When a partition reaches the upper or lower bound, a polling is triggered, any node that serves this partition can be voted as the coordinator. The election is based on the Chubby implementation, in which the coordinator obtains votes from a majority of the participating nodes. *Once we have the coordinator, it can enforce a sharding operation amongst all the nodes that serve the partition. *When a node finishes sharding locally, it sends an acknowledgements to the coordinator. If the coordinator manages to collect the acknowledgements from all the nodes that participate, the sharding is considered successful. *In the end, the coordinator broadcasts an update of the key range of the partition to all the nodes in the system.
  12. So this is the four-step coordination that I described. The second challenge is how to deal with node failure during coordination. Basically what we did here is to allow a dead node to resurrect within a time window. Even if one dead node does not come back to life, the sharding operation can still proceed and succeed. However, if more than one node fail during sharding, the operation is aborted and re-initiated later. Our paper has detailed description. I will have to skip this slide due to time constrain. Overall, our solution can tolerate the failure of one node duringsharding, and also guarantees that no data loss occurs if the sharding is aborted when multiple nodes fail.
  13. The third challenge is data consistency. There are two aspects. One is related to sharding. Remember that each partition of data is stored in separate files. So we have to use two sets of replicas. One belongs to the partition before sharding, that is the original partition. The other set of replicas belongs to the partition after sharding, that is the future partition. There are two ways to handle reads and writes. One approach is to write to the future partition, and read from both partitions. If, unfortunately, the sharding fails, the future partition is merged back to the original partition so that we can recover the latest updates. The alternative approach is to write to both partitions, so that we can simply abandon the future partition when failure occurs. We prefer the first solution, because the failure of sharding does not happen very often in practice.
  14. The other aspect of consistency is related to data movement. We proposed that each node uses a pair of tokens to control the reads and writes for every partition. Each token is a boolean value. This figure shows how to switch the value of the tokens when moving one partition from the source node to the destination. The solid line means positive, while the dash line means negative. As can be seen, the source node serves both reads and writes during the whole process. *The destination node should start accepting writes before the replica is transferred at Time t1, so that it can receive the latest updates during data transfer. *After the replica is successfully accepted by the destination at t2, *the destine node should also start serving reads for this partition at Time t3, which is a short while after Time t2. Once the destination node can serve both reads and writes, *the source node is allowed to release both tokens in the end. The data files can be deleted later.
  15. Now that we have the solutions to auto-sharding. Let’s take a look at the replica placement algorithm that is based on it.There are a few rules to follow. Number one, to make our life easier, we don't move and shard the same partition at the same time. Second, we make sure each partition has R copies, wherein R is typically equal to 3. Third, we try to balance the workload each node undertakes, so if a node is heavily loaded, a number of partitions will be moved out from it. Last but not least, we will try to balance the number of partitions if no node is heavily loaded. *Based on these rules, we propose a two-phase node bootstrapping. In Phase 1, the new node receives a limited number of hotspot replicas from the most loaded nodes, so that it can start serving queries within a few minutes. In Phase 2, the new node continues to pull in more replicas from different nodes, until it possesses an average number of partitions. This process can take one or several hours, as long as it does not affect the front-end query processing.
  16. We have implemented the scheme of auto-sharding and placement on Apache Cassandra version 1.0.5, which uses the Split-Move approach. We call our system ElasCass, that is elastic Cassandra. We set up the evaluation on Amazon EC2, using the instances with 2 cpu cores and 8GB of memory. We use YCSB benchmark to launch queries. We have used a small cluster of 10 nodes. There are two reasons. One is because we didn’t have the resources to scale up to hundreds of nodes. The other is that we can have a fine-grained analysis of system behaviour when the scale is smaller.
  17. Firstly, we have evaluated the bootstrap performance. Remember the replication number is equal to 2, so the whole data set is copied to the second node in both approaches. From the 3rd node onwards, the behaviours of node bootstrapping are totally different. In ElasCass, the volume of data transferred remains below 10GB at all times. While in Split-Move, the volume transferred is reduced by half at each step. We have analysed the result, and realised that, in apache cassandra that uses Split-Move, the data is always transferred from the node with the most volume of data, which, in this case, is the first node in the system. Each time the data is moved out from the first node, the key range it serves is reduced by half. However, the data on disk is not deleted at runtime, because the data files are notmutable. As a result, the first node always stores the most data, but the actual key range it can offer is reduced exponentially by the power of 2. In contrast, ElasCass does not have this problem, because each partition is stored in separate files. It can simply move the files between the nodes.The time to bootstrap a node is determined by the volume of data transferred. Overall, ElasCass is able to bootstrap a node within 10 minutes. The BalanceVolume is the average volume of data each node should store at each scale. So let’s look at the data volume at bootstrap.
  18. Remember that ElasCass bootstrap a node in two phases. in the second phase, more data is pulled in by the new node from multiple nodes, it uses a low-priority background thread. And it stops when the data volume reaches the balance volume. As a result, data is well balanced in ElasCass. We have used an imbalance index to evaluate load balancing. So Lower is better. As can been seen in the second figure, with Split-Move approach, the data gets more and more imbalanced as the system scales up, while the imbalance index remains low in ElasCass. In the third figure, we can see that ElasCass uses less storage space than the split-move approach. This is because the data files are immutable in Apache Cassandra. Even the key-value pairs are moved out, they cannot be deleted until the files are re-constructed. In contrast, ElasCass can simply delete the data files of any specific partition.
  19. We have also used the YCSB benchmark to evaluate the performance of query processing. The two figures on top show the throughput of queries under zipfian distribution. The figure on the left is write throughput while read throughput is on the right. The throughput in ElasCass increases mostly linearly as the system scales up. While with Split-Move approach, the throughput stops improving from 5 nodes onwards. *This is because the data volume transferred is far less than the balance volume from 5 nodes. As a result, the new node is unable to serve enough queries without a sufficient key range. The two figures at the bottom depict the CPU usage. We use CPU usage to indicate the workload each node undertakes. As can be seen in the figure on the left, the cpu usage of ElasCass is above 70% at all times, because it manages to offer higher throughput. In contrast, in Apache Cassandra, which uses split-move, the average cpu usage decreases as the system scales up. and the workload becomes more and more imbalanced. This is the penalty of imbalance data allocation that we discussed in the previous slide.
  20. These are the takeaways from me. using the virtual node approach reduces the overhead of data movement, and thus improving the performance of node bootstrapping. Automatically shard the partitions at runtime simplifies replica placement and improve load-balancing, which leads to 80% of resource utilisation and an increased throughput that is scalable to #nodes. After we submitted this paper, We were happy to see that Apache Cassandra, MySQL and MongoDB have started to use auto-sharding of virtual nodes. So we are quite convinced that this approach can be used to improve the data movement in shared-nothing key-value stores.
  21. In conclusion, we have used auto-sharding to reduce the time to bootstrap nodes, achieve more balancing load and better performance of query processing.. We would also like to thank Smart Services CRC for the grant that makes this work possible.