SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
1.
Scale-Out ccNUMA:
Exploiting Skew with Strongly Consistent Caching
Antonios Katsarakis*, Vasilis Gavrielatos*,
A. Joshi, N. Oswald, B. Grot, V. Nagarajan
The University of Edinburgh
This work was supported by EPSRC, ARM and Microsoft through their PhD Fellowship Programs
*The first two authors equally contribute to this work
5.
… … …
…
KVS Performance 101
5
In-memory storage:
Avoid slow disk access
Partitioning:
• Shard the dataset across multiple nodes
• Enables high capacity in-memory storage
6.
… … …
…
KVS Performance 101
6
In-memory storage:
Avoid slow disk access
Partitioning:
• Shard the dataset across multiple nodes
• Enables high capacity in-memory storage
Remote Direct Memory Access (RDMA):
Avoid costly TCP/IP processing via
• Kernel bypass
• H/w network stack processing
7.
… … …
…
KVS Performance 101
7
In-memory storage:
Avoid slow disk access
Partitioning:
• Shard the dataset across multiple nodes
• Enables high capacity in-memory storage
Remote Direct Memory Access (RDMA):
Avoid costly TCP/IP processing via
• Kernel bypass
• H/w network stack processing
Good start, but there is a problem…
8.
Skewed Access Distribution
8
Real-world datasets → mixed popularity
• Popularity follows a power-law distribution
• Small number of objects hot; most are not
Mixed popularity → load imbalance
• Node(s) storing hottest objects
get highly loaded
• Majority of nodes are under-utilized
128 Servers
… … …
Overloaded
YCSB, skew exponent = 0.99
9.
Skewed Access Distribution
9
Real-world datasets → mixed popularity
• Popularity follows a power-law distribution
• Small number of objects hot; most are not
Mixed popularity → load imbalance
• Node(s) storing hottest objects
get highly loaded
• Majority of nodes are under-utilized
128 Servers
… … …
Overloaded
YCSB, skew exponent = 0.99
Skew-induced load imbalance limits system throughput
10.
Centralized cache [SOCC’11, SOSP’17]
• Dedicated node resides in front of the KVS
caching hot objects.
◦ Filters the skew with a small cache
◦ Throughput is limited by the single cache
Existing Skew Mitigation Techniques
10
… … …
← Cache
11.
Centralized cache [SOCC’11, SOSP’17]
• Dedicated node resides in front of the KVS
caching hot objects.
◦ Filters the skew with a small cache
◦ Throughput is limited by the single cache
NUMA abstraction [NSDI’14, SOCC’16]
• Uniformly distribute requests to all servers
• Remote objects RDMA’ed from home node
◦ Load balance the client requests
◦ No locality → excessive network b/w
Most requests require remote access
Existing Skew Mitigation Techniques
11
… … …
… … …
← Cache
12.
Centralized cache [SOCC’11, SOSP’17]
• Dedicated node resides in front of the KVS
caching hot objects.
◦ Filters the skew with a small cache
◦ Throughput is limited by the single cache
NUMA abstraction [NSDI’14, SOCC’16]
• Uniformly distribute requests to all servers
• Remote objects RDMA’ed from home node
◦ Load balance the client requests
◦ No locality → excessive network b/w
Most requests require remote access
Existing Skew Mitigation Techniques
12
… … …
… … …
Can we get the best of both worlds?
← Cache
13.
13
Caching + NUMA
… … …
+
Scale-Out ccNUMA!
… … …
via distributed caching
14.
14
Caching + NUMA
… … …
+
Scale-Out ccNUMA!
What are the challenges?
… … …
via distributed caching
15.
Scale-Out ccNUMA Challenges
15
Challenge 1: Distributed cache architecture design
• Which items to cache and where?
• How to steer traffic for maximum load balance & hit rate?
Challenge 2: Keeping the caches consistent
(i.e. what happens on a write)
• How to locate replicas?
• How to execute writes efficiently?
16.
Scale-Out ccNUMA Challenges
16
Challenge 1: Distributed cache architecture design
• Which items to cache and where?
• How to steer traffic for maximum load balance & hit rate?
Challenge 2: Keeping the caches consistent
(i.e. what happens on a write)
• How to locate replicas?
• How to execute writes efficiently?
Solving Challenge 1 with Symmetric Caching
17.
17
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA Abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
18.
18
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
19.
19
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
20.
20
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
Challenge 2: How to keep the caches consistent?
21.
Keeping the caches consistent
21
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
22.
Keeping the caches consistent
22
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
23.
Keeping the caches consistent
23
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
How to execute writes efficiently?
• Typical protocols:
◦ Ensure global write ordering via a primary
◦ Primary executes all writes → hot-spot Primary executes all writes
Write( )Write( )
Primary
24.
Keeping the caches consistent
24
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
How to execute writes efficiently?
• Typical protocols:
◦ Ensure global write ordering via a primary
◦ Primary executes all writes → hot-spot
• Fully distributed writes
Can guarantee ordering via logical clocks
Avoid hot-spots
Evenly spread write propagation costs
Primary executes all writes
Write( )Write( )
Primary
Fully distributed writes
Write( ) Write( )
25.
Protocols in Scale-out ccNUMA
25
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
Write( )
26.
Protocols in Scale-out ccNUMA
26
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Write( )
27.
Protocols in Scale-out ccNUMA
27
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Broadcast Invalidations*
* along with logical (Lamport) clocks
Lin
Invalidate all caches
Write( )
28.
Protocols in Scale-out ccNUMA
28
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Broadcast Invalidations*
Broadcast Updates*
* along with logical (Lamport) clocks
Lin
Invalidate all caches
Write( )
Broadcast Updates
29.
Protocols in Scale-out ccNUMA
29
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Broadcast Invalidations*
Broadcast Updates*
◦ Sequential Consistency (SC): 1 RTT
Broadcast Updates*
* along with logical (Lamport) clocks
Lin
SC
Invalidate all caches
Write( )
Broadcast Updates
32.
Performance
32
>3χ
Both systems are network bound
Lin: >3x throughput at low write ratio
33.
Performance
33
>3χ
1.6χ
Both systems are network bound
Lin: >3x throughput at low write ratio, 1.6x at 5% writes
34.
2.2χ
Performance
34
Both systems are network bound
Lin: >3x throughput at low write ratio, 1.6x at 5% writes
SC: higher throughput at higher write ratios: 2.2x at 5% writes
>3χ
1.6χ
35.
Conclusion
35
Scale-Out ccNUMA:
Distributed cache → best of Caching + NUMA
• Symmetric Caching:
◦ Load balances and filters skew
◦ Throughput scales with number of servers
◦ Less network b/w: most requests are local
• Fully distributed protocols:
◦ Efficient RDMA Implementation
◦ Fully distributed writes
◦ Two strong consistency guarantees
Up to 3x performance of state-of-the-art
while guaranteeing per-key Linearizability
Symmetric Caching
Fully distributed protocols
Write( ) Write( )
… … …