Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scale-out ccNUMA - Eurosys'18

234 views

Published on

These are the slides from eurosys'18 talk of paper
"Scale-Out ccNUMA: Exploiting Skew with Strongly Consistent Caching"

paper link: http://homepages.inf.ed.ac.uk/vnagaraj/papers/eurosys18.pdf

Published in: Technology
  • Be the first to comment

Scale-out ccNUMA - Eurosys'18

  1. 1. Scale-Out ccNUMA: Exploiting Skew with Strongly Consistent Caching Antonios Katsarakis*, Vasilis Gavrielatos*, 
 A. Joshi, N. Oswald, B. Grot, V. Nagarajan The University of Edinburgh This work was supported by EPSRC, ARM and Microsoft through their PhD Fellowship Programs *The first two authors equally contribute to this work
  2. 2. Large-scale online services 2 Backed by Key-Value Stores (KVS) Characteristics: • Numerous users • Read-mostly workloads ( e.g. Facebook 0.2% writes [ATC’13] )
 Distributed KVS
  3. 3. KVS Performance 101 3
  4. 4. … KVS Performance 101 4 In-memory storage:
 Avoid slow disk access
  5. 5. … … … … KVS Performance 101 5 In-memory storage:
 Avoid slow disk access Partitioning:
 • Shard the dataset across multiple nodes • Enables high capacity in-memory storage

  6. 6. … … … … KVS Performance 101 6 In-memory storage:
 Avoid slow disk access Partitioning:
 • Shard the dataset across multiple nodes • Enables high capacity in-memory storage
 Remote Direct Memory Access (RDMA): Avoid costly TCP/IP processing via • Kernel bypass • H/w network stack processing
  7. 7. … … … … KVS Performance 101 7 In-memory storage:
 Avoid slow disk access Partitioning:
 • Shard the dataset across multiple nodes • Enables high capacity in-memory storage
 Remote Direct Memory Access (RDMA): Avoid costly TCP/IP processing via • Kernel bypass • H/w network stack processing Good start, but there is a problem…
  8. 8. Skewed Access Distribution 8 Real-world datasets → mixed popularity • Popularity follows a power-law distribution • Small number of objects hot; most are not Mixed popularity → load imbalance • Node(s) storing hottest objects get highly loaded • Majority of nodes are under-utilized 128 Servers … … … Overloaded YCSB, skew exponent = 0.99
  9. 9. Skewed Access Distribution 9 Real-world datasets → mixed popularity • Popularity follows a power-law distribution • Small number of objects hot; most are not Mixed popularity → load imbalance • Node(s) storing hottest objects get highly loaded • Majority of nodes are under-utilized 128 Servers … … … Overloaded YCSB, skew exponent = 0.99 Skew-induced load imbalance limits system throughput
  10. 10. Centralized cache [SOCC’11, SOSP’17] • Dedicated node resides in front of the KVS caching hot objects. ◦ Filters the skew with a small cache ◦ Throughput is limited by the single cache Existing Skew Mitigation Techniques 10 … … … ← Cache
  11. 11. Centralized cache [SOCC’11, SOSP’17] • Dedicated node resides in front of the KVS caching hot objects. ◦ Filters the skew with a small cache ◦ Throughput is limited by the single cache NUMA abstraction [NSDI’14, SOCC’16] • Uniformly distribute requests to all servers • Remote objects RDMA’ed from home node ◦ Load balance the client requests ◦ No locality → excessive network b/w Most requests require remote access Existing Skew Mitigation Techniques 11 … … … … … … ← Cache
  12. 12. Centralized cache [SOCC’11, SOSP’17] • Dedicated node resides in front of the KVS caching hot objects. ◦ Filters the skew with a small cache ◦ Throughput is limited by the single cache NUMA abstraction [NSDI’14, SOCC’16] • Uniformly distribute requests to all servers • Remote objects RDMA’ed from home node ◦ Load balance the client requests ◦ No locality → excessive network b/w Most requests require remote access Existing Skew Mitigation Techniques 12 … … … … … … Can we get the best of both worlds? ← Cache
  13. 13. 13 Caching + NUMA … … … + Scale-Out ccNUMA! … … … via distributed caching
  14. 14. 14 Caching + NUMA … … … + Scale-Out ccNUMA! What are the challenges? … … … via distributed caching
  15. 15. Scale-Out ccNUMA Challenges 15 Challenge 1: Distributed cache architecture design • Which items to cache and where? • How to steer traffic for maximum load balance & hit rate? Challenge 2: Keeping the caches consistent 
 (i.e. what happens on a write) • How to locate replicas? • How to execute writes efficiently?
  16. 16. Scale-Out ccNUMA Challenges 16 Challenge 1: Distributed cache architecture design • Which items to cache and where? • How to steer traffic for maximum load balance & hit rate? Challenge 2: Keeping the caches consistent 
 (i.e. what happens on a write) • How to locate replicas? • How to execute writes efficiently? Solving Challenge 1 with Symmetric Caching
  17. 17. 17 Which items to cache and where? • Insight: hottest objects see most hits • Idea: all nodes cache hottest objects → Implication: all caches have same content • Symmetric caching: small cache with hottest objects at each node How to steer traffic for maximum load balance and hit rate? • Insight: symmetric caching → all caches equal (highest) hit rate • Idea: uniformly spread requests • Requests for hottest objects → served locally on any node • Cache misses served as in NUMA Abstraction Benefits: • Load balances and filters the skew • Throughput scales with number of servers • Less network b/w: most requests are served locally Symmetric Caching … … …
  18. 18. 18 Which items to cache and where? • Insight: hottest objects see most hits • Idea: all nodes cache hottest objects → Implication: all caches have same content • Symmetric caching: small cache with hottest objects at each node How to steer traffic for maximum load balance and hit rate? • Insight: symmetric caching → all caches equal (highest) hit rate • Idea: uniformly spread requests • Requests for hottest objects → served locally on any node • Cache misses served as in NUMA abstraction Benefits: • Load balances and filters the skew • Throughput scales with number of servers • Less network b/w: most requests are served locally Symmetric Caching … … …
  19. 19. 19 Which items to cache and where? • Insight: hottest objects see most hits • Idea: all nodes cache hottest objects → Implication: all caches have same content • Symmetric caching: small cache with hottest objects at each node How to steer traffic for maximum load balance and hit rate? • Insight: symmetric caching → all caches equal (highest) hit rate • Idea: uniformly spread requests • Requests for hottest objects → served locally on any node • Cache misses served as in NUMA abstraction Benefits: • Load balances and filters the skew • Throughput scales with number of servers • Less network b/w: most requests are served locally Symmetric Caching … … …
  20. 20. 20 Which items to cache and where? • Insight: hottest objects see most hits • Idea: all nodes cache hottest objects → Implication: all caches have same content • Symmetric caching: small cache with hottest objects at each node How to steer traffic for maximum load balance and hit rate? • Insight: symmetric caching → all caches equal (highest) hit rate • Idea: uniformly spread requests • Requests for hottest objects → served locally on any node • Cache misses served as in NUMA abstraction Benefits: • Load balances and filters the skew • Throughput scales with number of servers • Less network b/w: most requests are served locally Symmetric Caching … … … Challenge 2: How to keep the caches consistent?
  21. 21. Keeping the caches consistent 21 Requirement: On a write, inform all replicas of the new value How to locate replicas? - Easy with Symmetric Caching! If object in local cache → all nodes cache it
  22. 22. Keeping the caches consistent 22 Requirement: On a write, inform all replicas of the new value How to locate replicas? - Easy with Symmetric Caching! If object in local cache → all nodes cache it
  23. 23. Keeping the caches consistent 23 Requirement: On a write, inform all replicas of the new value How to locate replicas? - Easy with Symmetric Caching! If object in local cache → all nodes cache it How to execute writes efficiently? • Typical protocols: ◦ Ensure global write ordering via a primary ◦ Primary executes all writes → hot-spot Primary executes all writes Write( )Write( ) Primary
  24. 24. Keeping the caches consistent 24 Requirement: On a write, inform all replicas of the new value How to locate replicas? - Easy with Symmetric Caching! If object in local cache → all nodes cache it How to execute writes efficiently? • Typical protocols: ◦ Ensure global write ordering via a primary ◦ Primary executes all writes → hot-spot • Fully distributed writes Can guarantee ordering via logical clocks Avoid hot-spots Evenly spread write propagation costs Primary executes all writes Write( )Write( ) Primary Fully distributed writes Write( ) Write( )
  25. 25. Protocols in Scale-out ccNUMA 25 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: Write( )
  26. 26. Protocols in Scale-out ccNUMA 26 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: ◦ Linearizability (Lin): 2 RTTs Write( )
  27. 27. Protocols in Scale-out ccNUMA 27 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: ◦ Linearizability (Lin): 2 RTTs Broadcast Invalidations* * along with logical (Lamport) clocks Lin Invalidate all caches Write( )
  28. 28. Protocols in Scale-out ccNUMA 28 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: ◦ Linearizability (Lin): 2 RTTs Broadcast Invalidations* Broadcast Updates* * along with logical (Lamport) clocks Lin Invalidate all caches Write( ) Broadcast Updates
  29. 29. Protocols in Scale-out ccNUMA 29 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: ◦ Linearizability (Lin): 2 RTTs Broadcast Invalidations* Broadcast Updates* ◦ Sequential Consistency (SC): 1 RTT Broadcast Updates*
 * along with logical (Lamport) clocks Lin SC Invalidate all caches Write( ) Broadcast Updates
  30. 30. Evaluation 30 Hardware setup: 9 nodes • 56Gb/s FDR InfiniBand NIC • 64GB DRAM • 2x 10 core CPUs - 25MB L3 KVS Workload: • Skew exponent: α = 0.99 (YCSB) • 250 M key-value pairs - (Key = 8B, Value = 40B) Evaluated systems: • Baseline: NUMA abstraction (state-of-the-art) • Scale-out ccNUMA • Per-node symmetric cache size: 0.1% of dataset
  31. 31. Performance 31 Both systems are network bound
  32. 32. Performance 32 >3χ Both systems are network bound Lin: >3x throughput at low write ratio
  33. 33. Performance 33 >3χ 1.6χ Both systems are network bound Lin: >3x throughput at low write ratio, 1.6x at 5% writes
  34. 34. 2.2χ Performance 34 Both systems are network bound Lin: >3x throughput at low write ratio, 1.6x at 5% writes SC: higher throughput at higher write ratios: 2.2x at 5% writes >3χ 1.6χ
  35. 35. Conclusion 35 Scale-Out ccNUMA: Distributed cache → best of Caching + NUMA • Symmetric Caching: ◦ Load balances and filters skew ◦ Throughput scales with number of servers ◦ Less network b/w: most requests are local • Fully distributed protocols: ◦ Efficient RDMA Implementation ◦ Fully distributed writes ◦ Two strong consistency guarantees Up to 3x performance of state-of-the-art while guaranteeing per-key Linearizability Symmetric Caching Fully distributed protocols Write( ) Write( ) … … …
  36. 36. Questions? 36
  37. 37. Backup Slides 37
  38. 38. Effectiveness of caching 38 ~65% ~60%
  39. 39. Read-only (varying skew) 39
  40. 40. Request breakdown 40
  41. 41. Network traffic 41
  42. 42. Read-only performance + Coalescing 42
  43. 43. Object-size & writes 43
  44. 44. Object-size & coalescing 44
  45. 45. Latency vs xPut 45 ~ order of magnitude lower than typical 1ms QoS (on max xPut)
  46. 46. Break even (+model) 46 Same performance with ideal baseline (uniform workload)
  47. 47. Scalability (+model) 47

×