Scale-out ccNUMA - Eurosys'18

Scale-Out ccNUMA:
Exploiting Skew with Strongly Consistent Caching
Antonios Katsarakis*, Vasilis Gavrielatos*, 

A. Joshi, N. Oswald, B. Grot, V. Nagarajan
The University of Edinburgh
This work was supported by EPSRC, ARM and Microsoft through their PhD Fellowship Programs
*The first two authors equally contribute to this work
Large-scale online services
2
Backed by Key-Value Stores (KVS)
Characteristics:
• Numerous users
• Read-mostly workloads
( e.g. Facebook 0.2% writes [ATC’13] )

Distributed KVS
KVS Performance 101
3
…
KVS Performance 101
4
In-memory storage:

Avoid slow disk access
… … …
…
KVS Performance 101
5
In-memory storage:

Avoid slow disk access
Partitioning:

• Shard the dataset across multiple nodes
• Enables high capacity in-memory storage

… … …
…
KVS Performance 101
6
In-memory storage:

Avoid slow disk access
Partitioning:

• Shard the dataset across multiple nodes
• Enables high capacity in-memory storage

Remote Direct Memory Access (RDMA):
Avoid costly TCP/IP processing via
• Kernel bypass
• H/w network stack processing
… … …
…
KVS Performance 101
7
In-memory storage:

Avoid slow disk access
Partitioning:

• Shard the dataset across multiple nodes
• Enables high capacity in-memory storage

Remote Direct Memory Access (RDMA):
Avoid costly TCP/IP processing via
• Kernel bypass
• H/w network stack processing
Good start, but there is a problem…
Skewed Access Distribution
8
Real-world datasets → mixed popularity
• Popularity follows a power-law distribution
• Small number of objects hot; most are not
Mixed popularity → load imbalance
• Node(s) storing hottest objects
get highly loaded
• Majority of nodes are under-utilized
128 Servers
… … …
Overloaded
YCSB, skew exponent = 0.99
Skewed Access Distribution
9
Real-world datasets → mixed popularity
• Popularity follows a power-law distribution
• Small number of objects hot; most are not
Mixed popularity → load imbalance
• Node(s) storing hottest objects
get highly loaded
• Majority of nodes are under-utilized
128 Servers
… … …
Overloaded
YCSB, skew exponent = 0.99
Skew-induced load imbalance limits system throughput
Centralized cache [SOCC’11, SOSP’17]
• Dedicated node resides in front of the KVS
caching hot objects.
◦ Filters the skew with a small cache
◦ Throughput is limited by the single cache
Existing Skew Mitigation Techniques
10
… … …
← Cache
Centralized cache [SOCC’11, SOSP’17]
• Dedicated node resides in front of the KVS
caching hot objects.
◦ Filters the skew with a small cache
◦ Throughput is limited by the single cache
NUMA abstraction [NSDI’14, SOCC’16]
• Uniformly distribute requests to all servers
• Remote objects RDMA’ed from home node
◦ Load balance the client requests
◦ No locality → excessive network b/w
Most requests require remote access
Existing Skew Mitigation Techniques
11
… … …
… … …
← Cache
Centralized cache [SOCC’11, SOSP’17]
• Dedicated node resides in front of the KVS
caching hot objects.
◦ Filters the skew with a small cache
◦ Throughput is limited by the single cache
NUMA abstraction [NSDI’14, SOCC’16]
• Uniformly distribute requests to all servers
• Remote objects RDMA’ed from home node
◦ Load balance the client requests
◦ No locality → excessive network b/w
Most requests require remote access
Existing Skew Mitigation Techniques
12
… … …
… … …
Can we get the best of both worlds?
← Cache
13
Caching + NUMA
… … …
+
Scale-Out ccNUMA!
… … …
via distributed caching
14
Caching + NUMA
… … …
+
Scale-Out ccNUMA!
What are the challenges?
… … …
via distributed caching
Scale-Out ccNUMA Challenges
15
Challenge 1: Distributed cache architecture design
• Which items to cache and where?
• How to steer traffic for maximum load balance & hit rate?
Challenge 2: Keeping the caches consistent 

(i.e. what happens on a write)
• How to locate replicas?
• How to execute writes efficiently?
Scale-Out ccNUMA Challenges
16
Challenge 1: Distributed cache architecture design
• Which items to cache and where?
• How to steer traffic for maximum load balance & hit rate?
Challenge 2: Keeping the caches consistent 

(i.e. what happens on a write)
• How to locate replicas?
• How to execute writes efficiently?
Solving Challenge 1 with Symmetric Caching
17
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA Abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
18
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
19
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
20
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
Challenge 2: How to keep the caches consistent?
Keeping the caches consistent
21
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
Keeping the caches consistent
22
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
Keeping the caches consistent
23
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
How to execute writes efficiently?
• Typical protocols:
◦ Ensure global write ordering via a primary
◦ Primary executes all writes → hot-spot Primary executes all writes
Write( )Write( )
Primary
Keeping the caches consistent
24
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
How to execute writes efficiently?
• Typical protocols:
◦ Ensure global write ordering via a primary
◦ Primary executes all writes → hot-spot
• Fully distributed writes
Can guarantee ordering via logical clocks
Avoid hot-spots
Evenly spread write propagation costs
Primary executes all writes
Write( )Write( )
Primary
Fully distributed writes
Write( ) Write( )
Protocols in Scale-out ccNUMA
25
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
Write( )
Protocols in Scale-out ccNUMA
26
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Write( )
Protocols in Scale-out ccNUMA
27
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Broadcast Invalidations*
* along with logical (Lamport) clocks
Lin
Invalidate all caches
Write( )
Protocols in Scale-out ccNUMA
28
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Broadcast Invalidations*
Broadcast Updates*
* along with logical (Lamport) clocks
Lin
Invalidate all caches
Write( )
Broadcast Updates
Protocols in Scale-out ccNUMA
29
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Broadcast Invalidations*
Broadcast Updates*
◦ Sequential Consistency (SC): 1 RTT
Broadcast Updates*

* along with logical (Lamport) clocks
Lin
SC
Invalidate all caches
Write( )
Broadcast Updates
Evaluation
30
Hardware setup: 9 nodes
• 56Gb/s FDR InfiniBand NIC
• 64GB DRAM
• 2x 10 core CPUs - 25MB L3
KVS Workload:
• Skew exponent: α = 0.99 (YCSB)
• 250 M key-value pairs - (Key = 8B, Value = 40B)
Evaluated systems:
• Baseline: NUMA abstraction (state-of-the-art)
• Scale-out ccNUMA
• Per-node symmetric cache size: 0.1% of dataset
Performance
31
Both systems are network bound
Performance
32
>3χ
Both systems are network bound
Lin: >3x throughput at low write ratio
Performance
33
>3χ
1.6χ
Both systems are network bound
Lin: >3x throughput at low write ratio, 1.6x at 5% writes
2.2χ
Performance
34
Both systems are network bound
Lin: >3x throughput at low write ratio, 1.6x at 5% writes
SC: higher throughput at higher write ratios: 2.2x at 5% writes
>3χ
1.6χ
Conclusion
35
Scale-Out ccNUMA:
Distributed cache → best of Caching + NUMA
• Symmetric Caching:
◦ Load balances and filters skew
◦ Throughput scales with number of servers
◦ Less network b/w: most requests are local
• Fully distributed protocols:
◦ Efficient RDMA Implementation
◦ Fully distributed writes
◦ Two strong consistency guarantees
Up to 3x performance of state-of-the-art
while guaranteeing per-key Linearizability
Symmetric Caching
Fully distributed protocols
Write( ) Write( )
… … …
Questions?
36
Backup Slides
37
Effectiveness of caching
38
~65%
~60%
Read-only (varying skew)
39
Request breakdown
40
Network traffic
41
Read-only performance + Coalescing
42
Object-size & writes
43
Object-size & coalescing
44
Latency vs xPut
45
~ order of magnitude lower than typical 1ms QoS (on max xPut)
Break even (+model)
46
Same performance with ideal baseline (uniform workload)
Scalability (+model)
47
1 of 47

Recommended

Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation] by
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Antonios Katsarakis
377 views58 slides
Hermes Reliable Replication Protocol - Poster by
Hermes Reliable Replication Protocol - Poster Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster Antonios Katsarakis
574 views1 slide
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent by
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentKafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentHostedbyConfluent
1.9K views30 slides
Stream Processing made simple with Kafka by
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with KafkaDataWorks Summit/Hadoop Summit
3K views104 slides
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation by
Hermes Reliable Replication Protocol -  ASPLOS'20 PresentationHermes Reliable Replication Protocol -  ASPLOS'20 Presentation
Hermes Reliable Replication Protocol - ASPLOS'20 PresentationAntonios Katsarakis
1.2K views75 slides
Rabbitmq & Kafka Presentation by
Rabbitmq & Kafka PresentationRabbitmq & Kafka Presentation
Rabbitmq & Kafka PresentationEmre Gündoğdu
537 views74 slides

More Related Content

What's hot

Confluent Tech Talk Korea by
Confluent Tech Talk KoreaConfluent Tech Talk Korea
Confluent Tech Talk Koreaconfluent
287 views83 slides
모바일 메신저 아키텍쳐 소개 by
모바일 메신저 아키텍쳐 소개모바일 메신저 아키텍쳐 소개
모바일 메신저 아키텍쳐 소개Hyogi Jung
23.2K views37 slides
Homer metrics | LORENZO MANGANI Y FEDERICO CABIDDU - VoIP2DAY 2017 by
Homer metrics | LORENZO MANGANI Y FEDERICO CABIDDU - VoIP2DAY 2017Homer metrics | LORENZO MANGANI Y FEDERICO CABIDDU - VoIP2DAY 2017
Homer metrics | LORENZO MANGANI Y FEDERICO CABIDDU - VoIP2DAY 2017VOIP2DAY
369 views25 slides
Cross Data Center Replication with Redis using Redis Enterprise by
Cross Data Center Replication with Redis using Redis EnterpriseCross Data Center Replication with Redis using Redis Enterprise
Cross Data Center Replication with Redis using Redis EnterpriseCihan Biyikoglu
2.7K views29 slides
Deploying Confluent Platform for Production by
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Productionconfluent
4.8K views29 slides
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications by
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend
5K views44 slides

What's hot(20)

Confluent Tech Talk Korea by confluent
Confluent Tech Talk KoreaConfluent Tech Talk Korea
Confluent Tech Talk Korea
confluent287 views
모바일 메신저 아키텍쳐 소개 by Hyogi Jung
모바일 메신저 아키텍쳐 소개모바일 메신저 아키텍쳐 소개
모바일 메신저 아키텍쳐 소개
Hyogi Jung23.2K views
Homer metrics | LORENZO MANGANI Y FEDERICO CABIDDU - VoIP2DAY 2017 by VOIP2DAY
Homer metrics | LORENZO MANGANI Y FEDERICO CABIDDU - VoIP2DAY 2017Homer metrics | LORENZO MANGANI Y FEDERICO CABIDDU - VoIP2DAY 2017
Homer metrics | LORENZO MANGANI Y FEDERICO CABIDDU - VoIP2DAY 2017
VOIP2DAY369 views
Cross Data Center Replication with Redis using Redis Enterprise by Cihan Biyikoglu
Cross Data Center Replication with Redis using Redis EnterpriseCross Data Center Replication with Redis using Redis Enterprise
Cross Data Center Replication with Redis using Redis Enterprise
Cihan Biyikoglu2.7K views
Deploying Confluent Platform for Production by confluent
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Production
confluent4.8K views
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications by Lightbend
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Lightbend5K views
RabbitMQ vs Apache Kafka Part II Webinar by Erlang Solutions
RabbitMQ vs Apache Kafka Part II WebinarRabbitMQ vs Apache Kafka Part II Webinar
RabbitMQ vs Apache Kafka Part II Webinar
Erlang Solutions1.2K views
생산성을 높여주는 iOS 개발 방법들.pdf by ssuserb942d2
생산성을 높여주는 iOS 개발 방법들.pdf생산성을 높여주는 iOS 개발 방법들.pdf
생산성을 높여주는 iOS 개발 방법들.pdf
ssuserb942d2224 views
Monitoring Apache Kafka by confluent
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
confluent3.9K views
A Deep Dive into Kafka Controller by confluent
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
confluent10.2K views
Apache Kafka - Martin Podval by Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
Martin Podval3.4K views
Automation of Software Engineering with OCI DevOps Build and Deployment Pipel... by Lucas Jellema
Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...
Automation of Software Engineering with OCI DevOps Build and Deployment Pipel...
Lucas Jellema281 views
Distributed Counters in Cassandra (Cassandra Summit 2010) by kakugawa
Distributed Counters in Cassandra (Cassandra Summit 2010)Distributed Counters in Cassandra (Cassandra Summit 2010)
Distributed Counters in Cassandra (Cassandra Summit 2010)
kakugawa2.4K views
Testing Kafka containers with Testcontainers: There and back again with Vikto... by HostedbyConfluent
Testing Kafka containers with Testcontainers: There and back again with Vikto...Testing Kafka containers with Testcontainers: There and back again with Vikto...
Testing Kafka containers with Testcontainers: There and back again with Vikto...
HostedbyConfluent2.2K views
카프카, 산전수전 노하우 by if kakao
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우
if kakao4K views
OFI Overview 2019 Webinar by seanhefty
OFI Overview 2019 WebinarOFI Overview 2019 Webinar
OFI Overview 2019 Webinar
seanhefty4.6K views

Similar to Scale-out ccNUMA - Eurosys'18

Apache cassandra by
Apache cassandraApache cassandra
Apache cassandraAdnan Siddiqi
90 views31 slides
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO... by
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
3.4K views42 slides
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar by
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
4.1K views29 slides
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka by
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
7.6K views35 slides
Highly available, scalable and secure data with Cassandra and DataStax Enterp... by
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Johnny Miller
4.5K views37 slides
L6.sp17.pptx by
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptxSudheerKumar499932
2 views82 slides

Similar to Scale-out ccNUMA - Eurosys'18(20)

Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO... by ScyllaDB
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB3.4K views
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar by DataStax Academy
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
DataStax Academy4.1K views
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka by Guozhang Wang
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang7.6K views
Highly available, scalable and secure data with Cassandra and DataStax Enterp... by Johnny Miller
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Johnny Miller4.5K views
Ambedded - how to build a true no single point of failure ceph cluster by inwin stack
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
inwin stack708 views
Stateful Applications On the Cloud: A PayPal Journey by Tesora
Stateful Applications On the Cloud: A PayPal JourneyStateful Applications On the Cloud: A PayPal Journey
Stateful Applications On the Cloud: A PayPal Journey
Tesora435 views
Frontera: open source, large scale web crawling framework by Scrapinghub
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
Scrapinghub5.9K views
DataEngConf SF16 - High cardinality time series search by Hakka Labs
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
Hakka Labs768 views
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ... by MayaData Inc
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Inc141 views
NAVER Ceph Storage on ssd for Container by Jangseon Ryu
NAVER Ceph Storage on ssd for ContainerNAVER Ceph Storage on ssd for Container
NAVER Ceph Storage on ssd for Container
Jangseon Ryu492 views
High cardinality time series search: A new level of scale - Data Day Texas 2016 by Eric Sammer
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer4.8K views
High performace network of Cloud Native Taiwan User Group by HungWei Chiu
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
HungWei Chiu1.8K views
«Scrapy internals» Александр Сибиряков, Scrapinghub by it-people
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
it-people446 views
Real-Time Analytics with Kafka, Cassandra and Storm by John Georgiadis
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
John Georgiadis2.3K views

Recently uploaded

KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineShapeBlue
154 views19 slides
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlueShapeBlue
75 views23 slides
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueShapeBlue
191 views23 slides
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
81 views34 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
443 views92 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
74 views38 slides

Recently uploaded(20)

KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue154 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue75 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue191 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue81 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue68 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue93 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software373 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash103 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10110 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue172 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE67 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue147 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue59 views

Scale-out ccNUMA - Eurosys'18

  • 1. Scale-Out ccNUMA: Exploiting Skew with Strongly Consistent Caching Antonios Katsarakis*, Vasilis Gavrielatos*, 
 A. Joshi, N. Oswald, B. Grot, V. Nagarajan The University of Edinburgh This work was supported by EPSRC, ARM and Microsoft through their PhD Fellowship Programs *The first two authors equally contribute to this work
  • 2. Large-scale online services 2 Backed by Key-Value Stores (KVS) Characteristics: • Numerous users • Read-mostly workloads ( e.g. Facebook 0.2% writes [ATC’13] )
 Distributed KVS
  • 4. … KVS Performance 101 4 In-memory storage:
 Avoid slow disk access
  • 5. … … … … KVS Performance 101 5 In-memory storage:
 Avoid slow disk access Partitioning:
 • Shard the dataset across multiple nodes • Enables high capacity in-memory storage

  • 6. … … … … KVS Performance 101 6 In-memory storage:
 Avoid slow disk access Partitioning:
 • Shard the dataset across multiple nodes • Enables high capacity in-memory storage
 Remote Direct Memory Access (RDMA): Avoid costly TCP/IP processing via • Kernel bypass • H/w network stack processing
  • 7. … … … … KVS Performance 101 7 In-memory storage:
 Avoid slow disk access Partitioning:
 • Shard the dataset across multiple nodes • Enables high capacity in-memory storage
 Remote Direct Memory Access (RDMA): Avoid costly TCP/IP processing via • Kernel bypass • H/w network stack processing Good start, but there is a problem…
  • 8. Skewed Access Distribution 8 Real-world datasets → mixed popularity • Popularity follows a power-law distribution • Small number of objects hot; most are not Mixed popularity → load imbalance • Node(s) storing hottest objects get highly loaded • Majority of nodes are under-utilized 128 Servers … … … Overloaded YCSB, skew exponent = 0.99
  • 9. Skewed Access Distribution 9 Real-world datasets → mixed popularity • Popularity follows a power-law distribution • Small number of objects hot; most are not Mixed popularity → load imbalance • Node(s) storing hottest objects get highly loaded • Majority of nodes are under-utilized 128 Servers … … … Overloaded YCSB, skew exponent = 0.99 Skew-induced load imbalance limits system throughput
  • 10. Centralized cache [SOCC’11, SOSP’17] • Dedicated node resides in front of the KVS caching hot objects. ◦ Filters the skew with a small cache ◦ Throughput is limited by the single cache Existing Skew Mitigation Techniques 10 … … … ← Cache
  • 11. Centralized cache [SOCC’11, SOSP’17] • Dedicated node resides in front of the KVS caching hot objects. ◦ Filters the skew with a small cache ◦ Throughput is limited by the single cache NUMA abstraction [NSDI’14, SOCC’16] • Uniformly distribute requests to all servers • Remote objects RDMA’ed from home node ◦ Load balance the client requests ◦ No locality → excessive network b/w Most requests require remote access Existing Skew Mitigation Techniques 11 … … … … … … ← Cache
  • 12. Centralized cache [SOCC’11, SOSP’17] • Dedicated node resides in front of the KVS caching hot objects. ◦ Filters the skew with a small cache ◦ Throughput is limited by the single cache NUMA abstraction [NSDI’14, SOCC’16] • Uniformly distribute requests to all servers • Remote objects RDMA’ed from home node ◦ Load balance the client requests ◦ No locality → excessive network b/w Most requests require remote access Existing Skew Mitigation Techniques 12 … … … … … … Can we get the best of both worlds? ← Cache
  • 13. 13 Caching + NUMA … … … + Scale-Out ccNUMA! … … … via distributed caching
  • 14. 14 Caching + NUMA … … … + Scale-Out ccNUMA! What are the challenges? … … … via distributed caching
  • 15. Scale-Out ccNUMA Challenges 15 Challenge 1: Distributed cache architecture design • Which items to cache and where? • How to steer traffic for maximum load balance & hit rate? Challenge 2: Keeping the caches consistent 
 (i.e. what happens on a write) • How to locate replicas? • How to execute writes efficiently?
  • 16. Scale-Out ccNUMA Challenges 16 Challenge 1: Distributed cache architecture design • Which items to cache and where? • How to steer traffic for maximum load balance & hit rate? Challenge 2: Keeping the caches consistent 
 (i.e. what happens on a write) • How to locate replicas? • How to execute writes efficiently? Solving Challenge 1 with Symmetric Caching
  • 17. 17 Which items to cache and where? • Insight: hottest objects see most hits • Idea: all nodes cache hottest objects → Implication: all caches have same content • Symmetric caching: small cache with hottest objects at each node How to steer traffic for maximum load balance and hit rate? • Insight: symmetric caching → all caches equal (highest) hit rate • Idea: uniformly spread requests • Requests for hottest objects → served locally on any node • Cache misses served as in NUMA Abstraction Benefits: • Load balances and filters the skew • Throughput scales with number of servers • Less network b/w: most requests are served locally Symmetric Caching … … …
  • 18. 18 Which items to cache and where? • Insight: hottest objects see most hits • Idea: all nodes cache hottest objects → Implication: all caches have same content • Symmetric caching: small cache with hottest objects at each node How to steer traffic for maximum load balance and hit rate? • Insight: symmetric caching → all caches equal (highest) hit rate • Idea: uniformly spread requests • Requests for hottest objects → served locally on any node • Cache misses served as in NUMA abstraction Benefits: • Load balances and filters the skew • Throughput scales with number of servers • Less network b/w: most requests are served locally Symmetric Caching … … …
  • 19. 19 Which items to cache and where? • Insight: hottest objects see most hits • Idea: all nodes cache hottest objects → Implication: all caches have same content • Symmetric caching: small cache with hottest objects at each node How to steer traffic for maximum load balance and hit rate? • Insight: symmetric caching → all caches equal (highest) hit rate • Idea: uniformly spread requests • Requests for hottest objects → served locally on any node • Cache misses served as in NUMA abstraction Benefits: • Load balances and filters the skew • Throughput scales with number of servers • Less network b/w: most requests are served locally Symmetric Caching … … …
  • 20. 20 Which items to cache and where? • Insight: hottest objects see most hits • Idea: all nodes cache hottest objects → Implication: all caches have same content • Symmetric caching: small cache with hottest objects at each node How to steer traffic for maximum load balance and hit rate? • Insight: symmetric caching → all caches equal (highest) hit rate • Idea: uniformly spread requests • Requests for hottest objects → served locally on any node • Cache misses served as in NUMA abstraction Benefits: • Load balances and filters the skew • Throughput scales with number of servers • Less network b/w: most requests are served locally Symmetric Caching … … … Challenge 2: How to keep the caches consistent?
  • 21. Keeping the caches consistent 21 Requirement: On a write, inform all replicas of the new value How to locate replicas? - Easy with Symmetric Caching! If object in local cache → all nodes cache it
  • 22. Keeping the caches consistent 22 Requirement: On a write, inform all replicas of the new value How to locate replicas? - Easy with Symmetric Caching! If object in local cache → all nodes cache it
  • 23. Keeping the caches consistent 23 Requirement: On a write, inform all replicas of the new value How to locate replicas? - Easy with Symmetric Caching! If object in local cache → all nodes cache it How to execute writes efficiently? • Typical protocols: ◦ Ensure global write ordering via a primary ◦ Primary executes all writes → hot-spot Primary executes all writes Write( )Write( ) Primary
  • 24. Keeping the caches consistent 24 Requirement: On a write, inform all replicas of the new value How to locate replicas? - Easy with Symmetric Caching! If object in local cache → all nodes cache it How to execute writes efficiently? • Typical protocols: ◦ Ensure global write ordering via a primary ◦ Primary executes all writes → hot-spot • Fully distributed writes Can guarantee ordering via logical clocks Avoid hot-spots Evenly spread write propagation costs Primary executes all writes Write( )Write( ) Primary Fully distributed writes Write( ) Write( )
  • 25. Protocols in Scale-out ccNUMA 25 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: Write( )
  • 26. Protocols in Scale-out ccNUMA 26 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: ◦ Linearizability (Lin): 2 RTTs Write( )
  • 27. Protocols in Scale-out ccNUMA 27 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: ◦ Linearizability (Lin): 2 RTTs Broadcast Invalidations* * along with logical (Lamport) clocks Lin Invalidate all caches Write( )
  • 28. Protocols in Scale-out ccNUMA 28 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: ◦ Linearizability (Lin): 2 RTTs Broadcast Invalidations* Broadcast Updates* * along with logical (Lamport) clocks Lin Invalidate all caches Write( ) Broadcast Updates
  • 29. Protocols in Scale-out ccNUMA 29 Efficient RDMA implementation Fully distributed writes via logical clocks Two (per-key) strongly consistent flavours: ◦ Linearizability (Lin): 2 RTTs Broadcast Invalidations* Broadcast Updates* ◦ Sequential Consistency (SC): 1 RTT Broadcast Updates*
 * along with logical (Lamport) clocks Lin SC Invalidate all caches Write( ) Broadcast Updates
  • 30. Evaluation 30 Hardware setup: 9 nodes • 56Gb/s FDR InfiniBand NIC • 64GB DRAM • 2x 10 core CPUs - 25MB L3 KVS Workload: • Skew exponent: α = 0.99 (YCSB) • 250 M key-value pairs - (Key = 8B, Value = 40B) Evaluated systems: • Baseline: NUMA abstraction (state-of-the-art) • Scale-out ccNUMA • Per-node symmetric cache size: 0.1% of dataset
  • 32. Performance 32 >3χ Both systems are network bound Lin: >3x throughput at low write ratio
  • 33. Performance 33 >3χ 1.6χ Both systems are network bound Lin: >3x throughput at low write ratio, 1.6x at 5% writes
  • 34. 2.2χ Performance 34 Both systems are network bound Lin: >3x throughput at low write ratio, 1.6x at 5% writes SC: higher throughput at higher write ratios: 2.2x at 5% writes >3χ 1.6χ
  • 35. Conclusion 35 Scale-Out ccNUMA: Distributed cache → best of Caching + NUMA • Symmetric Caching: ◦ Load balances and filters skew ◦ Throughput scales with number of servers ◦ Less network b/w: most requests are local • Fully distributed protocols: ◦ Efficient RDMA Implementation ◦ Fully distributed writes ◦ Two strong consistency guarantees Up to 3x performance of state-of-the-art while guaranteeing per-key Linearizability Symmetric Caching Fully distributed protocols Write( ) Write( ) … … …
  • 42. Read-only performance + Coalescing 42
  • 45. Latency vs xPut 45 ~ order of magnitude lower than typical 1ms QoS (on max xPut)
  • 46. Break even (+model) 46 Same performance with ideal baseline (uniform workload)