Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cloud Messaging Service
Technical Overview
P R E S E N T E D B Y M a t t e o M e r l i S e p t e m b e r 2 1 , 2 0 1 5
Sections
2
1. Introduction
2. Architecture
3. Bookkeeper
4. Future
5. Q & A
CMS - Technical Overview
What is CMS
3
• Hosted Pub / Sub
• Multi tenant (Auth / Quotas / Load Balancer)
• Horizontally scalable
• Highly available...
CMS key features
4 CMS - Technical Overview
• Multi-tenancy / hosted
• Operating a system at scale is hard and requires de...
Work load examples
5 CMS - Technical Overview
Challenge # Topics # Producers /
topic
# Subscriptions /
topic
Produced
msg ...
2. Architecture
Messaging model
7 CMS - Technical Overview
• Producers can attach to a topic and send messages to it
• A subscription is a...
Client API
8
▪ Expose messaging model concepts (producer/consumer)
▪ C++ and Java
▪ Connection pooling
▪ Handle recoverabl...
Java producer example
9
CmsClient client = CmsClient.create("http://<broker vip>:4080");
Producer producer = client.create...
Java consumer example
10
CmsClient client = CmsClient.create(“http://<broker vip>:4080");
Consumer consumer = client.subsc...
System overview
11 CMS - Technical Overview
Broker
• State-less
• Maintain in memory cache of
messages
• Read from Bookkee...
System overview
12 CMS - Technical Overview
Native dispatcher
• Async Netty server
Global replicators
• If topic is global...
Partitioned topics
13
▪ Client lib has a wrapper producer/
consumer implementation
▪ No API changes
▪ Producers can decide...
Partitioned topics
14
▪ Consumers can use all
subscription type with the same
semantics
▪ In “Failover” subscription type,...
3. Bookkeeper
CMS Bookkeeper usage
16
▪ CMS uses Bookkeeper through a higher level interface of
ManagedLedger:
› A single managed ledger...
Bookie internal structure
17 CMS - Technical Overview
• Writes are written both to
journal and to ledger storage
(in diffe...
Bookkeeper issues
18
▪ Performance degrades when writing to many ledgers at the same time
▪ When there are heavy reads, th...
Bookie storage improvements
19 CMS - Technical Overview
• Writes are written both to
journal and to in memory write
cache
...
Bookkeeper write latency
20
▪ After hardware, next limit to achieve low latency is JVM GC
▪ GC pauses are unavoidable. Try...
Bookie ledgers scalability
21 CMS - Technical Overview
Single bookie — 15K write/s
BKwritelatency(ms)
0
1
2
3
4
Ledgers / ...
4. Future
Auto batching
23
▪ Send messages in batches throughout the system
▪ Transparent to application
▪ Configure group timing an...
Low durability
24
▪ Current throughput bottleneck for bookie writes is journal syncs
▪ Could add more bookies but bigger c...
5. Q & A
Upcoming SlideShare
Loading in …5
×

Cloud Messaging Service: Technical Overview

1,283 views

Published on

Matteo Merli, the tech lead for Cloud Messaging Service at Yahoo, went through their design decisions, how they reached that and how they leverage Apache BookKeeper to implement a multi-tenant messaging service.

Published in: Internet
  • Be the first to comment

Cloud Messaging Service: Technical Overview

  1. 1. Cloud Messaging Service Technical Overview P R E S E N T E D B Y M a t t e o M e r l i S e p t e m b e r 2 1 , 2 0 1 5
  2. 2. Sections 2 1. Introduction 2. Architecture 3. Bookkeeper 4. Future 5. Q & A CMS - Technical Overview
  3. 3. What is CMS 3 • Hosted Pub / Sub • Multi tenant (Auth / Quotas / Load Balancer) • Horizontally scalable • Highly available, durable and consistent storage • Geo Replication • In production since 2013 CMS - Technical Overview CMS Cluster Producer Broker Consumer Bookie ZK Global ZK Replication
  4. 4. CMS key features 4 CMS - Technical Overview • Multi-tenancy / hosted • Operating a system at scale is hard and requires deep understanding of internals • Authentication / Self service provisioning / Quotas • SLAs (Write latency 2ms avg - 5ms 99pct) • Maintain the same latencies and throughput under backlog draining scenarios • Simple high level API with clear ordering, durability and consistency semantics • Geo-replication • Single API call to configure regions to replicate to • Load balancer: Dynamically optimize topics assignment to brokers • Support large number of topics • Store subscription position • Apps don’t need to store it • Able to delete data as soon as it's consumed • Support round-robin distribution across multiple consumers
  5. 5. Work load examples 5 CMS - Technical Overview Challenge # Topics # Producers / topic # Subscriptions / topic Produced msg rate / s / topic Fan-out 1 1 1 K 1 K Throughput & latency 1 1 1 100 K # Topics & latency 1 M 1 10 10 Fan-in 1 1 K 1 > 100 K • Design to support wide range of use cases • Need to be cost effective in every case
  6. 6. 2. Architecture
  7. 7. Messaging model 7 CMS - Technical Overview • Producers can attach to a topic and send messages to it • A subscription is a durable resources that is the recipient of all messages sent to the topic, after its creation • Subscriptions do have a type: • “Exclusive” means that only one consumer is allowed to attach to this subscription. First consumer decides the type. • “Shared” allows multiple consumers. Messages are sent in round-robin distribution. No ordering guarantees. • “Failover” allows multiple consumers, though only one is receiving messages at a given point, while others are in standby mode. Consumer-5 Failover Subscription-C Consumer-4 Consumer-3 Consumer-2 Subscription-B Shared Exclusive Consumer-1 Subscription-AProducer-X Producer-Y Topic
  8. 8. Client API 8 ▪ Expose messaging model concepts (producer/consumer) ▪ C++ and Java ▪ Connection pooling ▪ Handle recoverable failures transparently (reconnect / resend messages) without compromising ordering guarantees ▪ Sync / async version of every operation CMS - Technical Overview
  9. 9. Java producer example 9 CmsClient client = CmsClient.create("http://<broker vip>:4080"); Producer producer = client.createProducer("my-topic"); // handles retries in case of failure producer.send("my-message".getBytes()); // Async version: producer.sendAsync("my-message".getBytes()).thenRun(() -> { // Message was persisted }); CMS - Technical Overview
  10. 10. Java consumer example 10 CmsClient client = CmsClient.create(“http://<broker vip>:4080"); Consumer consumer = client.subscribe( “my-topic", "my-subscription-name", SubscriptionType.Exclusive); // Blocks until message available Message msg = consumer.receive(); // Do something... consumer.acknowledge(msg); CMS - Technical Overview
  11. 11. System overview 11 CMS - Technical Overview Broker • State-less • Maintain in memory cache of messages • Read from Bookkeeper when cache miss Bookkeeper • Distributed write-ahead log • Create many ledgers • Append entries • Read entries • Delete ledger • Consistent reads • Single writer (the broker) CMS Cluster Broker Bookie ZK Global ZK Replication Native dispatcher Managed Ledger BK Client Global replicators Cache Load Balancer Producer App CMS client Consumer App CMS client
  12. 12. System overview 12 CMS - Technical Overview Native dispatcher • Async Netty server Global replicators • If topic is global, republish messages in other regions Global Zookeeper • ZK instance with participants in multiple US regions • Consistent data store for customers configuration • Accept writes with one region downCMS Cluster Broker Bookie ZK Global ZK Replication Native dispatcher Managed Ledger BK Client Global replicators Cache Load Balancer Producer App CMS client Consumer App CMS client
  13. 13. Partitioned topics 13 ▪ Client lib has a wrapper producer/ consumer implementation ▪ No API changes ▪ Producers can decide how to assign messages to partitions: ▪ Single partition ▪ Round robin ▪ Provide a key on the message ▪ Hash of the key determines the partition ▪ Custom routing CMS - Technical Overview App CMS Cluster Broker 1 Producer T1 P0 P1 P2 P3 P4 T1- P0 Broker 2 Broker 3 T1- P1 T1- P2 T1- P3 T1- P4
  14. 14. Partitioned topics 14 ▪ Consumers can use all subscription type with the same semantics ▪ In “Failover” subscription type, the election is done per partition ▪ Evenly spread the partitions assignment across all available consumers ▪ No need for ZK coordination CMS - Technical Overview CMS Cluster Broker 1 App Consumer-1 T1 C0 C1 C2 C3 C4 T1- P0 Broker 2 Broker 3 T1- P1 T1- P2 T1- P3 T1- P4 App Consumer-2 T1 C0 C1 C2 C3 C4
  15. 15. 3. Bookkeeper
  16. 16. CMS Bookkeeper usage 16 ▪ CMS uses Bookkeeper through a higher level interface of ManagedLedger: › A single managed ledger represent the storage of a single topic › Maintains list of currently active BK ledgers › Maintains the subscription positions using an additional ledger to checkpoint the last acknowledged message in the stream › Cache data › Deletes ledgers when all cursors are done with them CMS - Technical Overview
  17. 17. Bookie internal structure 17 CMS - Technical Overview • Writes are written both to journal and to ledger storage (in different device) • Ledger storage writes are fsynced periodically • Reads are only coming from ledger storage • Entries are interleaved in entry log files • Ledger indexes are used to find entries offset
  18. 18. Bookkeeper issues 18 ▪ Performance degrades when writing to many ledgers at the same time ▪ When there are heavy reads, the ledger storage device gets slow and will impact writes ▪ Ledger storage flushes need to fsync many ledger index files each time CMS - Technical Overview
  19. 19. Bookie storage improvements 19 CMS - Technical Overview • Writes are written both to journal and to in memory write cache • Entries are periodically flushed • Entries are sorted by ledger to be sequential on disk (per flush period) • Since entries are sequential, we added read-ahead cache • Location index is mostly kept in memory and only updated during flush
  20. 20. Bookkeeper write latency 20 ▪ After hardware, next limit to achieve low latency is JVM GC ▪ GC pauses are unavoidable. Try to keep them around ~50ms and as least as frequents as possible › Switched BK client and servers to use Netty pooled ref-counted buffers and direct memory to hide it from GC and eliminate payload copies › Extensively profiled allocations and substantially reduced per-entry objects allocations • Use Recycler pattern to pool objects (very efficient for same thread allocate/release) • Primitive collections • Array queue instead of linked queues in executors • Open hash maps instead of linked hash maps • BTree instead of ConcurrentSkipList CMS - Technical Overview
  21. 21. Bookie ledgers scalability 21 CMS - Technical Overview Single bookie — 15K write/s BKwritelatency(ms) 0 1 2 3 4 Ledgers / bookie 1 1000 5000 10000 20000 50000 Avg 99pct
  22. 22. 4. Future
  23. 23. Auto batching 23 ▪ Send messages in batches throughout the system ▪ Transparent to application ▪ Configure group timing and size: e.g.: 1ms / 128Kb ▪ For the same byte/s throughput lower the txn/s through the system › Less CPU usage in broker/bookies › Lower GC pressure CMS - Technical Overview
  24. 24. Low durability 24 ▪ Current throughput bottleneck for bookie writes is journal syncs ▪ Could add more bookies but bigger cost ▪ Some use cases are ok to lose data in rare occasions ▪ Solution › Store data in bookies • No memory limitation, can build big backlog › Don’t write to bookie journal • Data is stored in write cache in 2 bookies + broker cache › Can lose < 1min data in case 1 broker & 2 bookies crash ▪ Higher throughput with less bookies ▪ Lower publish latency CMS - Technical Overview
  25. 25. 5. Q & A

×