Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Distributed pub/sub platform
github.com/yahoo/pulsar
Matteo Merli — mmerli@yahoo-inc.com
Bay Area Hadoop Meetup — 10/19/20...
What is Pulsar?
2
▪ Hosted multi-tenant pub/sub messaging platform
▪ Simple messaging model
▪ Horizontally scalable - Topi...
Common use cases
3
▪ Application integration
› Server-to-server control, status propagation, notifications
▪ Persistent qu...
Main features
4
▪ REST / Java / Command line administrative APIs
› Provision users / grant permissions
› Users self-admini...
Why build a new system?
5
▪ No existing solution to satisfy requirements
› Multi tenant — 1M topics — Low latency — Durabi...
Messaging Model
6 Pulsar
Consumer-A1 receives all messages published on T; B1, B2, B3 receive one third each
Shared
Exclus...
7
Client API
Producer
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Producer producer ...
Main client library features
8
▪ Sync / Async operations
▪ Partitioned topics
▪ Transparent batching of messages
▪ Compres...
Architecture
9 Pulsar
Separate layers
between brokers and
storage (bookies)
‣ Broker and bookies can
be added
independentl...
Architecture
10 Pulsar
Pulsar Cluster
Broker
Bookie
ZK
Global
ZK
Service
discovery
Producer
App
Pulsar
lib
Replication
Man...
BookKeeper
11
▪ Replicated log service
▪ Offer consistency and durability
▪ Why is it a good choice for Pulsar?
› Very eff...
BookKeeper - Storage
12
▪ A single bookie can serve
and store thousands of
ledgers
▪ Writes to journal, reads
come from le...
Performance — Single topic throughput and latency
13 Pulsar
Throughput and 99pct publish latency — 1 Topic — 1 Producer
La...
Final Remarks
• Check out the code and docs at github.com/yahoo/pulsar
• Give feedback or ask for more details on mailing ...
Upcoming SlideShare
Loading in …5
×

October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging system

891 views

Published on

Yahoo recently open-sourced Pulsar,  a highly scalable, low latency pub-sub messaging system running on commodity hardware. It provides simple pub-sub messaging semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication. Pulsar is used across various Yahoo applications for large scale data pipelines. Learn more about Pulsar architecture and use-cases in this talk.

Speakers:
Matteo Merli from Pulsar team at Yahoo 

Published in: Technology
  • Be the first to comment

October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging system

  1. 1. Distributed pub/sub platform github.com/yahoo/pulsar Matteo Merli — mmerli@yahoo-inc.com Bay Area Hadoop Meetup — 10/19/2016
  2. 2. What is Pulsar? 2 ▪ Hosted multi-tenant pub/sub messaging platform ▪ Simple messaging model ▪ Horizontally scalable - Topics, Message throughput ▪ Ordering, durability & delivery guarantees ▪ Geo-replication ▪ Easy to operate (Add capacity, replace machines) ▪ Few numbers for production usage: › 1.5 years — 1.4 M topics — 100 B msg/day — Zero data loss › Average publish latency < 5ms, 99pct 15ms › 80+ application onboarded — Self-serve provisioning › Presence in 8 data centers Pulsar
  3. 3. Common use cases 3 ▪ Application integration › Server-to-server control, status propagation, notifications ▪ Persistent queue › Stream processing, buffering, feed ingestion, tasks dispatcher ▪ Message bus for large scale data stores › Durable log › Replication within and across geo-locations Pulsar
  4. 4. Main features 4 ▪ REST / Java / Command line administrative APIs › Provision users / grant permissions › Users self-administration › Metrics for topics / brokers usage ▪ Multi tenancy › Authentication / Authorization › Storage quota management › Tenant isolation policies › Message TTL › Backlog and subscriptions management tools ▪ Message retention and replay › Rollback to redeliver already acknowledged messages Pulsar
  5. 5. Why build a new system? 5 ▪ No existing solution to satisfy requirements › Multi tenant — 1M topics — Low latency — Durability — Geo replication ▪ Kafka doesn’t scale well with many topics: › Storage model based on individual directory per topic partition › Enabling durability kills the performance ▪ Ability to manage large backlogs ▪ Operations are not very convenient › eg: replacing a server, manual commands to copy the data and involves clients › clients access to ZK clusters not desirable ▪ No scalable support to keep consumer position Pulsar
  6. 6. Messaging Model 6 Pulsar Consumer-A1 receives all messages published on T; B1, B2, B3 receive one third each Shared Exclusive Consumer-B1 Consumer-B2 Consumer-B3 Topic-T Subscription-B Subscription-A Consumer-A1 Producer-X Producer-Y
  7. 7. 7 Client API Producer PulsarClient client = PulsarClient.create( "http://broker.usw.example.com:8080"); Producer producer = client.createProducer( "persistent://my-prop/us-west/my-ns/my-topic"); // Handles retries in case of failure producer.send("my-message".getBytes()); // Async version: producer.sendAsync(“my-message”.getBytes()) .thenAccept(msgId -> { // Message was persisted }); Consumer PulsarClient client = PulsarClient.create( "http://broker.usw.example.com:8080"); Consumer consumer = client.subscribe( "persistent://my-prop/us-west/my-ns/my-topic", "my-subscription-name"); while (true) { // Wait for a message Message msg = consumer.receive(); // Process message … // Acknowledge the message so that // it can be deleted by broker consumer.acknowledge(msg); } Pulsar
  8. 8. Main client library features 8 ▪ Sync / Async operations ▪ Partitioned topics ▪ Transparent batching of messages ▪ Compression ▪ End-to-end checksum ▪ TLS encryption ▪ Individual and cumulative acknowledgment ▪ Client side stats Pulsar
  9. 9. Architecture 9 Pulsar Separate layers between brokers and storage (bookies) ‣ Broker and bookies can be added independently ‣ Traffic can be shifted very quickly across brokers ‣ New bookies will ramp up on traffic quickly Pulsar Cluster ZK Producer Consumer Broker 1 Broker 3 Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5 Broker 2
  10. 10. Architecture 10 Pulsar Pulsar Cluster Broker Bookie ZK Global ZK Service discovery Producer App Pulsar lib Replication Managed Ledger BK Client Global replicators Cache Dispatcher Consumer App Pulsar lib Load Balancer Broker ‣ End-to-end async message processing ‣ Messages are relayed across producers, bookies and consumers with no copies ‣ Pooled ref-counted buffers ‣ Cache recent messages
  11. 11. BookKeeper 11 ▪ Replicated log service ▪ Offer consistency and durability ▪ Why is it a good choice for Pulsar? › Very efficient storage for sequential data › For each topic we are creating multiple ledgers over time › Very good distribution of IO across all bookies › Isolation of write and reads › Flexible model for quorum writes with different tradeoffs Pulsar
  12. 12. BookKeeper - Storage 12 ▪ A single bookie can serve and store thousands of ledgers ▪ Writes to journal, reads come from ledger device: › Avoid read activity to impact write latency › Writes are added to in- memory write-cache and committed to journal › Write cache is flushed in background to separated ledger device ▪ Entries are sorted to allow for mostly sequential reads Pulsar
  13. 13. Performance — Single topic throughput and latency 13 Pulsar Throughput and 99pct publish latency — 1 Topic — 1 Producer Latency(ms) 0 1 2 3 4 5 6 Throughput (msg/s) 1,000 10,000 100,000 1,000,000 10,000,000 1,800,000 10 Bytes 100 Bytes 1KB
  14. 14. Final Remarks • Check out the code and docs at github.com/yahoo/pulsar • Give feedback or ask for more details on mailing lists: • Pulsar-Users • Pulsar-Dev

×