Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High performance messaging with Apache Pulsar

2,088 views

Published on

Apache Pulsar is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it's very important to ensure that the system can make use of all the available resources. Karthik Ramasamy and Matteo Merli share insights into the design decisions and the implementation techniques that allow Pulsar to achieve high performance with strong durability guarantees.

Published in: Engineering

High performance messaging with Apache Pulsar

  1. 1. HIGH PERFORMANCE MESSAGING WITH APACHE PULSAR http://pulsar.apache.org
  2. 2. WHAT IS PULSAR? “Pub-Sub messaging backed by durable log storage”
  3. 3. WHAT IS APACHE PULSAR? 3 Multi-tenancy A single cluster can support many tenants and use cases Ordering Guaranteed ordering Durability Data replicated and synced to disk Delivery Guarantees At least once, at most once and effectively once Highly scalable Can support millions of topics Unified messaging model Support both Topic & Queue semantic in a single model Geo-replication Out of box support for geographically distributed applications High throughput Can reach 1.8 M messages/s in a single partition Low Latency Low publish latency of 5ms at 99pct
  4. 4. MESSAGING MODEL
  5. 5. DEFINING PERFORMANCE
  6. 6. DISTRIBUTEDVSVERTICAL • Focus is very different • While both target overall system performance • Distributed systems generally focus on distributed perf 6
  7. 7. DISTRIBUTED SYSTEMS PERFORMANCE • Key factors: • How different components interact • How data is replicated • Can each component make progress while waiting for other 7
  8. 8. STATEFUL SYSTEMS • Macro optimizations typically yield orders of magnitude differences • Ensure throughput is not bottlenecked by waiting • In failure path (eg: how to replace a failed node) • In ops tasks (eg: how to expand a cluster) 8
  9. 9. STATEFUL SYSTEMS • Stateful systems can become unbalanced when traffic changes • The system needs to be designed to allow for quick reaction, distributing the load across all nodes 9
  10. 10. VERTICAL OPTIMIZATIONS • Ensure a single machine can give the max throughput • Optimize thread access • Concurrent data structures • Micro-Profiling 10
  11. 11. ARCHITECTURALVIEW
  12. 12. SEGMENT CENTRIC STORAGE • In addition to partitioning, messages are stored in segments (based on time and size) • Segments are independent from each others and spread across all storage nodes
  13. 13. SEGMENT CENTRIC • Unbounded log storage • Instant scaling without data rebalancing • High write and read availability via maximized data placement options • Fast replica repair — many-to-many read 13
  14. 14. SEGMENTSVS PARTITIONS
  15. 15. COMPARISON WITH APACHE KAFKA • In Kafka, partitions are sticky to brokers • A single partition is stored entirely in a single node • Retention is limited by a single node storage capacity • Failure recovery and capacity expansion require “rebalancing” • Rebalancing has a big impact over the system, affecting regular traffic 15
  16. 16. DATA PATH 1 — Publisher sends message to broker
  17. 17. DATA PATH 2 — Broker writes in parallel to N replicas
  18. 18. DATA PATH 3 — Wait for a quorum of acks from bookies
  19. 19. DATA PATH 4 — Send ack to producer — Dispatch to consumer
  20. 20. BOOKKEEPER REPLICATION MODEL • Single writer (Pulsar broker) • Write in parallel to multiple storage nodes • Wait for a configurable number of acks • Supports quorum writes (eg: write 3 nodes — wait 2 acks) • Perform recovery only after writer crashes • Establish what was the last committed entry and “seal” the segment 20
  21. 21. KAFKA REPLICATION MODEL ISR replication
  22. 22. LIMITATIONS OF KAFKA REPLICATION • When followers are “in-sync”, the leader will have to wait for them — cannot prune the slowest follower • Leader election happens per partition • To ensure ordering, only 1 message (or batch) can be outstanding — limits throughput • Reads can only happen from leader broker 22
  23. 23. STORAGE
  24. 24. STORAGE • Disk access patterns can lead to order of magnitude differences • Systems that rely on page-cache have unpredictable performance • Page cache is RAM speed until the system is under stress • After that, memory accesses can take 100s of millis 24
  25. 25. BOOKKEEPER INTERNAL
  26. 26. BOOKKEEPER STORAGE • IO isolation between write and read operations • Slow consumers won’t impact latency • Very effective IO patterns: • Journal — append only and no reads • Storage device — bulk write and sequential reads • Number of files is independent from number of topics 26
  27. 27. KAFKA STORAGE • Multiple files per each partitions — each segment has data • Lot of file descriptors needed • Page cache only works well until active data set exceed RAM size • IO is scattered throughout the disk 27
  28. 28. OPTIMIZATIONS
  29. 29. OPTIMIZATIONS • Payload Buffer pooling — Direct memory — No heap pollution • Object pooling in data path — minimize GC work • Serialize operations to thread to avoid mutex contention • Pulsar brokers acts as a “proxy” — Payloads are forwarded with zero-copies from producers to storage and consumers 29
  30. 30. BENCHMARK
  31. 31. OPENMESSAGING BENCHMARK openmessaging.cloud openmessaging.cloud/docs/benchmarks
  32. 32. BENCHMARK FRAMEWORK • Designed to measure performance of distributed messaging systems • Supports various “drivers” (Kafka, Pulsar, RocketMQ, RabbitMQ) • Automated deployment in EC2 • Configure workloads through aYAML file 32
  33. 33. DISTRIBUTED EXECUTION Coordinator will take the workload definition and propagate to multiple workers — Collects and reports stats
  34. 34. BENCHMARK RESULTS • Testing goals • Throughput & latency under different conditions • Min 2 guaranteed copies • Running on 3 EC2VMs with local SSDs 34
  35. 35. KAFKA SETTINGS • Topic settings replicationFactor=3 min.insync.replicas=2 log.flush.interval.ms= # Using default: means no fsyncs • Kafka producer config acks=all linger.ms=1 batch.size=131072 35
  36. 36. PULSAR / BOOKKEEPER SETTINGS • Use ensemble=3 write=3 ack=2 • Write to 3 bookies and wait for 2 acks • Data synced on disk before ack 36
  37. 37. MaxThroughput 1Topic 1 Partition 1KB payload
  38. 38. MaxThroughput — Exactly once producer 1Topic 1 Partition 1KB payload — Kafka settings:
 enable.idempotence=true max.in.flight.requests.per.connection=1 retries=2147483647
  39. 39. Latency at fixed throughput 50K msg/s 1Topic 1 Partition 1KB payload
  40. 40. Latency at fixed throughput — 99pct 50K msg/s 1Topic 1 Partition 1KB payload
  41. 41. Latency at fixed throughput — (including Kafka-sync) 50K msg/s 1Topic 1 Partition 1KB payload
  42. 42. Latency at fixed throughput — 99pct 50K msg/s 1Topic 1 Partition 1KB payload
  43. 43. OPTIMIZING FOR LOW LATENCY • Testing at a smaller throughput for sub-millisecond latency • Tested on bare metal server • Single machine to isolate impact of slow networks 43
  44. 44. HARDWARE SETUP • 1 Machine — Bare metal • 12 CPU cores — Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz • 128 GB RAM • 2 x 1.2TB NVMe disks 44
  45. 45. Latency at low throughput 1K msg/s 1Topic 1 Partition 1KB payload
  46. 46. Latency at low throughput 1K msg/s 1Topic 1 Partition 1KB payload
  47. 47. Latency at low throughput 1K msg/s 1Topic 1 Partition 1KB payload
  48. 48. QUESTIONS ?

×