Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Pulsar: The Next Generation Messaging and Queuing System

307 views

Published on

Apache Pulsar is the next generation messaging and queuing system with unique design trade-offs driven by the need for scalability and durability. Its two layered architecture of separating message storage from serving led to an implementation that unifies the flexibility and the high-level constructs of messaging, queuing and light weight computing with the scalable properties of log storage systems.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Apache Pulsar: The Next Generation Messaging and Queuing System

  1. 1. © 2019 SPLUNK INC. The Next Generation Messaging and Queuing System
  2. 2. © 2019 SPLUNK INC. Intro Senior Principal Engineer - Splunk Co-creator Apache Pulsar Matteo Merli Senior Director of Engineering - Splunk Karthik Ramasamy
  3. 3. © 2019 SPLUNK INC. Messaging and Streaming
  4. 4. © 2019 SPLUNK INC. Messaging Message passing between components, application, services
  5. 5. © 2019 SPLUNK INC. Streaming Analyze events that just happened
  6. 6. © 2019 SPLUNK INC. Messaging vs Streaming 2 worlds, 1 infra
  7. 7. © 2019 SPLUNK INC. Use cases ● OLTP, Integration ● Main challenges: ○ Latency ○ Availability ○ Data durability ○ High level features ■ Routing, DLQ, delays, individual acks ● Real-time analytics ● Main challenges: ○ Throughput ○ Ordering ○ Stateful processing ○ Batch + Real-Time Messaging Streaming
  8. 8. © 2019 SPLUNK INC. Storage Messaging Compute
  9. 9. © 2019 SPLUNK INC. Apache Pulsar Data replicated and synced to disk Durability Low publish latency of 5ms at 99pct Low Latency Can reach 1.8 M messages/s in a single partition High Throughput System is available if any 2 nodes are up High Availability Take advantage of dynamic cluster scaling in cloud environments Cloud Native Flexible Pub-Sub and Compute backed by durable log storage
  10. 10. © 2019 SPLUNK INC. Apache Pulsar Support both Topic & Queue semantic in a single model Unified messaging model Can support millions of topics Highly Scalable Lightweight compute framework based on functions Native Compute Supports multiple users and workloads in a single cluster Multi Tenant Out of box support for geographically distributed applications Geo Replication Flexible Pub-Sub and Compute backed by durable log storage
  11. 11. © 2019 SPLUNK INC. Apache Pulsar project in numbers 192 Contributors 30 Committers 100s Adopters 4.6K Github Stars
  12. 12. © 2019 SPLUNK INC. Sample of Pulsar users and contributors
  13. 13. © 2019 SPLUNK INC. Messaging Model
  14. 14. © 2019 SPLUNK INC. Pulsar Client libraries ● Java — C++ — C — Python — Go — NodeJS — WebSocket APIs ● Partitioned topics ● Apache Kafka compatibility wrapper API ● Transparent batching and compression ● TLS encryption and authentication ● End-to-end encryption
  15. 15. © 2019 SPLUNK INC. Architectural view Separate layers between brokers bookies ● Broker and bookies can be added independently ● Traffic can be shifted very quickly across brokers ● New bookies will ramp up on traffic quickly
  16. 16. © 2019 SPLUNK INC. Apache BookKeeper ● Low-latency durable writes ● Simple repeatable read consistency ● Highly available ● Store many logs per node ● I/O Isolation Replicated log storage
  17. 17. © 2019 SPLUNK INC. Inside BookKeeper Storage optimized for sequential & immutable data ● IO isolation between write and read operations ● Does not rely on OS page cache ● Slow consumers won’t impact latency ● Very effective IO patterns: ○ Journal — append only and no reads ○ Storage device — bulk write and sequential reads ● Number of files is independent from number of topics
  18. 18. © 2019 SPLUNK INC. Segment Centric Storage In addition to partitioning, messages are stored in segments (based on time and size) Segments are independent from each others and spread across all storage nodes
  19. 19. © 2019 SPLUNK INC. Segments vs Partitions
  20. 20. © 2019 SPLUNK INC. Tiered Storage Unlimited topic storage capacity Achieves the true “stream-storage”: keep the raw data forever in stream form Extremely cost effective
  21. 21. © 2019 SPLUNK INC. Schema Registry Store information on the data structure — Stored in BookKeeper Enforce data types on topic Allow for compatible schema evolutions
  22. 22. © 2019 SPLUNK INC. Schema Registry ● Integrated schema in API ● End-to-end type safety — Enforced in Pulsar broker Producer<MyClass> producer = client .newProducer(Schema.JSON(MyClass.class)) .topic("my-topic") .create(); producer.send(new MyClass(1, 2)); Consumer<MyClass> consumer = client .newConsumer(Schema.JSON(MyClass.class)) .topic("my-topic") .subscriptionName("my-subscription") .subscribe(); Message<MyClass> msg = consumer.receive(); Type Safe API
  23. 23. © 2019 SPLUNK INC. Geo Replication Scalable asynchronous replication Integrated in the broker message flow Simple configuration to add/remove regions
  24. 24. © 2019 SPLUNK INC. Replicated Subscriptions ● Consumption will restart close to where a consumer left off - Small amount of dups ● Implementation ○ Use markers injected into the data flow ○ Create a consistent snapshot of message ids across cluster ○ Establish a relationship: If consumed MA-1 in Cluster-A it must have consumed MB-2 in Cluster-B Migrate subscriptions across geo-replicated clusters
  25. 25. © 2019 SPLUNK INC. Multi-Tenancy ● Authentication / Authorization / Namespaces / Admin APIs ● I/O Isolations between writes and reads ○ Provided by BookKeeper ○ Ensure readers draining backlog won’t affect publishers ● Soft isolation ○ Storage quotas — flow-control — back-pressure — rate limiting ● Hardware isolation ○ Constrain some tenants on a subset of brokers or bookies A single Pulsar cluster supports multiple users and mixed workloads
  26. 26. © 2019 SPLUNK INC. Lightweight Compute with Pulsar Functions
  27. 27. © 2019 SPLUNK INC. Pulsar Functions
  28. 28. © 2019 SPLUNK INC. Pulsar Functions ● User supplied compute against a consumed message ○ ETL, data enrichment, filtering, routing ● Simplest possible API ○ Use language specific “function” notation ○ No SDK required ○ SDK available for more advanced features (state, metrics, logging, …) ● Language agnostic ○ Java, Python and Go ○ Easy to support more languages ● Pluggable runtime ○ Managed or manual deployment ○ Run as threads, processes or containers in Kubernetes
  29. 29. © 2019 SPLUNK INC. Pulsar Functions def process(input): return input + '!' import java.util.function.Function; public class ExclamationFunction implements Function<String, String> { @Override public String apply(String input) { return input + "!"; } } Python Java Examples
  30. 30. © 2019 SPLUNK INC. Pulsar Functions ● Functions can store state in stream storage ● State is global and replicated ● Multiple instances of the same function can access the same state ● Functions framework provides simple abstraction over state State management
  31. 31. © 2019 SPLUNK INC. Pulsar Functions ● Implemented on top of Apache BookKeeper “Table Service” ● BookKeeper provides a sharded key/value store based on: ○ Log & Snapshot - Stored as BookKeeper ledgers ○ Warm replicas that can be quickly promoted to leader ● In case of leader failure there is no downtime or huge log to replay State management
  32. 32. © 2019 SPLUNK INC. Pulsar Functions State example import org.apache.pulsar.functions.api.Context; import org.apache.pulsar.functions.api.PulsarFunction; public class CounterFunction implements PulsarFunction<String, Void> { @Override public Void process(String input, Context context) { for (String word : input.split(".")) { context.incrCounter(word, 1); } return null; } }
  33. 33. © 2019 SPLUNK INC. Pulsar IO Connectors Framework based on Pulsar Functions
  34. 34. © 2019 SPLUNK INC. Built-in Pulsar IO connectors
  35. 35. © 2019 SPLUNK INC. Querying data stored in Pulsar
  36. 36. © 2019 SPLUNK INC. Pulsar SQL ● Uses Presto for interactive SQL queries over data stored in Pulsar ● Query historic and real-time data ● Integrated with schema registry ● Can join with data from other sources
  37. 37. © 2019 SPLUNK INC. Pulsar SQL ● Read data directly from BookKeeper into Presto — bypass Pulsar Broker ● Many-to-many data reads ○ Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition ● Time based indexing — Use “publishTime” in predicates to reduce data being read from disk
  38. 38. © 2019 SPLUNK INC. Pulsar Storage API ● Work in progress to allow direct access to data stored in Pulsar ● Generalization of the work done for Presto connector ● Most efficient way to retrieve and process data from “batch” execution engines
  39. 39. Thank You © 2019 SPLUNK INC.

×