Ricardo Paiva
First impressions of Apache Pulsar features from
someone that have never used it. :)
Apache Pulsar
First Overview
Motivation
3 •
Kafka is an amazing tool, with increadible througput and resilience, but it has some
drawbacks or lacks few features:
 Capacity of a partition is limited by the smallest node
 Ops - Add/remove a new broker requires cluster rebalancing
 No long term storage
 Only sub/pub client pattern (no work queue)
 No namespace or tenancy management
 No multi-cluster replication
Motivation
Key concepts
5 •
Tiered Storage
Uses Apache Jclouds
6 •
Multi-tenant and Namespace
Pulsar Components
8 •
Brokers
9 •
Bookies
10 •
Producer
11 •
Consumer
12 •
Zookeeper
13 •
 It uses BookKeeper but other schema registry can be plugged
 Can be uploaded when a typed Producer is created or via REST API
 Versioned
 Defined at topic level
 Format types:
 String (used for UTF-8-encoded strings)
 JSON
 Protobuf
 Avro
 Only works with Java
Schema Registry
Subscription modes
15 •
Message Acknowledgment
16 •
 Message Retention
 Applies to messages that are marked as acknowledged and set to be deleted
 It’s a time limit applied on a topic whereas.
 TTL
 Applies to messages that were not consumed
 It’s a time limit on consumption with a subscription.
Retention
17 •
Exclusive
18 •
Failover
19 •
Shared (Working queue)
 Message ordering is not guaranteed.
 You cannot use cumulative acknowledgment with shared mode.
Internals
21 •
Bookie Storage
22 •
Cold storage
23 •
SQL with Presto
Other features
25 •
Geo Replication (Sync)
 Requires global Zookeeper installation
 Region Aware Placement Policy
 Higher latency
26 •
Geo Replication (ASync)
 Rack Aware Placement Policy
 First persisted to the local cluster and
then replicated asynchronously to the
remote clusters
 Enabled on a per-tenant basis
 Types:
 master-slave replication
 active-active bidirectional
replication
 full-mesh replication between
multiple data centers
27 •
 Per producer/topic sequence numbers to detect duplicates
 Each topic owner broker maintains an in-memory hashmap of the latest sequence number
per topic/producer.
 The broker periodically snapshots the latest sequence number to a cursor, which allows the
map to be reconstructed by another broker after a fail-over.
Deduplication
https://jack-vanlightly.com/blog/2018/10/25/testing-producer-deduplication-in-apache-kafka-and-apache-pulsar
28 •
 Lightweight compute framework
for Pulsar
 Can run inside or outside the
cluster
 State storage is handled by
BookKeeper
 "Serverless" idea
Pulsar Functions

Apache Pulsar First Overview

  • 1.
    Ricardo Paiva First impressionsof Apache Pulsar features from someone that have never used it. :) Apache Pulsar First Overview
  • 2.
  • 3.
    3 • Kafka isan amazing tool, with increadible througput and resilience, but it has some drawbacks or lacks few features:  Capacity of a partition is limited by the smallest node  Ops - Add/remove a new broker requires cluster rebalancing  No long term storage  Only sub/pub client pattern (no work queue)  No namespace or tenancy management  No multi-cluster replication Motivation
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    13 •  Ituses BookKeeper but other schema registry can be plugged  Can be uploaded when a typed Producer is created or via REST API  Versioned  Defined at topic level  Format types:  String (used for UTF-8-encoded strings)  JSON  Protobuf  Avro  Only works with Java Schema Registry
  • 14.
  • 15.
  • 16.
    16 •  MessageRetention  Applies to messages that are marked as acknowledged and set to be deleted  It’s a time limit applied on a topic whereas.  TTL  Applies to messages that were not consumed  It’s a time limit on consumption with a subscription. Retention
  • 17.
  • 18.
  • 19.
    19 • Shared (Workingqueue)  Message ordering is not guaranteed.  You cannot use cumulative acknowledgment with shared mode.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    25 • Geo Replication(Sync)  Requires global Zookeeper installation  Region Aware Placement Policy  Higher latency
  • 26.
    26 • Geo Replication(ASync)  Rack Aware Placement Policy  First persisted to the local cluster and then replicated asynchronously to the remote clusters  Enabled on a per-tenant basis  Types:  master-slave replication  active-active bidirectional replication  full-mesh replication between multiple data centers
  • 27.
    27 •  Perproducer/topic sequence numbers to detect duplicates  Each topic owner broker maintains an in-memory hashmap of the latest sequence number per topic/producer.  The broker periodically snapshots the latest sequence number to a cursor, which allows the map to be reconstructed by another broker after a fail-over. Deduplication https://jack-vanlightly.com/blog/2018/10/25/testing-producer-deduplication-in-apache-kafka-and-apache-pulsar
  • 28.
    28 •  Lightweightcompute framework for Pulsar  Can run inside or outside the cluster  State storage is handled by BookKeeper  "Serverless" idea Pulsar Functions

Editor's Notes

  • #2 Do quick presentation of each other short agenda (first kafka basics + seconds design choice that made it a great tool for our scale)