Pulsar - flexible pub-sub for internet scale

APACHE PULSAR
Flexible Pub-Sub system for Internet scale
http://pulsar.apache.org
Content on this presentation is licensed under a
Creative Commons Attribution 4.0 International license

Pulsar graduates asTLP project today!

WHO AM I?
• Matteo Merli
• Apache Pulsar PMC Chair
• Member of Apache BookKeeper PMC
• Co-Founder of Streamlio
• Worked on Pulsar since its beginning atYahoo

WHAT IS APACHE PULSAR?
“Pub-Sub messaging backed by durable log storage”

WHAT IS APACHE PULSAR?
5
Multi-tenancy
A single cluster can
support many tenants
and use cases
Ordering
Guaranteed ordering
Durability
Data replicated and
synced to disk
Delivery Guarantees
At least once, at most
once and effectively
once
Highly scalable
Can support millions
of topics
Uniﬁed messaging model
Support both Topic &
Queue semantic in a
single model
Geo-replication
Out of box support for
geographically distributed
applications
High throughput
Can reach 1.8 M
messages/s in a single
partition
Low Latency
Low publish latency of
5ms at 99pct

WHY BUILD A NEW SYSTEM?
• No existing solution to satisfy requirements
• Multi tenant — 1M topics — Low latency — Durability — Geo replication
• Other systems don’t scale well with many topics:
• Storage model based on individual directory per topic partition
• Durability kills the performance
• Ability to manage large backlogs — Read old data without impacting writers
• Many other choking points: getting stats, access to metadata, ﬂow-control
• Operations are not very convenient — replacing servers, expanding clusters, etc…
6

STATE OFTHE PROJECT
• Project started atYahoo around 2012 and went through various iterations
• Open-Sourced in September 2016
• Entered Apache Incubator in June 2017
• Graduated asTLP on September 2018
• 2249 Commits — 22Yahoo releases — 9 Apache releases
• 59 Contributors
7

ARCHITECTURALVIEW
Separate layers between
brokers bookies
• Broker and bookies can
be added independently
• Trafﬁc can be shifted very
quickly across brokers
• New bookies will ramp up
on trafﬁc quickly

APACHE BOOKKEEPER
Replicated log storage
• Low-latency durable writes
• Simple repeatable read consistency
• Highly available
• Store many logs per node
• I/O Isolation

SEGMENT
CENTRIC
STORAGE
• In addition to partitioning,
messages are stored in segments
(based on time and size)
• Segments are independent from
each others and spread across
all storage nodes

DATA PATH
1 — Publisher sends message to broker

DATA PATH
2 — Broker writes in parallel to N replicas

DATA PATH
3 — Wait for a quorum of acks from bookies

DATA PATH
4 — Send ack to producer — Dispatch to consumer

BOOKKEEPER INTERNAL
Storage optimized for sequential & immutable data
• IO isolation between write and
read operations
• Slow consumers won’t impact
latency
• Very effective IO patterns:
• Journal — append only and no
reads
• Storage device — bulk write
and sequential reads
• Number of ﬁles is independent
from number of topics

PULSAR CLIENT LIBRARY
• Java — C++ — Python — Go — WebSocket APIs
• Partitioned topics
• Apache Kafka compatibility wrapper API
• Transparent batching and compression
• TLS encryption and authentication
• End-to-end encryption
18

PYTHON CLIENT
import pulsar
client = pulsar.Client('pulsar://localhost:6650')
producer = client.create_producer('my-topic')
for i in range(10):
producer.send(('Hello-%d' % i).encode('utf-8'))
client.close()
19
• pip install pulsar-client

GO CLIENT
• go get -u github.com/apache/pulsar/pulsar-client-go/pulsar
client, err := pulsar.NewClient(pulsar.ClientOptions{
URL: "pulsar://localhost:6650"
})
producer, err := client.CreateProducer(pulsar.ProducerOptions{
Topic: "my-topic",
})
for i := 0; i < 10; i++ {
err := producer.Send(context.Background(), pulsar.ProducerMessage{
Payload: []byte(fmt.Sprintf("hello-%d", i)),
})
}
• Based on C++ client library — Pure Go client is being worked on
20

MULTI-TENANCY
• Authentication / Authorization / Namespaces / Admin APIs
• I/O Isolations between writes and reads
• Provided by BookKeeper - Ensure readers draining backlog won’t affect publishers
• Soft isolation
• Storage quotas — ﬂow-control — back-pressure — rate limiting
• Hardware isolation
• Constrain some tenants on a subset of brokers or bookies
21

GEO REPLICATION
Topic (T1) Topic (T1)
Topic (T1)
Subscription (S1) Subscription (S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data Center A Data Center B
Data Center C
• Scalable asynchronous
replication
• Integrated in the broker
message ﬂow
• Simple conﬁguration to
add/remove regions

SCHEMA REGISTRY
• Store information on the data structure — Stored in BookKeeper
• Enforce data types on topic
• Allow for compatible schema evolutions
23

TYPE-SAFE CLIENT API
Producer<MyClass> producer = client
.newProducer(Schema.JSON(MyClass.class))
.topic("my-topic")
.create();
producer.send(new MyClass(1, 2));
24
Consumer<MyClass> consumer = client
.newConsumer(Schema.JSON(MyClass.class))
.topic("my-topic")
.subscriptionName("my-subscription")
.subscribe();
Message<MyClass> msg = consumer.receive();
• Integrated schema in API
• End-to-end type safety — Enforced in Pulsar broker

PULSAR FUNCTIONS
Managed lightweight compute framework

PULSAR FUNCTIONS / 1
• Simple compute against a consumed message
• Managed or manual deployment
• A function gets messages from 1 or more topics
• An instance of the function is invoked to process the event
• The output of the function is published on 1 or more topics
26

• Super simple to use — no SDK
• Python example:
def process(input):
return input + '!'
• Supports Java & Python — Go will come next
27

• Good use cases for functions:
• ETL
• Data enrichment
• Data ﬁltering
• Routing
28

• Deployment modes:
• Local run — Manually run a function, useful for dev mode
• Managed — Worker service is running instances of functions
29

PULSAR IO
Connector framework based on Pulsar Functions

PULSAR IO
• Source — Ingest data into a Pulsar topic
• Sink — Reads data from topic and dump into external sink
• Pulsar provides a set of built-in connectors
• Users can submit customized connectors
31

TIERED STORAGE
Unlimited topic storage capacity
Achieves the true “stream-storage”: keep
the raw data forever in stream form

TIERED STORAGE
• Leverage cloud storage services to ofﬂoad cold data — Completely
transparent to clients
• Extremely cost effective — Backends (S3) (Coming GCS, HDFS)
• Example: Retain all data for 1 month — Ofﬂoad all messages older
than 1 day to S3
33

PULSAR SQL
• Coming very soon in Pulsar 2.2
• Interactive SQL queries over data stored in Pulsar
• Query old and real-time data
34

PULSAR SQL / 2
• Based on Presto by Facebook — https://prestodb.io/
• Presto is a distributed query execution engine
• Fetches the data from multiple sources (HDFS, S3, MySQL, …)
• Full SQL compatibility
35

PULSAR SQL / 3
• Pulsar connector for Presto
• Read data directly from BookKeeper — bypass Pulsar Broker
• Many-to-many data reads
• Data is split even on a single partition — multiple workers can read data in
parallel from single Pulsar partition
• Time based indexing — Use “publishTime” in predicates to reduce data being
read from disk
36

OPENMESSAGING
BENCHMARK
openmessaging.cloud
openmessaging.cloud/docs/benchmarks

BENCHMARK FRAMEWORK
• Designed to measure performance of distributed messaging systems
• Supports various “drivers” (Kafka, Pulsar, RocketMQ, RabbitMQ)
• Automated deployment in EC2
• Conﬁgure workloads through aYAML ﬁle
38

DISTRIBUTED EXECUTION
Coordinator will take the workload deﬁnition and propagate to multiple
workers — Collects and reports stats

MaxThroughput
1Topic
1 Partition
1KB payload

Latency at ﬁxed
throughput
50K msg/s
1Topic
1 Partition
1KB payload

Latency at ﬁxed
throughput
—
(including Kafka-sync)
50K msg/s
1Topic
1 Partition
1KB payload

Latency at ﬁxed
throughput
—
99pct
50K msg/s
1Topic
1 Partition
1KB payload

Pulsar - flexible pub-sub for internet scale

More Related Content

What's hot

Similar to Pulsar - flexible pub-sub for internet scale

Recently uploaded

Pulsar - flexible pub-sub for internet scale