Apache Pulsar: what we learned from building a
modern messaging and streaming system for cloud
Sijie Guo
Co-founder and CEO, StreamNative
Who am I?
● Sijie Guo (@sijieg)
● Co-founder and CEO, StreamNative
● PMC Member of Pulsar/BookKeeper
● Ex Co-Founder, Streamlio
● Ex Twitter, Yahoo!
StreamNative
Founded by the creators of Apache Pulsar, StreamNative provides a
cloud-native, unified messaging and streaming platform powered by
Apache Pulsar to support multi-cloud and hybrid-cloud strategies
Apache Pulsar: what we learned from building a
modern messaging and streaming system for cloud
Sijie Guo
Co-founder and CEO, StreamNative
2010
I build log storage for real-time data
2010 2011
I build log storage for real-time data
2010 2011
Distributed
Log
2015
I build log storage for real-time data
2010 2011
Distributed
Log
2015 2017
I build log storage for real-time data
The Log: What every software engineer should know about real-time
data's unifying abstraction
- Jay Kreps
Log-centric Designs
Apache Kafka
Kafka API is great, but ……
2008
“Hadoop Era”
Engineers at LinkedIn start
developing “Kafka” to
collect logs for Hadoop
Kafka is open-sourced
Engineers at Yahoo! start working on
a global messaging model that can
work across international teams. They
decide Kafka won’t work, and set out
to build their own.
2011
Hadoop
But, Kafka was built for Hadoop Era
2008
“Hadoop Era”
Engineers at LinkedIn start
developing “Kafka” to
collect logs for Hadoop
Kafka is open-sourced
Engineers at Yahoo! start working on
a global messaging model that can
work across international teams. They
decide Kafka won’t work, and set out
to build their own.
2011
Hadoop
But, Kafka was built for Hadoop Era
● Built for on-premise
● The bottleneck was disk; The
slowness of spinning disks
dominating how Kafka was
built - compute and disk are
coupled
● Data movement was the
main driven use case; trade
throughput over latency,
durability, and consistency
The world is very different from
what it was 10+ years ago
● The rise of cloud
○ Cloud computing changed the pricing dynamics around local vs remote storage;
remote storage is way cheaper than local storage in cloud
○ Cloud data centers have robust networking infrastructure to achieve low latency
and high bandwidth communication to remote storage
The world is very different from
what it was 10+ years ago
● The rise of cloud
● The shift towards micro-services and event-driven architecture
○ Log API doesn’t meet all the requirements
○ Unified messaging and streaming API is the trend
○ Require the ability to support multiple messaging protocols
The world is very different from
what it was 10+ years ago
● The rise of cloud
● The shift towards micro-services and event-driven architecture
● The need for multi-cloud and hybrid-cloud
The world is very different from
what it was 10+ years ago
● The rise of cloud
● The shift towards micro-services and event-driven architecture
● The need for multi-cloud and hybrid-cloud
● The data requirements go beyond just streaming towards entire
data lifecycle (aka unifying real-time data with historic context)
2008
“Hadoop Era”
Engineers at LinkedIn start
developing “Kafka” to
collect logs for Hadoop
Kafka is open-sourced
Engineers at Yahoo! start working on
a global messaging model that can
work across international teams. They
decide Kafka won’t work, and set out
to build their own.
2011
Hadoop
Pulsar is built for Cloud-Native Era
2016
Docker launches.
Containers gain
momentum
2014
Kubernetes
launches
Pulsar graduates as
a Top-Level Apache
Project
2013
Pulsar is
committed to
open-source
2018
Accelerated adoption of Pulsar as
companies move to Kubernetes and seek
cloud-native, multi-cloud and hybrid
cloud strategies
2019 -->
Docker / Kubernetes
● Built for Kubernetes and Cloud
● Hardware requirements changes in Cloud drove a
compute-and-storage-separation design
● The messaging use cases for mission-critical business applications
were the main driven use cases; high throughput and low latency
streaming with strong durability and consistency are the
requirements
● Unified messaging and streaming for a multi-tenant, real-time data
fabric
Pulsar is built for Cloud-Native Era
Look back …
What we learned from building Pulsar
● Compute and storage separation
● Messaging and Streaming unification
● Support multi messaging protocols
● Importance of multi-tenancy
● Geo-replication to support multi-cloud and hybrid-cloud
● Infinite stream storage to support entire data lifecycle
Compute and storage separation
● Instant scalability; No
data rebalance
● High availability
● Cost-effective
Apache Pulsar
Apache BookKeeper
Broker 0
Producer Consumer
Broker 1 Broker 2
Bookie
0
Bookie
1
Bookie
2
Bookie
3
Bookie
4
Enterprise Architecture
Messaging
● Queueing systems are ideal for work
queues that do not require tasks to
be performed in a particular
order—for example, sending one
email message to many recipients.
● RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Streaming
● Streaming works best in situations
where the order of messages is
important—for example, data ingestion.
● Kafka and Amazon Kinesis are
examples of messaging systems that
use streaming semantics for consuming
messages.
Streaming
Messaging
Producer 1
Producer 2
Pulsar
Topic/Partition
m0
m1
m2
m3
m4
Consumer D-1
Consumer D-2
Consumer D-3
Subscription D
<
k
2
,
v
1
>
<
k
2
,
v
3
>
<k3
,v2
>
<
k
1
,
v
0
>
<
k
1
,
v
4
>
Key-Shared
Consumer C-1
Consumer C-2
Consumer C-3
Subscription C
m1
m2
m3
m4
m0
Shared
Failover
Consumer B-1
Consumer B-0
Subscription B
m1
m2
m3
m4
m0
In case of failure in
Consumer B-0
Consumer A-1
Consumer A-0
Subscription A
m1
m2
m3
m4
m0
Exclusive
X
Unified API
Unified Application and Data Services
✓ Unified storage for
in-motion data
✓ Native tiered storage
✓ Single system to
exchange data
✓ Teams share toolset
Apache Pulsar
Pulsar Protocol
Handler
Pulsar Clients
(queue + stream)
Kafka Protocol
Handler
AMQP Protocol
Handler
MQTT Protocol
Handler
Kafka Clients AMQP Clients MQTT Clients
Support Multi Messaging Protocols
Break Data Silos with Multi-Tenancy Support
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Instance/Cluster
Geo-replication to support
multi-cloud and hybrid-cloud
Pulsar has built-in cross
data center replication
that is used in production
already.
Infinite stream storage to support
entire data lifecycle
Storage
Apache Pulsar
Apache BookKeeper
Broker 0
Producer Consumer
Broker 1 Broker 2
Bookie
0
Bookie
1
Bookie
2
Bookie
3
Bookie
4
S S S S
S S S S
T
1
T
2
T
3
T
4
T
0
S
Apache Pulsar
Apache BookKeeper
Offloading
Broker 0
Producer Consumer
Broker 1 Broker 2
Bookie
0
Bookie
1
Bookie
2
Bookie
3
Bookie
4
S S S S
S S S S
T
1
T
2
T
3
T
4
T
0
S
A hero is nothing but a product of his time.
“时势造英雄”
Pulsar is a product of cloud-native era

What We Learned From Building a Modern Messaging and Streaming System for Cloud

  • 1.
    Apache Pulsar: whatwe learned from building a modern messaging and streaming system for cloud Sijie Guo Co-founder and CEO, StreamNative
  • 2.
    Who am I? ●Sijie Guo (@sijieg) ● Co-founder and CEO, StreamNative ● PMC Member of Pulsar/BookKeeper ● Ex Co-Founder, Streamlio ● Ex Twitter, Yahoo!
  • 3.
    StreamNative Founded by thecreators of Apache Pulsar, StreamNative provides a cloud-native, unified messaging and streaming platform powered by Apache Pulsar to support multi-cloud and hybrid-cloud strategies
  • 4.
    Apache Pulsar: whatwe learned from building a modern messaging and streaming system for cloud Sijie Guo Co-founder and CEO, StreamNative
  • 6.
    2010 I build logstorage for real-time data
  • 7.
    2010 2011 I buildlog storage for real-time data
  • 8.
    2010 2011 Distributed Log 2015 I buildlog storage for real-time data
  • 9.
    2010 2011 Distributed Log 2015 2017 Ibuild log storage for real-time data
  • 11.
    The Log: Whatevery software engineer should know about real-time data's unifying abstraction - Jay Kreps
  • 12.
  • 13.
  • 14.
    Kafka API isgreat, but ……
  • 15.
    2008 “Hadoop Era” Engineers atLinkedIn start developing “Kafka” to collect logs for Hadoop Kafka is open-sourced Engineers at Yahoo! start working on a global messaging model that can work across international teams. They decide Kafka won’t work, and set out to build their own. 2011 Hadoop But, Kafka was built for Hadoop Era
  • 16.
    2008 “Hadoop Era” Engineers atLinkedIn start developing “Kafka” to collect logs for Hadoop Kafka is open-sourced Engineers at Yahoo! start working on a global messaging model that can work across international teams. They decide Kafka won’t work, and set out to build their own. 2011 Hadoop But, Kafka was built for Hadoop Era ● Built for on-premise ● The bottleneck was disk; The slowness of spinning disks dominating how Kafka was built - compute and disk are coupled ● Data movement was the main driven use case; trade throughput over latency, durability, and consistency
  • 17.
    The world isvery different from what it was 10+ years ago ● The rise of cloud ○ Cloud computing changed the pricing dynamics around local vs remote storage; remote storage is way cheaper than local storage in cloud ○ Cloud data centers have robust networking infrastructure to achieve low latency and high bandwidth communication to remote storage
  • 18.
    The world isvery different from what it was 10+ years ago ● The rise of cloud ● The shift towards micro-services and event-driven architecture ○ Log API doesn’t meet all the requirements ○ Unified messaging and streaming API is the trend ○ Require the ability to support multiple messaging protocols
  • 19.
    The world isvery different from what it was 10+ years ago ● The rise of cloud ● The shift towards micro-services and event-driven architecture ● The need for multi-cloud and hybrid-cloud
  • 20.
    The world isvery different from what it was 10+ years ago ● The rise of cloud ● The shift towards micro-services and event-driven architecture ● The need for multi-cloud and hybrid-cloud ● The data requirements go beyond just streaming towards entire data lifecycle (aka unifying real-time data with historic context)
  • 21.
    2008 “Hadoop Era” Engineers atLinkedIn start developing “Kafka” to collect logs for Hadoop Kafka is open-sourced Engineers at Yahoo! start working on a global messaging model that can work across international teams. They decide Kafka won’t work, and set out to build their own. 2011 Hadoop Pulsar is built for Cloud-Native Era 2016 Docker launches. Containers gain momentum 2014 Kubernetes launches Pulsar graduates as a Top-Level Apache Project 2013 Pulsar is committed to open-source 2018 Accelerated adoption of Pulsar as companies move to Kubernetes and seek cloud-native, multi-cloud and hybrid cloud strategies 2019 --> Docker / Kubernetes
  • 22.
    ● Built forKubernetes and Cloud ● Hardware requirements changes in Cloud drove a compute-and-storage-separation design ● The messaging use cases for mission-critical business applications were the main driven use cases; high throughput and low latency streaming with strong durability and consistency are the requirements ● Unified messaging and streaming for a multi-tenant, real-time data fabric Pulsar is built for Cloud-Native Era
  • 23.
  • 24.
    What we learnedfrom building Pulsar ● Compute and storage separation ● Messaging and Streaming unification ● Support multi messaging protocols ● Importance of multi-tenancy ● Geo-replication to support multi-cloud and hybrid-cloud ● Infinite stream storage to support entire data lifecycle
  • 25.
    Compute and storageseparation ● Instant scalability; No data rebalance ● High availability ● Cost-effective Apache Pulsar Apache BookKeeper Broker 0 Producer Consumer Broker 1 Broker 2 Bookie 0 Bookie 1 Bookie 2 Bookie 3 Bookie 4
  • 26.
  • 27.
    Messaging ● Queueing systemsare ideal for work queues that do not require tasks to be performed in a particular order—for example, sending one email message to many recipients. ● RabbitMQ and Amazon SQS are examples of popular queue-based message systems. Streaming ● Streaming works best in situations where the order of messages is important—for example, data ingestion. ● Kafka and Amazon Kinesis are examples of messaging systems that use streaming semantics for consuming messages.
  • 28.
    Streaming Messaging Producer 1 Producer 2 Pulsar Topic/Partition m0 m1 m2 m3 m4 ConsumerD-1 Consumer D-2 Consumer D-3 Subscription D < k 2 , v 1 > < k 2 , v 3 > <k3 ,v2 > < k 1 , v 0 > < k 1 , v 4 > Key-Shared Consumer C-1 Consumer C-2 Consumer C-3 Subscription C m1 m2 m3 m4 m0 Shared Failover Consumer B-1 Consumer B-0 Subscription B m1 m2 m3 m4 m0 In case of failure in Consumer B-0 Consumer A-1 Consumer A-0 Subscription A m1 m2 m3 m4 m0 Exclusive X Unified API
  • 29.
    Unified Application andData Services ✓ Unified storage for in-motion data ✓ Native tiered storage ✓ Single system to exchange data ✓ Teams share toolset
  • 30.
    Apache Pulsar Pulsar Protocol Handler PulsarClients (queue + stream) Kafka Protocol Handler AMQP Protocol Handler MQTT Protocol Handler Kafka Clients AMQP Clients MQTT Clients Support Multi Messaging Protocols
  • 31.
    Break Data Siloswith Multi-Tenancy Support Tenants (Compliance) Tenants (Data Services) Namespace (Microservices) Topic-1 (Cust Auth) Topic-1 (Location Resolution) Topic-2 (Demographics) Topic-1 (Budgeted Spend) Topic-1 (Acct History) Topic-1 (Risk Detection) Namespace (ETL) Namespace (Campaigns) Namespace (ETL) Tenants (Marketing) Namespace (Risk Assessment) Pulsar Instance/Cluster
  • 32.
    Geo-replication to support multi-cloudand hybrid-cloud Pulsar has built-in cross data center replication that is used in production already.
  • 33.
    Infinite stream storageto support entire data lifecycle Storage Apache Pulsar Apache BookKeeper Broker 0 Producer Consumer Broker 1 Broker 2 Bookie 0 Bookie 1 Bookie 2 Bookie 3 Bookie 4 S S S S S S S S T 1 T 2 T 3 T 4 T 0 S Apache Pulsar Apache BookKeeper Offloading Broker 0 Producer Consumer Broker 1 Broker 2 Bookie 0 Bookie 1 Bookie 2 Bookie 3 Bookie 4 S S S S S S S S T 1 T 2 T 3 T 4 T 0 S
  • 34.
    A hero isnothing but a product of his time. “时势造英雄”
  • 35.
    Pulsar is aproduct of cloud-native era