Data insights and data-driven strategies create the competitive differentiators companies thrive off today. The need for unified messaging and streaming has never been more apparent.
Pulsar started with the goal of building a global, geo-replicated infrastructure to serve Yahoo!’s messaging needs. With the increased need to process both business events (such as payment request, billing request) and operational events (such as log data, click events, etc), the team at Yahoo! set out to build a true unified infrastructure platform to handle all in-motion data. That technology became Apache Pulsar.
In this talk, Matteo Merli and Sijie Guo will dive into the landscape of unified messaging and streaming, how Pulsar helps companies achieve this vision, and what the future of Pulsar will look like.
AWS Community Day CPH - Three problems of Terraform
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Summit NA 2021 Keynote
1. Pulsar Virtual Summit North America 2021
Apache Pulsar:
Why Unified Messaging
and Streaming Is the
Future
Matteo Merli, Sijie Guo
@ Pulsar PMC
2. Who are we?
● Sijie Guo (@sijieg)
● CEO, StreamNative
● PMC Member of Pulsar/BookKeeper
● Ex Co-Founder, Streamlio
● Ex Twitter
● Matteo Merli (@merlimat)
● CTO, StreamNative
● Co-creator and PMC chair of Pulsar
● Ex Co-Founder, Streamlio
● Ex Yahoo!
3. StreamNative
Founded by the creators of Apache Pulsar, StreamNative provides a
cloud-native, unified messaging and streaming platform powered by
Apache Pulsar to support multi-cloud and hybrid-cloud strategies
8. Cloud-Native
Kubernetes Drive Adoption of Pulsar
✓ 80% of Pulsar users deploy Pulsar in a cloud environment
✓ 62% of Pulsar users deploy Pulsar on Kubernetes
✓ 49% noted Pulsar’s Cloud-Native capabilities as one of the
top reasons they chose to adopt Pulsar
9. Cloud-Native
Built for Kubernetes
Containers
Cloud Native
Hybrid & MultiCloud
● Single Cloud Provider
● Monolithic
Architectures
● Single Tenant Systems
● No Geo-replication
VM / Early Cloud Era Containers / Modern Cloud Era
Microservices
11. Kafka to Pulsar
More and More Kafka Users Adopt Pulsar
✓ 68% of respondents use Kafka in addition to Pulsar
✓ 34% of respondents use or plan to use Kafka-on-Pulsar
✓ Kafka and Pulsar serve different use cases
✓ Once adopted, Pulsar usage expands across organizations
12. Pulsar Adoption Use Cases
Adopted Pulsar to replace Kafka
in their DSP (Data Streaming
Platform).
● 1.5-2x lower in capex cost
● 5-50x improvement in
latency
● 2-3x lower in opex due
● 10 PB / day
Adopted Pulsar to power their
billing platform, Midas, which
processing hundreds of billions
of financial transactions daily.
Adoption then expanded to
Tencent’s Federated Learning
Platform and Tencent Gaming.
Use cases require a scalable
message queue for serving
mission-critical business
applications to replace
RabbitMQ.
In the process of expanding use
cases to build data streaming
services
15. Messaging
● Queueing systems are ideal for work
queues that do not require tasks to
be performed in a particular order—
for example, sending one email
message to many recipients.
● RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Streaming
● Streaming works best in situations
where the order of messages is
important—for example, data
ingestion.
● Kafka and Amazon Kinesis are
examples of messaging systems that
use streaming semantics for
consuming messages.
Data in motion
25. Step 4:
Schema
API
Reader and
Batch API
Pub/Sub
API
Publisher
Subscriber
Stream Processor
Applications
Microservices or
Event-Driven Architecture
Schema
API
Schema API
26. Step 5:
Functions
and IO API
Reader and
Batch API
Pub/Sub
API
Publisher
Subscriber
Stream Processor
Applications
Microservices or
Event-Driven Architecture
Schema
API
Schema API
Functions
API
Pulsar
IO/Connectors
Prebuilt Connectors
Custom Connectors
27. Step 6:
Tiered
Storage
Reader and
Batch API
Pub/Sub
API
Publisher
Subscriber
Stream Processor
Applications
Microservices or
Event-Driven Architecture
Schema
API
Schema API
Functions
API
Pulsar
IO/Connectors
Prebuilt Connectors
Custom Connectors
Tiered Storage
29. Reader and
Batch API
Pub/Sub
API
Publisher
Subscriber
Stream Processor
Applications
Microservices or
Event-Driven Architecture
Schema
API
Schema API
Functions
API
Pulsar
IO/Connectors
Prebuilt Connectors
Custom Connectors
Tiered Storage
Step 8:
Transaction
API
Transaction
API
32. Towards a self-adjusting
data platform
✓ Tuning data platforms to run at scale is hard
✓ Lots of configurations
✓ Requires in-depth knowledge of internals
✓ Workloads are constantly changing
33. Topic auto-partitioning
✓ Partitions are an artifact of implementation
✓ It’s not a natural property of the data
✓ Abstract the partitioning away from users
✓ Partitions are automatically split / merged based
✓ Rethink how an API should look like
34. Self-Adjusting Storage
✓ Ensure most optimal utilization of hardware
✓ No configuration
✓ Automatically adjust strategies based on changing
condition:
✓ Disk access
✓ Cache management
✓ Queue sizes
35. Pulsar Functions
✓ The foundation is now mature — UX is still poor
✓ Simpler tooling to create & manage functions
✓ CI/CD integration — Versioning — A/B testing
✓ Observability & Debuggability
✓ Improve support for Go and Python functions
✓ DSL — Provide higher level constructs to process data
36. Stream Storage
✓ Evolve the current state of Tiered Storage
✓ Integrate with data lake technologies
Before diving into the “Unified Messaging and Streaming”, let’s take a look at the trends in Pulsar community.
To understand what is happening behind the scene, we need to rewind back to the early days of Pulsar.Back to 2012, when we first set out to build Pulsar, we thought there should be a global geo-replicated infrastructure for all the messaging data. We didn’t start with the idea of making our own software, but started by observing the gaps in the existing technologies available at the time and realized how they were insufficient to serve the needs of an data-driven organization.
Talking about these 2 different worlds
Messaging - read slide
These are like commands that represent changes that need to be made to the system
An example : we send message that says “Process this order” or “change user to be deleted” but we don’t actually perform that change just notify
Messaging systems are selected when synchronous communications breaks down
In contrast - streaming systems deal with events. The state changes themselves, so instead of sending a message saying this user wants to update their email, we instead actually perform the update
Events interlinked together that may be persisted, replayed or aggregated
Talking about these 2 different worlds
Messaging - read slide
These are like commands that represent changes that need to be made to the system
An example : we send message that says “Process this order” or “change user to be deleted” but we don’t actually perform that change just notify
Messaging systems are selected when synchronous communications breaks down
In contrast - streaming systems deal with events. The state changes themselves, so instead of sending a message saying this user wants to update their email, we instead actually perform the update
Events interlinked together that may be persisted, replayed or aggregated
Instructor Notes
What we have here is a little bit of an example of what we might see in a modern organization that has run into both these issues
We have basically 2 different regimes or 2 different worlds - different teams.
Historically, these worlds often seem very different with entirely different tech stacks and entirely different teams. However, as data becomes more critical in informing applications, the need to have applications make more use of what data teams and data services are producing. Likewise getting the data out of applications and into the data realm has forced organizations to get better at being able to do both of these things really well. This can be a real challenge.
So on the left we have the application side and these are applications that are interacting via messages and dealing with the aspects of running your systems and providing capabilities focused on business concerns
On the right side we have services that deal with the data. Data bulk and large
Sometimes the right side includes real time or batch processes such as sending large amounts of data, putting it into data lakes, making computing answers about it, sending data for another services or providing that data to other orgs that need it
These 2 worlds generally are using different technologies and different tools and different processes - all leading to more complexity and cost
Read slide
Separate storage/transport systems for messaging, streaming, and big-data. Focus on ETL separate processes
Messaging helps decouple apps, provides for reliable async communication, work queues, in core applications.
Streaming allows for “medium-term” storage of streams (~30 days), aggregating streams of data and real-time processing for near real-time analytics.
Batch processing and long-term object storage (S3, HDFS, etc) allows for processing historical data to learn from the past.
“Tiering” of data from messaging -> streaming -> object storage is outside of core toolset and is maintained explicitly.
Application and Data domains are separated, data is replicated into data domain. Results from data domain are loaded (ETL) back into application domain.
Multiple teams with very different technology stacks.
====
To show how Pulsar provides that ability to be transformative here is a common example of an e-commerce system stack that contains both a streaming set of services and also data processing
On the application side we have
order services, inventory service and fulfillment
Talk to each service (think Amazon)
On the data side we have
Spark - some batch processing using spark
Flink - Real time inventory analysis using flink
Another use case maybe some long term storage needs versus short term (30 days) then data warehouse layer
Imagine a person ordering something and then check inventory and it isn’t there. Do you delete the order or put on backorder?
Once the inventory gets replenished then how do we notify the customers that their order is now coming
So need to join both sides together
It is very nature to merge both. Talk about the technologies are evolved to a way to that is able to support both.
Read slide and add more context:
“Unified” storage/transport of message and streams with access to underlying data:
Messaging - Decoupled applications with pub/sub, shared subscriptions for work queues, exclusive subscriptions for fanout and point-to-point messaging with flexible large numbers of non-partitioned topics.
Streaming - Ordered, scalable partitioned topics with failover and key shared subscriptions. Pub/sub (broker controlled) or reader API (client controlled) for advanced stream processing, replay, etc.
Big-data batch Access - Underlying segments of topics can be read directly, allow for scale-out parallelism.
Tiered storage is core to Pulsar, no need for external tools.
Application and data domains use single system to exchange data, with converged “messaging” and “streaming”.
One or many teams, with shared toolset.
Talk to diagram
Talk to the slide and on the left side say how Pulsar can process real time streams and on the right can do batch processing, offload to tiered storage and read back in parallel batch fashion and even provide a stream back to other systems for consumption
order services, inventory service and fulfillment - they still work from the messaging domain (use cases not too different)
But now can support processing at much higher scale, any messages they have are kept in Pulsar as a single source of truth and these messages can be offloaded via Pulsar to long term storage
Pulsar also provides the power to enable a unified batch and streaming job that can do both batch processing by reading from underlying storage and combine that with real time streams all with a single technology
Let's take a retrospective look at how Pulsar has evolved through the
years.
When we started designing Pulsar as a new platform, we always had this
idea of supporting both the Pub-Sub semantics as well as the data
streaming pipelines, which at the time were a new and emerging thing.
But it would be a lie to say we had everything pre-planned since the
beginning.
Instead, we spent a lot of time observing how people used these
platforms and we tried to fill all the gaps we were seeing, evolving Pulsar
with the changing needs of data applications.
At the very core of Pulsar there has always been the concept of the
"log". A distributed, replicated and immutable ledger where all the
events are appended.
BookKeeper has proved, throughout the years, to be the best storage
solution for streams of data. It scales to very large number of logs,
it offers consistency, durability, low latency and high-throughput and,
more importantly, very convenient operational tooling.
To summarize: using the log as a building block does a lot of the heavy
lifting required to build a truly scalable system.
Another architectural choice that came naturally from using BookKeeper
has been the separation of the storage from the data serving layer.
This comes from BookKeeper because BookKeeper requires to have a single
writer for a each log. In our case the Broker acts as that single
writer.
This multi-layer architecture was exactly what we needed
because it allows Pulsar to have:
1. Stateless brokers - Means topics can be easily moved across
brokers without copying any data. For example, expanding cluster
or adjusting the topics assignments after changing conditions.
2. Data locality - Because of this broker layer, the data for a single
topic or partition does not have to be stored in one single storage
node. Instead we can fully utilize the resources of the entire
cluster.
We just said that the log is the building block of Pulsa... but the
log on its own is a very low level construct. Applications very often
need much more sophisticated ways of interacting with the data than
just reading through the log of events.
Instead, we wanted to capture the right level of semantics needed to
support a wide range of pub-sub and streaming use cases. The core idea
was to leave the flexibility to consume data from topics in multiple
different ways, depending on what the application needs.
We ended up having 4 subscription types with different semantics
and different properties, each one with its own merits.
After the Pub/Sub API, the next addition was the Reader API.
You can think of it as the "unmanaged" way to consume data
from a topic.
While there are many reasons for using a reader, the main users
are typically Stream Processing frameworks because they tend to
have their own checkpointing mechanisms or, similarly, batch systems
that want to do a scan of the historical data.
The common theme in the API exposed by Pulsar is the support for
Schema.
Having direct support for Schema inside Pulsar means that brokers
can validate the schema of the data being published and that the expectation
of consumers is matched as well.
But it also means that it becomes very easy to "discover" the schema of the data. The discoverability
of the schema means that you can write fully type safe generic
consumers that don't need to be aware of one specific schema.
Next we looked at what people were trying to do with messaging
platforms and the realization was that there was always some
portion of computation involved. Application very often need to
do simple data transformations, enrichment and similar things.
Functions were designed to provide the simplicity of the "Serverless"
model with a very tight integration in the Pulsar platform.
One example of how powerful Pulsar functions are is that we have
created a connector framework, Pulsar IO, entirely based on Pulsar
Functions.
With Pulsar IO, you can choose between a large set of pre-built
connectors, both sources or sinks, or build your own custom connectors.
After that, the next trend saw is that more and more users wanted
to use the "stream" concept not just as a temporary
buffer, as a way to isolate the data ingestion and the processing.
Instead, they increasingly want to keep the stream as a permanent,
or at least long term "storage of record".
Tiered storage was the missing link to enable this. By offloading
cold data to cloud storage providers, we can have large scale data retention
at a very effective cost, all while maintaining the stream view
of the data and the same APIs.
Another realization was that, because of its nature, messaging is
always the integration point for different applications and components.
This makes migration from other platforms a bit harder. You often have to
coordinate that migration across different teams or organizations.
To make it easier, we extended the Pulsar brokers to be able to speak
several protocols, in addition the Pulsar native protocols. With
Protocol Handlers, there is a pluggable way to add more ways to interact
with the Pulsar service and the same topic data.
We started with KoP, Kafka On Pulsar, then followed up by AMQP and MQTT.
It is very powerful mechanism for a few reasons:
1. Applications can use existing client libraries with no code or
dependencies changes
2. You can mix all sort of different protocols to interact with the same topic
3. It's exposed directly in Pulsar brokers, data is stored only once and
there is no "proxy overhead"
To really complete the full picture, in Pulsar 2.8 we introduced support
for transactions.
It's now possible to do very complex interactions and take advantage of
the transactional properties, for example publishing messages atomically
across multiple topics, or consuming and producing atomically.
We can say that Pulsar 2.8 is a big milestone in the journey completing this
vision of unified messaging and streaming platform.
We are very excited and very proud of this release. This is culminating months
and months of work by a “larger than ever” group of committers and contributors.
And while transactions support is the biggest new feature, it is certainly not the
only one. We have feature like Exclusive producer support, about which I will
Be talking about tomorrow in an ad-hoc session, a new API for package
management, to improve the way we manage the functions and connectors
code artifacts, or finally simplified way to configure memory limit in
Pulsar clients.
After looking at the past, let's now take a look at some of the items
that we want to focus on in the very near future.
A problem that we're seeing overall in the data ecosystem is that these
platforms can be very difficult to tune and operate when running at a
large scale.
This is not a problem specific to Pulsar, but it is something that we
believe it should be addressed.
Typically, there are a lot of configuration options and each of them
requires in-depth knowledge of the internal of the system. Worse, when
integrating multiple systems, like a comput framework, it might be very
hard to predict how a change in the configuration will affect the overall
stability and performance.
Finally, the workloads are increasingly dynamic and constantly changing.
It's not possible to have a static configuration that will have "optimal"
performance in every condition.
The first item I want to discuss is partitioning. People are used to
see partitioning and sharding, but these are really artifacts of how
systems are implemented. Partitions are usually not a natural property
of the data.
Because of that, we want to abstract the partition concept away from
the user sight. Application developers should not be worried about
partitions, operators should not be thinking at how many partitions
are needed for a certain use case.
Instead, the system should be able to figure it out on its own,
internally splitting and merging partitions, while maintaining
the fundamentals ordering guarantees.
Tuning storage system can also be a very complex
task. In particular, it can be very hard to predict the impact
of configuration on the overall performance when we're crossing
multiple layers: there is the Operating System, the disk device and the
disk controller.
In a similar way, the idea we have is to make it working with
no configuration, in a way that the storage system is
able to automatically adjust the strategies based on the
changing conditions of the traffic.
All aspects regarding the access pattern to the disk, what kind
of cache eviction strategy and so on.
When we introduced Pulsar Functions, we had the idea of making
it a frictionless platform for developers to do data processing.
Over few years, the foundation of Pulsar Functions runtime has
really matured into a solid platform, although the user experience
is still not great.
While it is very easy for developers to write functions, we
should strive to make it much easier to actually deploy and
manage functions.
For example, having functions tooling to be well integrated
with CI/CD platforms, supporting versioning and out of the
box support for A/B testings.
Another aspect is observability and debuggability. The tooling
and the platform needs to make it super-easy for users to
discover issues in their own code or to detect performance
issues.
Finally, we are thinking on a more higher level DSL, that
can support higher level constructs to further simplify
writing data processing functions.
We talked before about Tiered Storage and how it has enabled
completely new use cases to be supported by Pulsar.
The next step here is to make sure we can integrate with
existing data lake technologies, like Delta Lake and
Apache Hudi.
The vision is to use the Data Lake as the tiered storage backend,
so that the same data can be consumed as a stream or with the
data lake tooling.
As a final note, given the very nature of Pulsar, that sits between
different systems and platforms and links all of them together,
we want to reaffirm our commitment to work with the larger data
community to ensure that Pulsar is supported everywhere, out of the box, as a first class
citizen.
We have been partnering with many Open Source communities like
Trino, Druid, Pinot, Spark and Flink. We will continue to do so,
and more in the future.
We believe that this will benefits Pulsar, its users and the overall data
ecosystem.