This document discusses Apache Pulsar Functions, a lightweight serverless compute framework built on Apache Pulsar. Pulsar Functions allows users to run stateless and stateful functions against data streams in Pulsar. Functions are simple Java functions that process individual messages. The functions integrate seamlessly with Pulsar for scalable, low-latency processing of streaming data at the edge and in cloud environments.
3. 3
What do we really mean by real time?
Aims
Aim is to react to events as they happen in real-:me
Where do events happen/arrive?
Message bus
What’s a reac:on?
An ac:on/transforma:on/func:on
6. 6
Traditional Compute API
S=tching all of this by programmers
public static class SplitSentence extends BaseBasicBolt {
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
public void execute(Tuple tuple, BasicOutputCollector
basicOutputCollector) {
String sentence = tuple.getStringByField("sentence");
String words[] = sentence.split(" ");
for (String w : words) {
basicOutputCollector.emit(new Values(w));
}
}
}
7. 7
Traditional Compute API
S=tching all of this by programmers
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
8. 8
Compute API 2.0
Func=onal
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);
16. 16
What’s needed? Stream-Native Compute
Insight gained from Serverless
Simplest possible API
Method/Procedure/Func:on
Mul: Language API
Scale developers
Message bus na:ve concepts
Input/Output/Log as topics
Flexible run:me
Simple standalone applica:ons vs system managed applica:ons
19. 19
What is Apache Pulsar?
Hyper Converged Data Platform that includes
Messaging
Durable log storage
Light weight Processing
Open Source
20. 20
Ordering
Guaranteed ordering
Multi-tenancy
A single cluster can
support many tenants
and use cases
High throughput
Can reach 1.8 M
messages/s in a
single partition
Durability
Data replicated and
synced to disk
Geo-replication
Out of box support for
geographically
distributed
applications
Unified messaging
model
Support both
Streaming and
Queuing in a single
model
Delivery Guarantees
At least once, at most
once and effectively once
Low Latency
Low publish latency of
5ms at 99pct
Highly scalable
Can support millions of
topics
How different is Apache Pulsar?
21. 21
Pulsar Architecture
Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1
Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5
Apache BookKeeper
Apache Pulsar
Producer Consumer
Stateless Serving
BROKER
Clients interact only with brokers
No state is stored in brokers
BOOKIES
Apache BookKeeper as the storage
Storage is append only
Provides high performance, low latency
Durability
No data loss. fsync before acknowledgement
22. 22
Pulsar Architecture
Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1
Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5
Apache BookKeeper
Apache Pulsar
Producer Consumer
Separa=on of Storage and Serving
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE
Bookies can be added independently
New bookies will ramp up traffic quickly
35. 35
Pulsar Functions
Running as a standalone applica=on
bin/pulsar-admin functions localrun
--input persistent://sample/standalone/ns1/test_input
--output persistent://sample/standalone/ns1/test_result
--className org.mycompany.ExclamationFunction
--jar myjar.jar
Runs as a standalone process
Run as many instances as you want. Framework automa:cally balances data
Run and manage via Mesos/K8/Nomad/your favorite tool
38. 38
Pulsar Functions: Use Cases
Edge Compu=ng
Sensor devices generate tons of data
We need local ac:ons
Simple filtering, threshold detec:on, regex matching, etc
Manageability is a big concern
The less moving parts, the bePer
Resource Constrained
Limited scope for Full blown schedulers/Job Managers
39. 39
Pulsar Functions: Use Cases
Model Serving
Models computed via offline analysis
Incoming requests should be classified using the model
Func:on is a natural representa:on for the classifica:on ac:on
Model itself can be stored in Bookkeeper
41. 41
Apache Pulsar in Production
3+ years
Serves 2.3 million topics
100 billion messages/day
Average latency < 5 ms
99% 15 ms (strong durability guarantees)
Zero data loss
80+ applica:ons
Self served provisioning
Full-mesh cross-datacenter replica:on - 8+ data centers
45. 45
State Storage w/ BookKeeper
The built-in state management is powered by Table Service in BookKeeper
BP-30: Table Service
Originated for a built-in metadata management within BookKeeper
Expose for general usage. e.g. State management for Pulsar Func:ons
Developer Preview
Pulsar Func:ons at Pulsar 2.0
Direct usage at BookKeeper 4.7
46. 46
State Storage w/ BookKeeper
Updates are wriPen in the log streams in BookKeeper
Materialized into a key/value table view
The key/value table is indexed with rocksdb for fast lookup
The source-of-truth is the log streams in BookKeeper
Rocksdb are transient key/value indexes
Rocksdb instances are incrementally checkpointed and stored into BookKeeper for
fast recovery