The Apache Pulsar messaging solution can perform lightweight, extensible processing on messaging as they stream through the system. This presentation provides an overview of this new functionality.
2. 2
Event Driven Architectures
The rise of RealTime
BigData began with Batch
HDFS/MapReduce/Hive
ReacBon Times became important
Reduce Bme between data arrival and data analysis/acBon
Emergence of Real-Time Streaming ystems
3. 3
What do we really mean by Real-Time?
Aims
Aim is to react to events as they happen in real-Bme
Where do Events happen/arrive?
Message Bus
Whats a reacBon
An acBon/transformaBon/funcBon
6. 6
Traditional Compute API
SBtching all of this by programmers
public static class SplitSentence extends BaseBasicBolt {
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
public void execute(Tuple tuple, BasicOutputCollector
basicOutputCollector) {
String sentence = tuple.getStringByField("sentence");
String words[] = sentence.split(" ");
for (String w : words) {
basicOutputCollector.emit(new Values(w));
}
}
}
7. 7
Traditional Compute API
SBtching all of this by programmers
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
8. 8
Compute API 2.0
FuncBonal
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);
11. 11
Traditional Real-Time Systems
Developer Experience
Powerful API but complicated
Does everyone really need to learn funcBonal programming?
Configurable/Scaleable but management overhead
Edge systems have resource/manageability constraints
12. 12
Traditional Real-Time Systems
OperaBonal Experience
Another system to operate is one too many
IOT deployment rouBnely have thousands of edge systems
SemanBc difference
Mismatch/DuplicaBon between Systems
Creates Developer and Operator FricBon
13. 13
Lessons learnt
UseCases
A significant percentage of transformaBons are simple
ETL
ReacBve Services
ClassificaBon
Real-Bme AggregaBon
Event RouBng
Microservices
14. 14
Meanwhile
The world of Cloud
The emergence of Serverless
Simple FuncBon API
FuncBons are submi^ed to the system
Run per event
ComposiBon APIs to do complex things
Wildly popular
15. 15
Serverless vs Streaming
Whats really the difference
Both are event driven architectures
Both can be used for analyBcs/serving
Both have composiBon APIs
Conf based for Serverless vs DSL based for Streaming
Serverless typically don’t care for ordering
Really the funcBon of the underlying source
Pay per acBon
Really a product billing interfaces
16. 16
Whats needed:- Stream-Native Compute
Insight gained from serverless
Simplest possible API
Method/Procedure/FuncBon
MulB Language API
Scale developers
Message bus naBve concepts
Input/Output/Log as topics
Flexible runBme
Simple standalone applicaBons vs system managed applicaBons
19. 19
Ordering
Guaranteed ordering
Multi-tenancy
A single cluster can
support many tenants
and use cases
High throughput
Can reach 1.8 M
messages/s in a
single partition
Durability
Data replicated and
synced to disk
Geo-replication
Out of box support for
geographically
distributed
applications
Unified messaging
model
Support both
Streaming and
Queuing in a single
model
Delivery Guarantees
At least once, at most
once and effectively once
Low Latency
Low publish latency of
5ms at 99pct
Highly scalable
Can support millions of
topics
What is Apache Pulsar?
20. 20
Pulsar Architecture
Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1
Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5
Apache BookKeeper
Apache Pulsar
Producer Consumer
Stateless Serving
BROKER
Clients interact only with brokers
No state is stored in brokers
BOOKIES
Apache BookKeeper as the storage
Storage is append only
Provides high performance, low latency
Durability
No data loss. fsync before acknowledgement
21. 21
Pulsar Architecture
Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1
Bookie 1 Bookie 2 Bookie 3 Bookie 4 Bookie 5
Apache BookKeeper
Apache Pulsar
Producer Consumer
SeparaBon of Storage and Serving
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE
Bookies can be added independently
New bookies will ramp up traffic quickly
25. 25
Topic (T1) Topic (T1)
Topic (T1)
SubscripBon (S1) SubscripBon (S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data Center A Data Center B
Data Center C
Multi Cluster Replication
27. 27
Pulsar Functions
API
SDK less API
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
28. 28
Pulsar Functions
API
SDK API
import org.apache.pulsar.functions.api.PulsarFunction;
import org.apache.pulsar.functions.api.Context;
public class ExclamationFunction implements PulsarFunction<String, String> {
@Override
public String process(String input, Context context) {
return input + "!";
}
}
29. 29
Pulsar Functions
Input and Output
FuncBon executed for every message of input topic
Supports mulBple topics as inputs
FuncBon Output goes to the output topic
FuncBon Output can be void/null
SerDe takes care of serializaBon/deserializaBon of messages
Custom SerDe can be provided by the users
Integrates with Schema Registry
30. 30
Pulsar Functions
Processing Guarantees
ATMOST_ONCE
Message is acked to Pulsar as soon as we receive it
ATLEAST_ONCE
Message acked to Pulsar aeer the funcBon completes
Default behaviour:- Not many ppl want to loose data
EFFECTIVELY_ONCE
Uses Pulsar’s inbuilt effecBvely once semanBcs
Controlled at runBme by user
31. 31
Pulsar Functions
Built in State
FuncBons can store state in StreamStore
Framework provides an simple library around this
Support server side operaBons like counters
Simplified applicaBon development
No need to standup an extra system
32. 32
Pulsar Functions
WordCount Topology
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) throws Exception {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}
33. 33
Built-in State Management
Pulsar uses BookKeeper as its stream storage
FuncBons can store State in BookKeeper
Framework provides the Context object for users to access State
Support server side operaBons like Counters
Simplified applicaBon development
No need to standup an extra system to develop/test/integrate/operate
34. 34
State Storage w/ BookKeeper
The built-in state management is powered by Table Service in BookKeeper
BP-30: Table Service
Originated for a built-in metadata management within BookKeeper
Expose for general usage. e.g. State management for Pulsar FuncBons
Developer Preview
Pulsar FuncBons at Pulsar 2.0
Direct usage at BookKeeper 4.7
35. 35
State Storage w/ BookKeeper
Updates are wri^en in the log streams in BookKeeper
Materialized into a key/value table view
The key/value table is indexed with rocksdb for fast lookup
The source-of-truth is the log streams in BookKeeper
Rocksdb are transient key/value indexes
Rocksdb instances are incrementally checkpointed and stored into BookKeeper for
fast recovery
36. 36
Pulsar Functions
Running as a standalone applicaBon
bin/pulsar-admin functions localrun
--input persistent://sample/standalone/ns1/test_input
--output persistent://sample/standalone/ns1/test_result
--className org.mycompany.ExclamationFunction
--jar myjar.jar
Runs as a standalone process
Run as many instances as you want. Framework automaBcally balances data
Run and manage via Mesos/K8/Nomad/your favorite tool
37. 37
Pulsar Functions
Running inside Pulsar cluster
‘Create’ and ‘Delete’ FuncBons in a Pulsar Cluster
Pulsar brokers run funcBons as either threads/processes/docker containers
Unifies Messaging and Compute cluster into one, significantly improving
manageability
Ideal match for Edge or small startup environment
Serverless in a jar
38. 38
Pulsar Functions
Stepping back: Where Pulsar FuncBons belong
Powerful/Complicated systems have their place
Data Centers/Cloud
Complex analysis
A significant percentage of analyBcs/acBons are mundane
ETL/CounBng/RouBng
Use simple tools for simple things
39. 39
Pulsar Functions: Use Cases
Edge CompuBng
Sensor devices generate tons of data
We need local acBons
Simple filtering, threshold detecBon, regex matching, etc
Manageability is a big concern
The less moving parts, the be^er
Resource Constrained
Limited scope for Full blown schedulers/Job Managers
40. 40
Pulsar Functions: Use Cases
Model Serving
Models computed via offline analysis
Incoming requests should be classified using the model
FuncBon is a natural representaBon for the classificaBon acBon
Model itself can be stored in Bookkeeper
41. 41
Roadmap
More language supports - Go, Javascript, C++
Cross FuncBons : FuncBon ComposiBon API
More State operaBons exposed to FuncBons
42. 42
Conclusion
Stream-NaBve Compute (aka FuncBons) is the new paradigm in Messaging Systems
Stream-NaBve Storage (aka States) is the new paradigm in Storage Systems
Pulsar FuncBons bridges lightweight compuBng capability into messaging and
storage system, which is the trends that streaming applicaBons need
h^ps://pulsar.incubator.apache.org/docs/latest/funcBons/quickstart/