2. 2
Kafka Streams:
A tale of three APIs High level API, no Java experience needed
Fluent API similar to Java 8 Streams
Allows access to underlying state stores
7. Streaming
input event leads to zero or more
output events
7
Input Output
Hello world
hello → 1
world → 1
A nice world
nice → 1
world → 2
and
hello again
hello → 2
again → 1
bye bye world
bye → 2
world → 3
Stateful
need to remember counts for all words
seen before.
11. StateStores in Kafka
11
KTables are backed by StateStores
- Key-Value store (RocksDB)
- get
- put
- delete
- all
- range
- provides data locality
- no network roundtrips to update state
- backed up in a Kafka changelog topic
- gives fault tolerance
No random access to state from the Streams DSL.
12. Limitations of Streams DSL
12
- no ‘random’ write access to state stores
- no input events: no output events
- cannot trigger computation to happen according to wall-clock time
14. Enter the Processor API
14
● more fine-grained control over event propagation
● ingredients
○ Processor/Transformer:
interface Processor<K, V> {
void process(K key, V value);
}
interface Transformer<K, V, R> {
R transform(K key, V value);
}
○ ProcessorContext
○ StateStores
○ Punctuators
● Streams DSL and KSQL are implemented/compile into processor API
● can be combined with Streams DSL
15. WordCount with Processor
15
public void process(String _key, String value) {
for (String word : value.toLowerCase().split("W+")) {
if (! stopWords.contain(word)) {
Long count = counts.get(word);
if (count == null) count = 0L;
count += 1;
counts.put(word, count);
context.forward(word, count);
}
}
}
Caveat: not the best use case as we are just re-implementing DSL functionality
16. Adding processor to a topology
16
final StoreBuilder countStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("count_state_store"),
Serdes.String(),
Serdes.Long()
);
builder.addSource("Source", "source-topic")
.addProcessor("Process", () -> new WordCountProcessor(), "Source")
.addStateStore(countStoreBuilder, "Process")
.addSink("Sink", "sink-topic", "Process");
17. ProcessorContext
17
● allows Processor/Transformer to access 'outside world'
● allows access to record metadata
○ header
○ offset
○ timestamp
○ topic-name
● allows access to state stores
● use `context#forward` to send messages downstream
18. Use cases
18
● Access to record metadata or other (unit testable) extensions of the DSL
● random access to state-stores
● periodic computations -> punctuators
○ Cron job for your streams
○ scheduled either by wall-clock (processing) time or event time
19. Extending the DSL: filter by record header
19
Task: filter records according to value of certain header.
class HeaderFilterTransformer<K, V, KeyValue<K, V>> implements Transformer<...> {
public HeaderFilterTransformer(String headerName, String headerValue) {
this.headerName = headerName;
this.headerValue = headerValue;
}
public KeyValue<K, V> transform(K key, V value) {
final Headers headers = context.headers();
for (Header header : headers) {
if (header.key().equals(headerName)) {
if (new String(header.value()).equals(headerValue))
return KeyValue.pair(key, value);
else
return null;
}
}
return null;
}
}
21. Use case: aggregating CDC messages
21
● get CDC (change data capture) messages from source database
○ each message represents change of single DB row
○ each message contains transaction Id
● need to aggregate CDC messages to a complete business entity and forward
those
● whenever new transaction Id occurs
Solution:
● keep denormalized copies of aggregated business entities in state store
● update with changes via CDC
● keep list of aggregated business entities which were changed during
transaction
● forward all changed entities when new transaction Id occurs
Alternative solution:
● do not ‘pre-aggregate’ but use range queries on state-stores with compound
keys and aggregate while forwarding
22. Punctuators
22
● Scheduled (periodic) execution of code
Two notions of time:
○ stream time (only advances if messages arrive)
○ wall clock time
● Does not run concurrently with process/transform
● Cancellable
Punctuator use cases:
● Implement time to live (TTL) for state stores
● Useful since KTable has no concept of retention
23. Scheduled WordCount
23
public void init(ProcessorContext context) {
this.context.schedule(
Duration.ofSeconds(1), PunctuationType.STREAM_TIME,
(timestamp) -> {
KeyValueIterator<String, Long> iter = counts.all();
iter.foreachRemaining) (entry -> {
context.forward(entry.key, entry.value.toString());
});
});
}
public void process(String dummy, String line) {
for (String word : line.toLowerCase().split("W+")) {
final Long oldValue = counts.get(word);
final Long newValue = oldValue == null ? 1L : oldValue + 1;
counts.put(word, newValue);
}
}
Totally different
sem
antics!
24. Wrap up
24
Processor API allows us to augment Streams DSL with
● random (write) access to state stores
● access to record meta data
● scheduled processing via punctuators
and is in general nothing to be afraid of!
Want to know more use cases?
Check out Antony Stubbs’ excellent talk:
https://www.youtube.com/watch?v=_KAFdwJ0zBA