Introduction to the Processor API

Introduction to the
Processor API

2
Kafka Streams:
A tale of three APIs High level API, no Java experience needed
Fluent API similar to Java 8 Streams
Allows access to underlying state stores

In the beginning,
there was the word!

In the beginning,
there was word count!

Streaming word count
5
final List<String> stopWords = Arrays.asList("a", "an", "and", "the");
final KStream<String, String> textlines = builder.stream("wordcount_input");
final KStream<String, Long> wordCounts = textlines
.mapValues(line -> line.toLowerCase())
.flatMapValues(line -> Arrays.asList(line.split("W+")))
.filterNot((_k, word) -> stopWords.contains(word))
.groupBy((_k, word) -> word)
.count()
.toStream();
wordCounts.to("wordcounts", Produced.with(Serdes.String(), Serdes.Long()));
Example for stateful stream processing using Kafka Streams DSL!

Streaming
input event leads to zero or more
output events
7
Input Output
Hello world
hello → 1
world → 1
A nice world
nice → 1
world → 2
and
hello again
hello → 2
again → 1
bye bye world
bye → 2
world → 3
Stateful
need to remember counts for all words
seen before.

Behind everything: stream and table duality
8
aggregate
changelog
stream
(of events)
table
(of state)

Streams and tables in the Streams DSL
9

StateStores in Kafka
11
KTables are backed by StateStores
- Key-Value store (RocksDB)
- get
- put
- delete
- all
- range
- provides data locality
- no network roundtrips to update state
- backed up in a Kafka changelog topic
- gives fault tolerance
No random access to state from the Streams DSL.

Limitations of Streams DSL
12
- no ‘random’ write access to state stores
- no input events: no output events
- cannot trigger computation to happen according to wall-clock time

Enter the Processor API
14
● more ﬁne-grained control over event propagation
● ingredients
○ Processor/Transformer:
interface Processor<K, V> {
void process(K key, V value);
}
interface Transformer<K, V, R> {
R transform(K key, V value);
}
○ ProcessorContext
○ StateStores
○ Punctuators
● Streams DSL and KSQL are implemented/compile into processor API
● can be combined with Streams DSL

WordCount with Processor
15
public void process(String _key, String value) {
for (String word : value.toLowerCase().split("W+")) {
if (! stopWords.contain(word)) {
Long count = counts.get(word);
if (count == null) count = 0L;
count += 1;
counts.put(word, count);
context.forward(word, count);
}
}
}
Caveat: not the best use case as we are just re-implementing DSL functionality

Adding processor to a topology
16
final StoreBuilder countStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("count_state_store"),
Serdes.String(),
Serdes.Long()
);
builder.addSource("Source", "source-topic")
.addProcessor("Process", () -> new WordCountProcessor(), "Source")
.addStateStore(countStoreBuilder, "Process")
.addSink("Sink", "sink-topic", "Process");

ProcessorContext
17
● allows Processor/Transformer to access 'outside world'
● allows access to record metadata
○ header
○ offset
○ timestamp
○ topic-name
● allows access to state stores
● use `context#forward` to send messages downstream

Use cases
18
● Access to record metadata or other (unit testable) extensions of the DSL
● random access to state-stores
● periodic computations -> punctuators
○ Cron job for your streams
○ scheduled either by wall-clock (processing) time or event time

Extending the DSL: ﬁlter by record header
19
Task: ﬁlter records according to value of certain header.
class HeaderFilterTransformer<K, V, KeyValue<K, V>> implements Transformer<...> {
public HeaderFilterTransformer(String headerName, String headerValue) {
this.headerName = headerName;
this.headerValue = headerValue;
}
public KeyValue<K, V> transform(K key, V value) {
final Headers headers = context.headers();
for (Header header : headers) {
if (header.key().equals(headerName)) {
if (new String(header.value()).equals(headerValue))
return KeyValue.pair(key, value);
else
return null;
}
}
return null;
}
}

Using HeaderFilterTransformer
20
KStream<...> filtered = originalStream.transform(
() -> new HeaderFilterTransformer(“headerName”, “valToLookFor”));
Transformer can be independently unit tested.

Use case: aggregating CDC messages
21
● get CDC (change data capture) messages from source database
○ each message represents change of single DB row
○ each message contains transaction Id
● need to aggregate CDC messages to a complete business entity and forward
those
● whenever new transaction Id occurs
Solution:
● keep denormalized copies of aggregated business entities in state store
● update with changes via CDC
● keep list of aggregated business entities which were changed during
transaction
● forward all changed entities when new transaction Id occurs
Alternative solution:
● do not ‘pre-aggregate’ but use range queries on state-stores with compound
keys and aggregate while forwarding

Punctuators
22
● Scheduled (periodic) execution of code
Two notions of time:
○ stream time (only advances if messages arrive)
○ wall clock time
● Does not run concurrently with process/transform
● Cancellable
Punctuator use cases:
● Implement time to live (TTL) for state stores
● Useful since KTable has no concept of retention

Scheduled WordCount
23
public void init(ProcessorContext context) {
this.context.schedule(
Duration.ofSeconds(1), PunctuationType.STREAM_TIME,
(timestamp) -> {
KeyValueIterator<String, Long> iter = counts.all();
iter.foreachRemaining) (entry -> {
context.forward(entry.key, entry.value.toString());
});
});
}
public void process(String dummy, String line) {
for (String word : line.toLowerCase().split("W+")) {
final Long oldValue = counts.get(word);
final Long newValue = oldValue == null ? 1L : oldValue + 1;
counts.put(word, newValue);
}
}
Totally different
sem
antics!

Wrap up
24
Processor API allows us to augment Streams DSL with
● random (write) access to state stores
● access to record meta data
● scheduled processing via punctuators
and is in general nothing to be afraid of!
Want to know more use cases?
Check out Antony Stubbs’ excellent talk:
https://www.youtube.com/watch?v=_KAFdwJ0zBA

Thank you!
@cschubertc
cschubert@confluent.io
cnfl.io/meetups cnfl.io/slackcnfl.io/blog

Confluent Developer
developer.confluent.io
Learn Kafka.
Start building with
Apache Kafka at
Confluent Developer.

Introduction to the Processor API

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to the Processor API

Similar to Introduction to the Processor API (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Introduction to the Processor API