SlideShare a Scribd company logo
1 of 26
Introduction to the
Processor API
2
Kafka Streams:
A tale of three APIs High level API, no Java experience needed
Fluent API similar to Java 8 Streams
Allows access to underlying state stores
In the beginning,
there was the word!
In the beginning,
there was word count!
Streaming word count
5
final List<String> stopWords = Arrays.asList("a", "an", "and", "the");
final KStream<String, String> textlines = builder.stream("wordcount_input");
final KStream<String, Long> wordCounts = textlines
.mapValues(line -> line.toLowerCase())
.flatMapValues(line -> Arrays.asList(line.split("W+")))
.filterNot((_k, word) -> stopWords.contains(word))
.groupBy((_k, word) -> word)
.count()
.toStream();
wordCounts.to("wordcounts", Produced.with(Serdes.String(), Serdes.Long()));
Example for stateful stream processing using Kafka Streams DSL!
Stateful streaming?
Streaming
input event leads to zero or more
output events
7
Input Output
Hello world
hello → 1
world → 1
A nice world
nice → 1
world → 2
and
hello again
hello → 2
again → 1
bye bye world
bye → 2
world → 3
Stateful
need to remember counts for all words
seen before.
Behind everything: stream and table duality
8
aggregate
changelog
stream
(of events)
table
(of state)
Streams and tables in the Streams DSL
9
Where is state stored?
StateStores in Kafka
11
KTables are backed by StateStores
- Key-Value store (RocksDB)
- get
- put
- delete
- all
- range
- provides data locality
- no network roundtrips to update state
- backed up in a Kafka changelog topic
- gives fault tolerance
No random access to state from the Streams DSL.
Limitations of Streams DSL
12
- no ‘random’ write access to state stores
- no input events: no output events
- cannot trigger computation to happen according to wall-clock time
The processor API
Enter the Processor API
14
● more fine-grained control over event propagation
● ingredients
○ Processor/Transformer:
interface Processor<K, V> {
void process(K key, V value);
}
interface Transformer<K, V, R> {
R transform(K key, V value);
}
○ ProcessorContext
○ StateStores
○ Punctuators
● Streams DSL and KSQL are implemented/compile into processor API
● can be combined with Streams DSL
WordCount with Processor
15
public void process(String _key, String value) {
for (String word : value.toLowerCase().split("W+")) {
if (! stopWords.contain(word)) {
Long count = counts.get(word);
if (count == null) count = 0L;
count += 1;
counts.put(word, count);
context.forward(word, count);
}
}
}
Caveat: not the best use case as we are just re-implementing DSL functionality
Adding processor to a topology
16
final StoreBuilder countStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("count_state_store"),
Serdes.String(),
Serdes.Long()
);
builder.addSource("Source", "source-topic")
.addProcessor("Process", () -> new WordCountProcessor(), "Source")
.addStateStore(countStoreBuilder, "Process")
.addSink("Sink", "sink-topic", "Process");
ProcessorContext
17
● allows Processor/Transformer to access 'outside world'
● allows access to record metadata
○ header
○ offset
○ timestamp
○ topic-name
● allows access to state stores
● use `context#forward` to send messages downstream
Use cases
18
● Access to record metadata or other (unit testable) extensions of the DSL
● random access to state-stores
● periodic computations -> punctuators
○ Cron job for your streams
○ scheduled either by wall-clock (processing) time or event time
Extending the DSL: filter by record header
19
Task: filter records according to value of certain header.
class HeaderFilterTransformer<K, V, KeyValue<K, V>> implements Transformer<...> {
public HeaderFilterTransformer(String headerName, String headerValue) {
this.headerName = headerName;
this.headerValue = headerValue;
}
public KeyValue<K, V> transform(K key, V value) {
final Headers headers = context.headers();
for (Header header : headers) {
if (header.key().equals(headerName)) {
if (new String(header.value()).equals(headerValue))
return KeyValue.pair(key, value);
else
return null;
}
}
return null;
}
}
Using HeaderFilterTransformer
20
KStream<...> filtered = originalStream.transform(
() -> new HeaderFilterTransformer(“headerName”, “valToLookFor”));
Transformer can be independently unit tested.
Use case: aggregating CDC messages
21
● get CDC (change data capture) messages from source database
○ each message represents change of single DB row
○ each message contains transaction Id
● need to aggregate CDC messages to a complete business entity and forward
those
● whenever new transaction Id occurs
Solution:
● keep denormalized copies of aggregated business entities in state store
● update with changes via CDC
● keep list of aggregated business entities which were changed during
transaction
● forward all changed entities when new transaction Id occurs
Alternative solution:
● do not ‘pre-aggregate’ but use range queries on state-stores with compound
keys and aggregate while forwarding
Punctuators
22
● Scheduled (periodic) execution of code
Two notions of time:
○ stream time (only advances if messages arrive)
○ wall clock time
● Does not run concurrently with process/transform
● Cancellable
Punctuator use cases:
● Implement time to live (TTL) for state stores
● Useful since KTable has no concept of retention
Scheduled WordCount
23
public void init(ProcessorContext context) {
this.context.schedule(
Duration.ofSeconds(1), PunctuationType.STREAM_TIME,
(timestamp) -> {
KeyValueIterator<String, Long> iter = counts.all();
iter.foreachRemaining) (entry -> {
context.forward(entry.key, entry.value.toString());
});
});
}
public void process(String dummy, String line) {
for (String word : line.toLowerCase().split("W+")) {
final Long oldValue = counts.get(word);
final Long newValue = oldValue == null ? 1L : oldValue + 1;
counts.put(word, newValue);
}
}
Totally different
sem
antics!
Wrap up
24
Processor API allows us to augment Streams DSL with
● random (write) access to state stores
● access to record meta data
● scheduled processing via punctuators
and is in general nothing to be afraid of!
Want to know more use cases?
Check out Antony Stubbs’ excellent talk:
https://www.youtube.com/watch?v=_KAFdwJ0zBA
Thank you!
@cschubertc
cschubert@confluent.io
cnfl.io/meetups cnfl.io/slackcnfl.io/blog
Confluent Developer
developer.confluent.io
Learn Kafka.
Start building with
Apache Kafka at
Confluent Developer.

More Related Content

What's hot

From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
confluent
 
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
confluent
 

What's hot (20)

ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
 
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
 
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
 
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...
 
Real-Time Stream Processing with KSQL and Apache Kafka
Real-Time Stream Processing with KSQL and Apache KafkaReal-Time Stream Processing with KSQL and Apache Kafka
Real-Time Stream Processing with KSQL and Apache Kafka
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
 
UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...
UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...
UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_final
 
Richmond kafka streams intro
Richmond kafka streams introRichmond kafka streams intro
Richmond kafka streams intro
 

Similar to Introduction to the Processor API

Similar to Introduction to the Processor API (20)

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Spring data ii
Spring data iiSpring data ii
Spring data ii
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Sqlapi0.1
Sqlapi0.1Sqlapi0.1
Sqlapi0.1
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Ddl
DdlDdl
Ddl
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Qt for beginners
Qt for beginnersQt for beginners
Qt for beginners
 
Connecting and using PostgreSQL database with psycopg2 [Python 2.7]
Connecting and using PostgreSQL database with psycopg2 [Python 2.7]Connecting and using PostgreSQL database with psycopg2 [Python 2.7]
Connecting and using PostgreSQL database with psycopg2 [Python 2.7]
 

More from confluent

More from confluent (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 

Recently uploaded (20)

A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 

Introduction to the Processor API

  • 2. 2 Kafka Streams: A tale of three APIs High level API, no Java experience needed Fluent API similar to Java 8 Streams Allows access to underlying state stores
  • 3. In the beginning, there was the word!
  • 4. In the beginning, there was word count!
  • 5. Streaming word count 5 final List<String> stopWords = Arrays.asList("a", "an", "and", "the"); final KStream<String, String> textlines = builder.stream("wordcount_input"); final KStream<String, Long> wordCounts = textlines .mapValues(line -> line.toLowerCase()) .flatMapValues(line -> Arrays.asList(line.split("W+"))) .filterNot((_k, word) -> stopWords.contains(word)) .groupBy((_k, word) -> word) .count() .toStream(); wordCounts.to("wordcounts", Produced.with(Serdes.String(), Serdes.Long())); Example for stateful stream processing using Kafka Streams DSL!
  • 7. Streaming input event leads to zero or more output events 7 Input Output Hello world hello → 1 world → 1 A nice world nice → 1 world → 2 and hello again hello → 2 again → 1 bye bye world bye → 2 world → 3 Stateful need to remember counts for all words seen before.
  • 8. Behind everything: stream and table duality 8 aggregate changelog stream (of events) table (of state)
  • 9. Streams and tables in the Streams DSL 9
  • 10. Where is state stored?
  • 11. StateStores in Kafka 11 KTables are backed by StateStores - Key-Value store (RocksDB) - get - put - delete - all - range - provides data locality - no network roundtrips to update state - backed up in a Kafka changelog topic - gives fault tolerance No random access to state from the Streams DSL.
  • 12. Limitations of Streams DSL 12 - no ‘random’ write access to state stores - no input events: no output events - cannot trigger computation to happen according to wall-clock time
  • 14. Enter the Processor API 14 ● more fine-grained control over event propagation ● ingredients ○ Processor/Transformer: interface Processor<K, V> { void process(K key, V value); } interface Transformer<K, V, R> { R transform(K key, V value); } ○ ProcessorContext ○ StateStores ○ Punctuators ● Streams DSL and KSQL are implemented/compile into processor API ● can be combined with Streams DSL
  • 15. WordCount with Processor 15 public void process(String _key, String value) { for (String word : value.toLowerCase().split("W+")) { if (! stopWords.contain(word)) { Long count = counts.get(word); if (count == null) count = 0L; count += 1; counts.put(word, count); context.forward(word, count); } } } Caveat: not the best use case as we are just re-implementing DSL functionality
  • 16. Adding processor to a topology 16 final StoreBuilder countStoreBuilder = Stores.keyValueStoreBuilder( Stores.persistentKeyValueStore("count_state_store"), Serdes.String(), Serdes.Long() ); builder.addSource("Source", "source-topic") .addProcessor("Process", () -> new WordCountProcessor(), "Source") .addStateStore(countStoreBuilder, "Process") .addSink("Sink", "sink-topic", "Process");
  • 17. ProcessorContext 17 ● allows Processor/Transformer to access 'outside world' ● allows access to record metadata ○ header ○ offset ○ timestamp ○ topic-name ● allows access to state stores ● use `context#forward` to send messages downstream
  • 18. Use cases 18 ● Access to record metadata or other (unit testable) extensions of the DSL ● random access to state-stores ● periodic computations -> punctuators ○ Cron job for your streams ○ scheduled either by wall-clock (processing) time or event time
  • 19. Extending the DSL: filter by record header 19 Task: filter records according to value of certain header. class HeaderFilterTransformer<K, V, KeyValue<K, V>> implements Transformer<...> { public HeaderFilterTransformer(String headerName, String headerValue) { this.headerName = headerName; this.headerValue = headerValue; } public KeyValue<K, V> transform(K key, V value) { final Headers headers = context.headers(); for (Header header : headers) { if (header.key().equals(headerName)) { if (new String(header.value()).equals(headerValue)) return KeyValue.pair(key, value); else return null; } } return null; } }
  • 20. Using HeaderFilterTransformer 20 KStream<...> filtered = originalStream.transform( () -> new HeaderFilterTransformer(“headerName”, “valToLookFor”)); Transformer can be independently unit tested.
  • 21. Use case: aggregating CDC messages 21 ● get CDC (change data capture) messages from source database ○ each message represents change of single DB row ○ each message contains transaction Id ● need to aggregate CDC messages to a complete business entity and forward those ● whenever new transaction Id occurs Solution: ● keep denormalized copies of aggregated business entities in state store ● update with changes via CDC ● keep list of aggregated business entities which were changed during transaction ● forward all changed entities when new transaction Id occurs Alternative solution: ● do not ‘pre-aggregate’ but use range queries on state-stores with compound keys and aggregate while forwarding
  • 22. Punctuators 22 ● Scheduled (periodic) execution of code Two notions of time: ○ stream time (only advances if messages arrive) ○ wall clock time ● Does not run concurrently with process/transform ● Cancellable Punctuator use cases: ● Implement time to live (TTL) for state stores ● Useful since KTable has no concept of retention
  • 23. Scheduled WordCount 23 public void init(ProcessorContext context) { this.context.schedule( Duration.ofSeconds(1), PunctuationType.STREAM_TIME, (timestamp) -> { KeyValueIterator<String, Long> iter = counts.all(); iter.foreachRemaining) (entry -> { context.forward(entry.key, entry.value.toString()); }); }); } public void process(String dummy, String line) { for (String word : line.toLowerCase().split("W+")) { final Long oldValue = counts.get(word); final Long newValue = oldValue == null ? 1L : oldValue + 1; counts.put(word, newValue); } } Totally different sem antics!
  • 24. Wrap up 24 Processor API allows us to augment Streams DSL with ● random (write) access to state stores ● access to record meta data ● scheduled processing via punctuators and is in general nothing to be afraid of! Want to know more use cases? Check out Antony Stubbs’ excellent talk: https://www.youtube.com/watch?v=_KAFdwJ0zBA
  • 26. Confluent Developer developer.confluent.io Learn Kafka. Start building with Apache Kafka at Confluent Developer.