Follow the (Kafka) Streams

Follow the
Kafka
Streams
Kafka

HelloWorld!
○ Mario Molina
○ Big Data Engineer @ Datio
○ Working in data & all things related since 2005.
○ You can find me at:
mmolimar_
mmolimar
mmolimar

A distributed streaming platform
○ Distributed and ordered commit log.
○ Pull-based publish/subscribe messaging with message
retention on disk.
○ Fault-tolerant, arbitrary scalability.
○ Isolated topics and partitions per consumer group.
○ Binary TCP-based communication protocol.
○ Actively developed.
○ Great stability. It’s used by industry-leading companies.
○ Excellent APIs (JVM languages mainly).
○ Optimized for read in the same order as write was done.
○ Optimized for massive writes.

The ecosystem
App
App
Producers
App
App
App
App
Consumers
App
App
Sources
Connectors
Sinks
App
Streams
App

Kafka APIs
○ Producer API.
○ Consumer API.
○ Connect API: sources and sinks.
○ Streams API.

Streams API
○ Library (Java & Scala) for stream processing (one-record-at-a-time).
○ Lightweight with a low barrier entry.
○ High-level DSL & low-level Processor API.
○ Semantics: at-least-once & exactly-once.
○ Fault tolerance.
○ Scalable & recoverable (improved with KIP-429 and KIP-441).
○ No external dependencies.

Key concepts in Streams API
○ The processor topology: represented by a directed
acyclic graph (DAG).
○ Sort of nodes in the processor topology:
○ Source processor.
○ Stream processor.
○ Sink processor.
○ State stores.
○ Sub-topologies.
○ Abstractions: KStream, KTable and GlobalKTable.
Processor
Processor
Processor*
Sink
Topology
Processor*
Sink
sub-topology sub-topology
state
store
Source

KTable
○ Partitioned table.
○ Each record represents the
latest state/value of its key.
○ “UPSERT” mode (from the
SQL perspective).
Abstractions
KStream
○ Partitioned record stream.
○ Immutable data (append only).
○ “INSERT” mode (from the SQL
perspective).
GlobalKTable
○ Not partitioned.
○ Same as a KTable but with
data from all partitions.
○ Just for the DSL.
k1 -> A
k1 -> A
T0 T1 T2 T3
KStream
KTable
k2 -> B
k1 -> A
k2 -> B
k1 -> C
k1 -> C
k2 -> B
k2 -> D
k1 -> C
k2 -> D
stream-table
duality

Terminal (stateless)
○ print.
○ foreach.
○ to.
Types of operations (DSL)
Stateless
○ filter / filterNot.
○ mapValues.
○ flatMapValues.
○ branch.
○ toStream.
○ map(*).
○ flatMap(*)
○ selectKey(*)
○ groupByKey.
○ groupBy.
○ ...
Stateful
○ aggregate.
○ joins (inner, left, outer).
○ count.
○ reduce.
○ windowed ops.

Parallelism
tasktask
Thread
Consumer
Producer
task
Thread
Consumer
Producer
App AppSample 1
AppSample 2

Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [TextLinesTopic])
--> KSTREAM-FLATMAPVALUES-0000000001
Processor: KSTREAM-FLATMAPVALUES-0000000001 (stores: [])
--> KSTREAM-KEY-SELECT-0000000002
<-- KSTREAM-SOURCE-0000000000
Processor: KSTREAM-KEY-SELECT-0000000002 (stores: [])
--> counts-store-repartition-filter
<-- KSTREAM-FLATMAPVALUES-0000000001
Processor: counts-store-repartition-filter (stores: [])
--> counts-store-repartition-sink
<-- KSTREAM-KEY-SELECT-0000000002
Sink: counts-store-repartition-sink (topic: counts-store-repartition)
<-- counts-store-repartition-filter
Sub-topology: 1
Source: counts-store-repartition-source (topics: [counts-store-repartition])
--> KSTREAM-AGGREGATE-0000000003
Processor: KSTREAM-AGGREGATE-0000000003 (stores: [counts-store])
--> KTABLE-MAPVALUES-0000000008
<-- counts-store-repartition-source
Processor: KTABLE-MAPVALUES-0000000008 (stores: [])
--> KTABLE-TOSTREAM-0000000009
<-- KSTREAM-AGGREGATE-0000000003
Processor: KTABLE-TOSTREAM-0000000009 (stores: [])
--> KSTREAM-SINK-0000000010
<-- KTABLE-MAPVALUES-0000000008
Sink: KSTREAM-SINK-0000000010 (topic: WordsWithCountsTopic)
<-- KTABLE-TOSTREAM-0000000009
Physical plan

Other interesting features
○ Windowing.
○ Interactive queries.
○ Topology optimization.

What if I don’t use it?
○ It’s OK if you just need to move data from one place to another.
○ But if you need to process/enrich or do other things with the data:
○ Code your specific use case using the producer and consumer
APIs.
○ Integrate another processing framework (ie: Spark, Flink...).

Demo - Product purchases
KafkaConnect
voluble
kukulcan

○ A REPL for Apache Kafka.
○ Support POSIX and Windows OS.
○ Written in Scala, Java and Python.
○ Shells in:
○ Ammonite REPL.
○ Scala REPL.
○ JShell.
○ Python shell.
○ APIs for Admin, Producer, Consumer, Connect
and Streams.
kukulcan
https://github.com/mmolimar/kukulcan

○ Intelligent data generator.
○ Source code:
○ https://github.com/MichaelDrogalis/voluble
○ Confluent Hub:
○ https://www.confluent.io/hub/mdrogalis/voluble
voluble

○ Scripts to run the demo in Kukulcan.
○ Source code:
○ https://github.com/mmolimar/meetups
○ Documentation:
○ https://github.com/mmolimar/meetups/tree/master/kafka-streams
Ammonite scripts

Getting involved with Apache Kafka
○ Website: http://kafka.apache.org
○ Join the mailing lists:
○ users@kafka.apache.org
○ dev@kafka.apache.org
○ Slack: https://confluentcommunity.slack.com
○ Meetups: https://www.meetup.com/<LOCATION>-Kafka
○ Contribute: https://github.com/apache/kafka
○ Kafka Summit 2020: https://kafka-summit.org

Thanks!
mmolimar
mmolimar
mmolimar_

Follow the (Kafka) Streams

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Follow the (Kafka) Streams

Similar to Follow the (Kafka) Streams (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Follow the (Kafka) Streams