Mario Molina, Datio, Software Engineer
Kafka Streams is an open source JVM library for building event streaming applications on top of Apache Kafka. Its goal is to allow programmers to create efficient, real-time, streaming applications and perform analysis and operations on the incoming data.
In this presentation we’ll cover the main features of Kafka Streams and do a live demo!
This demo will be partially on Confluent Cloud, if you haven’t already signed up, you can try Confluent Cloud for free. Get $200 every month for your first three months ($600 free usage in total) get more information and claim it here: https://cnfl.io/cloud-meetup-free
https://www.meetup.com/Mexico-Kafka/events/271972045/
2. HelloWorld!
○ Mario Molina
○ Big Data Engineer @ Datio
○ Working in data & all things related since 2005.
○ You can find me at:
mmolimar_
mmolimar
mmolimar
3. A distributed streaming platform
○ Distributed and ordered commit log.
○ Pull-based publish/subscribe messaging with message
retention on disk.
○ Fault-tolerant, arbitrary scalability.
○ Isolated topics and partitions per consumer group.
○ Binary TCP-based communication protocol.
○ Actively developed.
○ Great stability. It’s used by industry-leading companies.
○ Excellent APIs (JVM languages mainly).
○ Optimized for read in the same order as write was done.
○ Optimized for massive writes.
6. Streams API
○ Library (Java & Scala) for stream processing (one-record-at-a-time).
○ Lightweight with a low barrier entry.
○ High-level DSL & low-level Processor API.
○ Semantics: at-least-once & exactly-once.
○ Fault tolerance.
○ Scalable & recoverable (improved with KIP-429 and KIP-441).
○ No external dependencies.
7. Key concepts in Streams API
○ The processor topology: represented by a directed
acyclic graph (DAG).
○ Sort of nodes in the processor topology:
○ Source processor.
○ Stream processor.
○ Sink processor.
○ State stores.
○ Sub-topologies.
○ Abstractions: KStream, KTable and GlobalKTable.
Processor
Processor
Processor*
Sink
Topology
Processor*
Sink
sub-topology sub-topology
state
store
Source
8. KTable
○ Partitioned table.
○ Each record represents the
latest state/value of its key.
○ “UPSERT” mode (from the
SQL perspective).
Abstractions
KStream
○ Partitioned record stream.
○ Immutable data (append only).
○ “INSERT” mode (from the SQL
perspective).
GlobalKTable
○ Not partitioned.
○ Same as a KTable but with
data from all partitions.
○ Just for the DSL.
k1 -> A
k1 -> A
T0 T1 T2 T3
KStream
KTable
k2 -> B
k1 -> A
k2 -> B
k1 -> C
k1 -> C
k2 -> B
k2 -> D
k1 -> C
k2 -> D
stream-table
duality
14. What if I don’t use it?
○ It’s OK if you just need to move data from one place to another.
○ But if you need to process/enrich or do other things with the data:
○ Code your specific use case using the producer and consumer
APIs.
○ Integrate another processing framework (ie: Spark, Flink...).
17. ○ A REPL for Apache Kafka.
○ Support POSIX and Windows OS.
○ Written in Scala, Java and Python.
○ Shells in:
○ Ammonite REPL.
○ Scala REPL.
○ JShell.
○ Python shell.
○ APIs for Admin, Producer, Consumer, Connect
and Streams.
kukulcan
https://github.com/mmolimar/kukulcan
19. ○ Scripts to run the demo in Kukulcan.
○ Source code:
○ https://github.com/mmolimar/meetups
○ Documentation:
○ https://github.com/mmolimar/meetups/tree/master/kafka-streams
Ammonite scripts