Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt, TNG Technology Consulting

•

0 likes•712 views

This is a talk about debugging Stream–Table joins – based on my first-hand experience of stumbling into various pitfalls. I will walk through the laughter and tears of the sharper edges of ksqlDB that I encountered along the way. We will witness the power and versatility of kafkacat and uncover the number one kafkacat pitfall. We will see Stream–Table join semantics in action.

Technology

Failing to Cross the Streams
Lessons Learned the Hard Way
Philip Schmitt @philipschm1tt

The story of how I failed at a basic stream–table join
and the lessons I learned along the way…
…about ksqlDB
…about kafkacat
…about stream–table joins

What did I want to join, anyway?
A GDPR Story

Should be done in an hour or two!
Basic Stream–Table Join
“

https://kafka-tutorials.confluent.io/join-a-stream-to-a-table/ksql.html

CREATE TABLE customer_accounts (ID VARCHAR PRIMARY KEY, email VARCHAR)
WITH (kafka_topic='customer_accounts', value_format='json');
CREATE STREAM customer_consents (ID VARCHAR KEY)
WITH (kafka_topic='customer_consents', value_format='avro');
CREATE STREAM customer_consents_with_mail
WITH (kafka_topic='customer_consents_with_mail', value_format='avro') AS
SELECT customer_consents.ID,
customer_accounts.email,
customer_consents.consent
FROM customer_consents
LEFT JOIN customer_accounts ON customer_consents.ID = customer_accounts.ID;

show topics;
print ‘$topic' from beginning;
insert into $topic(id, someValue) values (123, ‘some string');
ksqlDB can be useful even if you don’t have a ksqlDB cluster
Inspired by Michael Drogalis:
https://gist.github.com/MichaelDrogalis/08463ed82a4a04015eef03cb483cad26

It can be dangerous to assume that the data in a topic from another
team/department is complete and correct.
Pay attention to data quality – garbage in, garbage out

I’ll just write all mail addresses to Kafka myself!
Filling the Table With kafkacat
“

kafkacat
SELECT customer_id, customer_email
FROM customer_accounts
.csv
kafka

kafkacat -X topic.partitioner=murmur2_random -b $broker
-t $topic –P…
kafkacat is powerful and versatile – but keep the partitioner in mind!

kafkacat -b localhost:9092 -t $topic -C -e -f '%k,%pn’
user_1,0
user_5,1
user_2,2
...
You can check the keys for co-partitioning with kafkacat

WTF doesn‘t this work yet?
More Stream–Table Join Semantics
“

The stream message is joined to the data that was in the table at the
event time of the stream message!
Time matters in a stream–table join
See: Matthias Sax – The Flux Capacitor of Kafka Streams and ksqlDB
https://www.confluent.io/resources/kafka-summit-2020/the-flux-capacitor-of-kafka-streams-and-ksqldb/

We joined the email addresses to all old consent messages

Well, at least we learned something!
Lessons Learned
“

Conclusion
We used a local ksqlDB via docker to join two topics.
There were some data quality issues, so it wasn’t as simple as we hoped.
We produced the data ourselves, but accidentally broke co-partitioning.
We used kafkacat to debug the co-partitioning.
Then we used the murmur2_random partitioner with kafkacat.
Then we learned about the time semantics of stream–table joins.
We used ksqlDB to time travel.
Now the join worked, and everyone is happy

What's hot

Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...confluent

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...confluent

Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...HostedbyConfluent

Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020confluent

Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLconfluent

Leveraging Microservice Architectures & Event-Driven Systems for Global APIsconfluent

Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...HostedbyConfluent

Shattering The Monolith(s) (Martin Kess, Namely) Kafka Summit SF 2019 confluent

Getting Started with Confluent Schema Registryconfluent

Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaGuido Schmutz

Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streamsconfluent

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent

Common issues with Apache Kafka® Producerconfluent

How to over-engineer things and have fun? | Oto Brglez, OPALABHostedbyConfluent

Using Location Data to Showcase Keys, Windows, and Joins in Kafka Streams DSL...confluent

KSQL Introconfluent

Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQLconfluent

Kafka Connect - debeziumKasun Don

ksqlDB - Stream Processing simplified!Guido Schmutz

Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang

What's hot (20)

Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...

Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...

Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020

Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL

Leveraging Microservice Architectures & Event-Driven Systems for Global APIs

Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...

Shattering The Monolith(s) (Martin Kess, Namely) Kafka Summit SF 2019

Getting Started with Confluent Schema Registry

Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka

Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...

Common issues with Apache Kafka® Producer

How to over-engineer things and have fun? | Oto Brglez, OPALAB

Using Location Data to Showcase Keys, Windows, and Joins in Kafka Streams DSL...

KSQL Intro

Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL

Kafka Connect - debezium

ksqlDB - Stream Processing simplified!

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Similar to Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt, TNG Technology Consulting

KSQL - Stream Processing simplified!Guido Schmutz

Apache Kafka - Scalable Message-Processing and more !Guido Schmutz

Event streaming webinar feb 2020Maheedhar Gunturu

Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLScyllaDB

SSR: Structured Streaming for R and Machine Learningfelixcss

SSR: Structured Streaming on R for Machine Learning with Felix CheungDatabricks

Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...confluent

Query Your Streaming Data on Kafka using SQL: Why, How, and WhatHostedbyConfluent

Service messaging using KafkaRobert Vadai

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa

Unlocking the world of stream processing with KSQL, the streaming SQL engine ...Michael Noll

Apache Kafka - A modern Stream Processing PlatformGuido Schmutz

Streaming ETL with Apache Kafka and KSQLNick Dearden

Apache Kafka - Scalable Message Processing and more!Guido Schmutz

Reading Cassandra Meetup Feb 2015: Apache SparkChristopher Batey

Manchester Hadoop Meetup: Spark Cassandra IntegrationChristopher Batey

Leveraging Azure Databricks to minimize time to insight by combining Batch an...Microsoft Tech Community

Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks

London Apache Kafka Meetup (Jan 2017)Landoop Ltd

Apache kafkaDaan Gerits

Similar to Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt, TNG Technology Consulting (20)

KSQL - Stream Processing simplified!

Apache Kafka - Scalable Message-Processing and more !

Event streaming webinar feb 2020

Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL

SSR: Structured Streaming for R and Machine Learning

SSR: Structured Streaming on R for Machine Learning with Felix Cheung

Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...

Query Your Streaming Data on Kafka using SQL: Why, How, and What

Service messaging using Kafka

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

Unlocking the world of stream processing with KSQL, the streaming SQL engine ...

Apache Kafka - A modern Stream Processing Platform

Streaming ETL with Apache Kafka and KSQL

Apache Kafka - Scalable Message Processing and more!

Reading Cassandra Meetup Feb 2015: Apache Spark

Manchester Hadoop Meetup: Spark Cassandra Integration

Leveraging Azure Databricks to minimize time to insight by combining Batch an...

Deep dive into stateful stream processing in structured streaming by Tathaga...

London Apache Kafka Meetup (Jan 2017)

Apache kafka

Recently uploaded

MINDCTI Revenue Release Quarter One 2024MIND CTI

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Architecting Cloud Native ApplicationsWSO2

AI in Action: Real World Use Cases by AnitarajAnitaRaj43

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformWSO2

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Elevate Developer Efficiency & build GenAI Application with Amazon QBhuvaneswari Subramani

Quantum Leap in Next-Generation ComputingWSO2

Introduction to use of FHIR Documents in ABDMKumar Satyam

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

CNIC Information System with Pakdata Cf In Pakistan

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Understanding the FAA Part 107 License ..

Architecting Cloud Native Applications

AI in Action: Real World Use Cases by Anitaraj

Six Myths about Ontologies: The Basics of Formal Ontology

Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....

Artificial Intelligence Chap.5 : Uncertainty

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Quantum Leap in Next-Generation Computing

Introduction to use of FHIR Documents in ABDM

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

AWS Community Day CPH - Three problems of Terraform

Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt, TNG Technology Consulting

1. Failing to Cross the Streams Lessons Learned the Hard Way Philip Schmitt @philipschm1tt

2. The story of how I failed at a basic stream–table join and the lessons I learned along the way… …about ksqlDB …about kafkacat …about stream–table joins

3. What did I want to join, anyway? A GDPR Story

10.

11.

12.

13.

14.

15. Should be done in an hour or two! Basic Stream–Table Join “

16. https://kafka-tutorials.confluent.io/join-a-stream-to-a-table/ksql.html

17. https://kafka-tutorials.confluent.io/join-a-stream-to-a-table/ksql.html

18. CREATE TABLE customer_accounts (ID VARCHAR PRIMARY KEY, email VARCHAR) WITH (kafka_topic='customer_accounts', value_format='json'); CREATE STREAM customer_consents (ID VARCHAR KEY) WITH (kafka_topic='customer_consents', value_format='avro'); CREATE STREAM customer_consents_with_mail WITH (kafka_topic='customer_consents_with_mail', value_format='avro') AS SELECT customer_consents.ID, customer_accounts.email, customer_consents.consent FROM customer_consents LEFT JOIN customer_accounts ON customer_consents.ID = customer_accounts.ID;

19. CREATE TABLE customer_accounts (ID VARCHAR PRIMARY KEY, email VARCHAR) WITH (kafka_topic='customer_accounts', value_format='json'); CREATE STREAM customer_consents (ID VARCHAR KEY) WITH (kafka_topic='customer_consents', value_format='avro'); CREATE STREAM customer_consents_with_mail WITH (kafka_topic='customer_consents_with_mail', value_format='avro') AS SELECT customer_consents.ID, customer_accounts.email, customer_consents.consent FROM customer_consents LEFT JOIN customer_accounts ON customer_consents.ID = customer_accounts.ID;

20. CREATE TABLE customer_accounts (ID VARCHAR PRIMARY KEY, email VARCHAR) WITH (kafka_topic='customer_accounts', value_format='json'); CREATE STREAM customer_consents (ID VARCHAR KEY) WITH (kafka_topic='customer_consents', value_format='avro'); CREATE STREAM customer_consents_with_mail WITH (kafka_topic='customer_consents_with_mail', value_format='avro') AS SELECT customer_consents.ID, customer_accounts.email, customer_consents.consent FROM customer_consents LEFT JOIN customer_accounts ON customer_consents.ID = customer_accounts.ID;

21. CREATE TABLE customer_accounts (ID VARCHAR PRIMARY KEY, email VARCHAR) WITH (kafka_topic='customer_accounts', value_format='json'); CREATE STREAM customer_consents (ID VARCHAR KEY) WITH (kafka_topic='customer_consents', value_format='avro'); CREATE STREAM customer_consents_with_mail WITH (kafka_topic='customer_consents_with_mail', value_format='avro') AS SELECT customer_consents.ID, customer_accounts.email, customer_consents.consent FROM customer_consents LEFT JOIN customer_accounts ON customer_consents.ID = customer_accounts.ID;

22.

23.

24.

25. show topics; print ‘$topic' from beginning; insert into $topic(id, someValue) values (123, ‘some string'); ksqlDB can be useful even if you don’t have a ksqlDB cluster Inspired by Michael Drogalis: https://gist.github.com/MichaelDrogalis/08463ed82a4a04015eef03cb483cad26

26. It can be dangerous to assume that the data in a topic from another team/department is complete and correct. Pay attention to data quality – garbage in, garbage out

27. I’ll just write all mail addresses to Kafka myself! Filling the Table With kafkacat “

28. kafkacat SELECT customer_id, customer_email FROM customer_accounts .csv kafka

29.

30.

31.

32. kafkacat -X topic.partitioner=murmur2_random -b $broker -t $topic –P… kafkacat is powerful and versatile – but keep the partitioner in mind!

33. kafkacat -b localhost:9092 -t $topic -C -e -f '%k,%pn’ user_1,0 user_5,1 user_2,2 ... You can check the keys for co-partitioning with kafkacat

34. WTF doesn‘t this work yet? More Stream–Table Join Semantics “

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48. The stream message is joined to the data that was in the table at the event time of the stream message! Time matters in a stream–table join See: Matthias Sax – The Flux Capacitor of Kafka Streams and ksqlDB https://www.confluent.io/resources/kafka-summit-2020/the-flux-capacitor-of-kafka-streams-and-ksqldb/

49.

50.

51. We joined the email addresses to all old consent messages

52. Well, at least we learned something! Lessons Learned “

53. Conclusion We used a local ksqlDB via docker to join two topics. There were some data quality issues, so it wasn’t as simple as we hoped. We produced the data ourselves, but accidentally broke co-partitioning. We used kafkacat to debug the co-partitioning. Then we used the murmur2_random partitioner with kafkacat. Then we learned about the time semantics of stream–table joins. We used ksqlDB to time travel. Now the join worked, and everyone is happy

54. The End Philip Schmitt @philipschm1tt

Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt, TNG Technology Consulting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt, TNG Technology Consulting

Similar to Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt, TNG Technology Consulting (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt, TNG Technology Consulting