This is a talk about debugging Stream–Table joins – based on my first-hand experience of stumbling into various pitfalls.
I will walk through the laughter and tears of the sharper edges of ksqlDB that I encountered along the way. We will witness the power and versatility of kafkacat and uncover the number one kafkacat pitfall. We will see Stream–Table join semantics in action.
AWS Community Day CPH - Three problems of Terraform
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt, TNG Technology Consulting
1. Failing to Cross the Streams
Lessons Learned the Hard Way
Philip Schmitt @philipschm1tt
2. The story of how I failed at a basic stream–table join
and the lessons I learned along the way…
…about ksqlDB
…about kafkacat
…about stream–table joins
18. CREATE TABLE customer_accounts (ID VARCHAR PRIMARY KEY, email VARCHAR)
WITH (kafka_topic='customer_accounts', value_format='json');
CREATE STREAM customer_consents (ID VARCHAR KEY)
WITH (kafka_topic='customer_consents', value_format='avro');
CREATE STREAM customer_consents_with_mail
WITH (kafka_topic='customer_consents_with_mail', value_format='avro') AS
SELECT customer_consents.ID,
customer_accounts.email,
customer_consents.consent
FROM customer_consents
LEFT JOIN customer_accounts ON customer_consents.ID = customer_accounts.ID;
19. CREATE TABLE customer_accounts (ID VARCHAR PRIMARY KEY, email VARCHAR)
WITH (kafka_topic='customer_accounts', value_format='json');
CREATE STREAM customer_consents (ID VARCHAR KEY)
WITH (kafka_topic='customer_consents', value_format='avro');
CREATE STREAM customer_consents_with_mail
WITH (kafka_topic='customer_consents_with_mail', value_format='avro') AS
SELECT customer_consents.ID,
customer_accounts.email,
customer_consents.consent
FROM customer_consents
LEFT JOIN customer_accounts ON customer_consents.ID = customer_accounts.ID;
20. CREATE TABLE customer_accounts (ID VARCHAR PRIMARY KEY, email VARCHAR)
WITH (kafka_topic='customer_accounts', value_format='json');
CREATE STREAM customer_consents (ID VARCHAR KEY)
WITH (kafka_topic='customer_consents', value_format='avro');
CREATE STREAM customer_consents_with_mail
WITH (kafka_topic='customer_consents_with_mail', value_format='avro') AS
SELECT customer_consents.ID,
customer_accounts.email,
customer_consents.consent
FROM customer_consents
LEFT JOIN customer_accounts ON customer_consents.ID = customer_accounts.ID;
21. CREATE TABLE customer_accounts (ID VARCHAR PRIMARY KEY, email VARCHAR)
WITH (kafka_topic='customer_accounts', value_format='json');
CREATE STREAM customer_consents (ID VARCHAR KEY)
WITH (kafka_topic='customer_consents', value_format='avro');
CREATE STREAM customer_consents_with_mail
WITH (kafka_topic='customer_consents_with_mail', value_format='avro') AS
SELECT customer_consents.ID,
customer_accounts.email,
customer_consents.consent
FROM customer_consents
LEFT JOIN customer_accounts ON customer_consents.ID = customer_accounts.ID;
22.
23.
24.
25. show topics;
print ‘$topic' from beginning;
insert into $topic(id, someValue) values (123, ‘some string');
ksqlDB can be useful even if you don’t have a ksqlDB cluster
Inspired by Michael Drogalis:
https://gist.github.com/MichaelDrogalis/08463ed82a4a04015eef03cb483cad26
26. It can be dangerous to assume that the data in a topic from another
team/department is complete and correct.
Pay attention to data quality – garbage in, garbage out
27. I’ll just write all mail addresses to Kafka myself!
Filling the Table With kafkacat
“
48. The stream message is joined to the data that was in the table at the
event time of the stream message!
Time matters in a stream–table join
See: Matthias Sax – The Flux Capacitor of Kafka Streams and ksqlDB
https://www.confluent.io/resources/kafka-summit-2020/the-flux-capacitor-of-kafka-streams-and-ksqldb/
49.
50.
51. We joined the email addresses to all old consent messages
53. Conclusion
We used a local ksqlDB via docker to join two topics.
There were some data quality issues, so it wasn’t as simple as we hoped.
We produced the data ourselves, but accidentally broke co-partitioning.
We used kafkacat to debug the co-partitioning.
Then we used the murmur2_random partitioner with kafkacat.
Then we learned about the time semantics of stream–table joins.
We used ksqlDB to time travel.
Now the join worked, and everyone is happy