How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming

•Download as PPTX, PDF•

1 like•2,569 views

Sebastian Schröder & me presented how we implemented merging of apache kafka topics. We cover core requirements (timeout due to business timestamps, heartbeats,. ) and operational requirements (at least once semantics ,checkpointing, clamping message sources on each other)

Software

How We Built an Event-Time
Merge of Two Kafka-Streams
with Spark Streaming
Ralf Sigmund, Sebastian Schröder
Otto GmbH & Co. KG

Agenda
• Who are we?
• Our Setup
• The Problem
•Our Approach
• Requirements
• Advanced Requirements
• Lessons learned

Who are we?
• Ralf
–Technical Designer Team Tracking
–Cycling, Surfing
–Twitter: @sistar_hh
• Sebastian
–Developer Team Tracking
–Taekwondo, Cycling
–Twitter: @Sebasti0n

Who are we working for?
• 2nd largest ecommerce company in Germany
• more than 1 million visitors every day
• our blog: https://dev.otto.de
• our jobs: www.otto.de/jobs

Our Setup
otto.de
Varnish Reverse
Proxy
Tracking-Server
Vertical 1Vertical 0
Vertical n
Service 0
Service 1
Service n

The Problem
otto.de
Varnish Reverse
Proxy
Vertical 1Vertical 0
Vertical n
Service 0
Service 1
Service n
Has a limit on the
size of messages
Needs to send
large tracking
messages
Tracking-Server

Our Approach
otto.de
Varnish Reverse
Proxy
Tracking-Server
Vertical 1Vertical 0
Vertical n
S0
S1
Sn
Spark
Service

Messages with the same key are
merged
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3
id: 2
t: 2
id: 3
t: 2
id: 3 t: 4
id: 0

Messages have to be timed out
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3
id: 2
t: 2
id: 3
t: 2
id: 3 t: 4
id: 0
max(business event timestamp)
time-out after 3 sec

Event Streams might get stuck
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3
id: 2
t: 2
id: 3
t: 2
id: 3 t: 4
id: 0
clock A
clock B
clock: min(clock A, clock B)

There might be no messages
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
t: 5
heart
beat
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
time-out after 3 sec
t: 5
heart
beat
clock: min(clock A, clock B) : 4

Requirements Summary
• Messages with the same key are merged
• Messages are timed out by the combined event
time of both source topics
• Timed out messages are sorted by event time

Merge messages with the same Id
•UpdateStateByKey, which is defined on pairwise
RDDs
•It applies a function to all new messages and
existing state for a given key
•Returns a StateDStream with the resulting
messages

UpdateStateByKey
1
2
3
1
2
3
1
2
3
UpdateFunction:
(Seq[InputMessage], Option[MergedMessage]) =>
Option[MergedMessage]

Flag And Remove Timed Out Msgs
1
MM
TO
2
2
3
MM
MM
TO
MM
TO
2
3
1 MM
1 MM
3
needs global
Business
Clock
UpdateFunction

Messages are timed out by event time
•We need access to all messages from the
current micro batch and the history (state) =>
Custom StateDStream
•Compute the maximum event time of both topics
and supply it to the updateStateByKey function

topicBtopicA Effect of different request rates in
source topics
10K
5K
0
10K
5K
0
catching up
with 5K per
microbatch
data read too
early

Handle different request rates in topics
•Stop reading from one topic if the event time of
other topic is too far behind
•Store latest event time of each topic and
partition on driver
•Use custom KafkaDStream with clamp method
which uses the event time

Handle different request rates in topics

Guarantee at least once semantics
o: 1
t: a
o: 7
t: a
o: 2
t: a
o: 5
t: b
o: 9
t: b
t: 6
t: b
Current State Sink Topic
a-offset: 1
b-offset: 5
a-offset: 1
b-offset: 5
a-offset: 1
b-offset: 5
On Restart/Recovery
Retrieve last message
Timeout
Spark Merge Service

Lessons learned
• Excellent Performance and Scalability
• Extensible via Custom RDDs and Extension to
DirectKafka
•No event time windows in Spark Streaming
See: https://github.com/apache/spark/pull/2633
• Checkpointing cannot be used when deploying new
artifact versions
• Driver/executor model is powerful, but also complex

THANK YOU.
Twitter:
@sistar_hh
@Sebasti0n
Blog: dev.otto.de
Jobs: otto.de/jobs

Similar to How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming

Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit

Scaling Slack - The Good, the Unexpected, and the Road AheadC4Media

Cloud Security Monitoring and Spark Analyticsamesar0

How did we move the mountain? - Migrating 1 trillion+ messages per day across...HostedbyConfluent

Event Driven ArchitecturesAvinash Ramineni

How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà

Event Driven Architectures - Phoenix Java Users Group 2013clairvoyantllc

Lifting the hood on spark streaming - StampedeCon 2015StampedeCon

Resilient Predictive Data Pipelines (GOTO Chicago 2016)Sid Anand

Move Your XPages Applications to the Fast LaneTeamstudio

Meet with MeteorTahmina Khatoon

Evolution of Big Data Messaging Kartik Paramasivam

Ingesting data streams with a serverless infrastructureGil Colunga

MongoDB .local London 2019: Nationwide Building Society: Building Mobile Appl...MongoDB

What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent

Serverlessusecase workshop feb3_v2kartraj

Automated Data Synchronization: Data Loader, Data Mirror & BeyondJeremyOtt5

BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...Amazon Web Services

Opportunities and Pitfalls of Event-Driven UtopiaC4Media

AWS re:Invent 2013 Scalable Media Processing in the CloudDavid Sayed

Similar to How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming (20)

Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund

Scaling Slack - The Good, the Unexpected, and the Road Ahead

Cloud Security Monitoring and Spark Analytics

How did we move the mountain? - Migrating 1 trillion+ messages per day across...

Event Driven Architectures

How bol.com makes sense of its logs, using the Elastic technology stack.

Event Driven Architectures - Phoenix Java Users Group 2013

Lifting the hood on spark streaming - StampedeCon 2015

Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Move Your XPages Applications to the Fast Lane

Meet with Meteor

Evolution of Big Data Messaging

Ingesting data streams with a serverless infrastructure

MongoDB .local London 2019: Nationwide Building Society: Building Mobile Appl...

What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019

Serverlessusecase workshop feb3_v2

Automated Data Synchronization: Data Loader, Data Mirror & Beyond

BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...

Opportunities and Pitfalls of Event-Driven Utopia

AWS re:Invent 2013 Scalable Media Processing in the Cloud

Recently uploaded

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

chapter--4-software-project-planning.pptkotipi9215

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

Asset Management Software - InfographicHr365.us smith

Professional Resume Template for Software DevelopersVinodh Ram

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Recently uploaded (20)

Intelligent Home Wi-Fi Solutions | ThinkPalm

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Salesforce Certified Field Service Consultant

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

The Evolution of Karaoke From Analog to App.pdf

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

chapter--4-software-project-planning.ppt

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Der Spagat zwischen BIAS und FAIRNESS (2024)

Automate your Kamailio Test Calls - Kamailio World 2024

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Folding Cheat Sheet #4 - fourth in a series

Asset Management Software - Infographic

Professional Resume Template for Software Developers

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming

1. How We Built an Event-Time Merge of Two Kafka-Streams with Spark Streaming Ralf Sigmund, Sebastian Schröder Otto GmbH & Co. KG

2. Agenda • Who are we? • Our Setup • The Problem •Our Approach • Requirements • Advanced Requirements • Lessons learned

3. Who are we? • Ralf –Technical Designer Team Tracking –Cycling, Surfing –Twitter: @sistar_hh • Sebastian –Developer Team Tracking –Taekwondo, Cycling –Twitter: @Sebasti0n

4. Who are we working for? • 2nd largest ecommerce company in Germany • more than 1 million visitors every day • our blog: https://dev.otto.de • our jobs: www.otto.de/jobs

5. Our Setup otto.de Varnish Reverse Proxy Tracking-Server Vertical 1Vertical 0 Vertical n Service 0 Service 1 Service n

6. The Problem otto.de Varnish Reverse Proxy Vertical 1Vertical 0 Vertical n Service 0 Service 1 Service n Has a limit on the size of messages Needs to send large tracking messages Tracking-Server

7. Our Approach otto.de Varnish Reverse Proxy Tracking-Server Vertical 1Vertical 0 Vertical n S0 S1 Sn Spark Service

8. Requirements

9. Messages with the same key are merged t: 1 id: 1 t: 3 id: 2 t: 2 id: 3 t: 0 id: 2 t: 4 id: 0 t: 2 id: 3 t: 1 id: 1 t: 0 id: 2 t: 3 id: 2 t: 2 id: 3 t: 2 id: 3 t: 4 id: 0

10. Messages have to be timed out t: 1 id: 1 t: 3 id: 2 t: 2 id: 3 t: 0 id: 2 t: 4 id: 0 t: 2 id: 3 t: 1 id: 1 t: 0 id: 2 t: 3 id: 2 t: 2 id: 3 t: 2 id: 3 t: 4 id: 0 max(business event timestamp) time-out after 3 sec

11. Event Streams might get stuck t: 1 id: 1 t: 3 id: 2 t: 2 id: 3 t: 0 id: 2 t: 4 id: 0 t: 2 id: 3 t: 1 id: 1 t: 0 id: 2 t: 3 id: 2 t: 2 id: 3 t: 2 id: 3 t: 4 id: 0 clock A clock B clock: min(clock A, clock B)

12. There might be no messages t: 1 id: 1 t: 4 id: 2 t: 2 id: 3 t: 5 heart beat t: 1 id: 1 t: 4 id: 2 t: 2 id: 3 time-out after 3 sec t: 5 heart beat clock: min(clock A, clock B) : 4

13. Requirements Summary • Messages with the same key are merged • Messages are timed out by the combined event time of both source topics • Timed out messages are sorted by event time

14. Solution

15. Merge messages with the same Id •UpdateStateByKey, which is defined on pairwise RDDs •It applies a function to all new messages and existing state for a given key •Returns a StateDStream with the resulting messages

16. UpdateStateByKey 1 2 3 1 2 3 1 2 3 UpdateFunction: (Seq[InputMessage], Option[MergedMessage]) => Option[MergedMessage]

17. Merge messages with the same Id

18. Merge messages with the same Id

19. Flag And Remove Timed Out Msgs 1 MM TO 2 2 3 MM MM TO MM TO 2 3 1 MM 1 MM 3 needs global Business Clock UpdateFunction

20. Messages are timed out by event time •We need access to all messages from the current micro batch and the history (state) => Custom StateDStream •Compute the maximum event time of both topics and supply it to the updateStateByKey function

21. Messages are timed out by event time

22. Messages are timed out by event time

23. Messages are timed out by event time

24. Messages are timed out by event time

25. Advanced Requirements

26. topicBtopicA Effect of different request rates in source topics 10K 5K 0 10K 5K 0 catching up with 5K per microbatch data read too early

27. Handle different request rates in topics •Stop reading from one topic if the event time of other topic is too far behind •Store latest event time of each topic and partition on driver •Use custom KafkaDStream with clamp method which uses the event time

28. Handle different request rates in topics

29. Handle different request rates in topics

30. Handle different request rates in topics

31. Handle different request rates in topics

32. Guarantee at least once semantics o: 1 t: a o: 7 t: a o: 2 t: a o: 5 t: b o: 9 t: b t: 6 t: b Current State Sink Topic a-offset: 1 b-offset: 5 a-offset: 1 b-offset: 5 a-offset: 1 b-offset: 5 On Restart/Recovery Retrieve last message Timeout Spark Merge Service

33. Guarantee at least once semantics

34. Guarantee at least once semantics

35. Everything put together

36. Lessons learned • Excellent Performance and Scalability • Extensible via Custom RDDs and Extension to DirectKafka •No event time windows in Spark Streaming See: https://github.com/apache/spark/pull/2633 • Checkpointing cannot be used when deploying new artifact versions • Driver/executor model is powerful, but also complex

37. THANK YOU. Twitter: @sistar_hh @Sebasti0n Blog: dev.otto.de Jobs: otto.de/jobs

How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming

Recommended

Recommended

More Related Content

Similar to How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming

Similar to How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming (20)

Recently uploaded

Recently uploaded (20)

How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming