What To Do For World Nature Conservation Day by Slidesgo.pptx
How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming
1. How We Built an Event-Time
Merge of Two Kafka-Streams
with Spark Streaming
Ralf Sigmund, Sebastian Schröder
Otto GmbH & Co. KG
2. Agenda
• Who are we?
• Our Setup
• The Problem
• Our Approach
• Requirements
• Advanced Requirements
• Lessons learned
3. Who are we?
• Ralf
– Technical Designer Team Tracking
– Cycling, Surfing
– Twitter: @sistar_hh
• Sebastian
– Developer Team Tracking
– Taekwondo, Cycling
– Twitter: @Sebasti0n
4. Who are we working for?
• 2nd largest ecommerce company in Germany
• more than 1 million visitors every day
• our blog: https://dev.otto.de
• our jobs: www.otto.de/jobs
12. There might be no messages
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
t: 5
heart
beat
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
time-out after 3 sec
t: 5
heart
beat
clock: min(clock A, clock B) : 4
13. Requirements Summary
• Messages with the same key are merged
• Messages are timed out by the combined event
time of both source topics
• Timed out messages are sorted by event time
15. Merge messages with the same Id
• UpdateStateByKey, which is defined on pairwise
RDDs
• It applies a function to all new messages and
existing state for a given key
• Returns a StateDStream with the resulting
messages
19. Flag And Remove Timed Out Msgs
DStream
[InputM
essage]
D
Stream
[M
ergedM
essage]
1
MM
TO
2
2
3
HistoryRDD
[M
ergedM
essage]
MM
MM
TO
MM
TO
Filter (timed out)
Filter !(timed out)
2
3
1 MM
Flag newly timed out
1 MM
3
needs global
Business
Clock
UpdateFunction
20. Messages are timed out by event time
• We need access to all messages from the
current micro batch and the history (state) =>
Custom StateDStream
• Compute the maximum event time of both topics
and supply it to the updateStateByKey function
26. topicBtopicA Effect of different request rates in
source topics
10K
5K
0
10K
5K
0
catching up
with 5K per
microbatch
data read too
early
27. Handle different request rates in topics
• Stop reading from one topic if the event time of
other topic is too far behind
• Store latest event time of each topic and partition
on driver
• Use custom KafkaDStream with clamp method
which uses the event time
32. Guarantee at least once semantics
o: 1
t: a
o: 7
t: a
o: 2
t: a
o: 5
t: b
o: 9
t: b
t: 6
t: b
Current State Sink Topic
a-offset: 1
b-offset: 5
a-offset: 1
b-offset: 5
a-offset: 1
b-offset: 5
On Restart/Recovery
Retrieve last message
Timeout
Spark Merge Service
36. Lessons learned
• Excellent Performance and Scalability
• Extensible via Custom RDDs and Extension to
DirectKafka
• No event time windows in Spark Streaming
See: https://github.com/apache/spark/pull/2633
• Checkpointing cannot be used when deploying new
artifact versions
• Driver/executor model is powerful, but also complex