How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming

How We Built an Event-Time
Merge of Two Kafka-Streams
with Spark Streaming
Ralf Sigmund, Sebastian Schröder
Otto GmbH & Co. KG

Agenda
• Who are we?
• Our Setup
• The Problem
• Our Approach
• Requirements
• Advanced Requirements
• Lessons learned

Who are we?
• Ralf
– Technical Designer Team Tracking
– Cycling, Surfing
– Twitter: @sistar_hh
• Sebastian
– Developer Team Tracking
– Taekwondo, Cycling
– Twitter: @Sebasti0n

Who are we working for?
• 2nd largest ecommerce company in Germany
• more than 1 million visitors every day
• our blog: https://dev.otto.de
• our jobs: www.otto.de/jobs

Our Setup
otto.de
Varnish Reverse
Proxy
Tracking-Server
Vertical 1Vertical 0
Vertical n
Service 0
Service 1
Service n

The Problem
otto.de
Varnish Reverse
Proxy
Vertical n
Service 0
Service 1
Service n
Has a limit on the
size of messages
Needs to send
large tracking
messages
Tracking-Server

Our Approach
otto.de
Varnish Reverse
Proxy
Tracking-Server
Vertical n
S0
S1
Sn
Spark
Service

Messages with the same key are
merged
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3
id: 2
t: 2
id: 3
t: 2
id: 3 t: 4
id: 0

Messages have to be timed out
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3
id: 2
t: 2
id: 3
t: 2
id: 3 t: 4
id: 0
max(business event timestamp)
time-out after 3 sec

Event Streams might get stuck
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3
id: 2
t: 2
id: 3
t: 2
id: 3 t: 4
id: 0
clock A
clock B
clock: min(clock A, clock B)

There might be no messages
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
t: 5
heart
beat
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
time-out after 3 sec
t: 5
heart
beat
clock: min(clock A, clock B) : 4

Requirements Summary
• Messages with the same key are merged
• Messages are timed out by the combined event
time of both source topics
• Timed out messages are sorted by event time

Merge messages with the same Id
• UpdateStateByKey, which is defined on pairwise
RDDs
• It applies a function to all new messages and
existing state for a given key
• Returns a StateDStream with the resulting
messages

UpdateStateByKey
DStream
[InputM
essage]
D
Stream
[M
ergedM
essage]
1
2
3
1
2
3
HistoryRDD
[M
ergedM
essage]
1
2
3
UpdateFunction:
(Seq[InputMessage], Option[MergedMessage]) =>
Option[MergedMessage]

Merge messages with the same Id

Flag And Remove Timed Out Msgs
DStream
[InputM
essage]
D
Stream
[M
ergedM
essage]
1
MM
TO
2
2
3
HistoryRDD
[M
ergedM
essage]
MM
MM
TO
MM
TO
Filter (timed out)
Filter !(timed out)
2
3
1 MM
Flag newly timed out
1 MM
3
needs global
Business
Clock
UpdateFunction

Messages are timed out by event time
• We need access to all messages from the
current micro batch and the history (state) =>
Custom StateDStream
• Compute the maximum event time of both topics
and supply it to the updateStateByKey function

Messages are timed out by event time

topicBtopicA Effect of different request rates in
source topics
10K
5K
0
10K
5K
0
catching up
with 5K per
microbatch
data read too
early

Handle different request rates in topics
• Stop reading from one topic if the event time of
other topic is too far behind
• Store latest event time of each topic and partition
on driver
• Use custom KafkaDStream with clamp method
which uses the event time

Handle different request rates in topics

Guarantee at least once semantics
o: 1
t: a
o: 7
t: a
o: 2
t: a
o: 5
t: b
o: 9
t: b
t: 6
t: b
Current State Sink Topic
a-offset: 1
b-offset: 5
a-offset: 1
b-offset: 5
a-offset: 1
b-offset: 5
On Restart/Recovery
Retrieve last message
Timeout
Spark Merge Service

Guarantee at least once semantics

Lessons learned
• Excellent Performance and Scalability
• Extensible via Custom RDDs and Extension to
DirectKafka
• No event time windows in Spark Streaming
See: https://github.com/apache/spark/pull/2633
• Checkpointing cannot be used when deploying new
artifact versions
• Driver/executor model is powerful, but also complex

THANK YOU.
Twitter:
@sistar_hh
@Sebasti0n
Blog: dev.otto.de
Jobs: otto.de/jobs

How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming

Similar to How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

How We Built an Event-Time Merge of Two Kafka Streams with Spark Streaming