Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How We Built an Event-Time
Merge of Two Kafka-Streams
with Spark Streaming
Ralf Sigmund, Sebastian Schröder
Otto GmbH & Co...
Agenda
• Who are we?
• Our Setup
• The Problem
•Our Approach
• Requirements
• Advanced Requirements
• Lessons learned
Who are we?
• Ralf
–Technical Designer Team Tracking
–Cycling, Surfing
–Twitter: @sistar_hh
• Sebastian
–Developer Team Tr...
Who are we working for?
• 2nd largest ecommerce company in Germany
• more than 1 million visitors every day
• our blog: ht...
Our Setup
otto.de
Varnish Reverse
Proxy
Tracking-Server
Vertical 1Vertical 0
Vertical n
Service 0
Service 1
Service n
The Problem
otto.de
Varnish Reverse
Proxy
Vertical 1Vertical 0
Vertical n
Service 0
Service 1
Service n
Has a limit on the...
Our Approach
otto.de
Varnish Reverse
Proxy
Tracking-Server
Vertical 1Vertical 0
Vertical n
S0
S1
Sn
Spark
Service
Requirements
Messages with the same key are
merged
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id...
Messages have to be timed out
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3...
Event Streams might get stuck
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3...
There might be no messages
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
t: 5
heart
beat
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
time-out afte...
Requirements Summary
• Messages with the same key are merged
• Messages are timed out by the combined event
time of both s...
Solution
Merge messages with the same Id
•UpdateStateByKey, which is defined on pairwise
RDDs
•It applies a function to all new mes...
UpdateStateByKey
1
2
3
1
2
3
1
2
3
UpdateFunction:
(Seq[InputMessage], Option[MergedMessage]) =>
Option[MergedMessage]
Merge messages with the same Id
Merge messages with the same Id
Flag And Remove Timed Out Msgs
1
MM
TO
2
2
3
MM
MM
TO
MM
TO
2
3
1 MM
1 MM
3
needs global
Business
Clock
UpdateFunction
Messages are timed out by event time
•We need access to all messages from the
current micro batch and the history (state) ...
Messages are timed out by event time
Messages are timed out by event time
Messages are timed out by event time
Messages are timed out by event time
Advanced Requirements
topicBtopicA Effect of different request rates in
source topics
10K
5K
0
10K
5K
0
catching up
with 5K per
microbatch
data ...
Handle different request rates in topics
•Stop reading from one topic if the event time of
other topic is too far behind
•...
Handle different request rates in topics
Handle different request rates in topics
Handle different request rates in topics
Handle different request rates in topics
Guarantee at least once semantics
o: 1
t: a
o: 7
t: a
o: 2
t: a
o: 5
t: b
o: 9
t: b
t: 6
t: b
Current State Sink Topic
a-o...
Guarantee at least once semantics
Guarantee at least once semantics
Everything put together
Lessons learned
• Excellent Performance and Scalability
• Extensible via Custom RDDs and Extension to
DirectKafka
•No even...
THANK YOU.
Twitter:
@sistar_hh
@Sebasti0n
Blog: dev.otto.de
Jobs: otto.de/jobs
Upcoming SlideShare
Loading in …5
×

How we built an event-time merge of two kafka-streams with spark-streaming

1,377 views

Published on

Sebastian Schröder & me presented how we implemented merging of apache kafka topics. We cover core requirements (timeout due to business timestamps, heartbeats,. ) and operational requirements (at least once semantics ,checkpointing, clamping message sources on each other)

Published in: Software
  • Be the first to comment

How we built an event-time merge of two kafka-streams with spark-streaming

  1. 1. How We Built an Event-Time Merge of Two Kafka-Streams with Spark Streaming Ralf Sigmund, Sebastian Schröder Otto GmbH & Co. KG
  2. 2. Agenda • Who are we? • Our Setup • The Problem •Our Approach • Requirements • Advanced Requirements • Lessons learned
  3. 3. Who are we? • Ralf –Technical Designer Team Tracking –Cycling, Surfing –Twitter: @sistar_hh • Sebastian –Developer Team Tracking –Taekwondo, Cycling –Twitter: @Sebasti0n
  4. 4. Who are we working for? • 2nd largest ecommerce company in Germany • more than 1 million visitors every day • our blog: https://dev.otto.de • our jobs: www.otto.de/jobs
  5. 5. Our Setup otto.de Varnish Reverse Proxy Tracking-Server Vertical 1Vertical 0 Vertical n Service 0 Service 1 Service n
  6. 6. The Problem otto.de Varnish Reverse Proxy Vertical 1Vertical 0 Vertical n Service 0 Service 1 Service n Has a limit on the size of messages Needs to send large tracking messages Tracking-Server
  7. 7. Our Approach otto.de Varnish Reverse Proxy Tracking-Server Vertical 1Vertical 0 Vertical n S0 S1 Sn Spark Service
  8. 8. Requirements
  9. 9. Messages with the same key are merged t: 1 id: 1 t: 3 id: 2 t: 2 id: 3 t: 0 id: 2 t: 4 id: 0 t: 2 id: 3 t: 1 id: 1 t: 0 id: 2 t: 3 id: 2 t: 2 id: 3 t: 2 id: 3 t: 4 id: 0
  10. 10. Messages have to be timed out t: 1 id: 1 t: 3 id: 2 t: 2 id: 3 t: 0 id: 2 t: 4 id: 0 t: 2 id: 3 t: 1 id: 1 t: 0 id: 2 t: 3 id: 2 t: 2 id: 3 t: 2 id: 3 t: 4 id: 0 max(business event timestamp) time-out after 3 sec
  11. 11. Event Streams might get stuck t: 1 id: 1 t: 3 id: 2 t: 2 id: 3 t: 0 id: 2 t: 4 id: 0 t: 2 id: 3 t: 1 id: 1 t: 0 id: 2 t: 3 id: 2 t: 2 id: 3 t: 2 id: 3 t: 4 id: 0 clock A clock B clock: min(clock A, clock B)
  12. 12. There might be no messages t: 1 id: 1 t: 4 id: 2 t: 2 id: 3 t: 5 heart beat t: 1 id: 1 t: 4 id: 2 t: 2 id: 3 time-out after 3 sec t: 5 heart beat clock: min(clock A, clock B) : 4
  13. 13. Requirements Summary • Messages with the same key are merged • Messages are timed out by the combined event time of both source topics • Timed out messages are sorted by event time
  14. 14. Solution
  15. 15. Merge messages with the same Id •UpdateStateByKey, which is defined on pairwise RDDs •It applies a function to all new messages and existing state for a given key •Returns a StateDStream with the resulting messages
  16. 16. UpdateStateByKey 1 2 3 1 2 3 1 2 3 UpdateFunction: (Seq[InputMessage], Option[MergedMessage]) => Option[MergedMessage]
  17. 17. Merge messages with the same Id
  18. 18. Merge messages with the same Id
  19. 19. Flag And Remove Timed Out Msgs 1 MM TO 2 2 3 MM MM TO MM TO 2 3 1 MM 1 MM 3 needs global Business Clock UpdateFunction
  20. 20. Messages are timed out by event time •We need access to all messages from the current micro batch and the history (state) => Custom StateDStream •Compute the maximum event time of both topics and supply it to the updateStateByKey function
  21. 21. Messages are timed out by event time
  22. 22. Messages are timed out by event time
  23. 23. Messages are timed out by event time
  24. 24. Messages are timed out by event time
  25. 25. Advanced Requirements
  26. 26. topicBtopicA Effect of different request rates in source topics 10K 5K 0 10K 5K 0 catching up with 5K per microbatch data read too early
  27. 27. Handle different request rates in topics •Stop reading from one topic if the event time of other topic is too far behind •Store latest event time of each topic and partition on driver •Use custom KafkaDStream with clamp method which uses the event time
  28. 28. Handle different request rates in topics
  29. 29. Handle different request rates in topics
  30. 30. Handle different request rates in topics
  31. 31. Handle different request rates in topics
  32. 32. Guarantee at least once semantics o: 1 t: a o: 7 t: a o: 2 t: a o: 5 t: b o: 9 t: b t: 6 t: b Current State Sink Topic a-offset: 1 b-offset: 5 a-offset: 1 b-offset: 5 a-offset: 1 b-offset: 5 On Restart/Recovery Retrieve last message Timeout Spark Merge Service
  33. 33. Guarantee at least once semantics
  34. 34. Guarantee at least once semantics
  35. 35. Everything put together
  36. 36. Lessons learned • Excellent Performance and Scalability • Extensible via Custom RDDs and Extension to DirectKafka •No event time windows in Spark Streaming See: https://github.com/apache/spark/pull/2633 • Checkpointing cannot be used when deploying new artifact versions • Driver/executor model is powerful, but also complex
  37. 37. THANK YOU. Twitter: @sistar_hh @Sebasti0n Blog: dev.otto.de Jobs: otto.de/jobs

×