Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

About time

100 views

Published on

Event-time stream processing with Akka Streams

Published in: Engineering
  • Be the first to comment

About time

  1. 1. Nadav Wiener Scala Tech Lead @ Riskified Scala since 2007 Akka Streams since 2016
  2. 2. RISKIFIED total employees in New York and Tel Aviv $64M in funding secured to date 1.000.000 global orders reviewed every day 1000 merchants, including several publicly traded companies 250
  3. 3. Time Windowing Streaming Data Platforms vs Libraries Glazier: Event Time Windowing
  4. 4. Libraries Spark / Flink Kafka Streams Akka Streams Monix / fs2 Platforms Poll ?
  5. 5. This was our dilemma
  6. 6. Behavioral Data
  7. 7. Proxy ?
  8. 8. no proxy
  9. 9. ? Proxy ?
  10. 10. Gather lowest latencies ● per each session ● per 10 second window We want to:
  11. 11. Browser HTTP Server latencies Gather lowest latencies ● per each session ● per 10 second window
  12. 12. Browser HTTP Server latencies write to journal Gather lowest latencies ● per each session ● per 10 second window
  13. 13. Browser HTTP Server latencies write to journal 10 second windows (for each user) Lowest Latency Stream Processing lowest latencies Database consume
  14. 14. Time Windowing
  15. 15. Time Windowing Platforms (Spark/Flink): Libraries (Akka Streams): 😀 😕
  16. 16. ?Platforms Libraries
  17. 17. ✔ Powerful Platforms are:
  18. 18. ✘ Big fish to catch ✘ Constraining ✔ Powerful Platforms are: but:
  19. 19. Platforms Libraries You are here Spark / Flink Kafka Streams Akka Streams Monix / fs2
  20. 20. Platforms Libraries Platforms Libraries You are here
  21. 21. Gather lowest latencies ● per each session ● per 10 second window Browser HTTP Server latencies lowest latencies Database write to journal consume Stream Processing Take #1
  22. 22. case class LatencyEntry(sessionId: String, latency: Duration) session id latency LatencyEntry Stream Processing Take #1
  23. 23. latencySource .groupBy(_.sessionId) .groupedWithin(10.seconds) .map(group => group.minBy(_.latency)) .mergeSubstreams .to(databaseSink) Partition into per-session substreams Stream Processing Take #1 session id latency LatencyEntry
  24. 24. latencySource .groupBy(_.sessionId) .groupedWithin(10.seconds) .map(group => group.minBy(_.latency)) .mergeSubstreams .to(databaseSink) Accumulate & emit every 10s Partition into per-session substreams Stream Processing Take #1 session id latency LatencyEntry
  25. 25. latencySource .groupBy(_.sessionId) .groupedWithin(10.seconds) .map(group => group.minBy(_.latency)) .mergeSubstreams .to(databaseSink) Accumulate & emit every 10s Lowest latency in accumulated data Partition into per-session substreams Stream Processing Take #1 session id latency LatencyEntry
  26. 26. latencySource .groupBy(_.sessionId) .groupedWithin(10.seconds) .map(group => group.minBy(_.latency)) .mergeSubstreams .to(databaseSink) Accumulate & emit every 10s Lowest latency in accumulated data Partition into per-session substreams Merge substreams & send to downstream db Stream Processing Take #1 session id latency LatencyEntry
  27. 27. But this is naive...
  28. 28. write to journal Browser HTTP Server latencies 1) Bring up only the HTTP server, and wait for latencies to accumulate 1
  29. 29. 1) Bring up only the HTTP server, and wait for latencies to accumulate 2) Only then bring up stream processing
  30. 30. write to journal consume 1Browser HTTP Server latencies 2 Database Stream Processing lowest latencies 1) Bring up only the HTTP server, and wait for latencies to accumulate 2) Only then bring up stream processing
  31. 31. Instead of this: 10 second windows (for each user) Lowest Latency 2 Database write to journal consume 1Browser HTTP Server latencies lowest latencies Stream Processing
  32. 32. 10 second windows (for each user) Lowest Latency We get this: 2 Database write to journal consume 1Browser HTTP Server latencies lowest latencies Stream Processing
  33. 33. WE SHOULDN’T BE LOOKING AT THE CLOCK
  34. 34. Processing Time Event Time Database write to journal consume Browser HTTP Server latencies lowest latencies Stream Processing
  35. 35. Event Time ● Timestamp as payload ● Plays well with distributed systems ● Not available in libraries Processing Time ● Time derived from clock ● Less suitable for business logic ● Available in libraries😕 😀
  36. 36. Event Time Processing Time ?
  37. 37. Glazier
  38. 38. Event time windowing library Glazier
  39. 39. Tour of the API Under the hood Glazier |+| Akka Streams Glazier
  40. 40. case class LatencyEntry(sessionId: String, latency: Duration, timestamp: Timestamp) session id latency timestamp LatencyEntry
  41. 41. latencySource .timestampWith(_.timestamp) .keyBy(_.sessionId) .windowBy(Window.tumbling(10.seconds), maxLateness = 1.second) .reduce(Seq(_, _).minBy(_.latency)) .mergeSubstreams .to(databaseSink) Event-time obtained from LatencyEntry.timestamp session id latency timestamp LatencyEntry
  42. 42. latencySource .timestampWith(_.timestamp) .keyBy(_.sessionId) .windowBy(Window.tumbling(10.seconds), maxLateness = 1.second) .reduce(Seq(_, _).minBy(_.latency)) .mergeSubstreams .to(databaseSink) Partitioned by session id session id latency timestamp LatencyEntry
  43. 43. latencySource .timestampWith(_.timestamp) .keyBy(_.sessionId) .windowBy(Window.tumbling(10.seconds), maxLateness = 1.second) .reduce(Seq(_, _).minBy(_.latency)) .mergeSubstreams .to(databaseSink) 10 second (event-time) windows 10s 10s
  44. 44. latencySource .timestampWith(_.timestamp) .keyBy(_.sessionId) .windowBy(Window.tumbling(10.seconds), maxLateness = 1.second) .reduce(Seq(_, _).minBy(_.latency)) .mergeSubstreams .to(databaseSink) 1 second grace period Events not guaranteed to arrive in order, Windows stay around for late events. 10s grace
  45. 45. latencySource .timestampWith(_.timestamp) .keyBy(_.sessionId) .windowBy(Window.tumbling(10.seconds), maxLateness = 1.second) .reduce(Seq(_, _).minBy(_.latency)) .mergeSubstreams .to(databaseSink) Emit lowest latency in window, once it closes session id latency timestamp LatencyEntry
  46. 46. latencySource .timestampWith(_.timestamp) .keyBy(_.sessionId) .windowBy(Window.tumbling(10.seconds), maxLateness = 1.second) .reduce(Seq(_, _).minBy(_.latency)) .mergeSubstreams .to(databaseSink) Merge window substreams
  47. 47. Windowing Functions type WindowingFunction = Timestamp => immutable.Seq[Interval] def tumbling(span: Span): WindowingFunction = ... span span span span
  48. 48. Windowing Functions type WindowingFunction = Timestamp => immutable.Seq[Interval] def tumbling(span: Span): WindowingFunction = { timestamp => val elapsed = timestamp % span val start = timestamp - elapsed List(Interval(start, start + span)) } span span span span
  49. 49. Windowing Functions type WindowingFunction = Timestamp => immutable.Seq[Interval] def sliding(span: Span, step: Span): WindowingFunction = ... span
  50. 50. Windowing Functions type WindowingFunction = Timestamp => immutable.Seq[Interval] def sliding(span: Span, step: Span): WindowingFunction = ... spanstep
  51. 51. Windowing Functions type WindowingFunction = Timestamp => immutable.Seq[Interval] def sliding(span: Span, step: Span): WindowingFunction = ... step spanstep
  52. 52. Windowing Functions type WindowingFunction = Timestamp => immutable.Seq[Interval] def sliding(span: Span, step: Span): WindowingFunction = ... step step step step step step step span
  53. 53. State Event Active Windows Logical Clock Instructions Active Windows Logical Clock State Step Windowing State
  54. 54. case class Step(presentTime: Timestamp, windows: Map[Interval, Set[Any]]) Timekeeping Newer events advance 'presentTime’ Active Windows Logical Clock
  55. 55. case class Step(presentTime: Timestamp, windows: Map[Interval, Set[Any]]) Timekeeping Windows represented as key-set by interval Active Windows Logical Clock
  56. 56. case class Step(presentTime: Timestamp, windows: Map[Interval, Set[Any]]) Timekeeping Newer events advance 'presentTime’ 10s…20s 0s…10s Timestamp 19s 1, 2 2, 5
  57. 57. def step[A](event: Event[A]): State[Step, Vector[Instruction[A]]] = for { _ <- advanceClock(event.timestamp, maxLateness) closeInstructions <- closeWindows openInstructions <- openWindows(event) handleInstructions <- handleEvent(event) } yield closeInstructions ++ openInstructions ++ handleInstructions Timekeeping 10s…20s 0s…10s Timestamp 19s 1, 2 2, 5
  58. 58. def step[A](event: Event[A]): State[Step, Vector[Instruction[A]]] = for { _ <- advanceClock(event.timestamp, maxLateness) closeInstructions <- closeWindows openInstructions <- openWindows(event) handleInstructions <- handleEvent(event) } yield closeInstructions ++ openInstructions ++ handleInstructions Timekeeping LatencyEntry(4, 100ms, 22s) 10s…20s 0s…10s 1, 2 2, 5Timestamp 22s
  59. 59. def step[A](event: Event[A]): State[Step, Vector[Instruction[A]]] = for { _ <- advanceClock(event.timestamp, maxLateness) closeInstructions <- closeWindows openInstructions <- openWindows(event) handleInstructions <- handleEvent(event) } yield closeInstructions ++ openInstructions ++ handleInstructions Active Windows 10s…20s 0s…10s Timestamp 22s 1, 2 2, 5LatencyEntry(4, 100ms, 22s)
  60. 60. 10s…20s Timestamp 22s 2, 5 def step[A](event: Event[A]): State[Step, Vector[Instruction[A]]] = for { _ <- advanceClock(event.timestamp, maxLateness) closeInstructions <- closeWindows openInstructions <- openWindows(event) handleInstructions <- handleEvent(event) } yield closeInstructions ++ openInstructions ++ handleInstructions Active Windows 20s…30s 4LatencyEntry(4, 100ms, 22s)
  61. 61. 10s…20s Timestamp 22s 2, 5 def step[A](event: Event[A]): State[Step, Vector[Instruction[A]]] = for { _ <- advanceClock(event.timestamp, maxLateness) closeInstructions <- closeWindows openInstructions <- openWindows(event) handleInstructions <- handleEvent(event) } yield closeInstructions ++ openInstructions ++ handleInstructions Instructions 20s…30s 4LatencyEntry(4, 100ms, 22s)
  62. 62. 10s…20s Timestamp 22s 2, 5 def step[A](event: Event[A]): State[Step, Vector[Instruction[A]]] = for { _ <- advanceClock(event.timestamp, maxLateness) closeInstructions <- closeWindows openInstructions <- openWindows(event) handleInstructions <- handleEvent(event) } yield closeInstructions ++ openInstructions ++ handleInstructions Instructions 20s…30s 4LatencyEntry(4, 100ms, 22s)
  63. 63. 10s…20s Timestamp 22s 2, 5 closeInstructions ++ openInstructions ++ handleInstructions == List( WindowStatusChange(Window(1, Interval(0s, 10s)), Close), WindowStatusChange(Window(2, Interval(0s, 10s)), Close), WindowStatusChange(Window(4, Interval(20s, 30s)), Open), HandleEvent(Window(4, Interval(20s, 30s)), LatencyEntry(4, 100ms, 20s)) ) Instructions 20s…30s 4LatencyEntry(4, 100ms, 22s)
  64. 64. def fromFlow[A](glazier: Glazier[A], flow: Flow[A, …]): SubFlow[…] = flow .scan(Glazier.Empty)((state, event) => glazier.runStep(state, event)) .mapConcat(_.instructions) .groupBy(_.window) .takeWhile { case WindowStatusChange(_, WindowStatus.Close) => false case _ => true } .collect { case HandleEvent(_, value) => value } Glazier |+| Akka Streams
  65. 65. def fromFlow[A](glazier: Glazier[A], flow: Flow[A, …]): SubFlow[…] = flow .scan(Glazier.Empty)((state, event) => glazier.runStep(state, event)) .mapConcat(_.instructions) .groupBy(_.window) .takeWhile { case WindowStatusChange(_, WindowStatus.Close) => false case _ => true } .collect { case HandleEvent(_, value) => value } Akka Streams Support
  66. 66. def fromFlow[A](glazier: Glazier[A], flow: Flow[A, …]): SubFlow[…] = flow .scan(Glazier.Empty)((state, event) => glazier.runStep(state, event)) .mapConcat(_.instructions) .groupBy(_.window) .takeWhile { case WindowStatusChange(_, WindowStatus.Close) => false case _ => true } .collect { case HandleEvent(_, value) => value } Akka Streams Support
  67. 67. def fromFlow[A](glazier: Glazier[A], flow: Flow[A, …]): SubFlow[…] = flow .scan(Glazier.Empty)((state, event) => glazier.runStep(state, event)) .mapConcat(_.instructions) .groupBy(_.window) .takeWhile { case WindowStatusChange(_, WindowStatus.Close) => false case _ => true } .collect { case HandleEvent(_, value) => value } Akka Streams Support
  68. 68. def fromFlow[A](glazier: Glazier[A], flow: Flow[A, …]): SubFlow[…] = flow .scan(Glazier.Empty)((state, event) => glazier.runStep(state, event)) .mapConcat(_.instructions) .groupBy(_.window) .takeWhile { case WindowStatusChange(_, WindowStatus.Close) => false case _ => true } .collect { case HandleEvent(_, value) => value } Akka Streams Support
  69. 69. def fromFlow[A](glazier: Glazier[A], flow: Flow[A, …]): SubFlow[…] = flow .scan(Glazier.Empty)((state, event) => glazier.runStep(state, event)) .mapConcat(_.instructions) .groupBy(_.window) .takeWhile { case WindowStatusChange(_, WindowStatus.Close) => false case _ => true } .collect { case HandleEvent(_, value) => value } Akka Streams Support
  70. 70. latencySource .timestampWith(_.timestamp) .keyBy(_.sessionId) .windowBy(Window.tumbling(10.seconds), maxLateness = 1.second) .reduce(Seq(_, _).minBy(_.latency)) .mergeSubstreams User code in windowed substream Akka Streams Support
  71. 71. latencySource .assignTimestamps(_.timestamp) .keyBy(_.sessionId) .window(TumblingEventTimeWindows.of(Time.seconds(10))) .allowedLateness(Time.seconds(1)) .reduceWith { case (r1, r2) => Seq(r1, r2).minBy(_.latency) } latencySource .timestampWith(_.timestamp) .keyBy(_.sessionId) .windowBy(Window.tumbling(10.seconds), maxLateness = 1.second) .reduce(Seq(_, _).minBy(_.latency)) .mergeSubstreams vs Flink API
  72. 72. Time Windowing Spark/Flink: Akka Streams: with Glazier: 😀 😕 😀
  73. 73. Questions?
  74. 74. Takeaways
  75. 75. ?Platforms Libraries
  76. 76. ✘ Upfront investment ✘ Constraining ✔ Powerful
  77. 77. ✔ Flexible
  78. 78. ✘ Missing functionality ✔ Flexible
  79. 79. Platforms Libraries You are here
  80. 80. Platforms Libraries Platforms Libraries Significant overlap
  81. 81. Thank you for your time! Thank you! Glazier https://github.com/riskified/glazier “Streaming Microservices”, Dean Wampler https://slideslive.com/38908773/kafkabased-microservices-with-akka-streams-and-kafka-streams “Windowing data in Akka Streams”, Adam Warski https://softwaremill.com/windowing-data-in-akka-streams/

×