H O P E S, R E G RETS A N D
“ B E ST P R ACTICES”
RISKING
EVERYT HING
W IT H
AKKA ST REAMS
08-12-20 1 6
JOACHIM HOFER
(@JOHOF ER )
2
0. What We Do
1. First Impressions
2. Lessons Learnt
3. Awesomeness
4. Ops
5. Verdict
T ABLE OF
CONT ENTS
3
WHAT WE DO @ ZALANDO
4
net sales 2015: 3 billion euros
several 1000 updates / second
latency: ideally seconds
WHAT WE DO: SOME NUMBERS
5
“ SOME” RISK INVOLVED!
6
Akka Streams vs. Futures, RxScala, Actors
Tech Blog “Which shoe fits you?”
https://github.com/zalando/
scala-concurrency-playground
Akka Streams fits us best!
Image: Nevit Dilmen (CC BY-SA 3.0)
FIRST IMPRESSIONS — PREPARATION
7
H U H?
FIRST IMPRESSIONS — GRAPH DSL
import GraphDSL.Implicits._
val bcSqsIn = b add Broadcast[StreamCreationEvent](2)
val rules = b add ruleStore.stage.async
val eval = b add evaluator.stage.async
val publish = b add nakadi.stage.async
val ack = b add ackStage.stage.async
bcSqsIn ~> rules ~> eval.in0
bcSqsIn ~> eval.in1; eval.out ~> publish ~> ack
FlowShape(bcSqsIn.in, ack.out)
8
A H !
FIRST IMPRESSIONS — PLAIN SOURCES
products
.mapAsync(parallelism = 5)(ruleEvaluatorEvent(flowId))
.groupedWithin(3, 100 millis)
.filter(_.nonEmpty)
.mapAsync(parallelism = 50)(sqsGateway.send)
.runForeach(result => log.info(result.getFailed.size))
9 Images: Jerry Daykin (CC BY 2.0, left), The Present Group (CC BY 3.0 US, right)
FIRST IMPRESSIONS — OTHER BIG SCARIES
MAT ERIALISATION B A C K P R E S SU R E
1 0
T AST Y DOCUMENTATION
Image: Wikivisual (CC BY-NC-SA 3.0)
1 1
MISTAKES W ERE MADE
Image: Hapesoft (public domain)
1 2
Caused by onError
Completes the stream
Recover using recover / recoverWithRetries
FAILURES (VS ERRORS)
1 3
Caused by exceptions
Get escalated to failures by default
Recover using a Supervision Strategy
Can be ignored easily (“Resume” strategy)
ERRORS (VS FAILURES)
1 4
R E S UMING
LESSONS LEARNT
RETRIEVE
EVENTS
EVENTS
RETRIEVE
RULES
EVALUAT E1
1 5
R E S UMING
LESSONS LEARNT
RETRIEVE
EVENTS
EVENTS
RETRIEVE
RULES
EVALUAT E
1
1
1 6
R E S UMING
LESSONS LEARNT
RETRIEVE
EVENTS
EVENTS
RETRIEVE
RULES
EVALUAT E
1
1 7
LESSONS LEARNT
RETRIEVE
EVENTS
EVENTS
RETRIEVE
RULES
EVALUAT E
1
2
1 8
LESSONS LEARNT
RETRIEVE
EVENTS
EVENTS
RETRIEVE
RULES
EVALUAT E
1
2
2
1 9
LESSONS LEARNT
RETRIEVE
EVENTS
EVENTS
RETRIEVE
RULES
EVALUAT E
1
2
2
2 0
G O IN G O U T - OF-SYNC
LESSONS LEARNT
RETRIEVE
EVENTS
EVENTS
RETRIEVE
RULES
EVALUAT E
12
2
2 1
Unconsumed response entities
Solution: discardEntityBytes
LESSONS LEARNT — AKKA-HT TP CLIENTS (1)
2 2
No response from the server
“Currently Akka HTTP doesn’t implement client-side request timeout
checking itself as this functionality can be regarded as a more general
purpose streaming infrastructure feature.” (akka http docs)
Solution: Use an explicit xxxTimeout stage
LESSONS LEARNT — AKKA-HT TP CLIENTS (2)
2 3
Default: only up to 16?!
16 (buffer size) x 50 (parallel flows) x 10 (events per batch) = 8000 events
Solution: keep low, use explicit buffer stages
LESSONS LEARNT — INTERNAL BUFFERS
2 4
C O N FIGURATION A S A S O U RCE?
Image: Alpha (CC BY-SA 2.0)
LESSONS LEARNT
RETRIEVE
EVENTS
EVENTS
RETRIEVE
RULES
EVALUAT E
CONFI GU R A TI ON
2 5
REACTIVE
MANIFESTO
2 6
Backpressure
Automatically asynchronous
AWESOMENESS — REACTIVE
2 7
AWESOMENESS — TESTABILITY
EVALUAT ETEST EVENTS OUTPUT S TO CHECK
2 8
groupedWithin
throttle
mapConcat
recoverWithRetries
…
AWESOMENESS — BUILT -I N STAGES
Flow[RuleEvaluationEvent]
.groupedWithin(batchSize, batchTimeout)
.mapAsync(parallelism = 1)(provider.deleteFromStreamCreation)
.mapConcat(identity)
.via(logAndMonitor(Checkpoint.Acknowledged, "acknowledged msg"))
2 9
no out-of-the-box solution
built-in stage monitor not very helpful
no access to internal buffers
OPS — MONITORI NG: THE BAD
3 0
Trace events along streams
Create your own monitoring stage
OPS — MONITORI NG: THE GOOD
3 1
E X A MPLE P A S S -THROUGH S T AG E L O G IC
OPS — MONITORI NG
…
new GraphStageLogic(shape) {
setHandlers(in, out, new InHandler with OutHandler {
override def onPush(): Unit = {
push(out, grab(in))
}
override def onPull(): Unit = {
pull(in)
}
})
}
…
3 2
E X A MPLE P A S S -THROUGH S T AG E L O G IC
OPS — MONITORI NG
…
new GraphStageLogic(shape) {
setHandlers(in, out, new InHandler with OutHandler {
override def onPush(): Unit = {
stats.countPush()
push(out, grab(in))
}
override def onPull(): Unit = {
stats.countPull()
pull(in)
}
})
}
…
3 3
easy to tune
very efficient
easy to scale
see also: Gearpump Materializer
Image: Duncan Rawlinson (CC BY 2.0)
OPS — TUNING AND SCALING
3 4
everything under control
just keeps on running
OPS — RELIABILITY
3 6
Takes time to understand
Very potent
Image: Twice25 (CC BY-SA 2.5)
VERDICT: IT’S MAGIC!
3 7
R I S K ➟ REW ARD
joachim.hofer@zalando.de
@johofer
J O A CHIM H O F ER
Availability Engineering
Backend Engineer
08-12-20 1 6

Risking Everything with Akka Streams

  • 1.
    H O PE S, R E G RETS A N D “ B E ST P R ACTICES” RISKING EVERYT HING W IT H AKKA ST REAMS 08-12-20 1 6 JOACHIM HOFER (@JOHOF ER )
  • 2.
    2 0. What WeDo 1. First Impressions 2. Lessons Learnt 3. Awesomeness 4. Ops 5. Verdict T ABLE OF CONT ENTS
  • 3.
    3 WHAT WE DO@ ZALANDO
  • 4.
    4 net sales 2015:3 billion euros several 1000 updates / second latency: ideally seconds WHAT WE DO: SOME NUMBERS
  • 5.
  • 6.
    6 Akka Streams vs.Futures, RxScala, Actors Tech Blog “Which shoe fits you?” https://github.com/zalando/ scala-concurrency-playground Akka Streams fits us best! Image: Nevit Dilmen (CC BY-SA 3.0) FIRST IMPRESSIONS — PREPARATION
  • 7.
    7 H U H? FIRSTIMPRESSIONS — GRAPH DSL import GraphDSL.Implicits._ val bcSqsIn = b add Broadcast[StreamCreationEvent](2) val rules = b add ruleStore.stage.async val eval = b add evaluator.stage.async val publish = b add nakadi.stage.async val ack = b add ackStage.stage.async bcSqsIn ~> rules ~> eval.in0 bcSqsIn ~> eval.in1; eval.out ~> publish ~> ack FlowShape(bcSqsIn.in, ack.out)
  • 8.
    8 A H ! FIRSTIMPRESSIONS — PLAIN SOURCES products .mapAsync(parallelism = 5)(ruleEvaluatorEvent(flowId)) .groupedWithin(3, 100 millis) .filter(_.nonEmpty) .mapAsync(parallelism = 50)(sqsGateway.send) .runForeach(result => log.info(result.getFailed.size))
  • 9.
    9 Images: JerryDaykin (CC BY 2.0, left), The Present Group (CC BY 3.0 US, right) FIRST IMPRESSIONS — OTHER BIG SCARIES MAT ERIALISATION B A C K P R E S SU R E
  • 10.
    1 0 T ASTY DOCUMENTATION Image: Wikivisual (CC BY-NC-SA 3.0)
  • 11.
    1 1 MISTAKES WERE MADE Image: Hapesoft (public domain)
  • 12.
    1 2 Caused byonError Completes the stream Recover using recover / recoverWithRetries FAILURES (VS ERRORS)
  • 13.
    1 3 Caused byexceptions Get escalated to failures by default Recover using a Supervision Strategy Can be ignored easily (“Resume” strategy) ERRORS (VS FAILURES)
  • 14.
    1 4 R ES UMING LESSONS LEARNT RETRIEVE EVENTS EVENTS RETRIEVE RULES EVALUAT E1
  • 15.
    1 5 R ES UMING LESSONS LEARNT RETRIEVE EVENTS EVENTS RETRIEVE RULES EVALUAT E 1 1
  • 16.
    1 6 R ES UMING LESSONS LEARNT RETRIEVE EVENTS EVENTS RETRIEVE RULES EVALUAT E 1
  • 17.
  • 18.
  • 19.
  • 20.
    2 0 G OIN G O U T - OF-SYNC LESSONS LEARNT RETRIEVE EVENTS EVENTS RETRIEVE RULES EVALUAT E 12 2
  • 21.
    2 1 Unconsumed responseentities Solution: discardEntityBytes LESSONS LEARNT — AKKA-HT TP CLIENTS (1)
  • 22.
    2 2 No responsefrom the server “Currently Akka HTTP doesn’t implement client-side request timeout checking itself as this functionality can be regarded as a more general purpose streaming infrastructure feature.” (akka http docs) Solution: Use an explicit xxxTimeout stage LESSONS LEARNT — AKKA-HT TP CLIENTS (2)
  • 23.
    2 3 Default: onlyup to 16?! 16 (buffer size) x 50 (parallel flows) x 10 (events per batch) = 8000 events Solution: keep low, use explicit buffer stages LESSONS LEARNT — INTERNAL BUFFERS
  • 24.
    2 4 C ON FIGURATION A S A S O U RCE? Image: Alpha (CC BY-SA 2.0) LESSONS LEARNT RETRIEVE EVENTS EVENTS RETRIEVE RULES EVALUAT E CONFI GU R A TI ON
  • 25.
  • 26.
  • 27.
    2 7 AWESOMENESS —TESTABILITY EVALUAT ETEST EVENTS OUTPUT S TO CHECK
  • 28.
    2 8 groupedWithin throttle mapConcat recoverWithRetries … AWESOMENESS —BUILT -I N STAGES Flow[RuleEvaluationEvent] .groupedWithin(batchSize, batchTimeout) .mapAsync(parallelism = 1)(provider.deleteFromStreamCreation) .mapConcat(identity) .via(logAndMonitor(Checkpoint.Acknowledged, "acknowledged msg"))
  • 29.
    2 9 no out-of-the-boxsolution built-in stage monitor not very helpful no access to internal buffers OPS — MONITORI NG: THE BAD
  • 30.
    3 0 Trace eventsalong streams Create your own monitoring stage OPS — MONITORI NG: THE GOOD
  • 31.
    3 1 E XA MPLE P A S S -THROUGH S T AG E L O G IC OPS — MONITORI NG … new GraphStageLogic(shape) { setHandlers(in, out, new InHandler with OutHandler { override def onPush(): Unit = { push(out, grab(in)) } override def onPull(): Unit = { pull(in) } }) } …
  • 32.
    3 2 E XA MPLE P A S S -THROUGH S T AG E L O G IC OPS — MONITORI NG … new GraphStageLogic(shape) { setHandlers(in, out, new InHandler with OutHandler { override def onPush(): Unit = { stats.countPush() push(out, grab(in)) } override def onPull(): Unit = { stats.countPull() pull(in) } }) } …
  • 33.
    3 3 easy totune very efficient easy to scale see also: Gearpump Materializer Image: Duncan Rawlinson (CC BY 2.0) OPS — TUNING AND SCALING
  • 34.
    3 4 everything undercontrol just keeps on running OPS — RELIABILITY
  • 35.
    3 6 Takes timeto understand Very potent Image: Twice25 (CC BY-SA 2.5) VERDICT: IT’S MAGIC!
  • 36.
    3 7 R IS K ➟ REW ARD
  • 37.
    joachim.hofer@zalando.de @johofer J O ACHIM H O F ER Availability Engineering Backend Engineer 08-12-20 1 6

Editor's Notes

  • #2 clickbait ask about experiences
  • #4 sell clothes and shoes online monolithic fashion store -> platform for fashion services tech: microservices, connected by event streams we: decide if product should be available why: based on rules (e.g. no price, image quality, missing stuff)
  • #7 BEGIN first impressions
  • #8 - difficult for Scala beginners
  • #9 just using sources a lot easier sufficient for most of the code best practice: start out with sources, transition to graphs later
  • #10 next up: the solution: rtfm
  • #11 best practice: RTFM pancakes example (go find it!)
  • #12 BEGIN Lessons Learnt next up: failure handling
  • #13 tell about failures vs errors first!
  • #14 Resume: we’re using it a lot, very convenient, but…
  • #21 out-of-sync deadlock by backpressure solution: debug on whiteboard, use application layer “errors” (Either)
  • #22 see documentation
  • #24 - premature optimisation - starts up slowly - bugs unnoticed longer, harder to find - keep to defaults (or adapt carefully) - use explicit buffer stages where necessary
  • #25 caching — HoldWithWait decoupling not as a stream source? currently traditional cache / DI, regrets… maybe just add it to event metadata
  • #26 BEGIN awesomeness clean and well-thought-out concept
  • #28 thinking in streams: thinking “stateless by default” easily test individual stages or test partial graphs
  • #29 very little code (!) very readable (because high-level) throttle: see canarying
  • #30 BEGIN ops what about Kamon? contribute to akka-streams? create open source?
  • #34 tuning: dispatchers, # parallel flows, async boundaries efficient: backpressure regulates rate “magically” scale: stream-oriented in general: just add more machines
  • #35 threads, memory fully under control no strange outages as of yet, fingers crossed NEXT: VERDICT
  • #37 Fast to develop, fast to execute ~ 300 events/sec/instance for rather complicated use case latency down from hours (legacy) to < 1 min (1 s outside peaks, not full system yet) the perfect abstraction for our use case
  • #38 no incidents yet because of Akka Streams there’s always more planets to land on risk was worth taking