Akka Streams -
An Adobe data-intensive story
©2020 Adobe. All Rights Reserved. Adobe Confidential.©2020 Adobe. All Rights Reserved. Adobe Confidential.
Bianca Tesila | Software Engineer
@biancatesila
©2020 Adobe. All Rights Reserved. Adobe Confidential.©2020 Adobe. All Rights Reserved. Adobe Confidential.
Stefano Bonetti @Reactive Summit 2017
Stream Driven Development
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Adobe Real-time Customer Data Platform
Person Attribute Data
• Name
• Gender
• Loyalty Status
Pseudonymous Behavioral
Data
• Search Ad Click
• Website Visit
• Opened email offer
Advertising Ecosystem
Paid Media
Social Media e.g. Facebook
Personalization
On-site Personalization
In-app Personalization
Customer Systems
CRM
Email
SFTP/S3
Data Collection
Dozens of pre-built sources
Profile Management
Complete known & pseudonymous data
Activation
Hundreds of destinations
Real-time
Customer
Profile
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Real-time Activation
Streaming Activation
Services
Profile Management
Complete known & pseudonymous data
Activation
Hundreds of destinations
Real-time
Customer
Profile
HTTP
Streaming Activation
ServicesStreaming Activation
Services
©2020 Adobe. All Rights Reserved. Adobe Confidential.
The problem
Streaming Activation
Services
Profile Management
Complete known & pseudonymous data
Activation
Hundreds of destinations
Real-time
Customer
Profile
HTTP
Streaming Activation
ServicesStreaming Activation
Services
• High throughput: 1 M profile updates per second
• Heterogeneous & spiky traffic
• Potentially erroneous integrations
©2020 Adobe. All Rights Reserved. Adobe Confidential.
The solution
Reactive Streams Manifesto - http://www.reactive-streams.org/
Handling streams of data whose volume is unknown.
Resource consumption needs to be controlled - a fast data
source should not overwhelm the stream destination.
©2020 Adobe. All Rights Reserved. Adobe Confidential.
It’s always about the team and the journey to production
Roll out fast Roll out wisely
Tune the
performance
Roll out fast
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Application reference
10
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Akka Streams Stages
11
Source Flow Sink
©2020 Adobe. All Rights Reserved. Adobe Confidential.12
SOURCE[Message, NotUsed]
FLOW[Message, ProfileUpdate, NotUsed]
SINK[ProfileUpdate, Future[Done]]
©2020 Adobe. All Rights Reserved. Adobe Confidential.
13
SOURCE[Message, NotUsed]
SINK[ProfileUpdate,
Future[Done]]
FLOW[Message,
Either[ValidationError,ProfileUpdate],
NotUsed]
SINK[ValidationError,
NotUsed]
©2020 Adobe. All Rights Reserved. Adobe Confidential.14
SOURCE[Message]
FLOW[Message,
Either[ValidationError,ProfileUpdate]]
SINK[Error]
FLOW[ProfileUpdate,
Either[InvalidProfileError,EnhancedProfile]]
SINK[EnhancedProfile]
©2020 Adobe. All Rights Reserved. Adobe Confidential.
From whiteboard to code
15
def stream(source: Source[M, NotUsed],
deserialize: Flow[M, Either[E, P], NotUsed],
enhance: Flow[P, Either[E, EP], NotUsed],
sinkError: Sink[E, Future[Done]],
httpSink: Sink[EP, Future[Done]])
: Future[Done] = {
val stream: RunnableGraph[Future[Done]] =
source
.via(deserialize via divertErrors(sinkError))
.via(enhance via divertErrors(sinkError)
.toMat(httpSink)(Keep.right)
RunnableGraph.fromGraph(stream).run()
}
}
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Configuration driven apps
16
stream {
source {
type: "kafka-source"
kafka-source {
bootstrapServers: "127.0.0.1:9092"
group: "activation"
topic: "activation-input"
reset: "earliest"
}
}
sink {
type: "http-sink"
http-sink {
schema: "https"
endpoint: ”facebook-business.com"
port: 443
uri: "/audience"
}
}
}
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Aim for composability & modularity
17
APPLICATION LAYERAPPLICATION LAYER
REACTIVE LAYER
SOURCES FLOWS SINKS GRAPHS
BUSINESS LAYER
EVENTS SERVICES REPOSITORIES MONITORING
Roll out wisely
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Pipelining & parallelism – Operator fusion
19
source.via(deserialize).via(dispatch).to(Sink.ignore())
Actor
Source map map Sink
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Pipelining & parallelism– Async boundaries
20
source.via(deserialize).async.via(dispatch).to(Sink.ignore())
Actor 1 Actor 2
Async Boundary
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Pipelining & parallelism– mapAsync(parallelism: Int)
21
source.via(deserialize).async.via(dispatch).to(Sink.ignore())
Source map mapAsync(2) Sink
Actor 1 Actor 2
Future[T]
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Pipelining & parallelism– mapAsync vs mapAsyncUnordered
22
mapAsync(3)
Future[T]
123457 689
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Pipelining & parallelism– mapAsync vs mapAsyncUnordered
23
mapAsync(3)
Future[T]
123457 68
mapAsyncUnordered(3)
Future[T]
123547 68
9
9
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Pipelining & parallelism– Internal buffers
24
mapAsync(3)
Future[T]
123457 689
# Default max size of buffers used in stream elements
max-input-buffer-size = 16
val dispatch: Future[T] = ???
Flow[T]
.buffer(bufferSize, OverflowStrategy.backpressure)
.mapAsync(parallelism)(dispatch)
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Errors as events
25
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Errors as events: is it enough?
26
def source : Source[Either[E, I], NotUsed] =
Source.fromGraph( new GraphStage[SourceShape[Either[E,I]]]{
override def onPull(): Unit = kafkaConsumer.fetch() match {
case Success(i) => push(out, Right(i)
case Failure(t) => push(out, Left(PipelineSourceError))
}
})
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Let the errors be
27
def source : Source[I, NotUsed] =
Source.fromGraph( new GraphStage[SourceShape[I]]{
override def onPull(): Unit = pipelineConsumer.fetch() match {
case Success(i) => push(out, i)
case Failure(t) => throw PipelineSourceError
}
})
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Let the errors be, but supervised
28
def source : Source[I, NotUsed] =
Source.fromGraph( new GraphStage[SourceShape[I]]{
override def onPull(): Unit = pipelineConsumer.fetch() match {
case Success(i) => push(out, i)
case Failure(t) => throw PipelineSourceError
}
})
def supervisedSource : Source[I, NotUsed] =
RestartSource.withBackoff(minBackoff, maxBackoff, maxRestarts) { () => source
}
https://doc.akka.io/docs/akka/current/stream/stream-error.html#supervision-strategies
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Sometimes the stream needs to actually stop
val supervisedSource : Source[I, NotUsed] =
RestartSource.withBackoff(minBackoff, maxBackoff, maxRestarts) { () => source }
val killSwitch : SharedKillSwitch = KillSwitches.shared("...")
val stream: RunnableGraph[Future[Done]] =
supervisedSource
.async
.via(killSwitch.flow)
.via(deserialize via divertErrors(sinkError))
.via(dispatch via divertErrors(sinkError))
.toMat(Sink.ignore)(Keep.right)
val runningStream: Future[Done] = stream.run()
runningStream onComplete { case _ => sys.exit(1) }
CoordinatedShutdown(system) addJvmShutdownHook {
killSwitch.shutdown()
}
https://doc.akka.io/docs/akka/current/stream/stream-dynamic.html#controlling-stream-completion-with-killswitch
When the source fails, we want the stream to stop:
• cancel its upstream
• complete its downstream
Akka Streams Kill Switch:
• complete the stream via shutdown()
• downstream in-flight messages are processed
before returning from shutdown()
Performance tuning –
Case study
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Reference application
SOURCE
DESERIALIZE
GROUP
LOOKUP
SINK
SERIALIZE
KPI: 1M incoming events per second
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Performance baseline
Reality: 60 K incoming events per minute per pod
SOURCE
DESERIALIZE
GROUP
LOOKUP
SINK
SERIALIZE
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Performance baseline
Reality: 60 K incoming events per minute per pod
Requirements for 1M events per second:
• 1000 pods
• 1000 Kafka partitions
SOURCE
DESERIALIZE
GROUP
LOOKUP
SINK
SERIALIZE
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Finding out the limits – Trial 1
SOURCE
NOOP SINK
Result: 800K incoming events per minute per pod
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Finding out the limits – Trial 2
Result: 500-800K incoming events per minute per pod
SOURCE
DESERIALIZE
GROUP
LOOKUP
SINK
SERIALIZE
LOOKUP LOOKUP
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Finding out the limits – Trial 3
Result: serialize stage identified as bottleneck
GROUP
LOOKUP
SERIALIZE
LOOKUP LOOKUP
Batches / min
SOURCE
DESERIALIZE
GROUP
SINK
BUFFER
BUFFER
BUFFER
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Finding out the limits – Trial 4
Result: 700K incoming events per minute per pod
Batches / min SOURCE
DESERIALIZE
GROUP
LOOKUP
SINK
SERIALIZE
LOOKUP LOOKUP
SERIALIZE SERIALIZE
©2020 Adobe. All Rights Reserved. Adobe Confidential.
The journey towards 1M events per second
Result:
1K incoming events
per second per pod
Required resources:
• 1000 pods
• 1000 Kafka partitions
Initial trial
Result:
11K incoming events
per second per pod
Required resources:
• 90 pods
• 90 Kafka partitions
Final trial
©2020 Adobe. All Rights Reserved. Adobe Confidential.
Tools of choice
§ Infrastructure: Amazon EKS
§ Monitoring: Micrometer SDK, Prometheus & Grafana, NewRelic
§ Logging: Splunk
§ Performance Testing: Gatling
Akka Streams - An Adobe data-intensive story

Akka Streams - An Adobe data-intensive story

  • 1.
    Akka Streams - AnAdobe data-intensive story
  • 2.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential.©2020 Adobe. All Rights Reserved. Adobe Confidential. Bianca Tesila | Software Engineer @biancatesila
  • 3.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential.©2020 Adobe. All Rights Reserved. Adobe Confidential. Stefano Bonetti @Reactive Summit 2017 Stream Driven Development
  • 4.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Adobe Real-time Customer Data Platform Person Attribute Data • Name • Gender • Loyalty Status Pseudonymous Behavioral Data • Search Ad Click • Website Visit • Opened email offer Advertising Ecosystem Paid Media Social Media e.g. Facebook Personalization On-site Personalization In-app Personalization Customer Systems CRM Email SFTP/S3 Data Collection Dozens of pre-built sources Profile Management Complete known & pseudonymous data Activation Hundreds of destinations Real-time Customer Profile
  • 5.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Real-time Activation Streaming Activation Services Profile Management Complete known & pseudonymous data Activation Hundreds of destinations Real-time Customer Profile HTTP Streaming Activation ServicesStreaming Activation Services
  • 6.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. The problem Streaming Activation Services Profile Management Complete known & pseudonymous data Activation Hundreds of destinations Real-time Customer Profile HTTP Streaming Activation ServicesStreaming Activation Services • High throughput: 1 M profile updates per second • Heterogeneous & spiky traffic • Potentially erroneous integrations
  • 7.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. The solution Reactive Streams Manifesto - http://www.reactive-streams.org/ Handling streams of data whose volume is unknown. Resource consumption needs to be controlled - a fast data source should not overwhelm the stream destination.
  • 8.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. It’s always about the team and the journey to production Roll out fast Roll out wisely Tune the performance
  • 9.
  • 10.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Application reference 10
  • 11.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Akka Streams Stages 11 Source Flow Sink
  • 12.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential.12 SOURCE[Message, NotUsed] FLOW[Message, ProfileUpdate, NotUsed] SINK[ProfileUpdate, Future[Done]]
  • 13.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. 13 SOURCE[Message, NotUsed] SINK[ProfileUpdate, Future[Done]] FLOW[Message, Either[ValidationError,ProfileUpdate], NotUsed] SINK[ValidationError, NotUsed]
  • 14.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential.14 SOURCE[Message] FLOW[Message, Either[ValidationError,ProfileUpdate]] SINK[Error] FLOW[ProfileUpdate, Either[InvalidProfileError,EnhancedProfile]] SINK[EnhancedProfile]
  • 15.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. From whiteboard to code 15 def stream(source: Source[M, NotUsed], deserialize: Flow[M, Either[E, P], NotUsed], enhance: Flow[P, Either[E, EP], NotUsed], sinkError: Sink[E, Future[Done]], httpSink: Sink[EP, Future[Done]]) : Future[Done] = { val stream: RunnableGraph[Future[Done]] = source .via(deserialize via divertErrors(sinkError)) .via(enhance via divertErrors(sinkError) .toMat(httpSink)(Keep.right) RunnableGraph.fromGraph(stream).run() } }
  • 16.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Configuration driven apps 16 stream { source { type: "kafka-source" kafka-source { bootstrapServers: "127.0.0.1:9092" group: "activation" topic: "activation-input" reset: "earliest" } } sink { type: "http-sink" http-sink { schema: "https" endpoint: ”facebook-business.com" port: 443 uri: "/audience" } } }
  • 17.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Aim for composability & modularity 17 APPLICATION LAYERAPPLICATION LAYER REACTIVE LAYER SOURCES FLOWS SINKS GRAPHS BUSINESS LAYER EVENTS SERVICES REPOSITORIES MONITORING
  • 18.
  • 19.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Pipelining & parallelism – Operator fusion 19 source.via(deserialize).via(dispatch).to(Sink.ignore()) Actor Source map map Sink
  • 20.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Pipelining & parallelism– Async boundaries 20 source.via(deserialize).async.via(dispatch).to(Sink.ignore()) Actor 1 Actor 2 Async Boundary
  • 21.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Pipelining & parallelism– mapAsync(parallelism: Int) 21 source.via(deserialize).async.via(dispatch).to(Sink.ignore()) Source map mapAsync(2) Sink Actor 1 Actor 2 Future[T]
  • 22.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Pipelining & parallelism– mapAsync vs mapAsyncUnordered 22 mapAsync(3) Future[T] 123457 689
  • 23.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Pipelining & parallelism– mapAsync vs mapAsyncUnordered 23 mapAsync(3) Future[T] 123457 68 mapAsyncUnordered(3) Future[T] 123547 68 9 9
  • 24.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Pipelining & parallelism– Internal buffers 24 mapAsync(3) Future[T] 123457 689 # Default max size of buffers used in stream elements max-input-buffer-size = 16 val dispatch: Future[T] = ??? Flow[T] .buffer(bufferSize, OverflowStrategy.backpressure) .mapAsync(parallelism)(dispatch)
  • 25.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Errors as events 25
  • 26.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Errors as events: is it enough? 26 def source : Source[Either[E, I], NotUsed] = Source.fromGraph( new GraphStage[SourceShape[Either[E,I]]]{ override def onPull(): Unit = kafkaConsumer.fetch() match { case Success(i) => push(out, Right(i) case Failure(t) => push(out, Left(PipelineSourceError)) } })
  • 27.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Let the errors be 27 def source : Source[I, NotUsed] = Source.fromGraph( new GraphStage[SourceShape[I]]{ override def onPull(): Unit = pipelineConsumer.fetch() match { case Success(i) => push(out, i) case Failure(t) => throw PipelineSourceError } })
  • 28.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Let the errors be, but supervised 28 def source : Source[I, NotUsed] = Source.fromGraph( new GraphStage[SourceShape[I]]{ override def onPull(): Unit = pipelineConsumer.fetch() match { case Success(i) => push(out, i) case Failure(t) => throw PipelineSourceError } }) def supervisedSource : Source[I, NotUsed] = RestartSource.withBackoff(minBackoff, maxBackoff, maxRestarts) { () => source } https://doc.akka.io/docs/akka/current/stream/stream-error.html#supervision-strategies
  • 29.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Sometimes the stream needs to actually stop val supervisedSource : Source[I, NotUsed] = RestartSource.withBackoff(minBackoff, maxBackoff, maxRestarts) { () => source } val killSwitch : SharedKillSwitch = KillSwitches.shared("...") val stream: RunnableGraph[Future[Done]] = supervisedSource .async .via(killSwitch.flow) .via(deserialize via divertErrors(sinkError)) .via(dispatch via divertErrors(sinkError)) .toMat(Sink.ignore)(Keep.right) val runningStream: Future[Done] = stream.run() runningStream onComplete { case _ => sys.exit(1) } CoordinatedShutdown(system) addJvmShutdownHook { killSwitch.shutdown() } https://doc.akka.io/docs/akka/current/stream/stream-dynamic.html#controlling-stream-completion-with-killswitch When the source fails, we want the stream to stop: • cancel its upstream • complete its downstream Akka Streams Kill Switch: • complete the stream via shutdown() • downstream in-flight messages are processed before returning from shutdown()
  • 30.
  • 31.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Reference application SOURCE DESERIALIZE GROUP LOOKUP SINK SERIALIZE KPI: 1M incoming events per second
  • 32.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Performance baseline Reality: 60 K incoming events per minute per pod SOURCE DESERIALIZE GROUP LOOKUP SINK SERIALIZE
  • 33.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Performance baseline Reality: 60 K incoming events per minute per pod Requirements for 1M events per second: • 1000 pods • 1000 Kafka partitions SOURCE DESERIALIZE GROUP LOOKUP SINK SERIALIZE
  • 34.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Finding out the limits – Trial 1 SOURCE NOOP SINK Result: 800K incoming events per minute per pod
  • 35.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Finding out the limits – Trial 2 Result: 500-800K incoming events per minute per pod SOURCE DESERIALIZE GROUP LOOKUP SINK SERIALIZE LOOKUP LOOKUP
  • 36.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Finding out the limits – Trial 3 Result: serialize stage identified as bottleneck GROUP LOOKUP SERIALIZE LOOKUP LOOKUP Batches / min SOURCE DESERIALIZE GROUP SINK BUFFER BUFFER BUFFER
  • 37.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Finding out the limits – Trial 4 Result: 700K incoming events per minute per pod Batches / min SOURCE DESERIALIZE GROUP LOOKUP SINK SERIALIZE LOOKUP LOOKUP SERIALIZE SERIALIZE
  • 38.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. The journey towards 1M events per second Result: 1K incoming events per second per pod Required resources: • 1000 pods • 1000 Kafka partitions Initial trial Result: 11K incoming events per second per pod Required resources: • 90 pods • 90 Kafka partitions Final trial
  • 39.
    ©2020 Adobe. AllRights Reserved. Adobe Confidential. Tools of choice § Infrastructure: Amazon EKS § Monitoring: Micrometer SDK, Prometheus & Grafana, NewRelic § Logging: Splunk § Performance Testing: Gatling